In my last blog, I discussed how to threshold an image, and we ended up doing quite a good job of cleaning up the image with the creased paper. But then I finished the blog by showing that our algorithm had a problem, it didn’t cope well with coffee cup stains.
The problem is that coffee cup stains are dark, so our existing algorithm does not distinguish between dark writing and dark stains. We need to find some features that distinguish between dark stains and dark writing.
Looking at the image, we can see that even though the stains are dark, the writing is darker. The stains are dark compared to the overall image, but lighter than the writing within them,
So we can hypothesise that we might be able to separate the writing from the stains by using thresholding that looks at the pixel brightnesses over a localised set of pixels rather than across all of the image. That way we might separate the darker writing from the dark background stain.
While I am an R programmer and a fan of R, I have to admit that R can be a frustrating platform for image processing. For example, I cannot get the adimpro package to work at all, despite much experimentation and tinkering, and googling for solutions. Then I discovered that the biOps package was removed from CRAN, and the archived versions refuse to install on my PC. But eventually I found ways to get image processing done in R.
First, I tried the EbayesThresh package.
# libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster, EbayesThresh) # sample image containing coffee cup stain img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png") # using EbayesThresh et = ebayesthresh(img) plot(raster(et)) range(et)
All of the values are zero! That isn’t very helpful.
Second, I tried the treethresh package.
# experiment with treethresh if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster, treethresh) # sample image containing coffee cup stain img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png") # using tree thresholding tt = treethresh(img) tt.pruned <- prune(tt) tt.thresholded <- thresh(tt.pruned) plot(raster(tt.thresholded)) range(tt.thresholded)
Once again I only got zeroes.
Finally, I tried wavelet thresholding using the treethresh package.
# experiment with wavelet thresholding if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster, treethresh) # sample image containing coffee cup stain img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png") # using wavelet thresholding # need a square image that has with that is a power of 2 # compute wavelet transform wt.square <- imwd(img[1:256,1:256]) # Perform the thresholding wt.thresholded = threshold(wt.square) # Compute inverse wavelet transform wt.denoised <- imwr(wt.thresholded) plot(raster(wt.denoised))
It wasn’t what I was looking for, but at least it’s not just a matrix of zeroes!
After these two failures with CRAN packages, I decided that I needed to look beyond CRAN and try out the EBImage package on Bioconductor.
if (!require("EBImage")) { source("http://bioconductor.org/biocLite.R") biocLite("EBImage") } # libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster) # sample image containing coffee cup stain img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png") # using adaptive thresholding img.eb = readImage("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png") img.thresholded.3 = thresh(img.eb, 3, 3) display(img.thresholded.3) img.thresholded.5 = thresh(img.eb, 5, 5) display(img.thresholded.5) img.thresholded.7 = thresh(img.eb, 7, 7) display(img.thresholded.7) img.thresholded.9 = thresh(img.eb, 9, 9) display(img.thresholded.9)
These images show that we are on the right track. The adaptive thresholding is good at separating the letters from the background, but it also adds black to the background. While the adaptive thresholding may not be perfect, any feature extraction that adds to our knowledge will probably improve our ensemble model.
Before we add adaptive thresholding to our model, we will do a bit more pre-processing. Based on the observation that adaptive thresholding is keeping the letters but adding some unwanted black to the background, we can combine all of the adaptive thresholding with normal thresholding images, and use the maximum pixel brightness across each 5 image set.
# a function to do k-means thresholding kmeansThreshold = function(img) { # fit 3 clusters v = img2vec(img) km.mod = kmeans(v, 3) # allow for the random ordering of the clusters oc = order(km.mod$centers) # the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]])) # using upper threshold imgHi = v imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1 return (imgHi) } # a function to turn a matrix image into a vector img2vec = function(img) { return (matrix(img, nrow(img) * ncol(img), 1)) } img.thresholded.3 = thresh(img.eb, 3, 3) img.thresholded.5 = thresh(img.eb, 5, 5) img.thresholded.7 = thresh(img.eb, 7, 7) img.thresholded.9 = thresh(img.eb, 9, 9) img.thresholded.11 = thresh(img.eb, 11, 11) # a function to convert an Image into a matrix Image2Mat = function(Img) { m1 = t(matrix(Img, nrow(Img), ncol(Img))) return(m1) } # combine the adaptive thresholding ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img))) ttt.2 = apply(ttt.1, 1, max) ttt.3 = matrix(ttt.2, nrow(img), ncol(img)) plot(raster(ttt.3))
It’s starting to look much better. The coffee cup stain isn’t removed, but we are starting to successfully clean it up. This suggests that adaptive thresholding will be an important predictor in our ensemble.
Let’s bring it all together, and build a new ensemble model, extending the model from my last blog on this subject.
# libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW) if (!require("EBImage")) { source("http://bioconductor.org/biocLite.R") biocLite("EBImage") } # a function to do k-means thresholding kmeansThreshold = function(img) { # fit 3 clusters v = img2vec(img) km.mod = kmeans(v, 3) # allow for the random ordering of the clusters oc = order(km.mod$centers) # the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]])) # using upper threshold imgHi = v imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1 return (imgHi) } # a function that applies adaptive thresholding adaptiveThresholding = function(img) { img.eb <- Image(t(img)) img.thresholded.3 = thresh(img.eb, 3, 3) img.thresholded.5 = thresh(img.eb, 5, 5) img.thresholded.7 = thresh(img.eb, 7, 7) img.thresholded.9 = thresh(img.eb, 9, 9) img.thresholded.11 = thresh(img.eb, 11, 11) img.kmThresh = kmeansThreshold(img) # combine the adaptive thresholding ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img))) ttt.2 = apply(ttt.1, 1, max) ttt.3 = matrix(ttt.2, nrow(img), ncol(img)) return (ttt.3) } # a function to turn a matrix image into a vector img2vec = function(img) { return (matrix(img, nrow(img) * ncol(img), 1)) } # a function to convert an Image into a matrix Image2Mat = function(Img) { m1 = t(matrix(Img, nrow(Img), ncol(Img))) return(m1) } dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train" cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned" outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted" outPath = file.path(outFolder, "trainingdata.csv") filenames = list.files(dirtyFolder) for (f in filenames) { print(f) imgX = readPNG(file.path(dirtyFolder, f)) imgY = readPNG(file.path(cleanFolder, f)) # turn the images into vectors x = matrix(imgX, nrow(imgX) * ncol(imgX), 1) y = matrix(imgY, nrow(imgY) * ncol(imgY), 1) # threshold the image x2 = kmeansThreshold(imgX) # adaptive thresholding x3 = img2vec(adaptiveThresholding(imgX)) dat = data.table(cbind(y, x, x2, x3)) setnames(dat,c("y", "raw", "thresholded", "adaptive")) write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE) } # read in the full data table dat = read.csv(outPath) # fit a model to a subset of the data set.seed(1) rows = sample(nrow(dat), 1000000) gbm.mod = gbm(y ~ raw + thresholded + adaptive, data = dat[rows,], n.trees = 7500, cv.folds = 3, train.fraction = 0.5, interaction.depth = 5) best.iter <- gbm.perf(gbm.mod,method="cv",oobag.curve = FALSE) s = summary(gbm.mod) # get the predictions - using parallel processing to save time numCores = 6 #change the 6 to your number of CPU cores. or maybe lower due to RAM limits cl = makeCluster(numCores) registerDoSNOW(cl) num_splits = numCores split_testing = sort(rank(1:nrow(dat))%%numCores) yHat = foreach(i=unique(split_testing),.combine=c,.packages=c("gbm")) %dopar% { as.numeric(predict(gbm.mod, newdata=dat[split_testing==i,], n.trees = best.iter)) } stopCluster(cl) # what score do we get on the training data? rmse = sqrt(mean( (yHat - dat$y) ^ 2 )) print(rmse) # show the predicted result for a sample image img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png") x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img), img2vec(adaptiveThresholding(img))) setnames(x, c("raw", "thresholded", "adaptive")) yHat = predict(gbm.mod, newdata=x, n.trees = best.iter) imgOut = matrix(yHat, nrow(img), ncol(img)) writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png") plot(raster(imgOut))
The adaptive thresholding gets a relative importance score of 2.4%, but full image thresholding loses all of its importance in the model.
The result may not be perfect, but you can see how the coffee cup stain is starting to be erased. We will need to find some more features to continue the clean up, but that will be left to future blogs.
This blog’s model has improved the RMSE on the training data from 6.5% reported in my last blog to 5.4% in this latest version. There’s a long way to go to have a top 10 performing model, but this process is steadily moving us in the right direction.
I always learn so much with each of your posts about approaching these kinds of problems and the R tools that can help, thank you so much!
One lingering question that I had from this post is line 119 in your code. Is cannyDilated a feature we’ve covered? Sorry if I somehow missed it. I suppose it’s a thresholding technique related to edge detection?
Thanks again!
LikeLiked by 1 person
Hi Bobby,
Glad you’re enjoying the series. How are you doing in the competition? I’m busy researching my final model to submit next month before the competition ends.
That line of R script doesn’t belong in this blog. It’s a snippet from my next blog in this series. It is a composite feature that I developed that combines canny edge detection and morphology. I will probably edit it out of this blog.
Colin
LikeLike
I’ve been climbing up the leaderboard due to improvements from each insight you provide in your blog which has been satisfying but most importantly educational. I’ve also been pouring over as much literature as I can about particular models and their shortcomings and strengths, but I still find myself asking very fundamental questions..
For example, I remember back in your first post that we had used a linear model for our predictions. But instead of images, let’s say we had a dataset about predicting housing prices with many features. Obviously it would be hard to visualize such a dataset due to its multidimensionality so my question is how we can ascertain if the features have a linear or nonlinear relationship without the aid of a graph.
Ah, ok! So that line of code was a teaser :).
How have you approached tuning of your parameters with gbm and the k-means thresholding? Do you do any auto-tuning?
Also, yesterday I came across an R package called “xgboost”. It’s an optimized general purpose gradient boosting library that supports parallelization. Not sure how it holds up to R’s gbm library but I thought I’d mention it.
LikeLiked by 1 person
That’s a really good question. I think that you should always do visualisation of relationships, and use graphs whenever possible because our brains are better at visual information than abstractions. Begin with one-way analysis – plot the target value against one of your predictors or features. Then try something like I suggest in the blog I posted this morning, which you can find at https://colinpriest.com/2015/08/14/denoising-dirty-documents-part-3/ which approaches this issue using generalized additive models.You can extend that blog’s approach by building a multi-dimensional GAM. You can even do something similar with GBMs, but that would take a full blog to explain how to do.
So far I haven’t done much tuning of my model hyperparameters because your initial focus should be on feature extraction rather than fine tuning of the machine learning. But when I did some hyperparameter tuning it was nothing complex, just a manual grid search over a few different hyperparameter values. My best model so far takes a day to fit. Unlike some competitors, I don’t have much computing power (e.g. a cluster) at my disposal, nor do I have the budget to rent an AWS cluster. A full hyperparameter tuning run would take too long on just my one workstation PC. But I can tell you that the hyperparameters that I post in my blogs are not adequate to do well in the competition. Instead they were selected to get a reasonable result in a short run time, so that I can demonstrate the basic principles. I leave it to my readers to experiment the learning rate, tree depth and number of trees.
xgboost comes highly recommended to me by Xavier Conort, one of the top Kaggle competitors. But I haven’t tried it yet. I wouldn’t be surprised if it outperforms gbm for this problem. I’m planning to try out xgboost soon. My competition submission isn’t using the gbm library because I started to run out of RAM on my more complex models. Instead I’m using Revolution R’s boosted trees library, which scales up to big data much better. At present I’m testing out deep learning, which is showing some promise, although I haven’t had the time yet to do much with it.
LikeLike
Pingback: Denoising Dirty Documents : Part 4 | Keeping Up With The Latest Techniques
Great advice, thank you!
How would you also approach feature engineering with anonymous features/variables when not dealing with images? For our competition, it’s been easy to validate our pre-processing by just looking at a single image output to see if the stains were cleaned up a bit more, but I’m guessing we wouldn’t have such luxuries with other contests..
LikeLike
Hi Bobby,
That’s why I chose this competition as my first – it’s a lot more intuitive to do the feedback loop!
The other day I attended a presentation by Xavier Conort, one of the top Kagglers, about how his team placed 3rd in the KDD Cup 2015. He doesn’t use as much of a feedback loop, but he does hypothesise new features and then tests how much they change the score on the model, and tests their variable importance. He also looks for poorly fitting training examples, and tries to find anything about those bad fits that could be used in his model.
In my latest blog I show the variable importance of the new feature that has been added. I am working with a data example that was chosen just for its poor fit.
Colin
LikeLike
Pingback: Denoising Dirty Documents: Part 7 | Keeping Up With The Latest Techniques
Pingback: Image Processing + Machine Learning in R: Denoising Dirty Documents Tutorial Series | no free hunch