In my last blog, I discussed how to threshold an image, and we ended up doing quite a good job of cleaning up the image with the creased paper. But then I finished the blog by showing that our algorithm had a problem, it didn’t cope well with coffee cup stains.
The problem is that coffee cup stains are dark, so our existing algorithm does not distinguish between dark writing and dark stains. We need to find some features that distinguish between dark stains and dark writing.
Looking at the image, we can see that even though the stains are dark, the writing is darker. The stains are dark compared to the overall image, but lighter than the writing within them,
So we can hypothesise that we might be able to separate the writing from the stains by using thresholding that looks at the pixel brightnesses over a localised set of pixels rather than across all of the image. That way we might separate the darker writing from the dark background stain.
While I am an R programmer and a fan of R, I have to admit that R can be a frustrating platform for image processing. For example, I cannot get the adimpro package to work at all, despite much experimentation and tinkering, and googling for solutions. Then I discovered that the biOps package was removed from CRAN, and the archived versions refuse to install on my PC. But eventually I found ways to get image processing done in R.
First, I tried the EbayesThresh package.
# libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster, EbayesThresh) # sample image containing coffee cup stain img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png") # using EbayesThresh et = ebayesthresh(img) plot(raster(et)) range(et)
All of the values are zero! That isn’t very helpful.
Second, I tried the treethresh package.
# experiment with treethresh if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster, treethresh) # sample image containing coffee cup stain img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png") # using tree thresholding tt = treethresh(img) tt.pruned <- prune(tt) tt.thresholded <- thresh(tt.pruned) plot(raster(tt.thresholded)) range(tt.thresholded)
Once again I only got zeroes.
Finally, I tried wavelet thresholding using the treethresh package.
# experiment with wavelet thresholding if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster, treethresh) # sample image containing coffee cup stain img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png") # using wavelet thresholding # need a square image that has with that is a power of 2 # compute wavelet transform wt.square <- imwd(img[1:256,1:256]) # Perform the thresholding wt.thresholded = threshold(wt.square) # Compute inverse wavelet transform wt.denoised <- imwr(wt.thresholded) plot(raster(wt.denoised))
It wasn’t what I was looking for, but at least it’s not just a matrix of zeroes!
After these two failures with CRAN packages, I decided that I needed to look beyond CRAN and try out the EBImage package on Bioconductor.
if (!require("EBImage")) { source("http://bioconductor.org/biocLite.R") biocLite("EBImage") } # libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster) # sample image containing coffee cup stain img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png") # using adaptive thresholding img.eb = readImage("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png") img.thresholded.3 = thresh(img.eb, 3, 3) display(img.thresholded.3) img.thresholded.5 = thresh(img.eb, 5, 5) display(img.thresholded.5) img.thresholded.7 = thresh(img.eb, 7, 7) display(img.thresholded.7) img.thresholded.9 = thresh(img.eb, 9, 9) display(img.thresholded.9)
These images show that we are on the right track. The adaptive thresholding is good at separating the letters from the background, but it also adds black to the background. While the adaptive thresholding may not be perfect, any feature extraction that adds to our knowledge will probably improve our ensemble model.
Before we add adaptive thresholding to our model, we will do a bit more pre-processing. Based on the observation that adaptive thresholding is keeping the letters but adding some unwanted black to the background, we can combine all of the adaptive thresholding with normal thresholding images, and use the maximum pixel brightness across each 5 image set.
# a function to do k-means thresholding kmeansThreshold = function(img) { # fit 3 clusters v = img2vec(img) km.mod = kmeans(v, 3) # allow for the random ordering of the clusters oc = order(km.mod$centers) # the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]])) # using upper threshold imgHi = v imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1 return (imgHi) } # a function to turn a matrix image into a vector img2vec = function(img) { return (matrix(img, nrow(img) * ncol(img), 1)) } img.thresholded.3 = thresh(img.eb, 3, 3) img.thresholded.5 = thresh(img.eb, 5, 5) img.thresholded.7 = thresh(img.eb, 7, 7) img.thresholded.9 = thresh(img.eb, 9, 9) img.thresholded.11 = thresh(img.eb, 11, 11) # a function to convert an Image into a matrix Image2Mat = function(Img) { m1 = t(matrix(Img, nrow(Img), ncol(Img))) return(m1) } # combine the adaptive thresholding ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img))) ttt.2 = apply(ttt.1, 1, max) ttt.3 = matrix(ttt.2, nrow(img), ncol(img)) plot(raster(ttt.3))
It’s starting to look much better. The coffee cup stain isn’t removed, but we are starting to successfully clean it up. This suggests that adaptive thresholding will be an important predictor in our ensemble.
Let’s bring it all together, and build a new ensemble model, extending the model from my last blog on this subject.
# libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW) if (!require("EBImage")) { source("http://bioconductor.org/biocLite.R") biocLite("EBImage") } # a function to do k-means thresholding kmeansThreshold = function(img) { # fit 3 clusters v = img2vec(img) km.mod = kmeans(v, 3) # allow for the random ordering of the clusters oc = order(km.mod$centers) # the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]])) # using upper threshold imgHi = v imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1 return (imgHi) } # a function that applies adaptive thresholding adaptiveThresholding = function(img) { img.eb <- Image(t(img)) img.thresholded.3 = thresh(img.eb, 3, 3) img.thresholded.5 = thresh(img.eb, 5, 5) img.thresholded.7 = thresh(img.eb, 7, 7) img.thresholded.9 = thresh(img.eb, 9, 9) img.thresholded.11 = thresh(img.eb, 11, 11) img.kmThresh = kmeansThreshold(img) # combine the adaptive thresholding ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img))) ttt.2 = apply(ttt.1, 1, max) ttt.3 = matrix(ttt.2, nrow(img), ncol(img)) return (ttt.3) } # a function to turn a matrix image into a vector img2vec = function(img) { return (matrix(img, nrow(img) * ncol(img), 1)) } # a function to convert an Image into a matrix Image2Mat = function(Img) { m1 = t(matrix(Img, nrow(Img), ncol(Img))) return(m1) } dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train" cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned" outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted" outPath = file.path(outFolder, "trainingdata.csv") filenames = list.files(dirtyFolder) for (f in filenames) { print(f) imgX = readPNG(file.path(dirtyFolder, f)) imgY = readPNG(file.path(cleanFolder, f)) # turn the images into vectors x = matrix(imgX, nrow(imgX) * ncol(imgX), 1) y = matrix(imgY, nrow(imgY) * ncol(imgY), 1) # threshold the image x2 = kmeansThreshold(imgX) # adaptive thresholding x3 = img2vec(adaptiveThresholding(imgX)) dat = data.table(cbind(y, x, x2, x3)) setnames(dat,c("y", "raw", "thresholded", "adaptive")) write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE) } # read in the full data table dat = read.csv(outPath) # fit a model to a subset of the data set.seed(1) rows = sample(nrow(dat), 1000000) gbm.mod = gbm(y ~ raw + thresholded + adaptive, data = dat[rows,], n.trees = 7500, cv.folds = 3, train.fraction = 0.5, interaction.depth = 5) best.iter <- gbm.perf(gbm.mod,method="cv",oobag.curve = FALSE) s = summary(gbm.mod) # get the predictions - using parallel processing to save time numCores = 6 #change the 6 to your number of CPU cores. or maybe lower due to RAM limits cl = makeCluster(numCores) registerDoSNOW(cl) num_splits = numCores split_testing = sort(rank(1:nrow(dat))%%numCores) yHat = foreach(i=unique(split_testing),.combine=c,.packages=c("gbm")) %dopar% { as.numeric(predict(gbm.mod, newdata=dat[split_testing==i,], n.trees = best.iter)) } stopCluster(cl) # what score do we get on the training data? rmse = sqrt(mean( (yHat - dat$y) ^ 2 )) print(rmse) # show the predicted result for a sample image img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png") x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img), img2vec(adaptiveThresholding(img))) setnames(x, c("raw", "thresholded", "adaptive")) yHat = predict(gbm.mod, newdata=x, n.trees = best.iter) imgOut = matrix(yHat, nrow(img), ncol(img)) writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png") plot(raster(imgOut))
The adaptive thresholding gets a relative importance score of 2.4%, but full image thresholding loses all of its importance in the model.
The result may not be perfect, but you can see how the coffee cup stain is starting to be erased. We will need to find some more features to continue the clean up, but that will be left to future blogs.
This blog’s model has improved the RMSE on the training data from 6.5% reported in my last blog to 5.4% in this latest version. There’s a long way to go to have a top 10 performing model, but this process is steadily moving us in the right direction.