In the first blog in this series, I explained the nature of the problem to be solved, and showed how to create a simple linear model that gives an RMSE score on the training data of 7.8%. In this blog I introduce the feedback loop for creating features, and we extend and improve the existing competition model to include a couple of new features.
I have been building machine learning models for almost 20 years now, and never once has my first model been satisfactory, nor has that model been the model that I finally use. The Denoising Dirty Documents competition is no different. In the plots above, you can see my progress through the public leaderboard, as I iteratively improved my model. The same has occurred more broadly in the competition overall, as the best score has iteratively improved since the competition began.
Coming from an actuarial / statistical background, I tend to think in terms of a “control cycle”. For actuaries, the control cycle looks something like this:
We can categorise the content of my last blog as:
1) Define the Problem
- remove noise from images
- images are three-dimensional surfaces
2) Hypothesise Solutions
- we hypothesised that the pixel brightnesses needed to be rescaled
3) Implement
- we fitted a least squares linear model to rescale the pixel brightnesses, and we capped and floored the values to keep them within the [0, 1] range
4) Monitor Results
- we calculated the RMSE on the training data before and after the implementation
- we looked at one of the output images
So we have completed one cycle. Now we need to do another cycle.
What I like about the Denoising Dirty Images Competition is that the data visualisation and model visualisation are obvious. We don’t need to develop clever visualisation tools to understand the problem or the model output. We can just look at the dirty and predicted images. Let’s begin with reviewing the example input and output images from the model used in the last blog i.e. let’s begin by recalling the Monitor Results step from the previous cycle:
Now we can progress to the beginning of the next cycle, the Define the Problem stage. This time the problem to be solved is to remove one or more of the errors from the output of the existing model. Once we already have a model, we can define the problem by asking ourselves the following question.
Question: In the predicted image, where have prediction errors have been made?
Answer: The predicted image contains some of the crease lines.
So in this cycle, the problem is to remove the remaining crease lines.
It’s time to begin the Hypothesise Solutions stage of the cycle by asking ourselves the following questions:
- In the dirty image, what is a single characteristic of what we want to keep or remove that we haven’t captured in our existing model?
- What characteristics do the errors in the predicted image have?
- What characteristics does the dirty image have at the locations where these errors occur?
Here are a few observations that I made in answer to these questions:
- In the dirty image, the writing is darker than the background around it. We haven’t captured this locality information.
- The predicted image retains some of the crease lines. Those remaining crease lines are narrower and not as dark as the writing.
- In the dirty image, the shadows next to the white centres of the crease are the remaining crease lines from the predicted image.
This led me to the hypothesis that the model for a single pixel needs to include information about the brightness of other pixels in the image. There are multiple ways that we could consider the brightnesses of other pixels, but in today’s blog I will limit myself to considering the brightness of the pixel under consideration versus the range of pixel brightnesses across the entire image.
To Implement a hypothesis test, and solution, we need to know some theory about image processing. There are some good textbooks that cover this material. The image processing / machine vision textbook that I have been using is
Computer and Machine Vision, Fourth Edition: Theory, Algorithms, Practicalities
The machine vision technique that I will apply this today is thresholding, which is the process of turning an image into pixels that can only be black or white, with no grey shades or colours. Writing code to do thresholding is the easy part. The trickier part is to decide the threshold value at which pixels are split into either black or white.
# libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster) img = readPNG("C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train\\6.png") # turn the image into a vector img2vec = function(img) { return (matrix(img, nrow(img) * ncol(img), 1)) } # show a histogram hist(img2vec(img))
One way of finding a threshold value is to look at a histogram of the pixel brightnesses and look for a natural break between local maxima in the histogram. Since the histogram above is tri-modal, this would leave us with two choices of thresholds, one around 0.3 and one around 0.65.
# threshold at 0.3 img0.3 = img img0.3[img0.3 <= 0.3] = 0 img0.3[img0.3 > 0.3] = 1 plot(raster(img0.3))
Now we can Monitor Results. Thresholding at 0.3 gives us no false positives for this image, but leaves out a lot of the writing.
# threshold at 0.65 img0.65 = img img0.65[img0.65 <= 0.65] = 0 img0.65[img0.65 > 0.65] = 1 plot(raster(img0.65))
Thresholding at 0.65 gives us no false positives for this image, and correctly flags the writing without flagging the creases. That makes it quite a good feature for use on this image, as it will help us to remove the residual crease mark that we found when reviewing the output of the last blog’s model. We should definitely include a thresholding feature in our model!
But it feels very clumsy to plot a histogram and manually select the threshold, and such a feature definitely isn’t “machine learning”. Can we automate this?
Otsu’s Method is a well-known technique in image processing that can automatically suggest a threshold. But it assumes a bi-modal histogram, which we have seen is not true.
Inspired by Otsu’s method, we could use cluster analysis to generate three clusters of pixel brightnesses, and use the splits between those clusters to threshold the image.
# fit 3 clusters v = img2vec(img) km.mod = kmeans(v, 3) # allow for the random ordering of the clusters oc = order(km.mod$centers) # the lower threshold is the halfway point between the top of the lowest cluster and the bottom of the middle cluster loThresh = 0.5 * (max(v[km.mod$cluster == oc[1]]) + min(v[km.mod$cluster == oc[2]])) # the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]])) # using lower threshold imgLo = img imgLo[imgLo <= loThresh] = 0 imgLo[imgLo > loThresh] = 1 plot(raster(imgLo)) # using upper threshold imgHi = img imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1 plot(raster(imgHi))
Now we can Monitor Results. Once again, the lower threshold choice doesn’t capture enough of the writing, while the upper threshold choice works well for this image. And this time there wasn’t any manual work required!
Let’s put it all together and combine last blog’s feature (a linear transformation of the raw pixel brightnesses) with this blog’s feature (thresholding). This time we will use something more sophisticated than linear regression. Our model will be based upon R’s GBM package, a gradient boosted machine to combine the raw pixel brightnesses with the thresholded pixels. GBMs are good all-purpose models, are simple to set up, perform well, and are almost always my first choice when creating machine learning models.
# libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster, data.table, gbm) # a function to do k-means thresholding kmeansThreshold = function(img) { # fit 3 clusters v = img2vec(img) km.mod = kmeans(v, 3) # allow for the random ordering of the clusters oc = order(km.mod$centers) # the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]])) # using upper threshold imgHi = v imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1 return (imgHi) } dirtyFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train" cleanFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned" outFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted" outPath = file.path(outFolder, "trainingdata.csv") filenames = list.files(dirtyFolder) for (f in filenames) { print(f) imgX = readPNG(file.path(dirtyFolder, f)) imgY = readPNG(file.path(cleanFolder, f)) # turn the images into vectors x = matrix(imgX, nrow(imgX) * ncol(imgX), 1) y = matrix(imgY, nrow(imgY) * ncol(imgY), 1) # threshold the image x2 = kmeansThreshold(imgX) dat = data.table(cbind(y, x, x2)) setnames(dat,c("y", "raw", "thresholded")) write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE) } # view the data dat = read.csv(outPath) rows = sample(nrow(dat), 10000) d1 = dat[rows,] plot(d1$raw[dat$thresholded == 0], d1$y[dat$thresholded == 0], col = "blue") lines(d1$raw[dat$thresholded == 1], d1$y[dat$thresholded == 1], col = "red", type="p") # fit a model to a subset of the data rows = sample(nrow(dat), 100000) gbm.mod = gbm(y ~ raw + thresholded, data = dat[rows,], n.trees = 5000, cv.folds = 10, train.fraction = 0.5) best.iter <- gbm.perf(gbm.mod,method="cv") # what score do we get on the training data? yHat = predict(gbm.mod, newdata=dat, n.trees = best.iter) rmse = sqrt(mean( (yHat - dat$y) ^ 2 )) print(rmse) # show the predicted result for a sample image img = readPNG("C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train\\6.png") x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img)) setnames(x, c("raw", "thresholded")) yHat = predict(gbm.mod, newdata=x, n.trees = best.iter) imgOut = matrix(yHat, nrow(img), ncol(img)) writePNG(imgOut, "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\sample.png") plot(raster(imgOut))
Notice how the sample output image is different from the thresholded image? The edges of the writing are lighter than the centre of the letters. That’s good, because that’s one of the characteristics of the clean image.
The sample image that we created from the predicted values is looking really good. The residual crease marks have disappeared, yet we have kept the writing. And the RMSE score on the training data has improved to 6.5%.
But before this blog ends, let’s consider an image that wasn’t cleaned so well:
# here's a sample image that doesn't perform as well img = readPNG("C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png") x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img)) setnames(x, c("raw", "thresholded")) yHat = predict(gbm.mod, newdata=x) imgOut = matrix(yHat, nrow(img), ncol(img)) writePNG(imgOut, "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\sample.png") plot(raster(imgOut))
Damn! I feel like cursing the jerk who put their coffee cup on top of the important document and left a stain!
In my next blog, I will discuss the first steps towards removing that coffee cup stain.
http://wms-na.amazon-adsystem.com/20070822/US/js/link-enhancer-common.js?tag=keeupwitthela-20&linkId=VYH2LC2X67R2SDNO
Another great post! Thanks again, Colin.
So did you use a variation of Otsu’s Method in the code since the histogram was tri-modal and not bi-modal as Otsu expects? That was a good idea using an optimal threshold for each image and using that result as an extra feature. I had begun to play with some thresholding but didn’t think to use it as another feature! So I was trying to find a relationship between the grayscale yHat and the binary values and it was a bit confusing haha. What you did makes a lot more sense.
I’ll have to do more research about Gradient Boosted Machines as I’m not familiar with them. Are these similar to regression trees? I might be quite off the mark here, apologies.
The updated algorithm scored an RMSE of 0.089 which was quite a jump! I look forward to seeing how you’re going to approach getting rid of those pesky coffee stains! Might some morphological processing be involved?
Cheers.
LikeLike
Hey Bobby,
If you haven’t already, could you go on Kaggle and vote for my scripts please?
Colin
LikeLike
Hi Bobby,
Yes the choice of clustering was due to the tri-modal nature of the histogram, and confirmed by the failure of standard two-level thresholding techniques, as tested by looking at the thresholded images.
Gradient boosted machines are forests of trees, a bit like random forests, except that each new tree is fitted to the residuals of the existing model, whereas in random forests each tree is fitted independently of the other.
In my competition model, I am using morphology to find the stains. But there are multiple ways to do it. As I write my blogs I think of different ways to solve the problems, so the blogs don’t exactly match my competition submission.
Colin
LikeLiked by 1 person
Pingback: Denoising Dirty Documents: Part 3 | Keeping Up With The Latest Techniques
Putting an ad link to amazon without mentioning the title or the author of the book is sneaky. All the best on the competition and thanks for sharing the insights.
LikeLike
Hi,
Sorry the link didn’t work correctly. It is supposed to show the title and book cover. I used the html that Amazon gave me, but it isn’t showing correctly.
Colin
LikeLike
OK, I think that the link is fixed now. Amazon links don’t seem to work out-of-the-box in a WordPress blog.
LikeLike
Pingback: Image Processing + Machine Learning in R: Denoising Dirty Documents Tutorial Series | no free hunch
Pingback: Image Processing + Machine Learning – FPGA, image processing, machine learning, deep learning