Tags
I recently blogged about my learning curve in my first Kaggle competition. This has become my most popular blog to date, and some readers have asked for more. So this blog is the first in a series of blogs about how to put together a reasonable solution to Kaggle’s Denoising Dirty Documents competition.
Some other competitors have been posting scripts, but those scripts are usually written in Python, whereas my background makes me an R programmer. So I will be writing scripts that make use of R.
The Structure of the Problem
We have been given a series of training images, both dirty (with stains and creased paper) and clean (with a white background and black letters). We are asked to develop an algorithm that converts, as close as possible, the dirty images into clean images.
A greyscale image (such as shown above) can be thought of as a three-dimensional surface. The x and y axes are the location within the image, and the z axis is the brightness of the image at that location. The great the brightness, the whiter the image at that location.
So from a mathematical perspective, we are being asked to transform one three-dimensional surface into another three dimensional surface.
Our task is to clean the images, to remove the stains, remove the paper creases, improve the contrast, and just leave the writing.
Loading the Image Data
In R, images are stored as matrices, with the row being the y-axis, the column being the x-axis, and the numerical value being the brightness of the pixel. Since Kaggle has stored the images in png format, we can use the png package to load the images.
# libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster) img = readPNG("C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train\\6.png") head(img) plot(raster(img))
You can see that the brightness values lie within the [0, 1] range, with 0 being black and 1 being white.
Restructuring the Data for Machine Learning
Instead of modelling the entire image at once, we should predict the cleaned-up brightness for each pixel within the image, and construct a cleaned image by combining together a set of predicted pixel brightnesses. We want a vector of y values, and a matrix of x values. The simplest data set is where the x values are just the pixel brightnesses of the dirty images.
# libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster, data.table) dirtyFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train" cleanFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned" outFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted" outPath = file.path(outFolder, "trainingdata.csv") filenames = list.files(dirtyFolder) for (f in filenames) { print(f) imgX = readPNG(file.path(dirtyFolder, f)) imgY = readPNG(file.path(cleanFolder, f)) # turn the images into vectors x = matrix(imgX, nrow(imgX) * ncol(imgX), 1) y = matrix(imgY, nrow(imgY) * ncol(imgY), 1) dat = data.table(cbind(y, x)) setnames(dat,c("y", "x")) write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE) } # view the data dat = read.csv(outPath) head(dat) rows = sample(nrow(dat), 10000) plot(dat$x[rows], dat$y[rows])
The data is now in a familiar format, which each row representing a data point, the first column being the target value, and the remaining column being the predictors.
Our First Predictive Model
Look at the relationship between x and y.
Except at the extremes, there is a linear relationship between the brightness of the dirty images and the cleaned images. There is some noise around this linear relationship, and a clump of pixels that are halfway between white and black. There is a broad spread of x values as y approaches 1, and these pixels probably represent stains that need to be removed.
So the obvious first model would be a linear transformation, with truncation to ensure that the predicted brightnesses remain within the [0, 1] range.
# fit a linear model, ignoring the data points at the extremes lm.mod.1 = lm(y ~ x, data=dat[dat$y > 0.05 & dat$y < 0.95,]) summary(lm.mod.1) dat$predicted = sapply(predict(lm.mod.1, newdata=dat), function(x) max(min(x, 1),0)) plot(dat$predicted[rows], dat$y[rows]) rmse1 = sqrt(mean( (dat$y - dat$x) ^ 2)) rmse2 = sqrt(mean( (dat$predicted - dat$y) ^ 2)) c(rmse1, rmse2)
The linear model has done a brightness and contrast correction. This reduces the RMSE score from 0.157 to 0.078. Let’s see an output image:
# show the predicted result for a sample image img = readPNG("C:\\Users\\Colin\\Dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\6.png") x = data.table(matrix(img, nrow(img) * ncol(img), 1)) setnames(x, "x") yHat = sapply(predict(lm.mod.1, newdata=x), function(x) max(min(x, 1),0)) imgOut = matrix(yHat, nrow(img), ncol(img)) writePNG(imgOut, "C:\\Users\\Colin\\Dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png") plot(raster(imgOut))
Although we have used a very simple model, we have been able to clean up this image:
Our predicted image is:
That’s quite good performance for a simple least squares linear regression!
To be fair though, I deliberately chose an example image that performs well. In my next blog in this series, I will discuss the use of a feedback loop in model design, and how to design new features to use as predictors.
Wow, great post! So many things clicked while going through it and your explanations were very intuitive. I did have a question regarding the RMSE calculations (and please pardon my ignorance here). We calculated RMSE with regards to the cleaned images and the dirty images (rmse1). Then we calculated RMSE with regards to our predicted images and the dirty images (rmse2), which turned out to be lower.
I know rmse2 is supposed to be lower than rmse1 due to the fact that our predicted values are between the cleaned image values and the dirty image values. My question is why we’re using this decrease in RMSE value as confirmation that our linear regression model performed well? Shouldn’t we instead compute RMSE with regards to our predicted values and the target values (i.e. the cleaned image values) to see how well the model performed? When I did this, I got an RMSE score of 0.07786137. I assume that means our overall prediction is 93% accurate? Like I said, if my ignorance is showing, my apologies haha.
Once again, excellent post. I’m super eager to read your next so keep them coming, this stuff is gold :).
LikeLiked by 1 person
Hi Bobby,
Thanks for the positive feedback! 🙂
You found a typo in my script, and your suggestion is correct. Well done. I have now corrected the blog to reflect this.
Are you competing? Have you tried running this script against the test data and making a submission? I’d be interested to know what score it gets in the competition.
Colin
LikeLike
Ok, cool, I wasn’t exactly sure if I was on the right track or not.. but it helps to confirm that I’m at least following along I think :).
So I made my first submission using what you’ve taught in this post to get a score of 0.12449, woo! This all is quite fun :D. Looking forward to your post about feature engineering and this feedback loop in the design of the model.
Thanks again, Colin!
LikeLiked by 1 person
Pingback: Denoising Dirty Documents: Part 2 | Keeping Up With The Latest Techniques
Pingback: Efficiently Building GLMs: Part 5 | Keeping Up With The Latest Techniques
This competition challenges you to give these documents a machine learning makeover. Given a dataset of images of scanned text that has seen better days, you’re challenged to remove the noise.
LikeLike
write.table(dat, file=outPath, append=(f != filenames[1]), sep=”,”, row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
I am a bit new to R. I am unable to understand why did you use
append=(f != filenames[1]),
the output is different if i write append=FALSE.
can you please explain the difference?
Thanks in advance.
LikeLiked by 1 person
We want the text file to contain the values from all of the images, but we are only writing the values from one image at a time. The append parameter tells R whether to start a new text file, or add this data to the end of the existing data in the file. We want to start a new text file when saving the values from the first image. We want to append to the existing text file when saving the values from the subsequent images. So we want a value of FALSE when the filename matches the filename of the first image, and TRUE for every other filename.
“!=” means “not equal to”
“f != filenames[1]” means TRUE whenever the filename is the not same as the first in the list, and FALSE when it matches.
The syntax for this is very similar to C++
LikeLiked by 1 person
Thankyou so much. Further I want you to solve one more for me->
> lm.mod.1 = lm(y ~ x, data=dat[dat$y > 0.05 & dat$y < 0.95,])
Error: unexpected ‘;’ in “lm.mod.1 = lm(y ~ x, data=dat[dat$y >”
why is this showing error?
LikeLike
That error suggests that you have a typo in your code.
LikeLike
achinsngh with regards to you question:
Thankyou so much. Further I want you to solve one more for me->
> lm.mod.1 = lm(y ~ x, data=dat[dat$y > 0.05 & dat$y ”
why is this showing error?
I had the same error when pasting the code directly from the box on this page:
lm(y ~ x, data=dat[dat$y > 0.05 & dat$y < 0.95,])
It can be fixed by replacing the “>” with “>” and the “<” with “<".
LikeLike
Hi Ryan. Sometimes wordpress doesn’t use the correct character for the greater than sign or the lesser than sign. If you get a problem, just manually edit the line of code to use the correct character.
LikeLike
Pingback: Denoising Dirty Documents: Part 7 | Keeping Up With The Latest Techniques
Pingback: Image Processing + Machine Learning in R: Denoising Dirty Documents Tutorial Series | no free hunch
Pingback: Image Processing + Machine Learning – FPGA, image processing, machine learning, deep learning
Pingback: Using Machine Learning to Denoise Images for Better OCR Accuracy - PyImageSearch