, , ,

I recently blogged about my learning curve in my first Kaggle competition. This has become my most popular blog to date, and some readers have asked for more. So this blog is the first in a series of blogs about how to put together a reasonable solution to Kaggle’s Denoising Dirty Documents competition.

Some other competitors have been posting scripts, but those scripts are usually written in Python, whereas my background makes me an R programmer. So I will be writing scripts that make use of R.

The Structure of the Problem

We have been given a series of training images, both dirty (with stains and creased paper) and clean (with a white background and black letters). We are asked to develop an algorithm that converts, as close as possible, the dirty images into clean images.

the problem to be solved

A greyscale image (such as shown above) can be thought of as a three-dimensional surface. The x and y axes are the location within the image, and the z axis is the brightness of the image at that location. The great the brightness, the whiter the image at that location.

So from a mathematical perspective, we are being asked to transform one three-dimensional surface into another three dimensional surface.


Our task is to clean the images, to remove the stains, remove the paper creases, improve the contrast, and just leave the writing.

Loading the Image Data

In R, images are stored as matrices, with the row being the y-axis, the column being the x-axis, and the numerical value being the brightness of the pixel. Since Kaggle has stored the images in png format, we can use the png package to load the images.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster)

img = readPNG("C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train\\6.png")

20150801 output 1 20150801 output 2

You can see that the brightness values lie within the [0, 1] range, with 0 being black and 1 being white.

Restructuring the Data for Machine Learning

Instead of modelling the entire image at once, we should predict the cleaned-up brightness for each pixel within the image, and construct a cleaned image by combining together a set of predicted pixel brightnesses. We want a vector of y values, and a matrix of x values. The simplest data set is where the x values are just the pixel brightnesses of the dirty images.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table)

dirtyFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

dat = data.table(cbind(y, x))
setnames(dat,c("y", "x"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
# view the data
dat = read.csv(outPath)
rows = sample(nrow(dat), 10000)
plot(dat$x[rows], dat$y[rows])

20150801 output 4

The data is now in a familiar format, which each row representing a data point, the first column being the target value, and the remaining column being the predictors.

Our First Predictive Model

Look at the relationship between x and y.

20150801 output 3

Except at the extremes, there is a linear relationship between the brightness of the dirty images and the cleaned images. There is some noise around this linear relationship, and a clump of pixels that are halfway between white and black. There is a broad spread of x values as y approaches 1, and these pixels probably represent stains that need to be removed.

So the obvious first model would be a linear transformation, with truncation to ensure that the predicted brightnesses remain within the [0, 1] range.

# fit a linear model, ignoring the data points at the extremes
lm.mod.1 = lm(y ~ x, data=dat[dat$y > 0.05 & dat$y < 0.95,])
dat$predicted = sapply(predict(lm.mod.1, newdata=dat), function(x) max(min(x, 1),0))
plot(dat$predicted[rows], dat$y[rows])
rmse1 = sqrt(mean( (dat$y - dat$x) ^ 2))
rmse2 = sqrt(mean( (dat$predicted - dat$y) ^ 2))
c(rmse1, rmse2)

20150801 output 5 20150801 output 6

The linear model has done a brightness and contrast correction. This reduces the RMSE score from 0.157 to 0.078. Let’s see an output image:

# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\Dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\6.png")
x = data.table(matrix(img, nrow(img) * ncol(img), 1))
setnames(x, "x")
yHat = sapply(predict(lm.mod.1, newdata=x), function(x) max(min(x, 1),0))
imgOut = matrix(yHat, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\Dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")

20150801 output 7

Although we have used a very simple model, we have been able to clean up this image:

20150801 - before

Our predicted image is:

20150801 - after

That’s quite good performance for a simple least squares linear regression!

To be fair though, I deliberately chose an example image that performs well. In my next blog in this series, I will discuss the use of a feedback loop in model design, and how to design new features to use as predictors.