, , ,

Now that Kaggle’s Denoising Dirty Documents Competition has closed, it’s time to start posting the secrets to getting a very good score in this competition. In this blog, I describe how to take advantage of the first of two information leakages that I used.


Information leakage occurs in predictive modelling when the training and test data includes values that would not be known at the time a prediction was being made. For example, I once worked on a direct marketing sales propensity modelling project where our predictive model fitted the training data far too well, making us suspicious. We eventually tracked it down to an incorrectly designed data extract that used status values as at the data extraction date, instead of as at the date at which a prediction would have been run. The sale of the product to the customer changed the value of that data field, so the data needed to be the value of the data field before the sale occurred. In real life projects, information leakage is a bad thing because it overstates the model accuracy. Therefore you need to ensure that it does not occur; otherwise your predictive model will not perform well. In data science competitions, information leakage is something to be taken advantage of. It enables you to obtain a higher score.

The first information leakage in this competition comes from how the training data was created. There are only 8 different page backgrounds. There are 2 coffee cup stains, 2 folded pages, 2 watermarks and 2 crumpled pages.

20151015 output 1

This gives us a huge advantage in removing the background and the stains. Instead of using an uncertain estimate based upon only a single image, we can group together the images so that each group has the same background. Then, for each pixel location, calculate the brightest pixel across all of the images in that group, and use the brightest value as the pixel brightness in the background. You can find a couple of scripts on Kaggle that do this.

The problem is that you can’t do the same for the test images because they have different backgrounds, and they have only four different backgrounds. You can’t have known this without manually looking at the test images, and that break the rules prohibiting manual processing of the test images.

20151015 output 2

You can also run into troubles if your model just assumes that the test images are arranged so that there are 8 different background images that repeat every 8 images.

A better solution is to let your model decide which images can be grouped together, and let your model decide which background belongs with each image. The way to do this is to apply a median filter to each image, to obtain an estimate of the background of each image, and then group the median images together according to similarity.


median_Filter = function(img, filterWidth)
pad = floor(filterWidth / 2)
padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(0, nrow(img) * ncol(img), filterWidth * filterWidth)
k = 1
for (i in seq_len(filterWidth))
for (j in seq_len(filterWidth))
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1

filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
return (matrix(filtered, nrow(img), ncol(img)))

# a function to turn a matrix image into a vector
img2vec = function(img)
return (matrix(img, nrow(img) * ncol(img), 1))

# training data
dirtyFolder = "C:/Users/Colin/dropbox/Kaggle/Denoising Dirty Documents/data/train"
outPath = "D:/CNN with background removal/train median25"
filenames = list.files(dirtyFolder)
# use a 25x25 median filter to get the background
for (f in filenames)
imgX = readPNG(file.path(dirtyFolder, f))

median25 = median_Filter(imgX, 25)

outFile = file.path(outPath, f)
writePNG(median25, outFile)

The script above calculates the median filter for each image and stores it in a folder. I have used a filter size of 25 pixels because I want broad patterns of the background, and I want the text removed.

Now that I have the median images, I can iterate through each training image and link it to other images that have similar median images. I have used RMSE as a measure of similarity. The cutoff RMSE of 2.5% was determined by looking at the range of RMSE values and looking for a natural break.

20151015 output 3

# find the images with matching background
outPath2 = "D:/CNN with background removal/train background"
outPath3 = "D:/CNN with background removal/train foreground"
for (i in 1:length(filenames))
f = filenames[i]

imgX = readPNG(file.path(outPath, f))

scores = matrix(1, length(filenames))
for (j in 1:length(filenames))
imgY = readPNG(file.path(outPath, filenames[j]))
rmse = 1
if (nrow(imgY) >= nrow(imgX) & ncol(imgY) >= ncol(imgX)) rmse = sqrt(mean((imgX - imgY[1:nrow(imgX), 1:ncol(imgX)]) ^ 2))
scores[j] = rmse

sameStains = filenames[scores <= 0.025]
nImages = length(sameStains)

rawData = matrix(0, ncol(imgX) * nrow(imgX), nImages)
for (j in 1:nImages)
imgY = readPNG(file.path(dirtyFolder, sameStains[j]))
rawData[,j] = img2vec(imgY[1:nrow(imgX), 1:ncol(imgX)])

background = matrix(unlist(apply(rawData,1,max)), nrow(imgX), ncol(imgX)) # background is defined as the lightest pixel of images with similar median transformations
writePNG(background, file.path(outPath2, f))

imgX = readPNG(file.path(dirtyFolder, f))
foreground = (imgX - background) / background
r = range(foreground)
foreground = (foreground - r[1]) * (r[2] - r[1])
writePNG(foreground, file.path(outPath3, f))

One of the tricks in the script above was that I didn’t simply subtract the background from the image. If I had done that, then the result would not be consistent in areas where the stains occur:

20151015 output 4

20151015 foreground-bad

Notice that in the image above, the writing is faded where the stain was dark. That’s because it was created with the code:

foreground = (imgX - background)
r = range(foreground)
foreground = (foreground - r[1]) * (r[2] - r[1])

Where the background is dark, there is less opportunity for the writing to contrast against the background. To fix this, I changed the above script to be:

foreground = (imgX - background) / background
r = range(foreground)
foreground = (foreground - r[1]) * (r[2] - r[1])

By dividing the difference by the background brightness, I have rescaled the contrast to allow for the limitation on the maximum contrast at this location. The result is shown below:
20151015 foreground-good
This time the writing has consistent darkness across the entire image, regardless of the brightness of the background pixels.

My competition submission consisted of 4 stages, and this leakage-based background removal was the first stage. The final stage also took advantage of an information leakage. But you will have to wait for a couple of blogs to see what that was…