An Even Dozen – Denoising Dirty Documents: Part 12 | Keeping Up With The Latest Techniques

Tags

Image Processing, Kaggle, Machine Learning, R, Stacking, XGBoost

Over the past 11 blogs in this series, I have discussed how to build machine learning models for Kaggle’s Denoising Dirty Documents competition.

The final blog in this series brings the count to an even dozen, and will achieve two aims:

ensemble the models that we have built
take advantage of the second information leakage in the competition

Ensembling, the combining of individual models into a single model, performs best when the individual models have errors that are not strongly correlated. For example, if each model has statistically independent errors, and each model performs with similar accuracy, then the average prediction across the 4 models will have half the RMSE score of the individual models. One way to increase the statistical independence of the models is to use different feature sets and / or types of models on each. I therefore chose the following combination of models:

deep learning – thresholding based features
deep learning – edge based features
deep learning – median based features
images with backgrounds removed using information leakage
xgboost – wide selection of features
convolutional neural network – using raw images without background removal pre-processing
convolutional neural network – using images with backgrounds removed using information leakage
deep convolutional neural network – using raw images without background removal pre-processing
deep convolutional neural network – using images with backgrounds removed using information leakage

It turned out that some of these models had errors that weren’t strongly independent to other models. But I was rushing to improve my leaderboard score in the final 48 hours of the competition, so I didn’t have time to experiment.

I didn’t experiment much with different ensemble models. However I did test xgboost versus a simple average or a least square linear regression, and it outperformed both. Maybe an elastic net could have done a good job.

Here is the R code for my ensemble:


.libPaths(c(.libPaths(), "./rlibs"))
library(png)
library(data.table)
library(xgboost)

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

cleanFolder = "./data/train_cleaned"
inFolder1 = "./threshold based model/training data"
inFolder2 = "./edge based model/training data"
inFolder3 = "./median based model/training data"
inFolder4 = "./foreground/train foreground"
inFolder5 = "./submission 11/train_postprocessed"
inFolder6 = "./convnet/train_predicted"
inFolder7 = "./cnn_leakage/train_predicted"
inFolder8 = "./CNN based model/training"
inFolder9 = "./deep CNN/train_predicted"

outPath = "./stacked/stacking.csv"

filenames = list.files(cleanFolder)
for (f in filenames)
{
print(f)
imgX1 = readPNG(file.path(inFolder1, f))
imgX2 = readPNG(file.path(inFolder2, f))
imgX3 = readPNG(file.path(inFolder3, f))
imgX4 = readPNG(file.path(inFolder4, f))
imgX5 = readPNG(file.path(inFolder5, f))
imgX6 = readPNG(file.path(inFolder6, f))
imgX7 = readPNG(file.path(inFolder7, f))
imgX8 = readPNG(file.path(inFolder8, f))
imgX9 = readPNG(file.path(inFolder9, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
y = img2vec(imgY)
x1 = img2vec(imgX1)
x2 = img2vec(imgX2)
x3 = img2vec(imgX3)
x4 = img2vec(imgX4)
x5 = img2vec(imgX5)
x6 = img2vec(imgX6)
x7 = img2vec(imgX7)
x8 = img2vec(imgX8)
x9 = img2vec(imgX9)

dat = data.table(cbind(y, x1, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,c("y", "threshold", "edge", "median", "foreground", "submission11", "convnet", "cnn_leakage", "CNN", "deepCNN"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
#rows = sample(nrow(dat), 15000000)
dat[is.na(dat)] = 0
#dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
dtrain <- xgb.DMatrix(as.matrix(dat[,-1]), label = as.matrix(dat[,1]))
#
nThreads = 30
# do cross validation first
#xgb.tab = xgb.cv(data = dtrain, nthread = nThreads, eval_metric = "rmse", nrounds = 1000, early.stop.round = 15, nfold = 4, print.every.n = 10)
# what is the best number of rounds?
#min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
min.error.idx = 300 # was 268
xgb.mod = xgboost(data = dtrain, nthread = nThreads, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

dat_predicted = predict(xgb.mod, newdata=as.matrix(dat[,-1]))
sqrt( mean( (dat$y - dat_predicted) ^ 2 )) # 0.00759027

save (xgb.mod, file = "./model/xgb.rData")

#####################################################################################################################################

imgFolder = "./data/test"
inFolder1 = "./threshold based model/test data"
inFolder2 = "./edge based model/test data"
inFolder3 = "./median based model/test data"
inFolder4 = "./foreground/test foreground"
inFolder5 = "./submission 11/test_postprocessed"
inFolder6 = "./convnet/test_predicted"
inFolder7 = "./cnn_leakage/test_predicted"
inFolder8 = "./CNN based model/test"
inFolder9 = "./deep CNN/test_predicted"

outFolder = "./stacked/test data"
outFolder2 = "./stacked/test images"

filenames = list.files(imgFolder)
for (f in filenames)
{
print(f)
imgX1 = readPNG(file.path(inFolder1, f))
imgX2 = readPNG(file.path(inFolder2, f))
imgX3 = readPNG(file.path(inFolder3, f))
imgX4 = readPNG(file.path(inFolder4, f))
imgX5 = readPNG(file.path(inFolder5, f))
imgX6 = readPNG(file.path(inFolder6, f))
imgX7 = readPNG(file.path(inFolder7, f))
imgX8 = readPNG(file.path(inFolder8, f))
imgX9 = readPNG(file.path(inFolder9, f))

# turn the images into vectors
x1 = img2vec(imgX1)
x2 = img2vec(imgX2)
x3 = img2vec(imgX3)
x4 = img2vec(imgX4)
x5 = img2vec(imgX5)
x6 = img2vec(imgX6)
x7 = img2vec(imgX7)
x8 = img2vec(imgX8)
x9 = img2vec(imgX9)

dat = data.table(cbind(x1, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,c("threshold", "edge", "median", "foreground", "submission11", "convnet", "cnn_leakage", "CNN", "deepCNN"))
yHat = predict(xgb.mod, newdata=as.matrix(dat))
yHat[yHat < 0] = 0
yHat[yHat > 1] = 1
imgY = matrix(yHat, nrow(imgX1), ncol(imgX1))
writePNG(imgY, file.path(outFolder2, f))
save(imgY, file = file.path(outFolder, gsub(".png", ".rData", f)))
}

Ensembling materially improved my leaderboard score versus any of the individual models. I feel that was due to the use of different features across my 3 deep learning models. So now I had a set of images that looked quite good:

To my eyes, my predicted images were indistinguishable from the clean images in the training data. In a real world situation I would have stopped model development here, because the image quality exceeds the minimum requirements for OCR. However, since this was a competition, I wanted the best score I could get.

So I took advantage of the second data leakage in the competition – the fact that the cleaned images were repeated across the dataset. This meant that I could compare a cleaned images to other cleaned images that appeared to have the same text and the same font, and clean up any pixels that were different across the set of images. I experimented with using the mean of the pixel brightness across the images, but using the median performed better.


library(png)
library(data.table)

inFolder = "./stacked/test data"
outFolder = "./information leakage/data"
outFolder2 = "./information leakage/images"

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

filenames = list.files(inFolder, pattern = "\\.rData$")
for (f in filenames)
{
print(f)

load(file.path(inFolder, f))
imgX = imgY

# look for the closest matched images
scores = matrix(1, length(filenames))
for (i in 1:length(filenames))
{
load(file.path(inFolder, filenames[i]))
rmse = 1
if (nrow(imgY) >= nrow(imgX) && ncol(imgY) >= ncol(imgX))
{
imgY = imgY[1:nrow(imgX), 1:ncol(imgX)]
rmse = sqrt(mean( (imgX - imgY)^2 ))
}
scores[i] = rmse
}

dat = matrix(1, ncol(imgX) * nrow(imgX), 4)
for (i in 1:4)
{
f2 = filenames[order(scores)][i]
load(file.path(inFolder, f2))
dat[,i] = img2vec(imgY)
}

dat2 = apply(dat, 1, median)
#dat2 = apply(dat, 1, mean)

imgOut = matrix(dat2, nrow(imgX), ncol(imgX))
writePNG(imgOut, file.path(outFolder2, gsub(".rData", ".png", f)))
save(imgOut, file = file.path(outFolder, f))
}

This information leakage halved the RMSE, and I suspect that it was what allowed the top two competitors to obtain RMSE scores less than 1%.

So that’s it for this series of blogs. I learned a lot from my first Kaggle competition. Competing against others, and sharing ideas is a fun way to learn.

3 thoughts on “An Even Dozen – Denoising Dirty Documents: Part 12”

Pingback: Image Processing + Machine Learning in R: Denoising Dirty Documents Tutorial Series | no free hunch
Bobby said:

March 8, 2016 at 7:32 pm

Well done with this competition Colin! I meant to keep in touch soon after the competition but school got in the way. Still enjoying your blog posts! I’ve got some to catch up on. Keep up the good work!

LikeLike

Pingback: Image Processing + Machine Learning – FPGA, image processing, machine learning, deep learning

Keeping Up With The Latest Techniques

~ brief insights

An Even Dozen – Denoising Dirty Documents: Part 12

3 thoughts on “An Even Dozen – Denoising Dirty Documents: Part 12”

Leave a comment Cancel reply

Share this:

Related

3 thoughts on “An Even Dozen – Denoising Dirty Documents: Part 12”

Leave a comment Cancel reply