Tags

, , , , ,

Over the past 11 blogs in this series, I have discussed how to build machine learning models for Kaggle’s Denoising Dirty Documents competition.

dozeneggs

The final blog in this series brings the count to an even dozen, and will achieve two aims:

  1. ensemble the models that we have built
  2. take advantage of the second information leakage in the competition

Ensembling, the combining of individual models into a single model, performs best when the individual models have errors that are not strongly correlated. For example, if each model has statistically independent errors, and each model performs with similar accuracy, then the average prediction across the 4 models will have half the RMSE score of the individual models. One way to increase the statistical independence of the models is to use different feature sets and / or types of models on each. I therefore chose the following combination of models:

  1. deep learning – thresholding based features
  2. deep learning – edge based features
  3. deep learning – median based features
  4. images with backgrounds removed using information leakage
  5. xgboost – wide selection of features
  6. convolutional neural network – using raw images without background removal pre-processing
  7. convolutional neural network – using images with backgrounds removed using information leakage
  8. deep convolutional neural network – using raw images without background removal pre-processing
  9. deep convolutional neural network – using images with backgrounds removed using information leakage

20151115 ensemble structure

It turned out that some of these models had errors that weren’t strongly independent to other models. But I was rushing to improve my leaderboard score in the final 48 hours of the competition, so I didn’t have time to experiment.

I didn’t experiment much with different ensemble models. However I did test xgboost versus a simple average or a least square linear regression, and it outperformed both. Maybe an elastic net could have done a good job.

Here is the R code for my ensemble:


.libPaths(c(.libPaths(), "./rlibs"))
library(png)
library(data.table)
library(xgboost)

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

cleanFolder = "./data/train_cleaned"
inFolder1 = "./threshold based model/training data"
inFolder2 = "./edge based model/training data"
inFolder3 = "./median based model/training data"
inFolder4 = "./foreground/train foreground"
inFolder5 = "./submission 11/train_postprocessed"
inFolder6 = "./convnet/train_predicted"
inFolder7 = "./cnn_leakage/train_predicted"
inFolder8 = "./CNN based model/training"
inFolder9 = "./deep CNN/train_predicted"

outPath = "./stacked/stacking.csv"

filenames = list.files(cleanFolder)
for (f in filenames)
{
print(f)
imgX1 = readPNG(file.path(inFolder1, f))
imgX2 = readPNG(file.path(inFolder2, f))
imgX3 = readPNG(file.path(inFolder3, f))
imgX4 = readPNG(file.path(inFolder4, f))
imgX5 = readPNG(file.path(inFolder5, f))
imgX6 = readPNG(file.path(inFolder6, f))
imgX7 = readPNG(file.path(inFolder7, f))
imgX8 = readPNG(file.path(inFolder8, f))
imgX9 = readPNG(file.path(inFolder9, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
y = img2vec(imgY)
x1 = img2vec(imgX1)
x2 = img2vec(imgX2)
x3 = img2vec(imgX3)
x4 = img2vec(imgX4)
x5 = img2vec(imgX5)
x6 = img2vec(imgX6)
x7 = img2vec(imgX7)
x8 = img2vec(imgX8)
x9 = img2vec(imgX9)

dat = data.table(cbind(y, x1, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,c("y", "threshold", "edge", "median", "foreground", "submission11", "convnet", "cnn_leakage", "CNN", "deepCNN"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
#rows = sample(nrow(dat), 15000000)
dat[is.na(dat)] = 0
#dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
dtrain <- xgb.DMatrix(as.matrix(dat[,-1]), label = as.matrix(dat[,1]))
#
nThreads = 30
# do cross validation first
#xgb.tab = xgb.cv(data = dtrain, nthread = nThreads, eval_metric = "rmse", nrounds = 1000, early.stop.round = 15, nfold = 4, print.every.n = 10)
# what is the best number of rounds?
#min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
min.error.idx = 300 # was 268
xgb.mod = xgboost(data = dtrain, nthread = nThreads, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

dat_predicted = predict(xgb.mod, newdata=as.matrix(dat[,-1]))
sqrt( mean( (dat$y - dat_predicted) ^ 2 )) # 0.00759027

save (xgb.mod, file = "./model/xgb.rData")

#####################################################################################################################################

imgFolder = "./data/test"
inFolder1 = "./threshold based model/test data"
inFolder2 = "./edge based model/test data"
inFolder3 = "./median based model/test data"
inFolder4 = "./foreground/test foreground"
inFolder5 = "./submission 11/test_postprocessed"
inFolder6 = "./convnet/test_predicted"
inFolder7 = "./cnn_leakage/test_predicted"
inFolder8 = "./CNN based model/test"
inFolder9 = "./deep CNN/test_predicted"

outFolder = "./stacked/test data"
outFolder2 = "./stacked/test images"

filenames = list.files(imgFolder)
for (f in filenames)
{
print(f)
imgX1 = readPNG(file.path(inFolder1, f))
imgX2 = readPNG(file.path(inFolder2, f))
imgX3 = readPNG(file.path(inFolder3, f))
imgX4 = readPNG(file.path(inFolder4, f))
imgX5 = readPNG(file.path(inFolder5, f))
imgX6 = readPNG(file.path(inFolder6, f))
imgX7 = readPNG(file.path(inFolder7, f))
imgX8 = readPNG(file.path(inFolder8, f))
imgX9 = readPNG(file.path(inFolder9, f))

# turn the images into vectors
x1 = img2vec(imgX1)
x2 = img2vec(imgX2)
x3 = img2vec(imgX3)
x4 = img2vec(imgX4)
x5 = img2vec(imgX5)
x6 = img2vec(imgX6)
x7 = img2vec(imgX7)
x8 = img2vec(imgX8)
x9 = img2vec(imgX9)

dat = data.table(cbind(x1, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,c("threshold", "edge", "median", "foreground", "submission11", "convnet", "cnn_leakage", "CNN", "deepCNN"))
yHat = predict(xgb.mod, newdata=as.matrix(dat))
yHat[yHat < 0] = 0
yHat[yHat > 1] = 1
imgY = matrix(yHat, nrow(imgX1), ncol(imgX1))
writePNG(imgY, file.path(outFolder2, f))
save(imgY, file = file.path(outFolder, gsub(".png", ".rData", f)))
}

Ensembling materially improved my leaderboard score versus any of the individual models. I feel that was due to the use of different features across my 3 deep learning models. So now I had a set of images that looked quite good:

20151115 output 1

20151115 output 2

To my eyes, my predicted images were indistinguishable from the clean images in the training data. In a real world situation I would have stopped model development here, because the image quality exceeds the minimum requirements for OCR. However, since this was a competition, I wanted the best score I could get.

So I took advantage of the second data leakage in the competition – the fact that the cleaned images were repeated across the dataset. This meant that I could compare a cleaned images to other cleaned images that appeared to have the same text and the same font, and clean up any pixels that were different across the set of images. I experimented with using the mean of the pixel brightness across the images, but using the median performed better.


library(png)
library(data.table)

inFolder = "./stacked/test data"
outFolder = "./information leakage/data"
outFolder2 = "./information leakage/images"

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

filenames = list.files(inFolder, pattern = "\\.rData$")
for (f in filenames)
{
print(f)

load(file.path(inFolder, f))
imgX = imgY

# look for the closest matched images
scores = matrix(1, length(filenames))
for (i in 1:length(filenames))
{
load(file.path(inFolder, filenames[i]))
rmse = 1
if (nrow(imgY) >= nrow(imgX) && ncol(imgY) >= ncol(imgX))
{
imgY = imgY[1:nrow(imgX), 1:ncol(imgX)]
rmse = sqrt(mean( (imgX - imgY)^2 ))
}
scores[i] = rmse
}

dat = matrix(1, ncol(imgX) * nrow(imgX), 4)
for (i in 1:4)
{
f2 = filenames[order(scores)][i]
load(file.path(inFolder, f2))
dat[,i] = img2vec(imgY)
}

dat2 = apply(dat, 1, median)
#dat2 = apply(dat, 1, mean)

imgOut = matrix(dat2, nrow(imgX), ncol(imgX))
writePNG(imgOut, file.path(outFolder2, gsub(".rData", ".png", f)))
save(imgOut, file = file.path(outFolder, f))
}

This information leakage halved the RMSE, and I suspect that it was what allowed the top two competitors to obtain RMSE scores less than 1%.

So that’s it for this series of blogs. I learned a lot from my first Kaggle competition. Competing against others, and sharing ideas is a fun way to learn.