Keeping Up With The Latest Techniques

~ brief insights

Keeping Up With The Latest Techniques

Category Archives: GBM

Denoising Dirty Documents: Part 3

14 Friday Aug 2015

Posted by Colin Priest in Adaptive Thresholding, GBM, Gradient Boosting Machine, Image Processing, Kaggle, Machine Learning, R

≈ 9 Comments

Tags

Adaptive Thresholding, GBMs, Image Processing, Kaggle, Machine Learning, R

In my last blog, I discussed how to threshold an image, and we ended up doing quite a good job of cleaning up the image with the creased paper. But then I finished the blog by showing that our algorithm had a problem, it didn’t cope well with coffee cup stains.

coffee stain cropped

The problem is that coffee cup stains are dark, so our existing algorithm does not distinguish between dark writing and dark stains. We need to find some features that distinguish between dark stains and dark writing.20150808 plot 7

Looking at the image, we can see that even though the stains are dark, the writing is darker. The stains are dark compared to the overall image, but lighter than the writing within them,

So we can hypothesise that we might be able to separate the writing from the stains by using thresholding that looks at the pixel brightnesses over a localised set of pixels rather than across all of the image. That way we might separate the darker writing from the dark background stain.

While I am an R programmer and a fan of R, I have to admit that R can be a frustrating platform for image processing. For example, I cannot get the adimpro package to work at all, despite much experimentation and tinkering, and googling for solutions. Then I discovered that the biOps package was removed from CRAN, and the archived versions refuse to install on my PC. But eventually I found ways to get image processing done in R.

First, I tried the EbayesThresh package.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, EbayesThresh)

# sample image containing coffee cup stain
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")


# using EbayesThresh
et = ebayesthresh(img)
plot(raster(et))
range(et)

20150815 output 1
All of the values are zero! That isn’t very helpful.

Second, I tried the treethresh package.


# experiment with treethresh
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, treethresh)

# sample image containing coffee cup stain
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")

# using tree thresholding
tt = treethresh(img)
tt.pruned <- prune(tt)
tt.thresholded <- thresh(tt.pruned)
plot(raster(tt.thresholded))
range(tt.thresholded)

Once again I only got zeroes.
Finally, I tried wavelet thresholding using the treethresh package.

# experiment with wavelet thresholding
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, treethresh)

# sample image containing coffee cup stain
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")

# using wavelet thresholding
# need a square image that has with that is a power of 2
# compute wavelet transform
wt.square <- imwd(img[1:256,1:256])
# Perform the thresholding
wt.thresholded = threshold(wt.square)
# Compute inverse wavelet transform
wt.denoised <- imwr(wt.thresholded)
plot(raster(wt.denoised))

20150815 output 2
It wasn’t what I was looking for, but at least it’s not just a matrix of zeroes!

After these two failures with CRAN packages, I decided that I needed to look beyond CRAN and try out the EBImage package on Bioconductor.


if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster)

# sample image containing coffee cup stain
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")

# using adaptive thresholding
img.eb = readImage("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
img.thresholded.3 = thresh(img.eb, 3, 3)
display(img.thresholded.3)
img.thresholded.5 = thresh(img.eb, 5, 5)
display(img.thresholded.5)
img.thresholded.7 = thresh(img.eb, 7, 7)
display(img.thresholded.7)
img.thresholded.9 = thresh(img.eb, 9, 9)
display(img.thresholded.9)

20150815 output 3

These images show that we are on the right track. The adaptive thresholding is good at separating the letters from the background, but it also adds black to the background. While the adaptive thresholding may not be perfect, any feature extraction that adds to our knowledge will probably improve our ensemble model.

Before we add adaptive thresholding to our model, we will do a bit more pre-processing. Based on the observation that adaptive thresholding is keeping the letters but adding some unwanted black to the background, we can combine all of the adaptive thresholding with normal thresholding images, and use the maximum pixel brightness across each 5 image set.


# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0
imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

# combine the adaptive thresholding
ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)),
img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
plot(raster(ttt.3))

20150815 output 4

It’s starting to look much better. The coffee cup stain isn’t removed, but we are starting to successfully clean it up. This suggests that adaptive thresholding will be an important predictor in our ensemble.

Let’s bring it all together, and build a new ensemble model, extending the model from my last blog on this subject.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0
imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
img.eb <- Image(t(img))
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)
img.kmThresh = kmeansThreshold(img)

# combine the adaptive thresholding
ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

dat = data.table(cbind(y, x, x2, x3))
setnames(dat,c("y", "raw", "thresholded", "adaptive"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit a model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 1000000)
gbm.mod = gbm(y ~ raw + thresholded + adaptive, data = dat[rows,], n.trees = 7500, cv.folds = 3, train.fraction = 0.5, interaction.depth = 5)
best.iter <- gbm.perf(gbm.mod,method="cv",oobag.curve = FALSE)

s = summary(gbm.mod)

# get the predictions - using parallel processing to save time
numCores = 6 #change the 6 to your number of CPU cores. or maybe lower due to RAM limits
cl = makeCluster(numCores)
registerDoSNOW(cl)
num_splits = numCores
split_testing = sort(rank(1:nrow(dat))%%numCores)
yHat = foreach(i=unique(split_testing),.combine=c,.packages=c("gbm")) %dopar% {
as.numeric(predict(gbm.mod, newdata=dat[split_testing==i,], n.trees = best.iter))
}
stopCluster(cl)

# what score do we get on the training data?
rmse = sqrt(mean( (yHat - dat$y) ^ 2 ))
print(rmse)

# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img), img2vec(adaptiveThresholding(img)))
setnames(x, c("raw", "thresholded", "adaptive"))
yHat = predict(gbm.mod, newdata=x, n.trees = best.iter)
imgOut = matrix(yHat, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150815 output 6

The adaptive thresholding gets a relative importance score of 2.4%, but full image thresholding loses all of its importance in the model.

20150815 output 5

The result may not be perfect, but you can see how the coffee cup stain is starting to be erased. We will need to find some more features to continue the clean up, but that will be left to future blogs.

This blog’s model has improved the RMSE on the training data from 6.5% reported in my last blog to 5.4% in this latest version. There’s a long way to go to have a top 10 performing model, but this process is steadily moving us in the right direction.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 2

07 Friday Aug 2015

Posted by Colin Priest in Cluster Analysis, GBM, Gradient Boosting Machine, Image Processing, k-means, Kaggle, Machine Learning, R

≈ 9 Comments

Tags

Cluster Analysis, GBMs, Image Processing, k-means, Kaggle, Machine Learning, R

In the first blog in this series, I explained the nature of the problem to be solved, and showed how to create a simple linear model that gives an RMSE score on the training data of 7.8%. In this blog I introduce the feedback loop for creating features, and we extend and improve the existing competition model to include a couple of new features.

20150731 leaderboard

I have been building machine learning models for almost 20 years now, and never once has my first model been satisfactory, nor has that model been the model that I finally use. The Denoising Dirty Documents competition is no different. In the plots above, you can see my progress through the public leaderboard, as I iteratively improved my model. The same has occurred more broadly in the competition overall, as the best score has iteratively improved since the competition began.

Coming from an actuarial / statistical background, I tend to think in terms of a “control cycle”. For actuaries, the control cycle looks something like this:

control cycle

We can categorise the content of my last blog as:

1) Define the Problem

  • remove noise from images
  • images are three-dimensional surfaces

2) Hypothesise Solutions

  • we hypothesised that the pixel brightnesses needed to be rescaled

3) Implement

  • we fitted a least squares linear model to rescale the pixel brightnesses, and we capped and floored the values to keep them within the [0, 1] range

4) Monitor Results

  • we calculated the RMSE on the training data before and after the implementation
  • we looked at one of the output images

So we have completed one cycle. Now we need to do another cycle.

What I like about the Denoising Dirty Images Competition is that the data visualisation and model visualisation are obvious. We don’t need to develop clever visualisation tools to understand the problem or the model output. We can just look at the dirty and predicted images. Let’s begin with reviewing the example input and output images from the model used in the last blog i.e. let’s begin by recalling the Monitor Results step from the previous cycle:

20150801 - before

20150801 - after

Now we can progress to the beginning of the next cycle, the Define the Problem stage. This time the problem to be solved is to remove one or more of the errors from the output of the existing model. Once we already have a model, we can define the problem by asking ourselves the following question.

Question: In the predicted image, where have prediction errors have been made?

Answer: The predicted image contains some of the crease lines.

So in this cycle, the problem is to remove the remaining crease lines.

It’s time to begin the Hypothesise Solutions stage of the cycle by asking ourselves the following questions:

  • In the dirty image, what is a single characteristic of what we want to keep or remove that we haven’t captured in our existing model?
  • What characteristics do the errors in the predicted image have?
  • What characteristics does the dirty image have at the locations where these errors occur?

Here are a few observations that I made in answer to these questions:

  • In the dirty image, the writing is darker than the background around it. We haven’t captured this locality information.
  • The predicted image retains some of the crease lines. Those remaining crease lines are narrower and not as dark as the writing.
  • In the dirty image, the shadows next to the white centres of the crease are the remaining crease lines from the predicted image.

This led me to the hypothesis that the model for a single pixel needs to include information about the brightness of other pixels in the image. There are multiple ways that we could consider the brightnesses of other pixels, but in today’s blog I will limit myself to considering the brightness of the pixel under consideration versus the range of pixel brightnesses across the entire image.

To Implement a hypothesis test, and solution, we need to know some theory about image processing. There are some good textbooks that cover this material. The image processing / machine vision textbook that I have been using is


Computer and Machine Vision, Fourth Edition: Theory, Algorithms, Practicalities

The machine vision technique that I will apply this today is thresholding, which is the process of turning an image into pixels that can only be black or white, with no grey shades or colours. Writing code to do thresholding is the easy part. The trickier part is to decide the threshold value at which pixels are split into either black or white.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster)

img = readPNG("C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train\\6.png")

# turn the image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# show a histogram
hist(img2vec(img))

20150808 plot 1

One way of finding a threshold value is to look at a histogram of the pixel brightnesses and look for a natural break between local maxima in the histogram. Since the histogram above is tri-modal, this would leave us with two choices of thresholds, one around 0.3 and one around 0.65.


# threshold at 0.3
img0.3 = img
img0.3[img0.3 <= 0.3] = 0 img0.3[img0.3 > 0.3] = 1
plot(raster(img0.3))

20150808 plot 2

Now we can Monitor Results. Thresholding at 0.3 gives us no false positives for this image, but leaves out a lot of the writing.


# threshold at 0.65
img0.65 = img
img0.65[img0.65 <= 0.65] = 0 img0.65[img0.65 > 0.65] = 1
plot(raster(img0.65))

20150808 plot 3

Thresholding at 0.65 gives us no false positives for this image, and correctly flags the writing without flagging the creases. That makes it quite a good feature for use on this image, as it will help us to remove the residual crease mark that we found when reviewing the output of the last blog’s model. We should definitely include a thresholding feature in our model!

But it feels very clumsy to plot a histogram and manually select the threshold, and such a feature definitely isn’t “machine learning”. Can we automate this?

Otsu’s Method is a well-known technique in image processing that can automatically suggest a threshold. But it assumes a bi-modal histogram, which we have seen is not true.

Inspired by Otsu’s method, we could use cluster analysis to generate three clusters of pixel brightnesses, and use the splits between those clusters to threshold the image.


# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the lower threshold is the halfway point between the top of the lowest cluster and the bottom of the middle cluster
loThresh = 0.5 * (max(v[km.mod$cluster == oc[1]]) + min(v[km.mod$cluster == oc[2]]))
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using lower threshold
imgLo = img
imgLo[imgLo <= loThresh] = 0 imgLo[imgLo > loThresh] = 1
plot(raster(imgLo))

# using upper threshold
imgHi = img
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1
plot(raster(imgHi))

20150808 plot 4 20150808 plot 5

Now we can Monitor Results. Once again, the lower threshold choice doesn’t capture enough of the writing, while the upper threshold choice works well for this image. And this time there wasn’t any manual work required!

Let’s put it all together and combine last blog’s feature (a linear transformation of the raw pixel brightnesses) with this blog’s feature (thresholding). This time we will use something more sophisticated than linear regression. Our model will be based upon R’s GBM package, a gradient boosted machine to combine the raw pixel brightnesses with the thresholded pixels. GBMs are good all-purpose models, are simple to set up, perform well, and are almost always my first choice when creating machine learning models.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm)

# a function to do k-means thresholding

kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

dirtyFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

dat = data.table(cbind(y, x, x2))
setnames(dat,c("y", "raw", "thresholded"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# view the data
dat = read.csv(outPath)
rows = sample(nrow(dat), 10000)
d1 = dat[rows,]
plot(d1$raw[dat$thresholded == 0], d1$y[dat$thresholded == 0], col = "blue")
lines(d1$raw[dat$thresholded == 1], d1$y[dat$thresholded == 1], col = "red", type="p")

# fit a model to a subset of the data
rows = sample(nrow(dat), 100000)
gbm.mod = gbm(y ~ raw + thresholded, data = dat[rows,], n.trees = 5000, cv.folds = 10, train.fraction = 0.5)
best.iter <- gbm.perf(gbm.mod,method="cv")

# what score do we get on the training data?
yHat = predict(gbm.mod, newdata=dat, n.trees = best.iter)
rmse = sqrt(mean( (yHat - dat$y) ^ 2 ))
print(rmse)

# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train\\6.png")
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img))
setnames(x, c("raw", "thresholded"))
yHat = predict(gbm.mod, newdata=x, n.trees = best.iter)
imgOut = matrix(yHat, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150808 plot 6
Notice how the sample output image is different from the thresholded image? The edges of the writing are lighter than the centre of the letters. That’s good, because that’s one of the characteristics of the clean image.
20150808 sample output
The sample image that we created from the predicted values is looking really good. The residual crease marks have disappeared, yet we have kept the writing. And the RMSE score on the training data has improved to 6.5%.

But before this blog ends, let’s consider an image that wasn’t cleaned so well:

# here's a sample image that doesn't perform as well
img = readPNG("C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img))
setnames(x, c("raw", "thresholded"))
yHat = predict(gbm.mod, newdata=x)
imgOut = matrix(yHat, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150808 plot 7
Damn! I feel like cursing the jerk who put their coffee cup on top of the important document and left a stain!

In my next blog, I will discuss the first steps towards removing that coffee cup stain.
http://wms-na.amazon-adsystem.com/20070822/US/js/link-enhancer-common.js?tag=keeupwitthela-20&linkId=VYH2LC2X67R2SDNO

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Blogroll

  • Discover New Voices
  • Discuss
  • Get Inspired
  • Get Mobile
  • Get Polling
  • Get Support
  • Great Reads
  • Learn WordPress.com
  • Theme Showcase
  • WordPress.com News
  • www.r-bloggers.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 276 other followers

Blog at WordPress.com.

Cancel
%d bloggers like this: