Denoising Dirty Documents : Part 5

Tags

Background Removal, Kaggle, Median Filter, R, Variable Importance

In my last blog we had faded the coffee cup stains, but there was more work to be done. So far we had used adaptive thresholding and edge detection. Today we will use median filters and background removal.

A median filter is an image filter that replaces a pixel with the median value of the pixels surrounding it. In doing this, it smoothes the image, and the result is often thought of as the “background” of the image, since it tends to wipe away small features, but maintains broad features.

While the biOps package has a median filter implementation, it isn’t difficult to write a function to do that ourselves, and it can be quite instructive to see how a median filter works.

median_Filter = function(img, filterWidth)
{
 pad = floor(filterWidth / 2)
 padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
 padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
 
 tab = NULL
 for (i in seq_len(filterWidth))
 {
 for (j in seq_len(filterWidth))
 {
 if (i == 1 && j == 1)
 {
 tab = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
 } else {
 tab = cbind(tab, img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))]))
 }
 }
 }
 
 filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
 return (matrix(filtered, nrow(img), ncol(img)))
}

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster)

# read in the coffee cup stain image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")

# use the median filter and save the result
filtered = median_Filter(img, 17)
writePNG(filtered, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")

What we get after applying a median filter is something that looks like the background of the image. It contains the coffee cup stains and also the shade of the paper upon which the writing appears. With a filter width of 17, the writing has almost entirely faded away.
I didn’t choose that filter width of 17 randomly. It was the result of running models with several different filter widths and seeing which had the best predictive powers.
Median filters are not fast, especially once the filter width increases. This function runs with similar speed to that in the biOps package. I shall leave the task of writing a parallel processing version of the median filter function to the more impatient readers.

While we now have the background, what we really wanted was the foreground – the writing, without the coffee cup stains. The foreground is the difference between the original image and the background. But in this case we know that the writing is always darker than the background, so our foreground should only show pixels that are darker than the background. I have also rescaled the result to lie in the interval [0, 1]. Here is an R script to implement background removal.

# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
 w = 5

 # the background is found via a median filter
 background = median_Filter(img, w)

 # the foreground is darker than the background
 foreground = img - background
 foreground[foreground > 0] = 0
 m1 = min(foreground)
 m2 = max(foreground)
 foreground = (foreground - m1) / (m2 - m1)
 
 return (matrix(foreground, nrow(img), ncol(img)))
}

# run the background removal function and save the result
foreground = background_Removal(img)
writePNG(foreground, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")

While it’s not perfect, the resulting filtered image has done a reasonably good job of separating the writing from the background. It is reasonable to expect that it will be a useful predictor in our model.
This time a filter width of 5 was chosen purely for the purpose of speed. You could have used a grid search to find the “best” filter width parameter.

# read in the coffee cup stain image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
imgClean = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned\\3.png")

bestRMSE = 1
bestWidth = 5

widths = c(3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25)

for (w in widths)
{
 # the background is found via a median filter
 background = median_Filter(img, w)

 # the foreground is darker than the background
 foreground = img - background
 foreground[foreground > 0] = 0
 m1 = min(foreground)
 m2 = max(foreground)
 foreground = (foreground - m1) / (m2 - m1)
 
 # score the result
 rmse = sqrt(mean( (foreground - imgClean) ^ 2 ))
 if (rmse < bestRMSE)
 {
 bestRMSE = rmse
 bestWidth = w
 print(c(bestWidth, rmse))
 }
}

In the past few blogs I have used the GBM package to create a predictive model. But as we have added more predictors or features, it has started to take a long time to fit the model and to calculate the predictions. So today I’m switching to the xgboost package because the top Kagglers are using it (Owen Zhang, Kaggle’s top ranked competitor says “when in doubt use xgboost“), and because it runs much faster than GBM. I’d never used xgboost until this week, and I must say that I’m quite impressed with its speed.

Here is an R script that fits an xgboost model, using all of the features that we have come up with over the 5 blogs to date.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps, xgboost, Ckmeans.1d.dp)

if (!require("EBImage"))
{
 source("http://bioconductor.org/biocLite.R")
 biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
 # fit 3 clusters
 v = img2vec(img)
 km.mod = kmeans(v, 3)
 # allow for the random ordering of the clusters
 oc = order(km.mod$centers)
 # the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
 hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))
 
 # using upper threshold
 imgHi = v
 imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1
 
 return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
 img.eb <- Image(t(img))
 img.thresholded.3 = thresh(img.eb, 3, 3)
 img.thresholded.5 = thresh(img.eb, 5, 5)
 img.thresholded.7 = thresh(img.eb, 7, 7)
 img.thresholded.9 = thresh(img.eb, 9, 9)
 img.thresholded.11 = thresh(img.eb, 11, 11)
 img.kmThresh = kmeansThreshold(img)

 # combine the adaptive thresholding
 ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
 ttt.2 = apply(ttt.1, 1, max)
 ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
 return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
 m1 = t(matrix(Img, nrow(Img), ncol(Img)))
 return(m1)
}

# a function to do canny edge detector
cannyEdges = function(img)
{
 img.biOps = imagedata(img * 255)
 img.canny = imgCanny(img.biOps, 0.7)
 return (matrix(img.canny / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated1 = function(img)
{
 img.biOps = imagedata(img * 255)
 img.canny = imgCanny(img.biOps, 0.7)
 # do some morphology on the edges to fill the gaps between them
 mat <- matrix (0, 3, 3)
 mask <- imagedata (mat, "grey", 3, 3)
 img.dilation = imgBinaryDilation(img.canny, mask)
 img.erosion = imgBinaryErosion(img.dilation, mask)
 return(matrix(img.erosion / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated2 = function(img)
{
 img.biOps = imagedata(img * 255)
 img.canny = imgCanny(img.biOps, 0.7)
 # do some morphology on the edges to fill the gaps between them
 mat <- matrix (0, 3, 3)
 mask <- imagedata (mat, "grey", 3, 3) img.dilation = imgBinaryDilation(img.canny, mask) img.erosion = imgBinaryErosion(img.dilation, mask) img.erosion.2 = imgBinaryErosion(img.erosion, mask) img.dilation.2 = imgBinaryDilation(img.erosion.2, mask) return(matrix(img.dilation.2 / 255, nrow(img), ncol(img))) } # a function to do a median filter median_Filter = function(img, filterWidth) { pad = floor(filterWidth / 2) padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad) padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img tab = NULL for (i in seq_len(filterWidth)) { for (j in seq_len(filterWidth)) { if (i == 1 && j == 1) { tab = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))]) } else { tab = cbind(tab, img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])) } } } filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)]))) return (matrix(filtered, nrow(img), ncol(img))) } # a function that uses median filter to get the background then finds the dark foreground background_Removal = function(img) { w = 5 # the background is found via a median filter background = median_Filter(img, w) # the foreground is darker than the background foreground = img - background foreground[foreground > 0] = 0
 m1 = min(foreground)
 m2 = max(foreground)
 foreground = (foreground - m1) / (m2 - m1)
 
 return (matrix(foreground, nrow(img), ncol(img)))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"
 
outPath = file.path(outFolder, "trainingdata_blog5.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))
 imgY = readPNG(file.path(cleanFolder, f))
 
 # turn the images into vectors
 x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
 y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)
 
 # threshold the image
 x2 = kmeansThreshold(imgX)

 # adaptive thresholding
 x3 = img2vec(adaptiveThresholding(imgX))
 
 # canny edge detector and related features
 x4 = img2vec(cannyEdges(imgX))
 x5 = img2vec(cannyDilated1(imgX))
 x6 = img2vec(cannyDilated2(imgX))
 
 # median filter and related features
 x7 = img2vec(median_Filter(imgX, 17))
 x8 = img2vec(background_Removal(imgX))

 dat = data.table(cbind(y, x, x2, x3, x4, x5, x6, x7, x8))
 setnames(dat,c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"))
 write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 10000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean]) 
# now fit an xgboost model 
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predictions
dtrainFull <- xgb.DMatrix(as.matrix(dat[,-1]), label = as.matrix(dat[,1]))
yHat = predict(xgb.mod, newdata=dtrainFull)
# what score do we get on the training data?
rmse = sqrt(mean( (yHat - dat$y) ^ 2 ))
print(rmse) # 2.4% vs 4.1%

# get the trained model
model = xgb.dump(xgb.mod, with.stats=TRUE)
# get the feature real names
names = names(dat)[-1]
# compute feature importance matrix
importance_matrix = xgb.importance(names, model=xgb.mod)
# plot the variable importance
gp = xgb.plot.importance(importance_matrix)
print(gp)


# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img), img2vec(adaptiveThresholding(img)), img2vec(cannyEdges(img)), 
 img2vec(cannyDilated1(img)), img2vec(cannyDilated2(img)),img2vec(median_Filter(img, 17)), img2vec(background_Removal(img)) )
setnames(x, c("raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"))
yHatImg = predict(xgb.mod, newdata=as.matrix(x))
yHatImg[yHatImg < 0] = 0 yHatImg[yHatImg > 1] = 1
imgOut = matrix(yHatImg, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

Both the background removal and the median filter features are important, with even higher importance scores than last blog’s canny edge detector features.

The coffee cup stain has been mostly removed, especially the speckles that we saw after applying the model in the last blog.

The result represents an improvement in RMSE score on the training data from 4.1% in the last blog to 2.4% in this blog. In the next blog we will further improve on this score.

12 thoughts on “Denoising Dirty Documents : Part 5”

Bobby said:

August 28, 2015 at 9:25 pm

Great stuff Colin! I’ve been looking for sliding window code so I’m glad you chose to implement the median filter from scratch instead of using a library.

In one of my submissions, I used a gaussian filter instead which helped my score a bit. I haven’t compared the median filter to the gaussian filter yet but I was wondering if there should be some particular preference here when it comes to the image data we’re working with?

Cheers.

LikeLiked by 1 person

Colin Priest said:

August 29, 2015 at 2:17 am

Hi Bobby,

Gaussian filters are much faster because they can use a convolutional filter, and our CPUs and GPUs are optimised for convolution operations. A median filter is good for finding a “typical” pixel brightness, because it ignores extremes. A Gaussian filter reacts more to one or two very different pixels, and is intended for smoothing rather than finding the background. A difference-of-gaussians (DoG) is good for finding features with a particular width.

Colin

LikeLike

- Bobby said:
  
  August 30, 2015 at 4:30 pm
  
  Ah ok, thanks for the explanation.
  I was also wondering if we’re just using the backgroundRemoval feature to train the xgboost model? The reason I ask is because that’s what the following code looks like its doing:
  
  dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
  
  If that is the case, why not use all the features?
  
  LikeLike
  
  - Colin Priest said:
    
    August 31, 2015 at 1:39 am
    
    Bobby,
    
    We are using all of the features. That line of code just formats the data into a DMatrix, splitting the target (the first column) from the predictors, which is what xgboost wants as input.
    
    You can tell that xgboost is using all of the features because the variable importance shows that they are being used.
    
    Colin
    
    LikeLike
Pingback: Denoising Dirty Documents: Part 6 | Keeping Up With The Latest Techniques
Pingback: Denoising Dirty Documents: Part 7 | Keeping Up With The Latest Techniques
Ramit Das said:

November 5, 2015 at 12:07 pm

biOps package is removed from R-3.2.2. Any alternative package from some other source. biOps is not installing from any outside source

LikeLike

- Colin Priest said:
  
  November 5, 2015 at 12:51 pm
  
  Hi Ramit,
  
  Have you had any success installing biOps from the CRAN archive?
  
  LikeLike
  
Pingback: Image Processing + Machine Learning in R: Denoising Dirty Documents Tutorial Series | No Free Hunch
Sparsh said:

November 17, 2016 at 11:38 am

which version 0f R you have used?

LikeLike

- Colin Priest said:
  
  November 17, 2016 at 12:57 pm
  
  Hi Sparsh. Sorry I can’t remember which version of R I was using when I wrote this. But the R scripts should work on any recent version of R. Colin
  
  LikeLike
  
Pingback: learn Data Science by doing | Egyptian AI & Big Data Geeks Offical Blog

Keeping Up With The Latest Techniques

~ brief insights

Denoising Dirty Documents : Part 5

12 thoughts on “Denoising Dirty Documents : Part 5”

Leave a comment Cancel reply

Share this:

Related

12 thoughts on “Denoising Dirty Documents : Part 5”

Leave a comment Cancel reply