Keeping Up With The Latest Techniques

~ brief insights

Keeping Up With The Latest Techniques

Category Archives: Variable Importance

Denoising Dirty Documents: Part 6

07 Monday Sep 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning, Median Filter, R, Variable Importance, XGBoost

≈ 7 Comments

Tags

Image Processing, Kaggle, Machine Learning, Median Filter, R, Variable Importance, XGBoost

So far in this series of blogs we have used image processing techniques to improve the images, and then ensembled together the results of that image processing using GBM or XGBoost. But I have noticed that some competitors have achieved reasonable results using purely machine learning approaches. While these pure machine learning approaches aren’t enough for the competitors to get to the top of the leader board, they have outperformed some of the models that I have presented in this series of blogs. However these scripts were invariably written in Python and I thought that it would be great to see how to use R to build a similar type of model (except better, because we will include all of the image processing predictors that we have developed so far). So today we will add a brute-force machine learning approach to our model.

At the time of writing this blog, the ranked top competitor who has shared their script is placed 17th with an RMSE of 2.6%. While I am not very experienced in Python, I can usually figure out what a Python script is doing if I read it. So here’s what I think that script ise doing:

  1. pad out each image by an extra 2 pixels (see my last blog for how to pad out an image in R)
  2. run a 3×3 sliding window along the image pixels (see my last blog for how to create a sliding window in R)
  3. use all 9 pixels within the sliding window as predictors for the pixel in the centre of the sliding window
  4. use a random forest model to predict the pixel brightnesses

This is a pure machine learning approach because it doesn’t do any image processing to pre-process the image. It simply says that if you want to predict a particular pixel brightness, then you should probably look around at the brightnesses of the pixels either side of that pixel.

canstockphoto16547059

While I don’t want to fuel the R versus Python language wars, I do want to create a model in R that can outperform my competitors. So instead of using a 3×3 sliding window, I will use a 5×5 sliding window. Because the random forest implementation in R tends to run out of RAM on my computer, I will use XGBoost.

More_Power_HomeImprovementO

Here is how I create a 5×5 sliding window.


# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")

# create a padded image within which we will embed our source image
pad = 2
width = 2 * pad + 1
padded = matrix(1, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

# create a matrix of predictor values - each column is a pixel from the sliding window
tab = NULL
for (i in seq_len(width))
{
for (j in seq_len(width))
{
if (i == 1 && j == 1)
{
tab = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
} else {
tab = cbind(tab, img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))]))
}
}
}

head(tab[,1:4])

20150904 output 1

When I noticed that this script ran a bit slowly, I looked at how I wrote the loops. R does not like manual looping, and it is very inefficient at appending data. So I rewrote the script to pre-allocate space for all of the cells rather than appending each column as it was calculated.


pad = 2
width = 2 * pad + 1
padded = matrix(1, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
k = 1
for (i in seq_len(width))
{
for (j in seq_len(width))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

This modification gave me a tenfold improvement in speed! It’s a reminder that I shouldn’t be lazy when writing R scripts.

There’s one small issue to consider before we put this all together for a new more powerful predictive model. In the code above I have used a brightness value of 1 when padding out the image, but that doesn’t look natural. The image below shows a clear white border around the image. This could confuse the machine learning algorithm, as it could waste time learning what to do with pure white pixels.

20150904 output 2

So instead it makes sense to pad out the image with a background brightness. One way to do this is to replicate the edge of the original image.


pad = 2
width = 2 * pad + 1
padded = matrix(1, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
# fill in the padded edges
padded[,1:pad] = padded[,pad + 1]
padded[,ncol(img) + pad + 1:pad] = padded[,ncol(padded) - pad]
padded[1:pad,] = padded[pad + 1,]
padded[nrow(img) + pad + 1:pad,] = padded[nrow(padded) - pad,]

20150904 output 3

This looks more natural, except where there is writing at the edge of the image. Another way is to pad out the pixels using the median of the entire image.


pad = 2
width = 2 * pad + 1
padded = matrix(median(img), nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

20150904 output 4

I haven’t tested which approach works best. So I leave it to the reader to compare the results from the three different padding approaches, and use whichever gives the best result (although I suspect that there won’t be much difference between the second and third approaches).

Let’s pull it all together, and add the surrounding pixels as a predictor to a more complete model.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps, xgboost, Ckmeans.1d.dp)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
img.eb  0] = 0
m1 = min(foreground)
m2 = max(foreground)
foreground = (foreground - m1) / (m2 - m1)

return (matrix(foreground, nrow(img), ncol(img)))
}

# a function that groups together the pixels contained within a sliding window around each pixel of interest
proximalPixels = function(img)
{
pad = 2
width = 2 * pad + 1
padded = matrix(median(img), nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
k = 1
for (i in seq_len(width))
{
for (j in seq_len(width))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

return (tab)
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata_blog6.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

# median filter and related features
x7 = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

# surrounding pixels
x9 = proximalPixels(imgX)

dat = data.table(cbind(y, x, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,append(c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"), paste("x", 1:25, sep="")))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain  1] = 1
imgOut = matrix(yHatImg, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150904 output 5

None of the individual pixels that we added have strong predictive powers, yet our RMSE on the training data dropped from 2.4% for the last blog’s model to 1.4% for this model! That’s because it is the combination of nearby pixels that are predictive, not just any individual pixel. I haven’t scored this model on the test set, but I suspect that it will get you a good ranking.
20150904 output 6

There is hardly any trace of the coffee cup stain remaining. This is a pretty good candidate model.

In order to understand the effect of the nearby pixels, it is useful to visualise their variable importance in a grid.

pixelNames = paste("x", 1:25, sep="")
pixelNames[13] = "raw"
grid = t(matrix(sapply(1:25, FUN = function(x) importance_matrix$Gain[importance_matrix$Feature == pixelNames[x]]), 5, 5))
grid[3, 3] = NA
plot(raster(grid))

20150904 output 7
This graphic shows that we didn’t need all of the surrounding pixels to create a good predictive model. The pixels that don’t lie on the same row or column as the target pixel aren’t as important. If we wanted to expand beyond the 3×3 sliding window used by our competitors, then we didn’t need to add all of the extra pixels. We could have just added the pixels at (1, 3) and (3, 1) and (3, 5) and (5, 3).

The model that we developed in this blog is pushing the boundaries of what my PC can do. Without access to very powerful computers and / or a cluster, we need to use a different approach if we want to improve on this blog’s model.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents : Part 5

28 Friday Aug 2015

Posted by Colin Priest in Background Removal, Kaggle, Median Filter, R, Variable Importance

≈ 12 Comments

Tags

Background Removal, Kaggle, Median Filter, R, Variable Importance

In my last blog we had faded the coffee cup stains, but there was more work to be done. So far we had used adaptive thresholding and edge detection. Today we will use median filters and background removal.

keep-off-median-banner

A median filter is an image filter that replaces a pixel with the median value of the pixels surrounding it. In doing this, it smoothes the image, and the result is often thought of as the “background” of the image, since it tends to wipe away small features, but maintains broad features.

While the biOps package has a median filter implementation, it isn’t difficult to write a function to do that ourselves, and it can be quite instructive to see how a median filter works.

median_Filter = function(img, filterWidth)
{
 pad = floor(filterWidth / 2)
 padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
 padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
 
 tab = NULL
 for (i in seq_len(filterWidth))
 {
 for (j in seq_len(filterWidth))
 {
 if (i == 1 && j == 1)
 {
 tab = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
 } else {
 tab = cbind(tab, img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))]))
 }
 }
 }
 
 filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
 return (matrix(filtered, nrow(img), ncol(img)))
}

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster)

# read in the coffee cup stain image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")

# use the median filter and save the result
filtered = median_Filter(img, 17)
writePNG(filtered, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")

20150829 output 1

What we get after applying a median filter is something that looks like the background of the image. It contains the coffee cup stains and also the shade of the paper upon which the writing appears. With a filter width of 17, the writing has almost entirely faded away.
I didn’t choose that filter width of 17 randomly. It was the result of running models with several different filter widths and seeing which had the best predictive powers.
Median filters are not fast, especially once the filter width increases. This function runs with similar speed to that in the biOps package. I shall leave the task of writing a parallel processing version of the median filter function to the more impatient readers.

While we now have the background, what we really wanted was the foreground – the writing, without the coffee cup stains. The foreground is the difference between the original image and the background. But in this case we know that the writing is always darker than the background, so our foreground should only show pixels that are darker than the background. I have also rescaled the result to lie in the interval [0, 1]. Here is an R script to implement background removal.

# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
 w = 5

 # the background is found via a median filter
 background = median_Filter(img, w)

 # the foreground is darker than the background
 foreground = img - background
 foreground[foreground > 0] = 0
 m1 = min(foreground)
 m2 = max(foreground)
 foreground = (foreground - m1) / (m2 - m1)
 
 return (matrix(foreground, nrow(img), ncol(img)))
}

# run the background removal function and save the result
foreground = background_Removal(img)
writePNG(foreground, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")

20150829 output 2

While it’s not perfect, the resulting filtered image has done a reasonably good job of separating the writing from the background. It is reasonable to expect that it will be a useful predictor in our model.
This time a filter width of 5 was chosen purely for the purpose of speed. You could have used a grid search to find the “best” filter width parameter.

# read in the coffee cup stain image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
imgClean = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned\\3.png")

bestRMSE = 1
bestWidth = 5

widths = c(3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25)

for (w in widths)
{
 # the background is found via a median filter
 background = median_Filter(img, w)

 # the foreground is darker than the background
 foreground = img - background
 foreground[foreground > 0] = 0
 m1 = min(foreground)
 m2 = max(foreground)
 foreground = (foreground - m1) / (m2 - m1)
 
 # score the result
 rmse = sqrt(mean( (foreground - imgClean) ^ 2 ))
 if (rmse < bestRMSE)
 {
 bestRMSE = rmse
 bestWidth = w
 print(c(bestWidth, rmse))
 }
}

In the past few blogs I have used the GBM package to create a predictive model. But as we have added more predictors or features, it has started to take a long time to fit the model and to calculate the predictions. So today I’m switching to the xgboost package because the top Kagglers are using it (Owen Zhang, Kaggle’s top ranked competitor says “when in doubt use xgboost“), and because it runs much faster than GBM. I’d never used xgboost until this week, and I must say that I’m quite impressed with its speed.

Here is an R script that fits an xgboost model, using all of the features that we have come up with over the 5 blogs to date.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps, xgboost, Ckmeans.1d.dp)

if (!require("EBImage"))
{
 source("http://bioconductor.org/biocLite.R")
 biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
 # fit 3 clusters
 v = img2vec(img)
 km.mod = kmeans(v, 3)
 # allow for the random ordering of the clusters
 oc = order(km.mod$centers)
 # the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
 hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))
 
 # using upper threshold
 imgHi = v
 imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1
 
 return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
 img.eb <- Image(t(img))
 img.thresholded.3 = thresh(img.eb, 3, 3)
 img.thresholded.5 = thresh(img.eb, 5, 5)
 img.thresholded.7 = thresh(img.eb, 7, 7)
 img.thresholded.9 = thresh(img.eb, 9, 9)
 img.thresholded.11 = thresh(img.eb, 11, 11)
 img.kmThresh = kmeansThreshold(img)

 # combine the adaptive thresholding
 ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
 ttt.2 = apply(ttt.1, 1, max)
 ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
 return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
 m1 = t(matrix(Img, nrow(Img), ncol(Img)))
 return(m1)
}

# a function to do canny edge detector
cannyEdges = function(img)
{
 img.biOps = imagedata(img * 255)
 img.canny = imgCanny(img.biOps, 0.7)
 return (matrix(img.canny / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated1 = function(img)
{
 img.biOps = imagedata(img * 255)
 img.canny = imgCanny(img.biOps, 0.7)
 # do some morphology on the edges to fill the gaps between them
 mat <- matrix (0, 3, 3)
 mask <- imagedata (mat, "grey", 3, 3)
 img.dilation = imgBinaryDilation(img.canny, mask)
 img.erosion = imgBinaryErosion(img.dilation, mask)
 return(matrix(img.erosion / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated2 = function(img)
{
 img.biOps = imagedata(img * 255)
 img.canny = imgCanny(img.biOps, 0.7)
 # do some morphology on the edges to fill the gaps between them
 mat <- matrix (0, 3, 3)
 mask <- imagedata (mat, "grey", 3, 3) img.dilation = imgBinaryDilation(img.canny, mask) img.erosion = imgBinaryErosion(img.dilation, mask) img.erosion.2 = imgBinaryErosion(img.erosion, mask) img.dilation.2 = imgBinaryDilation(img.erosion.2, mask) return(matrix(img.dilation.2 / 255, nrow(img), ncol(img))) } # a function to do a median filter median_Filter = function(img, filterWidth) { pad = floor(filterWidth / 2) padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad) padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img tab = NULL for (i in seq_len(filterWidth)) { for (j in seq_len(filterWidth)) { if (i == 1 && j == 1) { tab = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))]) } else { tab = cbind(tab, img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])) } } } filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)]))) return (matrix(filtered, nrow(img), ncol(img))) } # a function that uses median filter to get the background then finds the dark foreground background_Removal = function(img) { w = 5 # the background is found via a median filter background = median_Filter(img, w) # the foreground is darker than the background foreground = img - background foreground[foreground > 0] = 0
 m1 = min(foreground)
 m2 = max(foreground)
 foreground = (foreground - m1) / (m2 - m1)
 
 return (matrix(foreground, nrow(img), ncol(img)))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"
 
outPath = file.path(outFolder, "trainingdata_blog5.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))
 imgY = readPNG(file.path(cleanFolder, f))
 
 # turn the images into vectors
 x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
 y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)
 
 # threshold the image
 x2 = kmeansThreshold(imgX)

 # adaptive thresholding
 x3 = img2vec(adaptiveThresholding(imgX))
 
 # canny edge detector and related features
 x4 = img2vec(cannyEdges(imgX))
 x5 = img2vec(cannyDilated1(imgX))
 x6 = img2vec(cannyDilated2(imgX))
 
 # median filter and related features
 x7 = img2vec(median_Filter(imgX, 17))
 x8 = img2vec(background_Removal(imgX))

 dat = data.table(cbind(y, x, x2, x3, x4, x5, x6, x7, x8))
 setnames(dat,c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"))
 write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 10000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean]) 
# now fit an xgboost model 
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predictions
dtrainFull <- xgb.DMatrix(as.matrix(dat[,-1]), label = as.matrix(dat[,1]))
yHat = predict(xgb.mod, newdata=dtrainFull)
# what score do we get on the training data?
rmse = sqrt(mean( (yHat - dat$y) ^ 2 ))
print(rmse) # 2.4% vs 4.1%

# get the trained model
model = xgb.dump(xgb.mod, with.stats=TRUE)
# get the feature real names
names = names(dat)[-1]
# compute feature importance matrix
importance_matrix = xgb.importance(names, model=xgb.mod)
# plot the variable importance
gp = xgb.plot.importance(importance_matrix)
print(gp)


# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img), img2vec(adaptiveThresholding(img)), img2vec(cannyEdges(img)), 
 img2vec(cannyDilated1(img)), img2vec(cannyDilated2(img)),img2vec(median_Filter(img, 17)), img2vec(background_Removal(img)) )
setnames(x, c("raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"))
yHatImg = predict(xgb.mod, newdata=as.matrix(x))
yHatImg[yHatImg < 0] = 0 yHatImg[yHatImg > 1] = 1
imgOut = matrix(yHatImg, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150829 output 3

Both the background removal and the median filter features are important, with even higher importance scores than last blog’s canny edge detector features.

20150829 output 4

The coffee cup stain has been mostly removed, especially the speckles that we saw after applying the model in the last blog.

The result represents an improvement in RMSE score on the training data from 4.1% in the last blog to 2.4% in this blog. In the next blog we will further improve on this score.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Blogroll

  • Discover New Voices
  • Discuss
  • Get Inspired
  • Get Mobile
  • Get Polling
  • Get Support
  • Great Reads
  • Learn WordPress.com
  • Theme Showcase
  • WordPress.com News
  • www.r-bloggers.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 277 other subscribers

Blog at WordPress.com.

  • Follow Following
    • Keeping Up With The Latest Techniques
    • Join 86 other followers
    • Already have a WordPress.com account? Log in now.
    • Keeping Up With The Latest Techniques
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...
 

    %d bloggers like this: