Keeping Up With The Latest Techniques

~ brief insights

Keeping Up With The Latest Techniques

Tag Archives: Edge Detection

Denoising Dirty Documents: Part 11

08 Sunday Nov 2015

Posted by Colin Priest in Adaptive Thresholding, Background Removal, Deep Learning, Edge Detection, h2o, Image Processing, Kaggle, Machine Learning, Median Filter, Morphology, R

≈ 1 Comment

Tags

Adaptive Thresholding, Background Removal, Deep Learning, Edge Detection, h2o, Image Processing, Kaggle, Machine Learning, Median Filter, Morphology, R

In my last blog I showed how to use convolutional neural networks to build a model that removed stains from an image. While convolutional neural networks seem to be well suited for image processing, in this competition I found that deep neural networks performed better. In this blog I show how to build these models.

warnH022 - deep water

Since I wanted to use R, have limited RAM and I don’t have a powerful GPU, I chose to use h2o to build the models. That way I could do the feature engineering in R, pass the data to h2o, let h2o build a model, then get the predicted values back in R. The memory management would be done in h2o, which uses deep learning algorithms that adjust the RAM constraints. So I guess this combination of deep learning and h2o could be called “deep water” 😉

For my final competition submission I used an ensemble of models, including 3 deep learning models built with R and h2o. Each of the 3 deep learning models used different feature engineering:

  • median based feature engineering
  • edge based feature engineering
  • threshold based feature engineering

This blog shows the details of the median based model. I leave it to the reader to implement the edge based and threshold based models using the image processing scripts from my earlier blogs in this series.

If you don’t already have h2o installed on your computer, then you can install it directly from R. At the time of writing this blog, you could install h2o using the following script:


# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
if (! ("methods" %in% rownames(installed.packages()))) { install.packages("methods") }
if (! ("statmod" %in% rownames(installed.packages()))) { install.packages("statmod") }
if (! ("stats" %in% rownames(installed.packages()))) { install.packages("stats") }
if (! ("graphics" %in% rownames(installed.packages()))) { install.packages("graphics") }
if (! ("RCurl" %in% rownames(installed.packages()))) { install.packages("RCurl") }
if (! ("jsonlite" %in% rownames(installed.packages()))) { install.packages("jsonlite") }
if (! ("tools" %in% rownames(installed.packages()))) { install.packages("tools") }
if (! ("utils" %in% rownames(installed.packages()))) { install.packages("utils") }

# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-tibshirani/3/R")))

That script will need to be changed as new versions of h2o are released. So use the latest instructions shown here.

Once h2o is installed, you can interface with h2o from R using the CRAN package.


install.packages("h2o")
library(h2o)

Median based image processing is used for feature engineering in this example, but you could use any combination of image processing techniques for your feature engineering. I got better performance using separate deep learning models for different types of image processing, but that may be because I had limited computing resources. If you have more computing resources than me, then maybe you will be successful with a single large model that uses all of the image processing techniques to create features.


# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}
 
median_Filter = function(img, filterWidth)
{
 pad = floor(filterWidth / 2)
 padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
 padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
 
 tab = matrix(0, nrow(img) * ncol(img), filterWidth * filterWidth)
 k = 1
 for (i in seq_len(filterWidth))
 {
 for (j in seq_len(filterWidth))
 {
 tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
 k = k + 1
 }
 }
 
 filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
 return (matrix(filtered, nrow(img), ncol(img)))
}
 
# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
 w = 5
 p = 1.39
 th = 240
 
 # the background is found via a median filter
 background = median_Filter(img, w)
 
 # the foreground is darker than the background
 foreground = img / background
 foreground[foreground > 1] = 1
 
 foreground2 = foreground ^ p
 foreground2[foreground2 >= (th / 255)] = 1
 
 return (matrix(foreground2, nrow(img), ncol(img)))
} 

img2tab = function(imgX, f)
{
 median5 = img2vec(median_Filter(imgX, 5))
 median17 = img2vec(median_Filter(imgX, 17))
 median25 = img2vec(median_Filter(imgX, 25))
 backgroundRemoval = img2vec(background_Removal(imgX))
 foreground = readPNG(file.path(foregroundFolder, f))
 
 # pad out imgX
 padded = matrix(0, nrow(imgX) + padding * 2, ncol(imgX) + padding * 2)
 offsets = expand.grid(seq_len(2*padding+1), seq_len(2*padding+1))
 
 # raw pixels window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = imgX
 x = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x2 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median5
 x2 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x3 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median17
 x3 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x4 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median25
 x4 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x5 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = backgroundRemoval
 x5 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))
 
 # x6 window
 padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = foreground
 x6 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

 dat = data.table(cbind(x, x2, x3, x4, x5, x6))
 setnames(dat,c(
 paste("x", seq_len((2*padding+1)^2), sep=""), 
 paste("median5", seq_len((2*padding+1)^2), sep=""),
 paste("median17", seq_len((2*padding+1)^2), sep=""),
 paste("median25", seq_len((2*padding+1)^2), sep=""),
 paste("backgroundRemoval", seq_len((2*padding+1)^2), sep=""),
 paste("foreground", seq_len((2*padding+1)^2), sep="")
 ))
 
 return (dat)
}

If you’ve been following my blog, then you will see that there’s nothing new in the two image processing functions shown above.

To build the model you will need to start h2o, import the data and tell h2o to create a deep learning model.

h2oServer = h2o.init(nthreads = 6, max_mem_size = "10G")

trainData = h2o.importFile(h2oServer, path = outPath)
testData = h2o.importFile(h2oServer, path = outPath2)

model.dl.median <- h2o.deeplearning(x = 2:ncol(trainData), y = 1, training_frame = trainData, validation_frame = testData,
 score_training_samples = 0, 
 overwrite_with_best_model = TRUE,
 activation = "Rectifier", seed = 1,
 hidden = c(200, 200,200), epochs = 15,
 adaptive_rate = TRUE, initial_weight_distribution = "UniformAdaptive", loss = "MeanSquare",
 fast_mode = T, diagnostics = T, ignore_const_cols = T,
 force_load_balance = T)


You should change the h2o.init parameters according to the hardware on your computer. I’m running my model on a PC with 8 CPUs and 16GB of RAM, so I left a couple of CPUs free to do the user interface and core operating system functionality, plus some RAM for the operating system. Scale these parameters up or down if your PC specifications are more or less powerful than mine.

The model may take a few hours to fit. During that time R will not do anything. So if you want to see how the model is progressing, then point your browser to your localhost (port 54321 on my PC, but maybe a different port on yours) and use the h2o web interface to see what is happening.

You can get the predicted values using the following script:


filenames = list.files(dirtyFolder)
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))

dat = img2tab(imgX, f)

x.h2o = as.h2o(h2oServer, dat)
 predict.dl = as.data.frame(h2o.predict(model.dl.median, newdata = x.h2o))
 imgOut = matrix(as.numeric(predict.dl$predict), nrow(imgX), ncol(imgX))
 
 # correct the pixel brightnesses that are out of bounds
 imgOut[imgOut > 1] = 1
 imgOut[imgOut < 0] = 0

writePNG(imgOut, file.path(outFolder, f))
}

h2o.shutdown()

Running predictions is as simple as creating a data file, importing it to h2o, and then asking h2o to give you the predicted values from your already fitted model. I found that some of the raw predicted values were out of the [0, 1] range, and improved my leaderboard score by limiting the predicted values to lie within this range.

You do not need to shut down h2o after you finish running a model. In fact you may wish to leave it running so that you can do model diagnostics or run more predictions.

If you wish to save a copy of your model, for later reuse, then you can use the following syntax:


modelPath = h2o.saveModel(model.dl.median, dir = "./model", name = "model_dnn_median", force = TRUE)

Just remember that h2o needs to be running when you save models or load previously saved models.

In my next, and final, blog in this series, I will show how to take advantage of the second information leakage in the competition.

For those who want the entire R script to try out for themselves, here it is:


install.packages("h2o")
library(h2o)
library(png)
library(data.table)

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

median_Filter = function(img, filterWidth)
{
pad = floor(filterWidth / 2)
padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(0, nrow(img) * ncol(img), filterWidth * filterWidth)
k = 1
for (i in seq_len(filterWidth))
{
for (j in seq_len(filterWidth))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
return (matrix(filtered, nrow(img), ncol(img)))
}

# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
w = 5
p = 1.39
th = 240

# the background is found via a median filter
background = median_Filter(img, w)

# the foreground is darker than the background
foreground = img / background
foreground[foreground > 1] = 1

foreground2 = foreground ^ p
foreground2[foreground2 >= (th / 255)] = 1

return (matrix(foreground2, nrow(img), ncol(img)))
}

dirtyFolder = "./data/train"
cleanFolder = "./data/train_cleaned"
outFolder = "./model"
foregroundFolder = "./foreground/train foreground"

outPath = file.path(outFolder, "trainingdata.csv")
outPath2 = file.path(outFolder, "testdata.csv")
filenames = list.files(dirtyFolder)
padding = 2
set.seed(1)
library(h2o)
h2oServer = h2o.init(nthreads = 15, max_mem_size = "110G")

trainData = h2o.importFile(h2oServer, path = outPath)
testData = h2o.importFile(h2oServer, path = outPath2)

model.dl.median <- h2o.deeplearning(x = 2:ncol(trainData), y = 1, training_frame = trainData, validation_frame = testData,
score_training_samples = 0,
overwrite_with_best_model = TRUE,
activation = "Rectifier", seed = 1,
hidden = c(200, 200,200), epochs = 15,
adaptive_rate = TRUE, initial_weight_distribution = "UniformAdaptive", loss = "MeanSquare",
fast_mode = T, diagnostics = T, ignore_const_cols = T,
force_load_balance = T)

summary(model.dl)

modelPath = h2o.saveModel(model.dl.median, dir = "./model", name = "model_dnn_median", force = TRUE)

outFolder = "./model/training data"

img2tab = function(imgX, f)
{
median5 = img2vec(median_Filter(imgX, 5))
median17 = img2vec(median_Filter(imgX, 17))
median25 = img2vec(median_Filter(imgX, 25))
backgroundRemoval = img2vec(background_Removal(imgX))
foreground = readPNG(file.path(foregroundFolder, f))

# pad out imgX
padded = matrix(0, nrow(imgX) + padding * 2, ncol(imgX) + padding * 2)
offsets = expand.grid(seq_len(2*padding+1), seq_len(2*padding+1))

# raw pixels window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = imgX
x = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x2 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median5
x2 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x3 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median17
x3 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x4 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = median25
x4 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x5 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = backgroundRemoval
x5 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

# x6 window
padded[padding + seq_len(nrow(imgX)), padding + seq_len(ncol(imgX))] = foreground
x6 = sapply(seq_len((2*padding+1)^2), function(x) img2vec(padded[offsets[x, 2] - 1 + seq_len(nrow(imgX)), offsets[x, 1] - 1 + seq_len(ncol(imgX))]))

dat = data.table(cbind(x, x2, x3, x4, x5, x6))
setnames(dat,c(
paste("x", seq_len((2*padding+1)^2), sep=""),
paste("median5", seq_len((2*padding+1)^2), sep=""),
paste("median17", seq_len((2*padding+1)^2), sep=""),
paste("median25", seq_len((2*padding+1)^2), sep=""),
paste("backgroundRemoval", seq_len((2*padding+1)^2), sep=""),
paste("foreground", seq_len((2*padding+1)^2), sep="")
))

return (dat)
}

dirtyFolder = "./data/test"
outFolder = "./model/test data"
foregroundFolder = "./foreground/test foreground"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))

dat = img2tab(imgX, f)

x.h2o = as.h2o(h2oServer, dat)
predict.dl = as.data.frame(h2o.predict(model.dl.median, newdata = x.h2o))
imgOut = matrix(as.numeric(predict.dl$predict), nrow(imgX), ncol(imgX))

# correct the pixel brightnesses that are out of bounds
imgOut[imgOut > 1] = 1
imgOut[imgOut < 0] = 0

writePNG(imgOut, file.path(outFolder, f))
}

h2o.shutdown()

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents : Part 4

21 Friday Aug 2015

Posted by Colin Priest in Edge Detection, Image Processing, Kaggle, Machine Learning, Morphology, R

≈ 18 Comments

Tags

Edge Detection, Image Processing, Kaggle, Machine Learning, Morphology, R

At the end of the last blog, we had made some progress in removing the coffee cup stain from the image, but we needed to do more.

20150815 output 5

Adaptive thresholding has started to separate the writing from the stain, but it has also created a speckled pattern within the stains. We need to engineer a feature that can tell apart a stroke of writing from a speckled local maxima i.e. distinguish a ridge from a peak in the 3D surface.

1045988_SMPNG_46U34561M9028014G

In image processing, we do this via edge detection, which is the process of calculating the slope of the 3D surface of the image, and retaining lines where the slope is high. There are several different standard algorithms to do edge detection, and today we will use the canny edge detector.

The biOps package, which has an implementation of the canny edge detector, has been removed from CRAN. It has been migrated to Google Code. You will need to follow the installation instructions that can be found here. For example, since I am using Windows 64, I mostly followed the following instructions from the web site:

Windows 64 bit

    1. Note: we dropped jpeg and tiff IO functions.
    2. Download (go to the DLL’s page then download the raw file) libfftw3-3.dll, libfftw3f-f.dll, libfftw3l-3.dll, and zlib1.dll to C:\Program Files\R\R-3.x.x\bin\x64 (x.x needs to be edited) or somewhere present in the PATH variables. Make sure that the downloaded dll files are MB in file size.
    3. Download biOps_0.2.2.zip (go to the DLL’s page then download the raw file). Make sure that the file size is around 700KB.
    4. Run 64 bit R.
    5. Choose biOps_0.2.2.zip from Packages>Install package(s)…
    6. Load the library.
> library(biOps)

However for step E) I used the following R code:


install.packages("C:/Program Files/R/R-3.1.3/bin/biOps_0.2.2.zip")

That’s because I had downloaded biOps_0.2.2.zip into the C:/Program Files/R/R-3.1.3/bin folder. You should substitute the folder path that you downloaded the zip file into.

Update!!!: Google has switched off Google Code. But I have found the dlls inside an archive that you can download from here. Warning: the download is large (613.91MB).

Now we can start to experiment with edge detection. Note that biOps images have pixel brightnesses from 0 to 255 rather than from 0 to 1. So we have to rescale whenever we switch packages.


if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, biOps)

# read in the coffee cup stain image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
plot(raster(img))

# convert the image to a biOps image
img.biOps = imagedata(img * 255)
plot.imagedata(img.biOps)

# canny edge detection
img.canny = imgCanny(img.biOps, 0.7)
plot.imagedata(img.canny)

20150822 output 1

One of the things I like about this particular edge detection algorithm is that it has automatically thresholded the edges so that we are only left with the strong edges. So here are my observations about the behaviour of the edges:

  • they surround the writing
  • they surround the stains
  • there are few edges within a stain, so a lack of edges in a region may be a useful feature for removing the speckles within a stain
  • writing has a pair of parallel edges around each stroke, while the boundary of the stain has only a single edge, so the presence of a pair of edges may be a useful feature for separating stains from writing

To take advantage of these observations we shall use image morphology. Dilation is the process of making a line or blog thicker by expanding its boundary by one pixel. Erosion is the opposite, removing a 1 pixel thick layer from the boundary of an object. If we dilate the edges, then the pair of edges around the writing will expand to include the writing inside, and the edge of the stain will also expand.


# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
plot.imagedata(img.dilation)

20150822 output 3

The writing is all black, whereas most of the stain is white. This will probably be a useful feature for removing the coffee cup stain. But we can do more: now that we have dilated the edges, we can erode them to remove where we started with single edges.

20150822 output 2

This looks pretty good – all of the writing is black, but only a small part of the stain remains. the stain has a thin line, while the writing has thick lines. So we can erode once, then dilate once, and the thin lines will disappear.


# do some morphology to remove the stain lines
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
img.erosion.2 = imgBinaryErosion(img.erosion, mask)
img.dilation.2 = imgBinaryDilation(img.erosion.2, mask)
img.dilation.3 = imgBinaryDilation(img.dilation.2, mask)
plot.imagedata(img.dilation.3)

20150822 output 4

The stain is now almost completely removed! But unfortunately some of the writing has been removed too. So it is an imperfect feature for removing the coffee cup stain.

Let’s put it all together with the existing features that we have developed over the past few blogs, by adding canny edges and the dilated / eroded edges to the gradient boosted model:


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0
imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
img.eb <- Image(t(img))
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)
img.kmThresh = kmeansThreshold(img)

# combine the adaptive thresholding
ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

# a function to do canny edge detector
cannyEdges = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
return (matrix(img.canny / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated1 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
return(matrix(img.erosion / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated2 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
img.erosion.2 = imgBinaryErosion(img.erosion, mask)
img.dilation.2 = imgBinaryDilation(img.erosion.2, mask)
return(matrix(img.dilation.2 / 255, nrow(img), ncol(img)))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

dat = data.table(cbind(y, x, x2, x3, x4, x5, x6))
setnames(dat,c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit a model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 1000000)
gbm.mod = gbm(y ~ raw + thresholded + adaptive + canny + cannyDilated1 + cannyDilated2, data = dat[rows,], n.trees = 10000, train.fraction = 0.5, interaction.depth = 5)
best.iter <- gbm.perf(gbm.mod)

s = summary(gbm.mod)

# get the predictions - using parallel processing to save time
numCores = 6 #change the 6 to your number of CPU cores. or maybe lower due to RAM limits
cl = makeCluster(numCores)
registerDoSNOW(cl)
num_splits = numCores
split_testing = sort(rank(1:nrow(dat))%%numCores)
yHat = foreach(i=unique(split_testing),.combine=c,.packages=c("gbm")) %dopar% {
as.numeric(predict(gbm.mod, newdata=dat[split_testing==i,], n.trees = best.iter))
}
stopCluster(cl)
yHat[yHat < 0] = 0
yHat[yHat > 1] = 1
# what score do we get on the training data?
rmse = sqrt(mean( (yHat - dat$y) ^ 2 ))
print(rmse) # 4.1%

# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img), img2vec(adaptiveThresholding(img)), img2vec(cannyEdges(img)), img2vec(cannyDilated1(img)), img2vec(cannyDilated2(img)))
setnames(x, c("raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2"))
yHatImg = predict(gbm.mod, newdata=x, n.trees = best.iter)
yHatImg[yHatImg < 0] = 0
yHatImg[yHatImg > 1] = 1
imgOut = matrix(yHatImg, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150822 output 5

The first dilation operation that we tried in this blog is the most powerful predictor that we developed today, even more powerful than the adaptive thresholding that we used in the last blog. We have improved the RMSE score on the training set from 5.4% to 4.1%

20150822 sample

The model doesn’t completely remove the coffee cup stain, but it has faded the stain enough that we have a good chance at removing it later. In the next blog in this series, we will do more to clean up the coffee cup stain.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Blogroll

  • Discover New Voices
  • Discuss
  • Get Inspired
  • Get Mobile
  • Get Polling
  • Get Support
  • Great Reads
  • Learn WordPress.com
  • Theme Showcase
  • WordPress.com News
  • www.r-bloggers.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 277 other subscribers

Blog at WordPress.com.

  • Follow Following
    • Keeping Up With The Latest Techniques
    • Join 86 other followers
    • Already have a WordPress.com account? Log in now.
    • Keeping Up With The Latest Techniques
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...
 

    %d bloggers like this: