Tags

Edge Detection, Image Processing, Kaggle, Machine Learning, Morphology, R

At the end of the last blog, we had made some progress in removing the coffee cup stain from the image, but we needed to do more.

Adaptive thresholding has started to separate the writing from the stain, but it has also created a speckled pattern within the stains. We need to engineer a feature that can tell apart a stroke of writing from a speckled local maxima i.e. distinguish a ridge from a peak in the 3D surface.

In image processing, we do this via edge detection, which is the process of calculating the slope of the 3D surface of the image, and retaining lines where the slope is high. There are several different standard algorithms to do edge detection, and today we will use the canny edge detector.

The biOps package, which has an implementation of the canny edge detector, has been removed from CRAN. It has been migrated to Google Code. You will need to follow the installation instructions that can be found here. For example, since I am using Windows 64, I mostly followed the following instructions from the web site:

Windows 64 bit

Note: we dropped jpeg and tiff IO functions.
Download (go to the DLL’s page then download the raw file) libfftw3-3.dll, libfftw3f-f.dll, libfftw3l-3.dll, and zlib1.dll to C:\Program Files\R\R-3.x.x\bin\x64 (x.x needs to be edited) or somewhere present in the PATH variables. Make sure that the downloaded dll files are MB in file size.
Download biOps_0.2.2.zip (go to the DLL’s page then download the raw file). Make sure that the file size is around 700KB.
Run 64 bit R.
Choose biOps_0.2.2.zip from Packages>Install package(s)…
Load the library.

> library(biOps)

However for step E) I used the following R code:


install.packages("C:/Program Files/R/R-3.1.3/bin/biOps_0.2.2.zip")

That’s because I had downloaded biOps_0.2.2.zip into the C:/Program Files/R/R-3.1.3/bin folder. You should substitute the folder path that you downloaded the zip file into.

Update!!!: Google has switched off Google Code. But I have found the dlls inside an archive that you can download from here. Warning: the download is large (613.91MB).

Now we can start to experiment with edge detection. Note that biOps images have pixel brightnesses from 0 to 255 rather than from 0 to 1. So we have to rescale whenever we switch packages.


if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, biOps)

# read in the coffee cup stain image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
plot(raster(img))

# convert the image to a biOps image
img.biOps = imagedata(img * 255)
plot.imagedata(img.biOps)

# canny edge detection
img.canny = imgCanny(img.biOps, 0.7)
plot.imagedata(img.canny)

One of the things I like about this particular edge detection algorithm is that it has automatically thresholded the edges so that we are only left with the strong edges. So here are my observations about the behaviour of the edges:

they surround the writing
they surround the stains
there are few edges within a stain, so a lack of edges in a region may be a useful feature for removing the speckles within a stain
writing has a pair of parallel edges around each stroke, while the boundary of the stain has only a single edge, so the presence of a pair of edges may be a useful feature for separating stains from writing

To take advantage of these observations we shall use image morphology. Dilation is the process of making a line or blog thicker by expanding its boundary by one pixel. Erosion is the opposite, removing a 1 pixel thick layer from the boundary of an object. If we dilate the edges, then the pair of edges around the writing will expand to include the writing inside, and the edge of the stain will also expand.


# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
plot.imagedata(img.dilation)

The writing is all black, whereas most of the stain is white. This will probably be a useful feature for removing the coffee cup stain. But we can do more: now that we have dilated the edges, we can erode them to remove where we started with single edges.

This looks pretty good – all of the writing is black, but only a small part of the stain remains. the stain has a thin line, while the writing has thick lines. So we can erode once, then dilate once, and the thin lines will disappear.


# do some morphology to remove the stain lines
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
img.erosion.2 = imgBinaryErosion(img.erosion, mask)
img.dilation.2 = imgBinaryDilation(img.erosion.2, mask)
img.dilation.3 = imgBinaryDilation(img.dilation.2, mask)
plot.imagedata(img.dilation.3)

The stain is now almost completely removed! But unfortunately some of the writing has been removed too. So it is an imperfect feature for removing the coffee cup stain.

Let’s put it all together with the existing features that we have developed over the past few blogs, by adding canny edges and the dilated / eroded edges to the gradient boosted model:


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0
imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
img.eb <- Image(t(img))
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)
img.kmThresh = kmeansThreshold(img)

# combine the adaptive thresholding
ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

# a function to do canny edge detector
cannyEdges = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
return (matrix(img.canny / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated1 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
return(matrix(img.erosion / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated2 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
img.dilation = imgBinaryDilation(img.canny, mask)
img.erosion = imgBinaryErosion(img.dilation, mask)
img.erosion.2 = imgBinaryErosion(img.erosion, mask)
img.dilation.2 = imgBinaryDilation(img.erosion.2, mask)
return(matrix(img.dilation.2 / 255, nrow(img), ncol(img)))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

dat = data.table(cbind(y, x, x2, x3, x4, x5, x6))
setnames(dat,c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit a model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 1000000)
gbm.mod = gbm(y ~ raw + thresholded + adaptive + canny + cannyDilated1 + cannyDilated2, data = dat[rows,], n.trees = 10000, train.fraction = 0.5, interaction.depth = 5)
best.iter <- gbm.perf(gbm.mod)

s = summary(gbm.mod)

# get the predictions - using parallel processing to save time
numCores = 6 #change the 6 to your number of CPU cores. or maybe lower due to RAM limits
cl = makeCluster(numCores)
registerDoSNOW(cl)
num_splits = numCores
split_testing = sort(rank(1:nrow(dat))%%numCores)
yHat = foreach(i=unique(split_testing),.combine=c,.packages=c("gbm")) %dopar% {
as.numeric(predict(gbm.mod, newdata=dat[split_testing==i,], n.trees = best.iter))
}
stopCluster(cl)
yHat[yHat < 0] = 0
yHat[yHat > 1] = 1
# what score do we get on the training data?
rmse = sqrt(mean( (yHat - dat$y) ^ 2 ))
print(rmse) # 4.1%

# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img), img2vec(adaptiveThresholding(img)), img2vec(cannyEdges(img)), img2vec(cannyDilated1(img)), img2vec(cannyDilated2(img)))
setnames(x, c("raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2"))
yHatImg = predict(gbm.mod, newdata=x, n.trees = best.iter)
yHatImg[yHatImg < 0] = 0
yHatImg[yHatImg > 1] = 1
imgOut = matrix(yHatImg, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

The first dilation operation that we tried in this blog is the most powerful predictor that we developed today, even more powerful than the adaptive thresholding that we used in the last blog. We have improved the RMSE score on the training set from 5.4% to 4.1%

The model doesn’t completely remove the coffee cup stain, but it has faded the stain enough that we have a good chance at removing it later. In the next blog in this series, we will do more to clean up the coffee cup stain.

18 thoughts on “Denoising Dirty Documents : Part 4”

Pingback: Denoising Dirty Documents : Part 4 | Dinesh Ram Kali.
Pingback: Denoising Dirty Documents : Part 5 | Keeping Up With The Latest Techniques
Bobby said:

August 28, 2015 at 9:19 pm

Great post! I also wanted to note, for any Mac users, that in order to install the biOps packages, you can run the following lines of code:

# run this line in the command line
brew install libtiff jpeg fftw

# run these lines in R
install.packages(“devtools”)
library(devtools)
install_url(“http://cran.r-project.org/src/contrib/Archive/biOps/biOps_0.2.2.tar.gz”)
library(biOps)

LikeLiked by 1 person

Pingback: Denoising Dirty Documents: Part 7 | Keeping Up With The Latest Techniques
Pingback: Denoising Dirty Documents – Part 10 | Keeping Up With The Latest Techniques
Pingback: Image Processing + Machine Learning in R: Denoising Dirty Documents Tutorial Series | no free hunch
nclxin said:

November 28, 2016 at 5:48 pm

Hello, Colin,

Great article, I am trying to install biOps as instruction here. It seems that the Dll files and the package itself are not available any more from the address you provided.

Could you share copy of them somewhere?

Thanks

LikeLike

- Colin Priest said:
  
  November 29, 2016 at 7:45 am
  
  The links work for me. I note that BiOps was removed well before I wrote this blog so you must follow the instructions in my blog about how to install it. You can’t just install from CRAN.
  Colin
  
  LikeLike
  
  - nclxin said:
    
    November 29, 2016 at 3:53 pm
    
    Hi, Colin,
    
    The links work, but when you click the dll files, or the package zip file, it shows “SVN hosting has been permanently disabled.”
    
    It seems that those files are not available any more.
    
    LikeLiked by 1 person
- Colin Priest said:
  
  November 29, 2016 at 7:47 am
  
  The link from the blog https://code.google.com/archive/p/rimagebook/wikis/biOpsInstallationEn.wiki takes you to the instructions to install.
  
  LikeLike
  
- Colin Priest said:
  
  December 1, 2016 at 7:08 am
  
  Thanks for the heads up. Google has switched off Google Code:
  https://opensource.googleblog.com/2015/03/farewell-to-google-code.html
  
  I found the dlls inside an archive that you can download from https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/rimagebook/source-archive.zip
  Warning: the download is large (613.91MB)
  
  LikeLiked by 1 person
  
  - nclxin said:
    
    December 2, 2016 at 7:12 pm
    
    Great! Colin, Thank you very much
    
    I got biOps installed successfully.
    
    LikeLiked by 1 person
DJAMEL said:

January 26, 2017 at 7:57 pm

I tried to install biops using the old archive, until i find this blog,
thank you for this informations.

LikeLike

KIM said:

May 25, 2017 at 2:45 pm

Hello, Collin!
Thank you for your work. I tried to download biOps but I got a message “Failed with error: ‘package ‘biOps’ is not installed for ‘arch = x64’’” repeated.
Can you tell me what is the problem??

LikeLike

- Colin Priest said:
  
  May 25, 2017 at 3:01 pm
  
  Hi Kim,
  
  That error message sounds like a system incompatibility problem. Are you using Windows?
  
  By the way, my name is Colin not Collin.
  
  Colin
  
  LikeLike
  
  - Kim said:
    
    May 27, 2017 at 8:27 am
    
    I’m so sorry for that…. Colin
    I’m using window 10. Can’t I use biOps packages in Windows?
    
    LikeLike
Pingback: Image Processing + Machine Learning – FPGA, image processing, machine learning, deep learning
Pingback: learn Data Science by doing | Egyptian AI & Big Data Geeks Offical Blog

Keeping Up With The Latest Techniques

~ brief insights

Denoising Dirty Documents : Part 4

Windows 64 bit

18 thoughts on “Denoising Dirty Documents : Part 4”

Leave a comment Cancel reply

Windows 64 bit

Share this:

Related

18 thoughts on “Denoising Dirty Documents : Part 4”

Leave a comment Cancel reply