At the end of the last blog, we had made some progress in removing the coffee cup stain from the image, but we needed to do more.
Adaptive thresholding has started to separate the writing from the stain, but it has also created a speckled pattern within the stains. We need to engineer a feature that can tell apart a stroke of writing from a speckled local maxima i.e. distinguish a ridge from a peak in the 3D surface.
In image processing, we do this via edge detection, which is the process of calculating the slope of the 3D surface of the image, and retaining lines where the slope is high. There are several different standard algorithms to do edge detection, and today we will use the canny edge detector.
The biOps package, which has an implementation of the canny edge detector, has been removed from CRAN. It has been migrated to Google Code. You will need to follow the installation instructions that can be found here. For example, since I am using Windows 64, I mostly followed the following instructions from the web site:
Windows 64 bit
- Note: we dropped jpeg and tiff IO functions.
- Download (go to the DLL’s page then download the raw file) libfftw3-3.dll, libfftw3f-f.dll, libfftw3l-3.dll, and zlib1.dll to C:\Program Files\R\R-3.x.x\bin\x64 (x.x needs to be edited) or somewhere present in the PATH variables. Make sure that the downloaded dll files are MB in file size.
- Download biOps_0.2.2.zip (go to the DLL’s page then download the raw file). Make sure that the file size is around 700KB.
- Run 64 bit R.
- Choose biOps_0.2.2.zip from Packages>Install package(s)…
- Load the library.
> library(biOps)
However for step E) I used the following R code:
install.packages("C:/Program Files/R/R-3.1.3/bin/biOps_0.2.2.zip")
That’s because I had downloaded biOps_0.2.2.zip into the C:/Program Files/R/R-3.1.3/bin folder. You should substitute the folder path that you downloaded the zip file into.
Update!!!: Google has switched off Google Code. But I have found the dlls inside an archive that you can download from here. Warning: the download is large (613.91MB).
Now we can start to experiment with edge detection. Note that biOps images have pixel brightnesses from 0 to 255 rather than from 0 to 1. So we have to rescale whenever we switch packages.
if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster, biOps) # read in the coffee cup stain image img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png") plot(raster(img)) # convert the image to a biOps image img.biOps = imagedata(img * 255) plot.imagedata(img.biOps) # canny edge detection img.canny = imgCanny(img.biOps, 0.7) plot.imagedata(img.canny)
One of the things I like about this particular edge detection algorithm is that it has automatically thresholded the edges so that we are only left with the strong edges. So here are my observations about the behaviour of the edges:
- they surround the writing
- they surround the stains
- there are few edges within a stain, so a lack of edges in a region may be a useful feature for removing the speckles within a stain
- writing has a pair of parallel edges around each stroke, while the boundary of the stain has only a single edge, so the presence of a pair of edges may be a useful feature for separating stains from writing
To take advantage of these observations we shall use image morphology. Dilation is the process of making a line or blog thicker by expanding its boundary by one pixel. Erosion is the opposite, removing a 1 pixel thick layer from the boundary of an object. If we dilate the edges, then the pair of edges around the writing will expand to include the writing inside, and the edge of the stain will also expand.
# do some morphology on the edges to fill the gaps between them mat <- matrix (0, 3, 3) mask <- imagedata (mat, "grey", 3, 3) img.dilation = imgBinaryDilation(img.canny, mask) plot.imagedata(img.dilation)
The writing is all black, whereas most of the stain is white. This will probably be a useful feature for removing the coffee cup stain. But we can do more: now that we have dilated the edges, we can erode them to remove where we started with single edges.
This looks pretty good – all of the writing is black, but only a small part of the stain remains. the stain has a thin line, while the writing has thick lines. So we can erode once, then dilate once, and the thin lines will disappear.
# do some morphology to remove the stain lines mat <- matrix (0, 3, 3) mask <- imagedata (mat, "grey", 3, 3) img.dilation = imgBinaryDilation(img.canny, mask) img.erosion = imgBinaryErosion(img.dilation, mask) img.erosion.2 = imgBinaryErosion(img.erosion, mask) img.dilation.2 = imgBinaryDilation(img.erosion.2, mask) img.dilation.3 = imgBinaryDilation(img.dilation.2, mask) plot.imagedata(img.dilation.3)
The stain is now almost completely removed! But unfortunately some of the writing has been removed too. So it is an imperfect feature for removing the coffee cup stain.
Let’s put it all together with the existing features that we have developed over the past few blogs, by adding canny edges and the dilated / eroded edges to the gradient boosted model:
# libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps) if (!require("EBImage")) { source("http://bioconductor.org/biocLite.R") biocLite("EBImage") } # a function to do k-means thresholding kmeansThreshold = function(img) { # fit 3 clusters v = img2vec(img) km.mod = kmeans(v, 3) # allow for the random ordering of the clusters oc = order(km.mod$centers) # the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]])) # using upper threshold imgHi = v imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1 return (imgHi) } # a function that applies adaptive thresholding adaptiveThresholding = function(img) { img.eb <- Image(t(img)) img.thresholded.3 = thresh(img.eb, 3, 3) img.thresholded.5 = thresh(img.eb, 5, 5) img.thresholded.7 = thresh(img.eb, 7, 7) img.thresholded.9 = thresh(img.eb, 9, 9) img.thresholded.11 = thresh(img.eb, 11, 11) img.kmThresh = kmeansThreshold(img) # combine the adaptive thresholding ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img))) ttt.2 = apply(ttt.1, 1, max) ttt.3 = matrix(ttt.2, nrow(img), ncol(img)) return (ttt.3) } # a function to turn a matrix image into a vector img2vec = function(img) { return (matrix(img, nrow(img) * ncol(img), 1)) } # a function to convert an Image into a matrix Image2Mat = function(Img) { m1 = t(matrix(Img, nrow(Img), ncol(Img))) return(m1) } # a function to do canny edge detector cannyEdges = function(img) { img.biOps = imagedata(img * 255) img.canny = imgCanny(img.biOps, 0.7) return (matrix(img.canny / 255, nrow(img), ncol(img))) } # a function combining canny edge detector with morphology cannyDilated1 = function(img) { img.biOps = imagedata(img * 255) img.canny = imgCanny(img.biOps, 0.7) # do some morphology on the edges to fill the gaps between them mat <- matrix (0, 3, 3) mask <- imagedata (mat, "grey", 3, 3) img.dilation = imgBinaryDilation(img.canny, mask) img.erosion = imgBinaryErosion(img.dilation, mask) return(matrix(img.erosion / 255, nrow(img), ncol(img))) } # a function combining canny edge detector with morphology cannyDilated2 = function(img) { img.biOps = imagedata(img * 255) img.canny = imgCanny(img.biOps, 0.7) # do some morphology on the edges to fill the gaps between them mat <- matrix (0, 3, 3) mask <- imagedata (mat, "grey", 3, 3) img.dilation = imgBinaryDilation(img.canny, mask) img.erosion = imgBinaryErosion(img.dilation, mask) img.erosion.2 = imgBinaryErosion(img.erosion, mask) img.dilation.2 = imgBinaryDilation(img.erosion.2, mask) return(matrix(img.dilation.2 / 255, nrow(img), ncol(img))) } dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train" cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned" outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted" outPath = file.path(outFolder, "trainingdata.csv") filenames = list.files(dirtyFolder) for (f in filenames) { print(f) imgX = readPNG(file.path(dirtyFolder, f)) imgY = readPNG(file.path(cleanFolder, f)) # turn the images into vectors x = matrix(imgX, nrow(imgX) * ncol(imgX), 1) y = matrix(imgY, nrow(imgY) * ncol(imgY), 1) # threshold the image x2 = kmeansThreshold(imgX) # adaptive thresholding x3 = img2vec(adaptiveThresholding(imgX)) # canny edge detector and related features x4 = img2vec(cannyEdges(imgX)) x5 = img2vec(cannyDilated1(imgX)) x6 = img2vec(cannyDilated2(imgX)) dat = data.table(cbind(y, x, x2, x3, x4, x5, x6)) setnames(dat,c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2")) write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE) } # read in the full data table dat = read.csv(outPath) # fit a model to a subset of the data set.seed(1) rows = sample(nrow(dat), 1000000) gbm.mod = gbm(y ~ raw + thresholded + adaptive + canny + cannyDilated1 + cannyDilated2, data = dat[rows,], n.trees = 10000, train.fraction = 0.5, interaction.depth = 5) best.iter <- gbm.perf(gbm.mod) s = summary(gbm.mod) # get the predictions - using parallel processing to save time numCores = 6 #change the 6 to your number of CPU cores. or maybe lower due to RAM limits cl = makeCluster(numCores) registerDoSNOW(cl) num_splits = numCores split_testing = sort(rank(1:nrow(dat))%%numCores) yHat = foreach(i=unique(split_testing),.combine=c,.packages=c("gbm")) %dopar% { as.numeric(predict(gbm.mod, newdata=dat[split_testing==i,], n.trees = best.iter)) } stopCluster(cl) yHat[yHat < 0] = 0 yHat[yHat > 1] = 1 # what score do we get on the training data? rmse = sqrt(mean( (yHat - dat$y) ^ 2 )) print(rmse) # 4.1% # show the predicted result for a sample image img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png") x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img), img2vec(adaptiveThresholding(img)), img2vec(cannyEdges(img)), img2vec(cannyDilated1(img)), img2vec(cannyDilated2(img))) setnames(x, c("raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2")) yHatImg = predict(gbm.mod, newdata=x, n.trees = best.iter) yHatImg[yHatImg < 0] = 0 yHatImg[yHatImg > 1] = 1 imgOut = matrix(yHatImg, nrow(img), ncol(img)) writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png") plot(raster(imgOut))
The first dilation operation that we tried in this blog is the most powerful predictor that we developed today, even more powerful than the adaptive thresholding that we used in the last blog. We have improved the RMSE score on the training set from 5.4% to 4.1%
The model doesn’t completely remove the coffee cup stain, but it has faded the stain enough that we have a good chance at removing it later. In the next blog in this series, we will do more to clean up the coffee cup stain.
Pingback: Denoising Dirty Documents : Part 4 | Dinesh Ram Kali.
Pingback: Denoising Dirty Documents : Part 5 | Keeping Up With The Latest Techniques
Great post! I also wanted to note, for any Mac users, that in order to install the biOps packages, you can run the following lines of code:
# run this line in the command line
brew install libtiff jpeg fftw
# run these lines in R
install.packages(“devtools”)
library(devtools)
install_url(“http://cran.r-project.org/src/contrib/Archive/biOps/biOps_0.2.2.tar.gz”)
library(biOps)
LikeLiked by 1 person
Pingback: Denoising Dirty Documents: Part 7 | Keeping Up With The Latest Techniques
Pingback: Denoising Dirty Documents – Part 10 | Keeping Up With The Latest Techniques
Pingback: Image Processing + Machine Learning in R: Denoising Dirty Documents Tutorial Series | no free hunch
Hello, Colin,
Great article, I am trying to install biOps as instruction here. It seems that the Dll files and the package itself are not available any more from the address you provided.
Could you share copy of them somewhere?
Thanks
LikeLike
The links work for me. I note that BiOps was removed well before I wrote this blog so you must follow the instructions in my blog about how to install it. You can’t just install from CRAN.
Colin
LikeLike
Hi, Colin,
The links work, but when you click the dll files, or the package zip file, it shows “SVN hosting has been permanently disabled.”
It seems that those files are not available any more.
LikeLiked by 1 person
The link from the blog https://code.google.com/archive/p/rimagebook/wikis/biOpsInstallationEn.wiki takes you to the instructions to install.
LikeLike
Thanks for the heads up. Google has switched off Google Code:
https://opensource.googleblog.com/2015/03/farewell-to-google-code.html
I found the dlls inside an archive that you can download from https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/rimagebook/source-archive.zip
Warning: the download is large (613.91MB)
LikeLiked by 1 person
Great! Colin, Thank you very much
I got biOps installed successfully.
LikeLiked by 1 person
I tried to install biops using the old archive, until i find this blog,
thank you for this informations.
LikeLike
Hello, Collin!
Thank you for your work. I tried to download biOps but I got a message “Failed with error: ‘package ‘biOps’ is not installed for ‘arch = x64’’” repeated.
Can you tell me what is the problem??
LikeLike
Hi Kim,
That error message sounds like a system incompatibility problem. Are you using Windows?
By the way, my name is Colin not Collin.
Colin
LikeLike
I’m so sorry for that…. Colin
I’m using window 10. Can’t I use biOps packages in Windows?
LikeLike
Pingback: Image Processing + Machine Learning – FPGA, image processing, machine learning, deep learning
Pingback: learn Data Science by doing | Egyptian AI & Big Data Geeks Offical Blog