Keeping Up With The Latest Techniques

~ brief insights

Keeping Up With The Latest Techniques

Tag Archives: Variable Importance

Denoising Dirty Documents: Part 6

07 Monday Sep 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning, Median Filter, R, Variable Importance, XGBoost

≈ 7 Comments

Tags

Image Processing, Kaggle, Machine Learning, Median Filter, R, Variable Importance, XGBoost

So far in this series of blogs we have used image processing techniques to improve the images, and then ensembled together the results of that image processing using GBM or XGBoost. But I have noticed that some competitors have achieved reasonable results using purely machine learning approaches. While these pure machine learning approaches aren’t enough for the competitors to get to the top of the leader board, they have outperformed some of the models that I have presented in this series of blogs. However these scripts were invariably written in Python and I thought that it would be great to see how to use R to build a similar type of model (except better, because we will include all of the image processing predictors that we have developed so far). So today we will add a brute-force machine learning approach to our model.

At the time of writing this blog, the ranked top competitor who has shared their script is placed 17th with an RMSE of 2.6%. While I am not very experienced in Python, I can usually figure out what a Python script is doing if I read it. So here’s what I think that script ise doing:

  1. pad out each image by an extra 2 pixels (see my last blog for how to pad out an image in R)
  2. run a 3×3 sliding window along the image pixels (see my last blog for how to create a sliding window in R)
  3. use all 9 pixels within the sliding window as predictors for the pixel in the centre of the sliding window
  4. use a random forest model to predict the pixel brightnesses

This is a pure machine learning approach because it doesn’t do any image processing to pre-process the image. It simply says that if you want to predict a particular pixel brightness, then you should probably look around at the brightnesses of the pixels either side of that pixel.

canstockphoto16547059

While I don’t want to fuel the R versus Python language wars, I do want to create a model in R that can outperform my competitors. So instead of using a 3×3 sliding window, I will use a 5×5 sliding window. Because the random forest implementation in R tends to run out of RAM on my computer, I will use XGBoost.

More_Power_HomeImprovementO

Here is how I create a 5×5 sliding window.


# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")

# create a padded image within which we will embed our source image
pad = 2
width = 2 * pad + 1
padded = matrix(1, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

# create a matrix of predictor values - each column is a pixel from the sliding window
tab = NULL
for (i in seq_len(width))
{
for (j in seq_len(width))
{
if (i == 1 && j == 1)
{
tab = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
} else {
tab = cbind(tab, img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))]))
}
}
}

head(tab[,1:4])

20150904 output 1

When I noticed that this script ran a bit slowly, I looked at how I wrote the loops. R does not like manual looping, and it is very inefficient at appending data. So I rewrote the script to pre-allocate space for all of the cells rather than appending each column as it was calculated.


pad = 2
width = 2 * pad + 1
padded = matrix(1, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
k = 1
for (i in seq_len(width))
{
for (j in seq_len(width))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

This modification gave me a tenfold improvement in speed! It’s a reminder that I shouldn’t be lazy when writing R scripts.

There’s one small issue to consider before we put this all together for a new more powerful predictive model. In the code above I have used a brightness value of 1 when padding out the image, but that doesn’t look natural. The image below shows a clear white border around the image. This could confuse the machine learning algorithm, as it could waste time learning what to do with pure white pixels.

20150904 output 2

So instead it makes sense to pad out the image with a background brightness. One way to do this is to replicate the edge of the original image.


pad = 2
width = 2 * pad + 1
padded = matrix(1, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
# fill in the padded edges
padded[,1:pad] = padded[,pad + 1]
padded[,ncol(img) + pad + 1:pad] = padded[,ncol(padded) - pad]
padded[1:pad,] = padded[pad + 1,]
padded[nrow(img) + pad + 1:pad,] = padded[nrow(padded) - pad,]

20150904 output 3

This looks more natural, except where there is writing at the edge of the image. Another way is to pad out the pixels using the median of the entire image.


pad = 2
width = 2 * pad + 1
padded = matrix(median(img), nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

20150904 output 4

I haven’t tested which approach works best. So I leave it to the reader to compare the results from the three different padding approaches, and use whichever gives the best result (although I suspect that there won’t be much difference between the second and third approaches).

Let’s pull it all together, and add the surrounding pixels as a predictor to a more complete model.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps, xgboost, Ckmeans.1d.dp)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
img.eb  0] = 0
m1 = min(foreground)
m2 = max(foreground)
foreground = (foreground - m1) / (m2 - m1)

return (matrix(foreground, nrow(img), ncol(img)))
}

# a function that groups together the pixels contained within a sliding window around each pixel of interest
proximalPixels = function(img)
{
pad = 2
width = 2 * pad + 1
padded = matrix(median(img), nrow(img) + 2 * pad, ncol(img) + 2 * pad)
padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img

tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
k = 1
for (i in seq_len(width))
{
for (j in seq_len(width))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

return (tab)
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata_blog6.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

# adaptive thresholding
x3 = img2vec(adaptiveThresholding(imgX))

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

# median filter and related features
x7 = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

# surrounding pixels
x9 = proximalPixels(imgX)

dat = data.table(cbind(y, x, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,append(c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"), paste("x", 1:25, sep="")))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain  1] = 1
imgOut = matrix(yHatImg, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150904 output 5

None of the individual pixels that we added have strong predictive powers, yet our RMSE on the training data dropped from 2.4% for the last blog’s model to 1.4% for this model! That’s because it is the combination of nearby pixels that are predictive, not just any individual pixel. I haven’t scored this model on the test set, but I suspect that it will get you a good ranking.
20150904 output 6

There is hardly any trace of the coffee cup stain remaining. This is a pretty good candidate model.

In order to understand the effect of the nearby pixels, it is useful to visualise their variable importance in a grid.

pixelNames = paste("x", 1:25, sep="")
pixelNames[13] = "raw"
grid = t(matrix(sapply(1:25, FUN = function(x) importance_matrix$Gain[importance_matrix$Feature == pixelNames[x]]), 5, 5))
grid[3, 3] = NA
plot(raster(grid))

20150904 output 7
This graphic shows that we didn’t need all of the surrounding pixels to create a good predictive model. The pixels that don’t lie on the same row or column as the target pixel aren’t as important. If we wanted to expand beyond the 3×3 sliding window used by our competitors, then we didn’t need to add all of the extra pixels. We could have just added the pixels at (1, 3) and (3, 1) and (3, 5) and (5, 3).

The model that we developed in this blog is pushing the boundaries of what my PC can do. Without access to very powerful computers and / or a cluster, we need to use a different approach if we want to improve on this blog’s model.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents : Part 5

28 Friday Aug 2015

Posted by Colin Priest in Background Removal, Kaggle, Median Filter, R, Variable Importance

≈ 12 Comments

Tags

Background Removal, Kaggle, Median Filter, R, Variable Importance

In my last blog we had faded the coffee cup stains, but there was more work to be done. So far we had used adaptive thresholding and edge detection. Today we will use median filters and background removal.

keep-off-median-banner

A median filter is an image filter that replaces a pixel with the median value of the pixels surrounding it. In doing this, it smoothes the image, and the result is often thought of as the “background” of the image, since it tends to wipe away small features, but maintains broad features.

While the biOps package has a median filter implementation, it isn’t difficult to write a function to do that ourselves, and it can be quite instructive to see how a median filter works.

median_Filter = function(img, filterWidth)
{
 pad = floor(filterWidth / 2)
 padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad)
 padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img
 
 tab = NULL
 for (i in seq_len(filterWidth))
 {
 for (j in seq_len(filterWidth))
 {
 if (i == 1 && j == 1)
 {
 tab = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
 } else {
 tab = cbind(tab, img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))]))
 }
 }
 }
 
 filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
 return (matrix(filtered, nrow(img), ncol(img)))
}

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster)

# read in the coffee cup stain image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")

# use the median filter and save the result
filtered = median_Filter(img, 17)
writePNG(filtered, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")

20150829 output 1

What we get after applying a median filter is something that looks like the background of the image. It contains the coffee cup stains and also the shade of the paper upon which the writing appears. With a filter width of 17, the writing has almost entirely faded away.
I didn’t choose that filter width of 17 randomly. It was the result of running models with several different filter widths and seeing which had the best predictive powers.
Median filters are not fast, especially once the filter width increases. This function runs with similar speed to that in the biOps package. I shall leave the task of writing a parallel processing version of the median filter function to the more impatient readers.

While we now have the background, what we really wanted was the foreground – the writing, without the coffee cup stains. The foreground is the difference between the original image and the background. But in this case we know that the writing is always darker than the background, so our foreground should only show pixels that are darker than the background. I have also rescaled the result to lie in the interval [0, 1]. Here is an R script to implement background removal.

# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
 w = 5

 # the background is found via a median filter
 background = median_Filter(img, w)

 # the foreground is darker than the background
 foreground = img - background
 foreground[foreground > 0] = 0
 m1 = min(foreground)
 m2 = max(foreground)
 foreground = (foreground - m1) / (m2 - m1)
 
 return (matrix(foreground, nrow(img), ncol(img)))
}

# run the background removal function and save the result
foreground = background_Removal(img)
writePNG(foreground, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")

20150829 output 2

While it’s not perfect, the resulting filtered image has done a reasonably good job of separating the writing from the background. It is reasonable to expect that it will be a useful predictor in our model.
This time a filter width of 5 was chosen purely for the purpose of speed. You could have used a grid search to find the “best” filter width parameter.

# read in the coffee cup stain image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
imgClean = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned\\3.png")

bestRMSE = 1
bestWidth = 5

widths = c(3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25)

for (w in widths)
{
 # the background is found via a median filter
 background = median_Filter(img, w)

 # the foreground is darker than the background
 foreground = img - background
 foreground[foreground > 0] = 0
 m1 = min(foreground)
 m2 = max(foreground)
 foreground = (foreground - m1) / (m2 - m1)
 
 # score the result
 rmse = sqrt(mean( (foreground - imgClean) ^ 2 ))
 if (rmse < bestRMSE)
 {
 bestRMSE = rmse
 bestWidth = w
 print(c(bestWidth, rmse))
 }
}

In the past few blogs I have used the GBM package to create a predictive model. But as we have added more predictors or features, it has started to take a long time to fit the model and to calculate the predictions. So today I’m switching to the xgboost package because the top Kagglers are using it (Owen Zhang, Kaggle’s top ranked competitor says “when in doubt use xgboost“), and because it runs much faster than GBM. I’d never used xgboost until this week, and I must say that I’m quite impressed with its speed.

Here is an R script that fits an xgboost model, using all of the features that we have come up with over the 5 blogs to date.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps, xgboost, Ckmeans.1d.dp)

if (!require("EBImage"))
{
 source("http://bioconductor.org/biocLite.R")
 biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
 # fit 3 clusters
 v = img2vec(img)
 km.mod = kmeans(v, 3)
 # allow for the random ordering of the clusters
 oc = order(km.mod$centers)
 # the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
 hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))
 
 # using upper threshold
 imgHi = v
 imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1
 
 return (imgHi)
}

# a function that applies adaptive thresholding
adaptiveThresholding = function(img)
{
 img.eb <- Image(t(img))
 img.thresholded.3 = thresh(img.eb, 3, 3)
 img.thresholded.5 = thresh(img.eb, 5, 5)
 img.thresholded.7 = thresh(img.eb, 7, 7)
 img.thresholded.9 = thresh(img.eb, 9, 9)
 img.thresholded.11 = thresh(img.eb, 11, 11)
 img.kmThresh = kmeansThreshold(img)

 # combine the adaptive thresholding
 ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
 ttt.2 = apply(ttt.1, 1, max)
 ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
 return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
 return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
 m1 = t(matrix(Img, nrow(Img), ncol(Img)))
 return(m1)
}

# a function to do canny edge detector
cannyEdges = function(img)
{
 img.biOps = imagedata(img * 255)
 img.canny = imgCanny(img.biOps, 0.7)
 return (matrix(img.canny / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated1 = function(img)
{
 img.biOps = imagedata(img * 255)
 img.canny = imgCanny(img.biOps, 0.7)
 # do some morphology on the edges to fill the gaps between them
 mat <- matrix (0, 3, 3)
 mask <- imagedata (mat, "grey", 3, 3)
 img.dilation = imgBinaryDilation(img.canny, mask)
 img.erosion = imgBinaryErosion(img.dilation, mask)
 return(matrix(img.erosion / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated2 = function(img)
{
 img.biOps = imagedata(img * 255)
 img.canny = imgCanny(img.biOps, 0.7)
 # do some morphology on the edges to fill the gaps between them
 mat <- matrix (0, 3, 3)
 mask <- imagedata (mat, "grey", 3, 3) img.dilation = imgBinaryDilation(img.canny, mask) img.erosion = imgBinaryErosion(img.dilation, mask) img.erosion.2 = imgBinaryErosion(img.erosion, mask) img.dilation.2 = imgBinaryDilation(img.erosion.2, mask) return(matrix(img.dilation.2 / 255, nrow(img), ncol(img))) } # a function to do a median filter median_Filter = function(img, filterWidth) { pad = floor(filterWidth / 2) padded = matrix(NA, nrow(img) + 2 * pad, ncol(img) + 2 * pad) padded[pad + seq_len(nrow(img)), pad + seq_len(ncol(img))] = img tab = NULL for (i in seq_len(filterWidth)) { for (j in seq_len(filterWidth)) { if (i == 1 && j == 1) { tab = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))]) } else { tab = cbind(tab, img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])) } } } filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)]))) return (matrix(filtered, nrow(img), ncol(img))) } # a function that uses median filter to get the background then finds the dark foreground background_Removal = function(img) { w = 5 # the background is found via a median filter background = median_Filter(img, w) # the foreground is darker than the background foreground = img - background foreground[foreground > 0] = 0
 m1 = min(foreground)
 m2 = max(foreground)
 foreground = (foreground - m1) / (m2 - m1)
 
 return (matrix(foreground, nrow(img), ncol(img)))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"
 
outPath = file.path(outFolder, "trainingdata_blog5.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
 print(f)
 imgX = readPNG(file.path(dirtyFolder, f))
 imgY = readPNG(file.path(cleanFolder, f))
 
 # turn the images into vectors
 x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
 y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)
 
 # threshold the image
 x2 = kmeansThreshold(imgX)

 # adaptive thresholding
 x3 = img2vec(adaptiveThresholding(imgX))
 
 # canny edge detector and related features
 x4 = img2vec(cannyEdges(imgX))
 x5 = img2vec(cannyDilated1(imgX))
 x6 = img2vec(cannyDilated2(imgX))
 
 # median filter and related features
 x7 = img2vec(median_Filter(imgX, 17))
 x8 = img2vec(background_Removal(imgX))

 dat = data.table(cbind(y, x, x2, x3, x4, x5, x6, x7, x8))
 setnames(dat,c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"))
 write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# read in the full data table
dat = read.csv(outPath)

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 10000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean]) 
# now fit an xgboost model 
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predictions
dtrainFull <- xgb.DMatrix(as.matrix(dat[,-1]), label = as.matrix(dat[,1]))
yHat = predict(xgb.mod, newdata=dtrainFull)
# what score do we get on the training data?
rmse = sqrt(mean( (yHat - dat$y) ^ 2 ))
print(rmse) # 2.4% vs 4.1%

# get the trained model
model = xgb.dump(xgb.mod, with.stats=TRUE)
# get the feature real names
names = names(dat)[-1]
# compute feature importance matrix
importance_matrix = xgb.importance(names, model=xgb.mod)
# plot the variable importance
gp = xgb.plot.importance(importance_matrix)
print(gp)


# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img), img2vec(adaptiveThresholding(img)), img2vec(cannyEdges(img)), 
 img2vec(cannyDilated1(img)), img2vec(cannyDilated2(img)),img2vec(median_Filter(img, 17)), img2vec(background_Removal(img)) )
setnames(x, c("raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"))
yHatImg = predict(xgb.mod, newdata=as.matrix(x))
yHatImg[yHatImg < 0] = 0 yHatImg[yHatImg > 1] = 1
imgOut = matrix(yHatImg, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150829 output 3

Both the background removal and the median filter features are important, with even higher importance scores than last blog’s canny edge detector features.

20150829 output 4

The coffee cup stain has been mostly removed, especially the speckles that we saw after applying the model in the last blog.

The result represents an improvement in RMSE score on the training data from 4.1% in the last blog to 2.4% in this blog. In the next blog we will further improve on this score.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Efficiently Building GLMs: Part 1

10 Friday Jul 2015

Posted by Colin Priest in Generalized Linear Models, Gradient Boosting Machine, IntelliM, Machine Learning, R

≈ 5 Comments

Tags

Automation, GBMs, Generalized Linear Models, GLMs, IntelliM, Machine Learning, R, Variable Importance

It was way back in 1972 that John Nelder and Robert Wedderburn first introduced GLMs to the world (you can find their original paper here), and by the 1990s actuaries in the insurance industry (I am an actuary) had started using GLMs for technical pricing, empowered by the increased accessibility of modern powerful computers. In some parts of the world there are now huge teams, consisting of dozens of actuaries, sometimes more than 100 of them, building generalized linear models (GLMs) for technical pricing of insurance policies. But is this efficient or is it as out of date as a dinosaur riding a penny farthing?

Dinosaur on Penny Farthing

At university, when I was doing my masters degree in Applied Statistics, I was taught that the “best” GLM model is the one with the lowest AIC or BIC (the topic of whether AIC and BIC are good model benchmarks will be the topic of a future blog). But the information criterion is not a well behaved function (it can have many local minima and maxima). To find the model with the lowest information criterion, one cannot follow a steepest gradient approach, but instead one must systematically search through the different combinations of predictors.

Consider a scenario in which there are 30 possible predictors, consisting of rating factors collected from the customer (e.g. zip code / post code) and / or data variables collected from other sources (e.g. credit rating of the insured) and all of the these values are either categorical, or are numeric and have a linear relationship to the variable being modelled (whether that be claim frequency, severity or claims cost per policy). In such a situation, there are 2^30=1,073,741,824 possible combinations of predictors that could be included in the GLM formula. If each of your actuarial or statistical staff took 10 minutes to test a particular combination of predictors, and each of those staff worked 8 hours per day, for 50 weeks per year, it would take 11,185 man-years to find the best model! Once you include the search for 2-way interactions between predictors, the search time blows out to longer than the life-span of the universe!!!

In practice, actuaries and statisticians are producing GLM models faster than that estimate because they are using heuristics to substantially reduce the number of combinations to search. Those heuristics are usually based upon variants of step-wise regression, whereby they add or remove one predictor at a time. This is still extremely time consuming, and the process still does not necessarily produce the model with the best AIC or BIC.

How can you make this process more efficient?

Let’s make this more real by considering some publicly available data tracking hospital readmission of diabetes patients in USA from 1999 to 2008. You can find the data in the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

Improvement 1: Starting With a Reasonable Choice of Predictors

One can improve upon the step-wise regression approach by starting with a model that already has a few of the most useful predictors.

Instead of beginning with a GLM that uses all the predictors and removes them one-by-one, or starting with no predictors and adding more predictors one-by-one, you can start by understanding which predictors are likely to be the best candidates, and this involves taking a step beyond GLMs into some more modern techniques, including:

  • lasso regularisation
  • variable importance measures via random forests or gradient boosting machines

For the sake of simplicity, this blog will only apply just one of these approaches, the variable importance measure based upon a gradient boosting machine. But any of these three approaches will usually achieve the task.

A gradient boosting machine (GBM) is a very flexible type of machine learning algorithm. It can be used for regression and classification problems. You can use GBMs via the GBM package available in R. GBMs are a forest of trees whereby each successive tree is fitted to the residuals of the previous iteration of the forest i.e. each new tree predicts the errors from the existing forest. The GBM package has a measure of “relative influence” that is quite similar to a variable importance measure, and can be used for the same purpose.

Variable importance or relative influence is a measure of how much of the variation in outcomes is explained by the inclusion of the predictor in the model. A predictor will explain more of the variation in outcomes if:

  • it is statistically significant i.e. the difference isn’t random,
  • the difference is large for different values of the predictor i.e. the predictor differentiates well, and
  • there is considerable variation in the predictor value between observations i.e. more than just a small number of predictor observations are different to the average.

Here is some sample R code to give an indication of how one would do this with the diabetes readmission data:

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(gbm)

# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Documents\\IntelliM\\";

# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# GBM variable importance
set.seed(1)
gbm_imp = gbm(formula = readmitted_flag ~ ., distribution = &quot;bernoulli&quot;, data=td, n.trees = 1000, interaction.depth = 1, verbose=TRUE, shrinkage = 0.01, cv.folds=0, keep.data = F)
s = summary(gbm_imp)
head(s)

As I write in my last blog, I’m a fan of the pacman package in R. It conveniently ensures that I have installed packages before I load them, and then installs and loads the packages as required.

The next step is to read the diabetes readmission data into R. I am reading the data from a comma delimited file that I have created previously after downloading the UCI Machine Learning Repository data. You should edit the sample R script to use the folder and file name of your data.

Finally I fitted a GBM model. For the purposes of this blog I set the random seed, to make the results replicable. Note that the model hyperparameters were not optimised – I am just creating a model for the sake of understanding which predictors are important, not trying to fit the best possible GBM model. But sometimes the interaction.depth hyperparameter does matter. In my script above I have used interaction.depth = 1, which excludes the possibility of 2-way interaction effects between two predictors. I chose a value of 1 for simplicity for this example, and because most of the time it doesn’t make a difference to the discovery of the most important predictors. However, if you have strong reason to believe that your data exhibits strong 2-way effects between predictors, and also have reason to believe that the one-way effect of those same predictors is weak, then try higher values for this hyperparameter.

variable importance console screenshot

The summary function gets the variable importance measures from the GBM model. It displays a plot of the most important predictors, and also stores them in a table. As you can see from the screenshot above, R shows that just 5 predictors will give most of the predictive power. Note that the variable importance scores have been automatically scaled to add to 100.

R variable importance plot

The default plot for GBM variable importance is rather difficult to read and interpret. If I am working on a project that is purely within R, then I usually script for a better looking plot based upon the ggplot2 package in R. Lately I have been using IntelliM to build my GLMs because it automates the feature extraction, model building and validation process for GLMs, and it does it visually instead of manually writing R scripts. It gives an easier to read graph, shown below.

Variable Importance

Later in this series of blogs, I will discuss dimensionality reduction, but since one approach to dimensionality reduction relates to what I have just shown in this blog, I will give you a sneak peak.

The diabetes readmission data contains categorical predictors that contain codes for diagnoses and other data of interest. Some of these predictors have several hundred different possible codes. Unless we had many millions of observations in total, there is no way that all of those codes are going to have enough observations to provide statistically valid predictions. What we expect is that just a small number of these codes are important.

Factor Importance

So we can apply the same variable importance approach to the codes within a categorical predictor, and then group the unimportant codes together, making for a better model.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Blogroll

  • Discover New Voices
  • Discuss
  • Get Inspired
  • Get Mobile
  • Get Polling
  • Get Support
  • Great Reads
  • Learn WordPress.com
  • Theme Showcase
  • WordPress.com News
  • www.r-bloggers.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 277 other subscribers

Blog at WordPress.com.

  • Follow Following
    • Keeping Up With The Latest Techniques
    • Join 86 other followers
    • Already have a WordPress.com account? Log in now.
    • Keeping Up With The Latest Techniques
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...
 

    %d bloggers like this: