# Denoising Dirty Documents: Part 7

Tags

By the time I’d finished building the model in my last blog, I’d started to overload my computer’s RAM and CPU, so much so that I couldn’t add any more features. One solution could be for me to upgrade my hardware, or rent out a cluster in the cloud, but I’m trying to save money at the moment. So today I’m going to restructure my predictive solution into separate predictive models, each of which do not individually overload my computer, but which are combined via stacking to give an overall solution that is more predictive than any of the individual models. Stacking would also allow us to answer Bobby’s question:

I thought it was interesting how the importance chart didn’t show any one individual pixel as being significant. Is there another similar importance chart or metric that can show the importance of a combination of pixels?

So let’s start by breaking up the current monolithic model into discrete chunks. Once we have done this, it will be easier for us to add new features (which I will do in the next blog). I will store the predicted outputs from each model in image format with separate folders for each model. The R script for creating the individual models is primarily recycled from my past blogs. Since images contain integer values for pixel brightness, one may argue that storing the predicted values as images loses some of the information content, but the rounding of floating point values to integer values only affects the precision by a maximum of 0.2%, and the noise in the predicted values is an order of magnitude greater than that. If you want that fractional extra predictive power, then I leave it to you to adapt the script to write the floating point predictions into a data.table.

# Linear Transformation

The linear transformation model was developed here, and just involves a linear transformation, then constraining the output to be in the [0, 1] range. ```
# libraries
if (!require("pacman")) install.packages("pacman")

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"

# predict using linear transformation
outModel1 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model1"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)

# linear transformation
yHat = -0.126655 + 1.360662 * x

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel1, f))
}

```

# Thresholding

The thresholding model was developed here, and involves a combination of adaptive thresholding processes, each with different sizes of localisation ranges, plus a global thresholding process. But this time we will use xgboost instead of gbm. ```
# libraries
if (!require("pacman")) install.packages("pacman")

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod\$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod\$cluster == oc]) + min(v[km.mod\$cluster == oc]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
{
img.eb <- Image(t(img))
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)
img.kmThresh = kmeansThreshold(img)

ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model2.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)

# turn the images into vectors
x = img2vec(imgX)
y = img2vec(imgY)

# threshold the image
x2 = kmeansThreshold(imgX)

dat = data.table(cbind(y, x, x2, x3))
write.table(dat, file=outPath, append=(f != filenames), sep=",", row.names=FALSE, col.names=(f == filenames), quote=FALSE)
}

# read in the full data table

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel2 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model2"
for (f in filenames)
{
print(f)

# turn the images into vectors
x = img2vec(imgX)

# threshold the image
x2 = kmeansThreshold(imgX)

dat = data.table(cbind(x, x2, x3))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel2, f))
}

```

# Edges

The edge model was developed here, and involves a combination of canny edge detection and image morphology. ```
# libraries
if (!require("pacman")) install.packages("pacman")

# a function to do canny edge detector
cannyEdges = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
return (matrix(img.canny / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated1 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
return(matrix(img.erosion / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated2 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
return(matrix(img.dilation.2 / 255, nrow(img), ncol(img)))
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model3.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)

# turn the images into vectors
x = img2vec(imgX)
y = img2vec(imgY)

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

dat = data.table(cbind(y, x, x4, x5, x6))
setnames(dat,c("y", "raw", "canny", "cannyDilated1", "cannyDilated2"))
write.table(dat, file=outPath, append=(f != filenames), sep=",", row.names=FALSE, col.names=(f == filenames), quote=FALSE)
}

# read in the full data table

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel3 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model3"
for (f in filenames)
{
print(f)

# turn the images into vectors
x = img2vec(imgX)

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

dat = data.table(cbind(x, x4, x5, x6))
setnames(dat,c("raw", "canny", "cannyDilated1", "cannyDilated2"))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel3, f))
}

```

# Background Removal

The background removal model was developed here, and involves the use of median filters. An xgboost model is used to combine the median filter results. ```
# libraries
if (!require("pacman")) install.packages("pacman")
# a function to do a median filter
median_Filter = function(img, filterWidth)
{

tab = matrix(NA, nrow(img) * ncol(img), filterWidth ^ 2)
k = 1
for (i in seq_len(filterWidth))
{
for (j in seq_len(filterWidth))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

filtered = unlist(apply(tab, 1, function(x) median(x, na.rm = TRUE)))
return (matrix(filtered, nrow(img), ncol(img)))
}

# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
w = 5

# the background is found via a median filter
background = median_Filter(img, w)

# the foreground is darker than the background
foreground = img - background
foreground[foreground > 0] = 0
m1 = min(foreground)
m2 = max(foreground)
foreground = (foreground - m1) / (m2 - m1)

return (matrix(foreground, nrow(img), ncol(img)))
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model4.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)

# turn the images into vectors
x = img2vec(imgX)
y = img2vec(imgY)

# median filter and related features
x7a = img2vec(median_Filter(imgX, 5))
x7b = img2vec(median_Filter(imgX, 11))
x7c = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

dat = data.table(cbind(y, x, x7a, x7b, x7c, x8))
setnames(dat,c("y", "raw", "median5", "median11", "median17", "background"))
write.table(dat, file=outPath, append=(f != filenames), sep=",", row.names=FALSE, col.names=(f == filenames), quote=FALSE)
}

# read in the full data table

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel4 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model4"
for (f in filenames)
{
print(f)

# turn the images into vectors
x = img2vec(imgX)

# median filter and related features
x7a = img2vec(median_Filter(imgX, 5))
x7b = img2vec(median_Filter(imgX, 11))
x7c = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

dat = data.table(cbind(x, x7a, x7b, x7c, x8))
setnames(dat,c("raw", "median5", "median11", "median17", "background"))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel4, f))
}

```

# Nearby Pixels

The spacial model was developed here, and involves the use of a sliding window and xgboost. ```
# libraries
if (!require("pacman")) install.packages("pacman")

# a function that groups together the pixels contained within a sliding window around each pixel of interest
proximalPixels = function(img)
{
width = 2 * pad + 1

tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
k = 1
for (i in seq_len(width))
{
for (j in seq_len(width))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

return (tab)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model5.csv"
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)

# turn the images into vectors
#x = img2vec(imgX)
y = img2vec(imgY)

# surrounding pixels
x9 = proximalPixels(imgX)

dat = data.table(cbind(y, x9))
setnames(dat,append("y", paste("x", 1:25, sep="")))
write.table(dat, file=outPath, append=(f != filenames), sep=",", row.names=FALSE, col.names=(f == filenames), quote=FALSE)
}

# read in the full data table

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predicted result for each image
outModel5 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model5"
for (f in filenames)
{
print(f)

# turn the images into vectors
#x = img2vec(imgX)

# surrounding pixels
x9 = proximalPixels(imgX)

dat = data.table(x9)
setnames(dat,paste("x", 1:25, sep=""))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX), ncol(imgX))

# save the predicted value
writePNG(imgYHat, file.path(outModel5, f))
}
```

# Combining the Individual Models

There are many ways that the predictions of the models could be ensembled together. I have used an xgboost model because I want to allow for the complexity of the problem that is being solved.

```# libraries
if (!require("pacman")) install.packages("pacman")

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

inPath1 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model1"
inPath2 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model2"
inPath3 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model3"
inPath4 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model4"
inPath5 = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\model5"

cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outPath = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\stacking.csv"

filenames = list.files(inPath1)
for (f in filenames)
{
print(f)

# turn the images into vectors
#x = img2vec(imgX)
y = img2vec(imgY)

# contributing models
x1 = img2vec(imgX1)
x2 = img2vec(imgX2)
x3 = img2vec(imgX3)
x4 = img2vec(imgX4)
x5 = img2vec(imgX5)

dat = data.table(cbind(y, x1, x2, x3, x4, x5))
setnames(dat,c("y", "linear", "thresholding", "edges", "backgroundRemoval", "proximal"))
write.table(dat, file=outPath, append=(f != filenames), sep=",", row.names=FALSE, col.names=(f == filenames), quote=FALSE)
}

# read in the full data table

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 5000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 1000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the trained model
model = xgb.dump(xgb.mod, with.stats=TRUE)
# get the feature real names
names = names(dat)[-1]
# compute feature importance matrix
importance_matrix = xgb.importance(names, model=xgb.mod)
# plot the variable importance
gp = xgb.plot.importance(importance_matrix)
print(gp)

# get the predicted result for each image
outModel = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\stacking\\stacked"
for (f in filenames)
{
print(f)

# contributing models
x1 = img2vec(imgX1)
x2 = img2vec(imgX2)
x3 = img2vec(imgX3)
x4 = img2vec(imgX4)
x5 = img2vec(imgX5)

dat = data.table(cbind(x1, x2, x3, x4, x5))
setnames(dat,c("linear", "thresholding", "edges", "backgroundRemoval", "proximal"))

# predicted values
yHat = predict(xgb.mod, newdata=as.matrix(dat))

# constrain the range to be in [0, 1]
yHat = sapply(yHat, function(x) max(min(x, 1),0))

# turn the vector into an image
imgYHat = matrix(yHat, nrow(imgX1), ncol(imgX1))

# save the predicted value
writePNG(imgYHat, file.path(outModel, f))
}
``` The graph above answers Bobby’s question – when considered together, the nearby pixels provide most of the additional predictive power beyond that of the raw pixel brightness. It seems that the key to separating dark text from noise is to consider how the pixel fits into the local region within the image.
Now that we have set up a stacking model structure, we can recommence the process of adding more image processing and features to the model. And that’s what the next blog will do…

# Denoising Dirty Documents: Part 6

So far in this series of blogs we have used image processing techniques to improve the images, and then ensembled together the results of that image processing using GBM or XGBoost. But I have noticed that some competitors have achieved reasonable results using purely machine learning approaches. While these pure machine learning approaches aren’t enough for the competitors to get to the top of the leader board, they have outperformed some of the models that I have presented in this series of blogs. However these scripts were invariably written in Python and I thought that it would be great to see how to use R to build a similar type of model (except better, because we will include all of the image processing predictors that we have developed so far). So today we will add a brute-force machine learning approach to our model.

At the time of writing this blog, the ranked top competitor who has shared their script is placed 17th with an RMSE of 2.6%. While I am not very experienced in Python, I can usually figure out what a Python script is doing if I read it. So here’s what I think that script ise doing:

1. pad out each image by an extra 2 pixels (see my last blog for how to pad out an image in R)
2. run a 3×3 sliding window along the image pixels (see my last blog for how to create a sliding window in R)
3. use all 9 pixels within the sliding window as predictors for the pixel in the centre of the sliding window
4. use a random forest model to predict the pixel brightnesses

This is a pure machine learning approach because it doesn’t do any image processing to pre-process the image. It simply says that if you want to predict a particular pixel brightness, then you should probably look around at the brightnesses of the pixels either side of that pixel. While I don’t want to fuel the R versus Python language wars, I do want to create a model in R that can outperform my competitors. So instead of using a 3×3 sliding window, I will use a 5×5 sliding window. Because the random forest implementation in R tends to run out of RAM on my computer, I will use XGBoost. Here is how I create a 5×5 sliding window.

```
# show the predicted result for a sample image

# create a padded image within which we will embed our source image
width = 2 * pad + 1

# create a matrix of predictor values - each column is a pixel from the sliding window
tab = NULL
for (i in seq_len(width))
{
for (j in seq_len(width))
{
if (i == 1 && j == 1)
{
tab = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
} else {
tab = cbind(tab, img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))]))
}
}
}

``` When I noticed that this script ran a bit slowly, I looked at how I wrote the loops. R does not like manual looping, and it is very inefficient at appending data. So I rewrote the script to pre-allocate space for all of the cells rather than appending each column as it was calculated.

```
width = 2 * pad + 1

tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
k = 1
for (i in seq_len(width))
{
for (j in seq_len(width))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

```

This modification gave me a tenfold improvement in speed! It’s a reminder that I shouldn’t be lazy when writing R scripts.

There’s one small issue to consider before we put this all together for a new more powerful predictive model. In the code above I have used a brightness value of 1 when padding out the image, but that doesn’t look natural. The image below shows a clear white border around the image. This could confuse the machine learning algorithm, as it could waste time learning what to do with pure white pixels. So instead it makes sense to pad out the image with a background brightness. One way to do this is to replicate the edge of the original image.

```
width = 2 * pad + 1
# fill in the padded edges

``` This looks more natural, except where there is writing at the edge of the image. Another way is to pad out the pixels using the median of the entire image.

```
width = 2 * pad + 1

``` I haven’t tested which approach works best. So I leave it to the reader to compare the results from the three different padding approaches, and use whichever gives the best result (although I suspect that there won’t be much difference between the second and third approaches).

Let’s pull it all together, and add the surrounding pixels as a predictor to a more complete model.

```
# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps, xgboost, Ckmeans.1d.dp)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod\$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod\$cluster == oc]) + min(v[km.mod\$cluster == oc]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
{
img.eb  0] = 0
m1 = min(foreground)
m2 = max(foreground)
foreground = (foreground - m1) / (m2 - m1)

return (matrix(foreground, nrow(img), ncol(img)))
}

# a function that groups together the pixels contained within a sliding window around each pixel of interest
proximalPixels = function(img)
{
width = 2 * pad + 1

tab = matrix(1, nrow(img) * ncol(img), width ^ 2)
k = 1
for (i in seq_len(width))
{
for (j in seq_len(width))
{
tab[,k] = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
k = k + 1
}
}

return (tab)
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata_blog6.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

# median filter and related features
x7 = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

# surrounding pixels
x9 = proximalPixels(imgX)

dat = data.table(cbind(y, x, x2, x3, x4, x5, x6, x7, x8, x9))
setnames(dat,append(c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"), paste("x", 1:25, sep="")))
write.table(dat, file=outPath, append=(f != filenames), sep=",", row.names=FALSE, col.names=(f == filenames), quote=FALSE)
}

# read in the full data table

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain  1] = 1
imgOut = matrix(yHatImg, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

``` None of the individual pixels that we added have strong predictive powers, yet our RMSE on the training data dropped from 2.4% for the last blog’s model to 1.4% for this model! That’s because it is the combination of nearby pixels that are predictive, not just any individual pixel. I haven’t scored this model on the test set, but I suspect that it will get you a good ranking. There is hardly any trace of the coffee cup stain remaining. This is a pretty good candidate model.

In order to understand the effect of the nearby pixels, it is useful to visualise their variable importance in a grid.

```pixelNames = paste("x", 1:25, sep="")
pixelNames = "raw"
grid = t(matrix(sapply(1:25, FUN = function(x) importance_matrix\$Gain[importance_matrix\$Feature == pixelNames[x]]), 5, 5))
grid[3, 3] = NA
plot(raster(grid))
``` This graphic shows that we didn’t need all of the surrounding pixels to create a good predictive model. The pixels that don’t lie on the same row or column as the target pixel aren’t as important. If we wanted to expand beyond the 3×3 sliding window used by our competitors, then we didn’t need to add all of the extra pixels. We could have just added the pixels at (1, 3) and (3, 1) and (3, 5) and (5, 3).

The model that we developed in this blog is pushing the boundaries of what my PC can do. Without access to very powerful computers and / or a cluster, we need to use a different approach if we want to improve on this blog’s model.

# Denoising Dirty Documents : Part 5

In my last blog we had faded the coffee cup stains, but there was more work to be done. So far we had used adaptive thresholding and edge detection. Today we will use median filters and background removal. A median filter is an image filter that replaces a pixel with the median value of the pixels surrounding it. In doing this, it smoothes the image, and the result is often thought of as the “background” of the image, since it tends to wipe away small features, but maintains broad features.

While the biOps package has a median filter implementation, it isn’t difficult to write a function to do that ourselves, and it can be quite instructive to see how a median filter works.

```median_Filter = function(img, filterWidth)
{

tab = NULL
for (i in seq_len(filterWidth))
{
for (j in seq_len(filterWidth))
{
if (i == 1 && j == 1)
{
tab = img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))])
} else {
tab = cbind(tab, img2vec(padded[i - 1 + seq_len(nrow(img)), j - 1 + seq_len(ncol(img))]))
}
}
}

filtered = unlist(apply(tab, 1, function(x) median(x[!is.na(x)])))
return (matrix(filtered, nrow(img), ncol(img)))
}

# libraries
if (!require("pacman")) install.packages("pacman")

# read in the coffee cup stain image

# use the median filter and save the result
filtered = median_Filter(img, 17)
writePNG(filtered, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
``` What we get after applying a median filter is something that looks like the background of the image. It contains the coffee cup stains and also the shade of the paper upon which the writing appears. With a filter width of 17, the writing has almost entirely faded away.
I didn’t choose that filter width of 17 randomly. It was the result of running models with several different filter widths and seeing which had the best predictive powers.
Median filters are not fast, especially once the filter width increases. This function runs with similar speed to that in the biOps package. I shall leave the task of writing a parallel processing version of the median filter function to the more impatient readers.

While we now have the background, what we really wanted was the foreground – the writing, without the coffee cup stains. The foreground is the difference between the original image and the background. But in this case we know that the writing is always darker than the background, so our foreground should only show pixels that are darker than the background. I have also rescaled the result to lie in the interval [0, 1]. Here is an R script to implement background removal.

```# a function that uses median filter to get the background then finds the dark foreground
background_Removal = function(img)
{
w = 5

# the background is found via a median filter
background = median_Filter(img, w)

# the foreground is darker than the background
foreground = img - background
foreground[foreground > 0] = 0
m1 = min(foreground)
m2 = max(foreground)
foreground = (foreground - m1) / (m2 - m1)

return (matrix(foreground, nrow(img), ncol(img)))
}

# run the background removal function and save the result
foreground = background_Removal(img)
writePNG(foreground, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
``` While it’s not perfect, the resulting filtered image has done a reasonably good job of separating the writing from the background. It is reasonable to expect that it will be a useful predictor in our model.
This time a filter width of 5 was chosen purely for the purpose of speed. You could have used a grid search to find the “best” filter width parameter.

```# read in the coffee cup stain image

bestRMSE = 1
bestWidth = 5

widths = c(3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25)

for (w in widths)
{
# the background is found via a median filter
background = median_Filter(img, w)

# the foreground is darker than the background
foreground = img - background
foreground[foreground > 0] = 0
m1 = min(foreground)
m2 = max(foreground)
foreground = (foreground - m1) / (m2 - m1)

# score the result
rmse = sqrt(mean( (foreground - imgClean) ^ 2 ))
if (rmse < bestRMSE)
{
bestRMSE = rmse
bestWidth = w
print(c(bestWidth, rmse))
}
}
```

In the past few blogs I have used the GBM package to create a predictive model. But as we have added more predictors or features, it has started to take a long time to fit the model and to calculate the predictions. So today I’m switching to the xgboost package because the top Kagglers are using it (Owen Zhang, Kaggle’s top ranked competitor says “when in doubt use xgboost“), and because it runs much faster than GBM. I’d never used xgboost until this week, and I must say that I’m quite impressed with its speed.

Here is an R script that fits an xgboost model, using all of the features that we have come up with over the 5 blogs to date.

```# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps, xgboost, Ckmeans.1d.dp)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod\$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod\$cluster == oc]) + min(v[km.mod\$cluster == oc]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
{
img.eb <- Image(t(img))
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)
img.kmThresh = kmeansThreshold(img)

ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

# a function to do canny edge detector
cannyEdges = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
return (matrix(img.canny / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated1 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
return(matrix(img.erosion / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated2 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
m1 = min(foreground)
m2 = max(foreground)
foreground = (foreground - m1) / (m2 - m1)

return (matrix(foreground, nrow(img), ncol(img)))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata_blog5.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

# median filter and related features
x7 = img2vec(median_Filter(imgX, 17))
x8 = img2vec(background_Removal(imgX))

dat = data.table(cbind(y, x, x2, x3, x4, x5, x6, x7, x8))
setnames(dat,c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"))
write.table(dat, file=outPath, append=(f != filenames), sep=",", row.names=FALSE, col.names=(f == filenames), quote=FALSE)
}

# read in the full data table

# fit an xgboost model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 2000000)
dat[is.na(dat)] = 0
dtrain <- xgb.DMatrix(as.matrix(dat[rows,-1]), label = as.matrix(dat[rows,1]))
# do cross validation first
xgb.tab = xgb.cv(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = 10000, early.stop.round = 50, nfold = 5, print.every.n = 10)
# what is the best number of rounds?
min.error.idx = which.min(xgb.tab[, test.rmse.mean])
# now fit an xgboost model
xgb.mod = xgboost(data = dtrain, nthread = 8, eval_metric = "rmse", nrounds = min.error.idx, print.every.n = 10)

# get the predictions
dtrainFull <- xgb.DMatrix(as.matrix(dat[,-1]), label = as.matrix(dat[,1]))
yHat = predict(xgb.mod, newdata=dtrainFull)
# what score do we get on the training data?
rmse = sqrt(mean( (yHat - dat\$y) ^ 2 ))
print(rmse) # 2.4% vs 4.1%

# get the trained model
model = xgb.dump(xgb.mod, with.stats=TRUE)
# get the feature real names
names = names(dat)[-1]
# compute feature importance matrix
importance_matrix = xgb.importance(names, model=xgb.mod)
# plot the variable importance
gp = xgb.plot.importance(importance_matrix)
print(gp)

# show the predicted result for a sample image
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img), img2vec(adaptiveThresholding(img)), img2vec(cannyEdges(img)),
img2vec(cannyDilated1(img)), img2vec(cannyDilated2(img)),img2vec(median_Filter(img, 17)), img2vec(background_Removal(img)) )
setnames(x, c("raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2", "median17", "backgroundRemoval"))
yHatImg = predict(xgb.mod, newdata=as.matrix(x))
yHatImg[yHatImg < 0] = 0 yHatImg[yHatImg > 1] = 1
imgOut = matrix(yHatImg, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))
``` Both the background removal and the median filter features are important, with even higher importance scores than last blog’s canny edge detector features. The coffee cup stain has been mostly removed, especially the speckles that we saw after applying the model in the last blog.

The result represents an improvement in RMSE score on the training data from 4.1% in the last blog to 2.4% in this blog. In the next blog we will further improve on this score.

# Denoising Dirty Documents : Part 4

At the end of the last blog, we had made some progress in removing the coffee cup stain from the image, but we needed to do more. Adaptive thresholding has started to separate the writing from the stain, but it has also created a speckled pattern within the stains. We need to engineer a feature that can tell apart a stroke of writing from a speckled local maxima i.e. distinguish a ridge from a peak in the 3D surface. In image processing, we do this via edge detection, which is the process of calculating the slope of the 3D surface of the image, and retaining lines where the slope is high. There are several different standard algorithms to do edge detection, and today we will use the canny edge detector.

The biOps package, which has an implementation of the canny edge detector, has been removed from CRAN. It has been migrated to Google Code. You will need to follow the installation instructions that can be found here. For example, since I am using Windows 64, I mostly followed the following instructions from the web site:

# Windows 64 bit

1. Note: we dropped jpeg and tiff IO functions.
2. Download (go to the DLL’s page then download the raw file) libfftw3-3.dll, libfftw3f-f.dll, libfftw3l-3.dll, and zlib1.dll to C:\Program Files\R\R-3.x.x\bin\x64 (x.x needs to be edited) or somewhere present in the PATH variables. Make sure that the downloaded dll files are MB in file size.
4. Run 64 bit R.
5. Choose biOps_0.2.2.zip from Packages>Install package(s)…
`> library(biOps)`

However for step E) I used the following R code:

```
install.packages("C:/Program Files/R/R-3.1.3/bin/biOps_0.2.2.zip")

```

Now we can start to experiment with edge detection. Note that biOps images have pixel brightnesses from 0 to 255 rather than from 0 to 1. So we have to rescale whenever we switch packages.

```
if (!require("pacman")) install.packages("pacman")

# read in the coffee cup stain image
plot(raster(img))

# convert the image to a biOps image
img.biOps = imagedata(img * 255)
plot.imagedata(img.biOps)

# canny edge detection
img.canny = imgCanny(img.biOps, 0.7)
plot.imagedata(img.canny)

``` One of the things I like about this particular edge detection algorithm is that it has automatically thresholded the edges so that we are only left with the strong edges. So here are my observations about the behaviour of the edges:

• they surround the writing
• they surround the stains
• there are few edges within a stain, so a lack of edges in a region may be a useful feature for removing the speckles within a stain
• writing has a pair of parallel edges around each stroke, while the boundary of the stain has only a single edge, so the presence of a pair of edges may be a useful feature for separating stains from writing

To take advantage of these observations we shall use image morphology. Dilation is the process of making a line or blog thicker by expanding its boundary by one pixel. Erosion is the opposite, removing a 1 pixel thick layer from the boundary of an object. If we dilate the edges, then the pair of edges around the writing will expand to include the writing inside, and the edge of the stain will also expand.

```
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
plot.imagedata(img.dilation)

``` The writing is all black, whereas most of the stain is white. This will probably be a useful feature for removing the coffee cup stain. But we can do more: now that we have dilated the edges, we can erode them to remove where we started with single edges. This looks pretty good – all of the writing is black, but only a small part of the stain remains. the stain has a thin line, while the writing has thick lines. So we can erode once, then dilate once, and the thin lines will disappear.

```
# do some morphology to remove the stain lines
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
plot.imagedata(img.dilation.3)

``` The stain is now almost completely removed! But unfortunately some of the writing has been removed too. So it is an imperfect feature for removing the coffee cup stain.

Let’s put it all together with the existing features that we have developed over the past few blogs, by adding canny edges and the dilated / eroded edges to the gradient boosted model:

```
# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW, biOps)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod\$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod\$cluster == oc]) + min(v[km.mod\$cluster == oc]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0
imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
{
img.eb <- Image(t(img))
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)
img.kmThresh = kmeansThreshold(img)

ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

# a function to do canny edge detector
cannyEdges = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
return (matrix(img.canny / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated1 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
return(matrix(img.erosion / 255, nrow(img), ncol(img)))
}

# a function combining canny edge detector with morphology
cannyDilated2 = function(img)
{
img.biOps = imagedata(img * 255)
img.canny = imgCanny(img.biOps, 0.7)
# do some morphology on the edges to fill the gaps between them
mat <- matrix (0, 3, 3)
mask <- imagedata (mat, "grey", 3, 3)
return(matrix(img.dilation.2 / 255, nrow(img), ncol(img)))
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

# canny edge detector and related features
x4 = img2vec(cannyEdges(imgX))
x5 = img2vec(cannyDilated1(imgX))
x6 = img2vec(cannyDilated2(imgX))

dat = data.table(cbind(y, x, x2, x3, x4, x5, x6))
setnames(dat,c("y", "raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2"))
write.table(dat, file=outPath, append=(f != filenames), sep=",", row.names=FALSE, col.names=(f == filenames), quote=FALSE)
}

# read in the full data table

# fit a model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 1000000)
gbm.mod = gbm(y ~ raw + thresholded + adaptive + canny + cannyDilated1 + cannyDilated2, data = dat[rows,], n.trees = 10000, train.fraction = 0.5, interaction.depth = 5)
best.iter <- gbm.perf(gbm.mod)

s = summary(gbm.mod)

# get the predictions - using parallel processing to save time
numCores = 6 #change the 6 to your number of CPU cores. or maybe lower due to RAM limits
cl = makeCluster(numCores)
registerDoSNOW(cl)
num_splits = numCores
split_testing = sort(rank(1:nrow(dat))%%numCores)
yHat = foreach(i=unique(split_testing),.combine=c,.packages=c("gbm")) %dopar% {
as.numeric(predict(gbm.mod, newdata=dat[split_testing==i,], n.trees = best.iter))
}
stopCluster(cl)
yHat[yHat < 0] = 0
yHat[yHat > 1] = 1
# what score do we get on the training data?
rmse = sqrt(mean( (yHat - dat\$y) ^ 2 ))
print(rmse) # 4.1%

# show the predicted result for a sample image
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img), img2vec(adaptiveThresholding(img)), img2vec(cannyEdges(img)), img2vec(cannyDilated1(img)), img2vec(cannyDilated2(img)))
setnames(x, c("raw", "thresholded", "adaptive", "canny", "cannyDilated1", "cannyDilated2"))
yHatImg = predict(gbm.mod, newdata=x, n.trees = best.iter)
yHatImg[yHatImg < 0] = 0
yHatImg[yHatImg > 1] = 1
imgOut = matrix(yHatImg, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

``` The first dilation operation that we tried in this blog is the most powerful predictor that we developed today, even more powerful than the adaptive thresholding that we used in the last blog. We have improved the RMSE score on the training set from 5.4% to 4.1% The model doesn’t completely remove the coffee cup stain, but it has faded the stain enough that we have a good chance at removing it later. In the next blog in this series, we will do more to clean up the coffee cup stain.

# Efficiently Building GLMs: Part 5

In the last few blogs, I discussed how you can automate the process of choosing which predictor columns to use in your model. This week I will meander away from that path to discuss how feature engineering can be automated.

The “L” in GLM stands for “Linear”, and that gives you a huge hint about one of the key assumptions: a GLM assumes that the relationship between a predictor and the outcome must be linear, or more pedantically, that relationship must be linear after the link function has been applied. For example, in my blog about Denoising Dirty Documents, I showed a graph that demonstrated a linear relationship between the pixel brightness from the input images versus the pixel brightness of the target images. But while linear relationships do naturally occur, I find that I frequently have to transform the input data to create a linear relationship. This is especially so when using monetary values, where a logarithmic transformation often helps. But until recently, I used a manual process, making guesses and testing different transformations. It saved me a lot of time once I automated the process.

Once again, we are using the diabetes readmission data from the UCI Machine Learning Repository. You can download my slightly altered version of the data that I am using from here.

The key tool that we are going to use to linearise the relationships is a generalized additive model (GAM). GAMs are an extension of generalized linear models that allow for non-linear relationships using non-parametric smoothing functions. It’s a bit like using rolling averages, but much more sophisticated.

The diabetes readmission data has a few numeric columns, but today we will just consider the column containing the number of inpatient visits to the hospital over the previous 12 months. Let’s start by doing a one-way analysis of the relationship with the rate of hospital readmission.

```# libraries
if (!require(&quot;pacman&quot;)) install.packages(&quot;pacman&quot;)

# the working folder for this batch job
folderPath = &quot;C:\\Users\\Colin\\Dropbox\\IntelliM\\&quot;;

nm = names(td)
names(td) = gsub(&quot;.&quot;, &quot;_&quot;, nm, fixed=TRUE)

# summarise by number_inpatient
dt &lt;- data.table(td)
plot(summary\$number_inpatient, summary\$mean)

# fit a straight line
model.lm = lm(mean ~ number_inpatient, data = summary)
summary\$straight_line = coef(model.lm) + summary\$number_inpatient * coef(model.lm)
plot(summary\$number_inpatient, summary\$mean)
lines(summary\$number_inpatient, summary\$straight_line, col = &quot;red&quot;)
``` The relationship shows a non-linear relationship. There is curvature, and the relationship appears to be concave.

When we fit a GLM to this data, we use a logit transformation, because the data has a Bernoulli distribution. So instead of linearising the relationship shown in the one-way analysis above, we should linearise the relationship after the logit link function has been applied. We can see the relationship, after the logit link function, by fitting a GAM model.
R has two packages for building GAMs. I use the mgcv package because it includes automatic regularisation, and because it allows me to obtain a table of results from which I can fit a linearising function. The mgcv package supports both tensor and spline smoothing functions. In the sample below I have specified to use a tensor smoothing function, by using the te() syntax applied to the numeric predictor. Note that you will need mgcv package version 1.8.6 or higher for this R script to work.

```# fit a GAM

# check the version of mgcv - must be at least 1.8.6
v = packageVersion(&quot;mgcv&quot;) # need 1.8.6 or higher
nMajor = as.integer(substr(v, 1, 1))
nMinor = as.integer(substr(v, 3, 3))
nRevision = as.integer(substr(v, 5, 5))
updateRequired = TRUE
if (nMajor &gt; 1) updateRequired = FALSE
if (nMajor == 1 &amp;&amp; nMinor &gt; 8) updateRequired = FALSE
if (nMajor == 1 &amp;&amp; nMinor == 8 &amp;&amp; nRevision &gt; 5) updateRequired = FALSE
if (updateRequired)
{
stop(&quot;You need to install a newer version of the mgcv package&quot;)
}

# get a table of partial relativities
pd = plot.gam(model.gam, residuals = FALSE, n=200, select = NULL)
if (is.null(pd))
{
stop(&quot;You need to install a newer version of the mgcv package&quot;)
}
``` The GAM plot shows strong curvature, and also shows a confidence interval. Since the confidence interval becomes very broad for high numbers of inpatient visits (due to the low number of patients with these high number of visits) we should be very suspicious of any apparent curvature out in those ranges.
The pd variable contains the values that were used to create the plot shown above. We will fit linearising functions against the data within the pd variable.
Today we will only consider a power transformation and a truncated version of the power transformation. Power transformations are commonly used as linearising function, and I remember being taught to use them when doing my generalized linear models course at university. Given the concave nature of this relationship, you should probably try out a log transformation too.

```# create a table of the predicted values and standard errors
pdCol = pd[]
tab = data.frame(pdCol\$x, pdCol\$fit, pdCol\$se)
names(tab) = c(&quot;x&quot;, &quot;y&quot;, &quot;se&quot;)

plot(tab\$x, tab\$y)

# fit a straight line
model.lm.2 = lm(y ~ x, data = tab, weights = 1 / se ^ 2)
tab\$linear = coef(model.lm.2) + coef(model.lm.2) * tab\$x

# fit a power transformation
initA = tab\$y
initB = tab\$y[nrow(tab)]
initC = 0
minX = min(tab\$x)
maxX = max(tab\$x)
model.power = nlsLM(y ~ a + (b - a) / (maxX - minX) * (x - minX) ^ (1 + c) , data = tab, weights = 1/se^2,
start = c(a = initA, b = initB, c = initC),
trace = FALSE,
control = nls.control(maxiter = 1000, tol = 1e-05, minFactor = 1/2048, printEval = FALSE, warnOnly = TRUE)
)
tab\$power = predict(model.power)

# fit a truncated power transformation
# fit a power transformation
initA = tab\$y
initB = tab\$y[nrow(tab)]
initC = 0
tab\$x.truncated = tab\$x
tab\$x.truncated[tab\$x.truncated &gt; 12] = 12
minX = min(tab\$x.truncated)
maxX = max(tab\$x.truncated)
model.power.truncated = nlsLM(y ~ a + (b - a) / (maxX - minX) * (x.truncated - minX) ^ (1 + c) , data = tab, weights = 1/se^2,
start = c(a = initA, b = initB, c = initC),
trace = FALSE,
control = nls.control(maxiter = 1000, tol = 1e-05, minFactor = 1/2048, printEval = FALSE, warnOnly = TRUE)
)
tab\$power.truncated = predict(model.power.truncated)

# plot the result
ylim = c(min(pdCol\$fit - pdCol\$se), max(pdCol\$fit + pdCol\$se))
plot(tab\$x, tab\$y, ylim=ylim)
polygon(c(pdCol\$x, rev(pdCol\$x)), c(pdCol\$fit - pdCol\$se, rev(pdCol\$fit + pdCol\$se)), col=&quot;lightgrey&quot;, border=NA)
lines(tab\$x, tab\$y, type=&quot;l&quot;, col=&quot;black&quot;)
lines(tab\$x, tab\$linear, type=&quot;l&quot;, col = &quot;blue&quot;)
lines(tab\$x, tab\$power, type=&quot;l&quot;, col = &quot;red&quot;)
lines(tab\$x, tab\$power.truncated, type=&quot;l&quot;, col = &quot;green&quot;)
``` A linear relationship (the blue line) is clearly a poor fit to the data. The power transformation (the red line) does a pretty good job of capturing the curvature of the relationship over most of the range, staying within the confidence interval. It is not clear whether truncation (the green line) is necessary.
To choose between the linearising functions, one should score the goodness of fit, and choose the least complex function that provides an adequate fit. The definition of “adequate fit” varies according to what you need to use the model for. Remember also that the purpose of linearising functions is not just to fit well, but to capture the shape of the relationship, so I suggest that your goodness of fit score should include some measure of shape, such as the number of runs of the linearising function either side of the GAM smoother.

```# choosing between the models based upon the number of runs
# keep only the data points that lie closest to an actual data point
tab2 = tab[sapply(summary\$number_inpatient, function(x) which.min(abs(tab\$x - x))),]
runsTest.linear = runs.test(x = (tab2\$y - tab2\$linear), alternative = &quot;two.sided&quot;, threshold = 0, pvalue = &quot;exact&quot;, plot = FALSE)\$p.value
runsTest.power = runs.test(x = (tab2\$y - tab2\$power), alternative = &quot;two.sided&quot;, threshold = 0, pvalue = &quot;exact&quot;, plot = FALSE)\$p.value
runsTest.power.truncated = runs.test(x = (tab2\$y - tab2\$power.truncated), alternative = &quot;two.sided&quot;, threshold = 0, pvalue = &quot;exact&quot;, plot = FALSE)\$p.value
c(runsTest.linear, runsTest.power, runsTest.power.truncated)
```

To fully automate this R script, you would also need to consider different possible truncation points, and choose the best location to truncate. I leave it to the reader to implement that extra code.
.

Would you prefer to apply these process improvements without the effort of writing R scripts? Consider using a visual user interface, such as IntelliM.

# Denoising Dirty Documents: Part 3

In my last blog, I discussed how to threshold an image, and we ended up doing quite a good job of cleaning up the image with the creased paper. But then I finished the blog by showing that our algorithm had a problem, it didn’t cope well with coffee cup stains. The problem is that coffee cup stains are dark, so our existing algorithm does not distinguish between dark writing and dark stains. We need to find some features that distinguish between dark stains and dark writing. Looking at the image, we can see that even though the stains are dark, the writing is darker. The stains are dark compared to the overall image, but lighter than the writing within them,

So we can hypothesise that we might be able to separate the writing from the stains by using thresholding that looks at the pixel brightnesses over a localised set of pixels rather than across all of the image. That way we might separate the darker writing from the dark background stain.

While I am an R programmer and a fan of R, I have to admit that R can be a frustrating platform for image processing. For example, I cannot get the adimpro package to work at all, despite much experimentation and tinkering, and googling for solutions. Then I discovered that the biOps package was removed from CRAN, and the archived versions refuse to install on my PC. But eventually I found ways to get image processing done in R.

First, I tried the EbayesThresh package.

```
# libraries
if (!require("pacman")) install.packages("pacman")

# sample image containing coffee cup stain

# using EbayesThresh
et = ebayesthresh(img)
plot(raster(et))
range(et)

``` All of the values are zero! That isn’t very helpful.

Second, I tried the treethresh package.

```
# experiment with treethresh
if (!require("pacman")) install.packages("pacman")

# sample image containing coffee cup stain

# using tree thresholding
tt = treethresh(img)
tt.pruned &lt;- prune(tt)
tt.thresholded &lt;- thresh(tt.pruned)
plot(raster(tt.thresholded))
range(tt.thresholded)

```

Once again I only got zeroes.
Finally, I tried wavelet thresholding using the treethresh package.

```# experiment with wavelet thresholding
if (!require("pacman")) install.packages("pacman")

# sample image containing coffee cup stain

# using wavelet thresholding
# need a square image that has with that is a power of 2
# compute wavelet transform
wt.square &lt;- imwd(img[1:256,1:256])
# Perform the thresholding
wt.thresholded = threshold(wt.square)
# Compute inverse wavelet transform
wt.denoised &lt;- imwr(wt.thresholded)
plot(raster(wt.denoised))
``` It wasn’t what I was looking for, but at least it’s not just a matrix of zeroes!

After these two failures with CRAN packages, I decided that I needed to look beyond CRAN and try out the EBImage package on Bioconductor.

```
if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# libraries
if (!require("pacman")) install.packages("pacman")

# sample image containing coffee cup stain

img.thresholded.3 = thresh(img.eb, 3, 3)
display(img.thresholded.3)
img.thresholded.5 = thresh(img.eb, 5, 5)
display(img.thresholded.5)
img.thresholded.7 = thresh(img.eb, 7, 7)
display(img.thresholded.7)
img.thresholded.9 = thresh(img.eb, 9, 9)
display(img.thresholded.9)

``` These images show that we are on the right track. The adaptive thresholding is good at separating the letters from the background, but it also adds black to the background. While the adaptive thresholding may not be perfect, any feature extraction that adds to our knowledge will probably improve our ensemble model.

Before we add adaptive thresholding to our model, we will do a bit more pre-processing. Based on the observation that adaptive thresholding is keeping the letters but adding some unwanted black to the background, we can combine all of the adaptive thresholding with normal thresholding images, and use the maximum pixel brightness across each 5 image set.

```
# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod\$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod\$cluster == oc]) + min(v[km.mod\$cluster == oc]))

# using upper threshold
imgHi = v
imgHi[imgHi &lt;= hiThresh] = 0
imgHi[imgHi &gt; hiThresh] = 1

return (imgHi)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)),
img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
plot(raster(ttt.3))

``` It’s starting to look much better. The coffee cup stain isn’t removed, but we are starting to successfully clean it up. This suggests that adaptive thresholding will be an important predictor in our ensemble.

Let’s bring it all together, and build a new ensemble model, extending the model from my last blog on this subject.

```
# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm, foreach, doSNOW)

if (!require("EBImage"))
{
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
}

# a function to do k-means thresholding
kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod\$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod\$cluster == oc]) + min(v[km.mod\$cluster == oc]))

# using upper threshold
imgHi = v
imgHi[imgHi &lt;= hiThresh] = 0
imgHi[imgHi &gt; hiThresh] = 1

return (imgHi)
}

# a function that applies adaptive thresholding
{
img.eb &lt;- Image(t(img))
img.thresholded.3 = thresh(img.eb, 3, 3)
img.thresholded.5 = thresh(img.eb, 5, 5)
img.thresholded.7 = thresh(img.eb, 7, 7)
img.thresholded.9 = thresh(img.eb, 9, 9)
img.thresholded.11 = thresh(img.eb, 11, 11)
img.kmThresh = kmeansThreshold(img)

ttt.1 = cbind(img2vec(Image2Mat(img.thresholded.3)), img2vec(Image2Mat(img.thresholded.5)), img2vec(Image2Mat(img.thresholded.7)), img2vec(Image2Mat(img.thresholded.9)), img2vec(Image2Mat(img.thresholded.11)), img2vec(kmeansThreshold(img)))
ttt.2 = apply(ttt.1, 1, max)
ttt.3 = matrix(ttt.2, nrow(img), ncol(img))
return (ttt.3)
}

# a function to turn a matrix image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# a function to convert an Image into a matrix
Image2Mat = function(Img)
{
m1 = t(matrix(Img, nrow(Img), ncol(Img)))
return(m1)
}

dirtyFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

dat = data.table(cbind(y, x, x2, x3))
write.table(dat, file=outPath, append=(f != filenames), sep=",", row.names=FALSE, col.names=(f == filenames), quote=FALSE)
}

# read in the full data table

# fit a model to a subset of the data
set.seed(1)
rows = sample(nrow(dat), 1000000)
gbm.mod = gbm(y ~ raw + thresholded + adaptive, data = dat[rows,], n.trees = 7500, cv.folds = 3, train.fraction = 0.5, interaction.depth = 5)
best.iter &lt;- gbm.perf(gbm.mod,method="cv",oobag.curve = FALSE)

s = summary(gbm.mod)

# get the predictions - using parallel processing to save time
numCores = 6 #change the 6 to your number of CPU cores. or maybe lower due to RAM limits
cl = makeCluster(numCores)
registerDoSNOW(cl)
num_splits = numCores
split_testing = sort(rank(1:nrow(dat))%%numCores)
yHat = foreach(i=unique(split_testing),.combine=c,.packages=c("gbm")) %dopar% {
as.numeric(predict(gbm.mod, newdata=dat[split_testing==i,], n.trees = best.iter))
}
stopCluster(cl)

# what score do we get on the training data?
rmse = sqrt(mean( (yHat - dat\$y) ^ 2 ))
print(rmse)

# show the predicted result for a sample image
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img), img2vec(adaptiveThresholding(img)))
yHat = predict(gbm.mod, newdata=x, n.trees = best.iter)
imgOut = matrix(yHat, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

``` The adaptive thresholding gets a relative importance score of 2.4%, but full image thresholding loses all of its importance in the model. The result may not be perfect, but you can see how the coffee cup stain is starting to be erased. We will need to find some more features to continue the clean up, but that will be left to future blogs.

This blog’s model has improved the RMSE on the training data from 6.5% reported in my last blog to 5.4% in this latest version. There’s a long way to go to have a top 10 performing model, but this process is steadily moving us in the right direction.

# Efficiently Building GLMs: Part 4

In my previous blog in this series, I discussed how you can write custom R functions that can recognise separation and missing cells in multivariate data. This week I discuss how you can customise the process that searches for the best combination of predictor columns.

While the bestglm and glmulti packages are great for getting you started in automation, I found that I wanted more. In particular I wanted:

• automation of data with more predictor columns than glmulti allows,
• parallel processing to satisfy my impatience, and
• customisation of the model selection, to allow for issues such as separation and collinearity.

So I created by own functions as alternatives.

# Restructuring the Problem

Before we implement a search algorithm, we need to restructure the GLM model building problem so that it becomes an optimisation problem. We do this by creating a vector of binary values that flags which columns are to be included in the GLM model formula.

 1 0 1 0

Consider the binary vector above. Since the first value equals 1, the first predictor column is included in the model formula. Since the third slot in the vector equals 1, then the third predictor column is included in the model formula. However the second and fourth predictor columns are not included, because the indicator variables in those slots have a value of 0. We can code this in R:

```# a function to create GLM formulae from an indicator vector
createFormula = function(indicators)
{
# do the special case of no predictor columns, so use NULL model
if (sum(indicators) == 0)
return (formula(paste(target, " ~ 1", sep="")))

# the more common case
return (formula(paste(target, " ~ ", paste(predictors[indicators == 1], collapse = " + "))))
}
```

So the optimisation problem is to find the binary vector that creates the optimal GLM formula, where “optimal” relates to the quality of the GLM that you created (allowing for whatever measure you choose to you). To keep this example simple, I will use the BIC score of the GLM, but that does not mean that I am recommending that you should use BIC. I just means that this blog is only about automation, and there isn’t space here to discuss how to measure what the “best” GLM measure would be.

```
# libraries
if (!require("pacman")) install.packages("pacman")

# the function to be optimised
# the function to be optimised
glmScore = function(indicators)
{
# return a NULL model if any errors or warnings occur
glm.mod = tryCatch (
glm(createFormula(indicators), data = td, family = binomial(link=logit)),
error = function(e) glm(createFormula(indicators * 0), data = td, family = binomial(link=logit)),
warning = function(w) glm(createFormula(indicators * 0), data = td, family = binomial(link=logit))
)
# if there are problems with NA parameters (e.g. via separation), then use a NULL model
if (any(is.na(coef(glm.mod))))
glm.mod = glm(createFormula(indicators * 0), data = td, family = binomial(link=logit))
print(glm.mod\$formula)
return(BIC(glm.mod))
}

glmScr = memoise(glmScore)

```

As you can see, you can customise how a particular GLM is scored, to allow for your preferences about issues such as separation or collinearity.

The memoise package is another excellent utility that I have recently discovered. It remembers the return value for a function for a given parameter value, and instead of wasting time re-evaluating the function the next time that parameter value is used, it returns the cached return value for that particular parameter input. This can save a lot of processing time when a function is numerically demanding (e.g. fitting a GLM) and called repeatedly using the same parameter values (e.g. when iteratively searching through candidate formulae).

# Imputing Missing Values

Since we are going to be comparing scores from different GLM models, we need to ensure that those models and scores are comparable. Missing values can be an issue here. When there are missing values in some predictors, R drops out the rows with the missing values, and only fits the GLM to the rows with no missing values in the predictors. The model scores are based upon the fit to this reduce data set, and therefore have lower scores. But R does not warn the user that it is doing this!

In the diabetes data used in this series of blogs, there are missing values. In particular, more than half of the values for the patients’ weight are missing. A optimisation based on BIC would choose a model that included weight_num as a predictor because the small number of data rows used results in a much lower score for goodness of fit.

To overcome the problem of missing values, we must either remove all rows containing any missing values, or impute values where they are missing.  R provides a few ways to do this. For example, you can impute values using the mice package.

```
# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Dropbox\\IntelliM\\";

nm = names(td)
names(td) = gsub(".", "_", nm, fixed=TRUE)
# clean up the missing values
if (!require("pacman")) install.packages("pacman")

cleaned = mice(td)

```

In this blog I will take the dubious but quick approach of dropping out any rows that contain missing values in the numeric fields.

```
# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Dropbox\\IntelliM\\";

nm = names(td)
names(td) = gsub(".", "_", nm, fixed=TRUE)
# remove all rows containing missing values
td = td[! is.na(td\$weight_num),]

```

For the diabetes readmission data this reduces the data from 101,000 rows to 3,000 rows, which is hardly a suitable solution. A better solution would have been to simply drop the weight_num column from the data.

# Customised Step-wise Model Building

While step-wise model building is not the best way to put together a GLM, it is the quickest way to get to a reasonable model. I use the step-wise approach whenever I am feeling very, very impatient!

But instead of using an out-of-the-box version of step-wise model building, I wrote my own function. It begins by using a GBM to measure variable importance, and to create a reasonable starting model. Then it uses a version of the step-wise approach that has been enhanced to allow for one predictor to be swapped out for another predictor. Usually step-wise implementations are greedy algorithms, i.e. they only look one step ahead. But swapping one column for another involves two steps – removing one column, then adding another column. A greedy algorithm won’t see past the first step of removing a column, and therefore won’t see the full benefit of the swap, and won’t choose that path. It gets trapped in a local minima. This enhancement overcomes the problem by treating column swaps as if they involved just a single step. Here is my script:

```# function to show a formula as text
f2text = function(f)
{
paste(as.character(f)[c(2,1,3)], collapse = " ")
}

# a vector of names of predictor columns
predictors = names(td)
predictors = predictors[predictors != target]

# reset the cache on the glmScore function
forget(glmScr)
# create an indicator vector - in this case we will use a vector of zeroes to start with a NULL model
indicators = matrix(0, length(predictors), 1)
bestScore = glmScr(indicators)
bestParameters = indicators

searching = TRUE
while (searching)
{
# alternate formulae, with just one column added, or one column removed
alternates = sapply(seq_len(length(indicators)), function(x) {temp = indicators; temp[x] = 1 - temp[x]; return(temp) })

# alternate formulae, replacing one column for another
existing = seq_len(length(indicators))[indicators == 1]
for (i in existing)
{
for (j in seq_len(length(indicators)))
{
if (i != j && indicators[j] == 0)
{
temp = indicators
temp[i] = 0
temp[j] = 1
alternates = cbind(alternates, temp)
}
}
}

# evaluate all of the alternatives
scores = apply(alternates, 2, glmScr)

if (min(scores) < bestScore)
{
bestScore = min(scores)
bestParameters = alternates[,which.min(scores)]
print(paste(round(bestScore), f2text(createFormula(bestParameters)), sep = ": "))
indicators = bestParameters
} else
{
searching = FALSE
}
}
``` When you’re impatient, like me, you want things to run as fast as possible. From a speed perspective it makes sense to start with a null model, and add predictors, rather than starting from an overspecified model and removing predictors. You also want parallel processing. Rather than processing one GLM at a time, I want to simultaneously process as many GLMs as my PC will allow! R supports parallel processing, but Windows users have fewer parallel processing options in R than Linux users. Nevertheless, the snowfall package gives me exactly what I want for this particular task.

```
# libraries
if (!require("pacman")) install.packages("pacman")

sfInit( parallel=TRUE, cpus=3 )
if( sfParallel() )
{
cat( "Running in parallel mode on", sfCpus(), "nodes.\n" )
} else
{
cat( "Running in sequential mode.\n" ) }

# a vector of names of predictor columns
predictors = names(td)
predictors = predictors[predictors != target]

# reset the cache on the glmScore function
forget(glmScr)
# create an indicator vector - in this case we will use a vector of zeroes to start with a NULL model
indicators = matrix(0, length(predictors), 1)
bestScore = glmScr(indicators)
bestParameters = indicators

searching = TRUE
while (searching)
{
# alternate formulae, with just one column added, or one column removed
alternates = sapply(seq_len(length(indicators)), function(x) {temp = indicators; temp[x] = 1 - temp[x]; return(temp) })

# alternate formulae, replacing one column for another
existing = seq_len(length(indicators))[indicators == 1]
for (i in existing)
{
for (j in seq_len(length(indicators)))
{
if (i != j && indicators[j] == 0)
{
temp = indicators
temp[i] = 0
temp[j] = 1
alternates = cbind(alternates, temp)
}
}
}

sfExportAll()
# evaluate all of the alternatives
#scores = apply(alternates, 2, glmScr)
scores = unlist(sfClusterApplyLB(seq_len(ncol(alternates)), function(x) glmScr(alternates[,x])))

if (min(scores) < bestScore)
{
bestScore = min(scores)
bestParameters = alternates[,which.min(scores)]
print(paste(round(bestScore), f2text(createFormula(bestParameters)), sep = ": "))
indicators = bestParameters
} else
{
searching = FALSE
}
}

sfStop()

```

Step-wise model building works OK when there aren’t too many predictors, and when I don’t want any interaction terms. But once there are many interaction terms, step-wise model building has to work through too many permutations. The problem becomes too highly dimensional for grid based searches.

We need a new approach. We need our GLMs to evolve into intelligent models. Let’s consider the case where we want two-way interaction terms. Once again we use an indicator vector to define the GLM formula, but this time we need to include a vector cell for each and every combination of individual predictors.

The first step is to create a list of all of the individual predictors, and each two-way combination of predictors.

```
# a vector of names of predictor columns
predictors = names(td)
predictors = predictors[predictors != target]
indices = expand.grid(seq_len(length(predictors)), seq_len(length(predictors)))
indices = indices[indices[,1] != indices[,2],]
twoWays = apply(indices, 1, function(x) paste(predictors[x], collapse = ":"))
predictors = c(predictors, twoWays)
length(predictors)

```

Once we include two-way interaction terms, there are 2025 terms that could be included in the GLM formula! A thorough search of all possible formulae would require evaluating 2^2025 GLMs!!!

R has a few packages that implement genetic algorithms. Today I will use the GA package to optimise the indicator vector. Note that genetic algorithms try to maximise a fitness function, whereas we are trying to minimise a score. So we will need to define a fitness function that is -1 times the GLM score.

```
# libraries
if (!require("pacman")) install.packages("pacman")

# define a glm fitness function
glmFitness = function(x) return(-1 * glmScr(x))

ga.mod = ga(type = "binary", fitness = glmFitness, nBits = length(predictors), popSize = 100, parallel = TRUE, run = 5)

``` From the output we can see that our first attempt at a genetic algorithm didn’t work very well. That’s because most formulae contain 2-way interactions, and most 2-way interactions contain missing cells, and therefore these formulae have been rejected. To fix this we need to bias the initial population towards formulae that do not contain 2-way terms, and towards formulae than contain very few terms.

```
# custom function to create an initial population
initPopn = function (object, ...)
{
population = matrix(0, nrow = object@popSize, ncol = object@nBits)
n = ncol(td) - 1
probability = min(1, 2 / n)
ones = matrix(rbinom(n * object@popSize, 1, probability), object@popSize, n)
population[,seq_len(n)] = ones
return(population)
}

set.seed(1)

ga.mod.2 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors), population = initPopn, popSize = 100, parallel = TRUE, run = 5)

``` This time we had more success. But we need to tune the hyperparameters, using a bigger population size and slower stopping criteria, in order to get better results. I also added a custom function to show me which GLM model formula is the best performing at the end of each epoch. The data preparation and support functions have also been updated to make the process more robust, and I changed the initial population to ensure more genetic diversity.

```
# libraries
if (!require("pacman")) install.packages("pacman")

# function to show a formula as text
f2text = function(f)
{
if (class(f) != "formula")
stop("f is not a formula!")
return(paste(as.character(f)[c(2,1,3)], sep = "", collapse = " "))
}

# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Dropbox\\IntelliM\\";

# remove all rows containing missing values
td = td[! is.na(td\$weight_num),]

# remove low variance columns
td = td[,-nearZeroVar(td)]

# remove the diagnosis columns - just to make the example work faster
td = td[,-c(11, 12, 13)]

nm = names(td)
names(td) = gsub(".", "_", nm, fixed=TRUE)

# a vector of names of predictor columns
predictors = names(td)
predictors = predictors[predictors != target]
indices = expand.grid(seq_len(length(predictors)), seq_len(length(predictors)))
indices = indices[indices[,1] < indices[,2],]
twoWays = apply(indices, 1, function(x) paste(predictors[x], collapse = ":"))
predictors = c(predictors, twoWays)
length(predictors)

# a function to create GLM formulae from an indicator vector
createFormula = function(indicators)
{
if (is.null(indicators))
stop("createFormula: indicators is NULL!")
if (any(is.na(indicators)))
stop("createFormula: NA found in indicators!")
# do the special case of no predictor columns, so use NULL model
if (sum(indicators) == 0)
return (formula(paste(target, " ~ 1", sep="")))

# the more common case
result = formula(paste(target, " ~ ", paste(predictors[indicators == 1], collapse = " + ")))
return (result)
}

# the function to be optimised
glmScore = function(indicators)
{
if (is.null(indicators))
stop("glmScore: indicators is NULL!")
if (any(is.na(indicators)))
{
if (colin_verbose)
print("glmScore: NA found in indicators!")
}

# return a NULL model if any errors or warnings occur
fff = createFormula(indicators)
useMod = FALSE
glm.mod = NULL
tryCatch (
{
glm.mod = glm(fff, data = td, family = binomial(link=logit));
useMod = TRUE;
}
,
error = function(e) {
if (colin_verbose)
{
print(f2text(fff));
print(e);
}
},
warning = function(w) {
if (colin_verbose)
{
print(f2text(fff));
print(w);
}
}
)
if (! useMod)
{
return (9e99)
}
# if there are problems with NA parameters (e.g. via separation), then use a NULL model
if (any(is.na(coef(glm.mod))))
{
if (colin_verbose)
print("NAs in coefficients")
return (9e99)
}
result = BIC(glm.mod)
return(result)
}

# create a version of GLM score that remembers previous results
glmScr = memoise(glmScore)

# define a glm fitness function
glmFitness = function(x) return(-1 * glmScr(x))

# custom function to create an initial population
initPopn = function (object, ...)
{
population = matrix(0, nrow = object@popSize, ncol = object@nBits)
n = ncol(td) - 1
probability = min(1, 2 / n)
ones = matrix(rbinom(n * object@popSize, 1, probability), object@popSize, n)
population[,seq_len(n)] = ones
# make the first n rows be the nth predictor
n2 = min(n, object@popSize)
for (i in seq_len(n2))
{
population[i,] = 0
population[i, i] = 1
}
return(population)
}

# custom function to monitor the genetic algorithm
glmMonitor = function (object, digits = getOption("digits"), ...)
{
fitness <- na.exclude(object@fitness)
i = which.max(fitness[! is.na(object@fitness)])
p = object@population[! is.na(object@fitness),]
x = p[i,]
cat(paste("Iter =", object@iter, " | Median =",
format(-1 * median(fitness), digits = 0),
" | Best =",
format(-1 * max(fitness), digits = 0),
f2text(createFormula(x)),
"\n"))
}

# run a genetic algorithm to find the "best" GLM with interaction terms
forget(glmScr)
colin_verbose = FALSE
ga.mod.3 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors),
population = initPopn,
popSize = 500,
parallel = TRUE,
run = 25,
monitor = glmMonitor,
seed = 1)
``` The “best” GLM has evolved to include 2-way interactions. Since it doesn’t take too long to run, it is helpful to rerun the genetic algorithm using other random seeds, to check that we have converged on the best score.

```
ga.mod.4 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors),
population = initPopn,
popSize = 500,
parallel = TRUE,
run = 25,
monitor = glmMonitor,
seed = 2)
ga.mod.5 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors),
population = initPopn,
popSize = 500,
parallel = TRUE,
run = 25,
monitor = glmMonitor,
seed = 3)

```

Comparing the results of the tree genetic algorithm runs, we find that the best scoring model is:

readmitted_flag ~ number_emergency + number_inpatient + discharge_id_desc + age_num:num_lab_procedures + number_emergency:number_inpatient

This model scored 2191. This model also makes intuitive sense. The following data is predictive of the probability of another hospital visit in the next 30 days:

• the number of emergency visits to hospital over the previous 12 months
• the number of hospital visits over the previous 12 months
• the health of the patient when they were discharged (including whether they died in hospital!)
• the relationship between the patient’s age and the number of treatments that they receive
• the relationship between the number of emergency visits versus the total number of visits to the hospital over the past 12 months

.

Would you prefer to apply these process improvements without the effort of writing R scripts? Consider using a visual user interface, such as IntelliM.

# Denoising Dirty Documents: Part 2

In the first blog in this series, I explained the nature of the problem to be solved, and showed how to create a simple linear model that gives an RMSE score on the training data of 7.8%. In this blog I introduce the feedback loop for creating features, and we extend and improve the existing competition model to include a couple of new features. I have been building machine learning models for almost 20 years now, and never once has my first model been satisfactory, nor has that model been the model that I finally use. The Denoising Dirty Documents competition is no different. In the plots above, you can see my progress through the public leaderboard, as I iteratively improved my model. The same has occurred more broadly in the competition overall, as the best score has iteratively improved since the competition began.

Coming from an actuarial / statistical background, I tend to think in terms of a “control cycle”. For actuaries, the control cycle looks something like this: We can categorise the content of my last blog as:

1) Define the Problem

• remove noise from images
• images are three-dimensional surfaces

2) Hypothesise Solutions

• we hypothesised that the pixel brightnesses needed to be rescaled

3) Implement

• we fitted a least squares linear model to rescale the pixel brightnesses, and we capped and floored the values to keep them within the [0, 1] range

4) Monitor Results

• we calculated the RMSE on the training data before and after the implementation
• we looked at one of the output images

So we have completed one cycle. Now we need to do another cycle.

What I like about the Denoising Dirty Images Competition is that the data visualisation and model visualisation are obvious. We don’t need to develop clever visualisation tools to understand the problem or the model output. We can just look at the dirty and predicted images. Let’s begin with reviewing the example input and output images from the model used in the last blog i.e. let’s begin by recalling the Monitor Results step from the previous cycle:  Now we can progress to the beginning of the next cycle, the Define the Problem stage. This time the problem to be solved is to remove one or more of the errors from the output of the existing model. Once we already have a model, we can define the problem by asking ourselves the following question.

Question: In the predicted image, where have prediction errors have been made?

Answer: The predicted image contains some of the crease lines.

So in this cycle, the problem is to remove the remaining crease lines.

It’s time to begin the Hypothesise Solutions stage of the cycle by asking ourselves the following questions:

• In the dirty image, what is a single characteristic of what we want to keep or remove that we haven’t captured in our existing model?
• What characteristics do the errors in the predicted image have?
• What characteristics does the dirty image have at the locations where these errors occur?

Here are a few observations that I made in answer to these questions:

• In the dirty image, the writing is darker than the background around it. We haven’t captured this locality information.
• The predicted image retains some of the crease lines. Those remaining crease lines are narrower and not as dark as the writing.
• In the dirty image, the shadows next to the white centres of the crease are the remaining crease lines from the predicted image.

This led me to the hypothesis that the model for a single pixel needs to include information about the brightness of other pixels in the image. There are multiple ways that we could consider the brightnesses of other pixels, but in today’s blog I will limit myself to considering the brightness of the pixel under consideration versus the range of pixel brightnesses across the entire image.

To Implement a hypothesis test, and solution, we need to know some theory about image processing. There are some good textbooks that cover this material. The image processing / machine vision textbook that I have been using is  Computer and Machine Vision, Fourth Edition: Theory, Algorithms, Practicalities The machine vision technique that I will apply this today is thresholding, which is the process of turning an image into pixels that can only be black or white, with no grey shades or colours. Writing code to do thresholding is the easy part. The trickier part is to decide the threshold value at which pixels are split into either black or white.

```
# libraries
if (!require("pacman")) install.packages("pacman")

# turn the image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# show a histogram
hist(img2vec(img))

``` One way of finding a threshold value is to look at a histogram of the pixel brightnesses and look for a natural break between local maxima in the histogram. Since the histogram above is tri-modal, this would leave us with two choices of thresholds, one around 0.3 and one around 0.65.

```
# threshold at 0.3
img0.3 = img
img0.3[img0.3 <= 0.3] = 0 img0.3[img0.3 > 0.3] = 1
plot(raster(img0.3))

``` Now we can Monitor Results. Thresholding at 0.3 gives us no false positives for this image, but leaves out a lot of the writing.

```
# threshold at 0.65
img0.65 = img
img0.65[img0.65 <= 0.65] = 0 img0.65[img0.65 > 0.65] = 1
plot(raster(img0.65))

``` Thresholding at 0.65 gives us no false positives for this image, and correctly flags the writing without flagging the creases. That makes it quite a good feature for use on this image, as it will help us to remove the residual crease mark that we found when reviewing the output of the last blog’s model. We should definitely include a thresholding feature in our model!

But it feels very clumsy to plot a histogram and manually select the threshold, and such a feature definitely isn’t “machine learning”. Can we automate this?

Otsu’s Method is a well-known technique in image processing that can automatically suggest a threshold. But it assumes a bi-modal histogram, which we have seen is not true.

Inspired by Otsu’s method, we could use cluster analysis to generate three clusters of pixel brightnesses, and use the splits between those clusters to threshold the image.

```
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod\$centers)
# the lower threshold is the halfway point between the top of the lowest cluster and the bottom of the middle cluster
loThresh = 0.5 * (max(v[km.mod\$cluster == oc]) + min(v[km.mod\$cluster == oc]))
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod\$cluster == oc]) + min(v[km.mod\$cluster == oc]))

# using lower threshold
imgLo = img
imgLo[imgLo <= loThresh] = 0 imgLo[imgLo > loThresh] = 1
plot(raster(imgLo))

# using upper threshold
imgHi = img
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1
plot(raster(imgHi))

```

Now we can Monitor Results. Once again, the lower threshold choice doesn’t capture enough of the writing, while the upper threshold choice works well for this image. And this time there wasn’t any manual work required!

Let’s put it all together and combine last blog’s feature (a linear transformation of the raw pixel brightnesses) with this blog’s feature (thresholding). This time we will use something more sophisticated than linear regression. Our model will be based upon R’s GBM package, a gradient boosted machine to combine the raw pixel brightnesses with the thresholded pixels. GBMs are good all-purpose models, are simple to set up, perform well, and are almost always my first choice when creating machine learning models.

```
# libraries
if (!require("pacman")) install.packages("pacman")

# a function to do k-means thresholding

kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod\$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod\$cluster == oc]) + min(v[km.mod\$cluster == oc]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

dirtyFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

dat = data.table(cbind(y, x, x2))
setnames(dat,c("y", "raw", "thresholded"))
write.table(dat, file=outPath, append=(f != filenames), sep=",", row.names=FALSE, col.names=(f == filenames), quote=FALSE)
}

# view the data
rows = sample(nrow(dat), 10000)
d1 = dat[rows,]
plot(d1\$raw[dat\$thresholded == 0], d1\$y[dat\$thresholded == 0], col = "blue")
lines(d1\$raw[dat\$thresholded == 1], d1\$y[dat\$thresholded == 1], col = "red", type="p")

# fit a model to a subset of the data
rows = sample(nrow(dat), 100000)
gbm.mod = gbm(y ~ raw + thresholded, data = dat[rows,], n.trees = 5000, cv.folds = 10, train.fraction = 0.5)
best.iter <- gbm.perf(gbm.mod,method="cv")

# what score do we get on the training data?
yHat = predict(gbm.mod, newdata=dat, n.trees = best.iter)
rmse = sqrt(mean( (yHat - dat\$y) ^ 2 ))
print(rmse)

# show the predicted result for a sample image
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img))
setnames(x, c("raw", "thresholded"))
yHat = predict(gbm.mod, newdata=x, n.trees = best.iter)
imgOut = matrix(yHat, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

``` Notice how the sample output image is different from the thresholded image? The edges of the writing are lighter than the centre of the letters. That’s good, because that’s one of the characteristics of the clean image. The sample image that we created from the predicted values is looking really good. The residual crease marks have disappeared, yet we have kept the writing. And the RMSE score on the training data has improved to 6.5%.

But before this blog ends, let’s consider an image that wasn’t cleaned so well:

```# here's a sample image that doesn't perform as well
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img))
setnames(x, c("raw", "thresholded"))
yHat = predict(gbm.mod, newdata=x)
imgOut = matrix(yHat, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))
``` Damn! I feel like cursing the jerk who put their coffee cup on top of the important document and left a stain!

In my next blog, I will discuss the first steps towards removing that coffee cup stain. # Denoising Dirty Documents: Part 1

Tags

I recently blogged about my learning curve in my first Kaggle competition. This has become my most popular blog to date, and some readers have asked for more. So this blog is the first in a series of blogs about how to put together a reasonable solution to Kaggle’s Denoising Dirty Documents competition.

Some other competitors have been posting scripts, but those scripts are usually written in Python, whereas my background makes me an R programmer. So I will be writing scripts that make use of R.

# The Structure of the Problem

We have been given a series of training images, both dirty (with stains and creased paper) and clean (with a white background and black letters). We are asked to develop an algorithm that converts, as close as possible, the dirty images into clean images. A greyscale image (such as shown above) can be thought of as a three-dimensional surface. The x and y axes are the location within the image, and the z axis is the brightness of the image at that location. The great the brightness, the whiter the image at that location.

So from a mathematical perspective, we are being asked to transform one three-dimensional surface into another three dimensional surface. Our task is to clean the images, to remove the stains, remove the paper creases, improve the contrast, and just leave the writing.

In R, images are stored as matrices, with the row being the y-axis, the column being the x-axis, and the numerical value being the brightness of the pixel. Since Kaggle has stored the images in png format, we can use the png package to load the images.

```
# libraries
if (!require("pacman")) install.packages("pacman")

plot(raster(img))

```

You can see that the brightness values lie within the [0, 1] range, with 0 being black and 1 being white.

# Restructuring the Data for Machine Learning

Instead of modelling the entire image at once, we should predict the cleaned-up brightness for each pixel within the image, and construct a cleaned image by combining together a set of predicted pixel brightnesses. We want a vector of y values, and a matrix of x values. The simplest data set is where the x values are just the pixel brightnesses of the dirty images.

```
# libraries
if (!require("pacman")) install.packages("pacman")

dirtyFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

dat = data.table(cbind(y, x))
setnames(dat,c("y", "x"))
write.table(dat, file=outPath, append=(f != filenames), sep=",", row.names=FALSE, col.names=(f == filenames), quote=FALSE)
}
# view the data
rows = sample(nrow(dat), 10000)
plot(dat\$x[rows], dat\$y[rows])

``` The data is now in a familiar format, which each row representing a data point, the first column being the target value, and the remaining column being the predictors.

# Our First Predictive Model

Look at the relationship between x and y. Except at the extremes, there is a linear relationship between the brightness of the dirty images and the cleaned images. There is some noise around this linear relationship, and a clump of pixels that are halfway between white and black. There is a broad spread of x values as y approaches 1, and these pixels probably represent stains that need to be removed.

So the obvious first model would be a linear transformation, with truncation to ensure that the predicted brightnesses remain within the [0, 1] range.

```
# fit a linear model, ignoring the data points at the extremes
lm.mod.1 = lm(y ~ x, data=dat[dat\$y &gt; 0.05 & dat\$y &lt; 0.95,])
summary(lm.mod.1)
dat\$predicted = sapply(predict(lm.mod.1, newdata=dat), function(x) max(min(x, 1),0))
plot(dat\$predicted[rows], dat\$y[rows])
rmse1 = sqrt(mean( (dat\$y - dat\$x) ^ 2))
rmse2 = sqrt(mean( (dat\$predicted - dat\$y) ^ 2))
c(rmse1, rmse2)

```

The linear model has done a brightness and contrast correction. This reduces the RMSE score from 0.157 to 0.078. Let’s see an output image:

```
# show the predicted result for a sample image
x = data.table(matrix(img, nrow(img) * ncol(img), 1))
setnames(x, "x")
yHat = sapply(predict(lm.mod.1, newdata=x), function(x) max(min(x, 1),0))
imgOut = matrix(yHat, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\Dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

``` Although we have used a very simple model, we have been able to clean up this image: Our predicted image is: That’s quite good performance for a simple least squares linear regression!

To be fair though, I deliberately chose an example image that performs well. In my next blog in this series, I will discuss the use of a feedback loop in model design, and how to design new features to use as predictors.

# Efficiently Building GLMs: Part 3

Tags

In my last blog in this series, I described some R packages that can automatically choose the best GLM formula for you. Some people wanted more on this topic. In particular, Asko Kauppinen asked about quasi-complete separation, “empty cells in multidimensional space”. So in this blog, I will show you how to customise the automation process to deal with this issue. Once again, we are using the diabetes readmission data from the UCI Machine Repository. You can download my slightly altered version of the data that I am using from here.

Sometimes R gives you a warning when separation occurs, for example when running the following code.

```
# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Documents\\IntelliM\\"

# fit a GLM that has separation

```

After fitting the model, R shows the following warning: Note that this is a warning, and not an error. That’s because it may be perfectly valid for a fitted probability of 0 or 1 to occur. In the diabetes readmission data, it is reasonable to expect that no patients having a discharge description of “expired” (i.e. the patient died in hospital) will ever be readmitted to hospital again!

But, to simplify the example, let’s assume that expired patients can be readmitted, and therefore that we don’t want fitted probabilities of zero, and that medical treatment usually has some benefit and therefore that we don’t want fitted probabilities of one.

We could catch the warning thrown by R, and tell our model building algorithm that we don’t want to use these types of models. You could achieve this via R script that looks something like this:

```
# use a null model if this model has separation
diabetes.glm.2 = tryCatch (
,
warning = function(w) { glm(readmitted_flag ~ 1, data = td, family = binomial(link=logit)) }
)
diabetes.glm.2

```

The tryCatch function notes when a warning is thrown, and then runs the alternative null model. The result can be seen to be the null model: The null model will be the worst performing GLM, so this model will not be selected by the algorithm as being the “best” model.

This simple solution looks too good to be true, and that’s because it is too good to be true. Consider the following R script and output:

```
# use a null model if this model has separation
diabetes.glm.3 = tryCatch (
,
warning = function(w) { glm(readmitted_flag ~ 1, data = td, family = binomial(link=logit)) }
)
diabetes.glm.3

``` This time, R didn’t throw a warning, and, without informing the user, R has predicted the existence of zombies! So we can’t rely on R to consistently identify when separation occurs. We have to code for it ourselves…

```
# use a null model if this model has separation
# look for counts of zero
if (sum(xtabs(readmitted_flag ~ discharge_id_desc, data = td) == 0) > 0)
{
print("separation with zeroes!")
} else
{
# look for counts equal to number of observations
if (sum(xtabs(readmitted_flag ~ discharge_id_desc, data = td) == table(td\$discharge_id_desc)) > 0)
{
print("separation with zeroes!")
} else
{
}
}
diabetes.glm.4

``` This time we have been successful at manually identifying separation. A null model has been created. But this script only works for models with a single predictor. To work with models that use more than one predictor, we need to improve our approach. Since the script is getting more complex, I will refactor it into a function that takes a formula and a data frame, and returns a GLM. That GLM will be the null model when separation is found.

```# expand a dot formula to show all of the terms
expandDotFormula = function(f, dat)
{
target = gsub("()", "", f, fixed=TRUE)

# allow for dot-style formula
if (gsub("()", "", f, fixed=TRUE) == ".")
{
nm = names(dat)
nm = nm[nm != target]
fAdj = formula(paste(target, " ~ ", paste(nm, collapse=" + "), sep=""))
}

}

# test whether a specific term in a formula will cause separation
termHasSeparation = function(target, predictor, dat)
{
cols = unlist(strsplit(predictor, ":"))
isFactor = sapply(cols, function(x) class(dat[,x]) == "character" || class(dat[,x]) == "factor")
if (sum(isFactor) == 0)
return (FALSE)
fff = formula(paste(target, " ~ ", paste(cols[isFactor], collapse=" + "), sep = ""))
# check whether there are cells in which there are all zeroes
if (sum(xtabs(fff, data = dat) == 0) > 0)
return (TRUE)
# check whether there are cells in which there are all ones
if (sum(xtabs(fff, data = dat) == table(dat[,cols[isFactor]])) > 0)
return (TRUE)
# check whether there are cells with no exposure
nPossibleCells = prod(sapply(cols[isFactor], function(x) length(unique(dat[,x]))))
nActualCells = prod(dim(xtabs(fff, data = dat)))
if (nActualCells < nPossibleCells)
return (TRUE)

return (FALSE)
}

# test whether a formula will cause separation
fHasSeparation = function(f, dat)
{
ff = expandDotFormula(f, dat)
tt = terms(ff)
ttf = attr(tt, "factors")

# ttf has columns that represent individual terms of the formula
fTerms = colnames(ttf)
# check each term whether it exhibits separation
reject = any(sapply(fTerms, function(x) termHasSeparation(target, x, dat)))

return (reject)
}

# this function returns a NULL model if the formula causes separation
noSeparationGLM = function(f, data)
{
if (fHasSeparation(f, data))
{
target = gsub("()", "", f, fixed=TRUE)
f = formula(paste(target, " ~ 1", sep=""))
return (glm(f, data = data, family = binomial(link=logit)))
} else
{
return (glm(f, data = data, family = binomial(link=logit)))
}
}

# test some formulae
fHasSeparation(readmitted_flag ~ weight_num : discharge_id_desc, td)

# test some GLMs
noSeparationGLM(readmitted_flag ~ race + weight_num, td)
```  The fHasSeparation function now allows us to flag when a formula has terms that will cause separation, and the noSeparationGLM function returns a null model if you try to fit a model that has separation. We can use the noSeparationGLM function directly in glmulti to force it to reject GLMs that have separation. Sample code is shown below:

```
# fit a GLM allowing that possible models will be rejected due to separation
if (!require("pacman")) install.packages("pacman")

glms = glmulti(readmitted_flag ~ ., data = td[1:1000,1:15], level = 1, method="h", fitfunc = noSeparationGLM)

# show the result - note the absence of columns that cause separation

summary(glms)\$bestmodel

```

You should also check which columns and pairs of columns will cause separation. Some of these might be valid e.g. the “expired” patients in the diabetes readmission data, and it can also point out to you where you may need to do some dimensionality reduction. The R script to do this is:

```
# list the columns that cause separation

# list the pairs of columns that cause separation
colPairs = expand.grid(names(td[,-1]), names(td[,-1]))
colPairs = colPairs[apply(colPairs, 1, function(x) x != x),]
combinations = apply(colPairs, 1, function(x) paste(x, collapse=":")) 