Keeping Up With The Latest Techniques

~ brief insights

Keeping Up With The Latest Techniques

Category Archives: Machine Learning

Efficiently Building GLMs: Part 4

08 Saturday Aug 2015

Posted by Colin Priest in Automation, Generalized Linear Models, Genetic Algorithms, Machine Learning, R

≈ 2 Comments

Tags

Automation, Genetic Algorithms, GLMs, Machine Learning, R

In my previous blog in this series, I discussed how you can write custom R functions that can recognise separation and missing cells in multivariate data. This week I discuss how you can customise the process that searches for the best combination of predictor columns.

While the bestglm and glmulti packages are great for getting you started in automation, I found that I wanted more. In particular I wanted:

  • automation of data with more predictor columns than glmulti allows,
  • parallel processing to satisfy my impatience, and
  • customisation of the model selection, to allow for issues such as separation and collinearity.

So I created by own functions as alternatives.

Restructuring the Problem

Before we implement a search algorithm, we need to restructure the GLM model building problem so that it becomes an optimisation problem. We do this by creating a vector of binary values that flags which columns are to be included in the GLM model formula.

1 0 1 0

Consider the binary vector above. Since the first value equals 1, the first predictor column is included in the model formula. Since the third slot in the vector equals 1, then the third predictor column is included in the model formula. However the second and fourth predictor columns are not included, because the indicator variables in those slots have a value of 0. We can code this in R:

# a function to create GLM formulae from an indicator vector
createFormula = function(indicators)
{
 # do the special case of no predictor columns, so use NULL model
 if (sum(indicators) == 0)
 return (formula(paste(target, " ~ 1", sep="")))
 
 # the more common case
 return (formula(paste(target, " ~ ", paste(predictors[indicators == 1], collapse = " + "))))
}

So the optimisation problem is to find the binary vector that creates the optimal GLM formula, where “optimal” relates to the quality of the GLM that you created (allowing for whatever measure you choose to you). To keep this example simple, I will use the BIC score of the GLM, but that does not mean that I am recommending that you should use BIC. I just means that this blog is only about automation, and there isn’t space here to discuss how to measure what the “best” GLM measure would be.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(memoise)

# the function to be optimised
# the function to be optimised
glmScore = function(indicators)
{
 # return a NULL model if any errors or warnings occur
 glm.mod = tryCatch (
 glm(createFormula(indicators), data = td, family = binomial(link=logit)),
 error = function(e) glm(createFormula(indicators * 0), data = td, family = binomial(link=logit)),
 warning = function(w) glm(createFormula(indicators * 0), data = td, family = binomial(link=logit))
 )
 # if there are problems with NA parameters (e.g. via separation), then use a NULL model
 if (any(is.na(coef(glm.mod))))
 glm.mod = glm(createFormula(indicators * 0), data = td, family = binomial(link=logit))
 print(glm.mod$formula)
 return(BIC(glm.mod))
}



glmScr = memoise(glmScore)

As you can see, you can customise how a particular GLM is scored, to allow for your preferences about issues such as separation or collinearity.

The memoise package is another excellent utility that I have recently discovered. It remembers the return value for a function for a given parameter value, and instead of wasting time re-evaluating the function the next time that parameter value is used, it returns the cached return value for that particular parameter input. This can save a lot of processing time when a function is numerically demanding (e.g. fitting a GLM) and called repeatedly using the same parameter values (e.g. when iteratively searching through candidate formulae).

Imputing Missing Values

Since we are going to be comparing scores from different GLM models, we need to ensure that those models and scores are comparable. Missing values can be an issue here. When there are missing values in some predictors, R drops out the rows with the missing values, and only fits the GLM to the rows with no missing values in the predictors. The model scores are based upon the fit to this reduce data set, and therefore have lower scores. But R does not warn the user that it is doing this!

In the diabetes data used in this series of blogs, there are missing values. In particular, more than half of the values for the patients’ weight are missing. A optimisation based on BIC would choose a model that included weight_num as a predictor because the small number of data rows used results in a much lower score for goodness of fit.

To overcome the problem of missing values, we must either remove all rows containing any missing values, or impute values where they are missing.  R provides a few ways to do this. For example, you can impute values using the mice package.


# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Dropbox\\IntelliM\\";
 
# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))
# fix the column headers
nm = names(td)
names(td) = gsub(".", "_", nm, fixed=TRUE)
# clean up the missing values
if (!require("pacman")) install.packages("pacman")
pacman::p_load(mice)

cleaned = mice(td)

In this blog I will take the dubious but quick approach of dropping out any rows that contain missing values in the numeric fields.


# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Dropbox\\IntelliM\\";
 
# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# fix the column headers
nm = names(td)
names(td) = gsub(".", "_", nm, fixed=TRUE)
# remove all rows containing missing values
td = td[! is.na(td$weight_num),]

For the diabetes readmission data this reduces the data from 101,000 rows to 3,000 rows, which is hardly a suitable solution. A better solution would have been to simply drop the weight_num column from the data.

Customised Step-wise Model Building

While step-wise model building is not the best way to put together a GLM, it is the quickest way to get to a reasonable model. I use the step-wise approach whenever I am feeling very, very impatient!

But instead of using an out-of-the-box version of step-wise model building, I wrote my own function. It begins by using a GBM to measure variable importance, and to create a reasonable starting model. Then it uses a version of the step-wise approach that has been enhanced to allow for one predictor to be swapped out for another predictor. Usually step-wise implementations are greedy algorithms, i.e. they only look one step ahead. But swapping one column for another involves two steps – removing one column, then adding another column. A greedy algorithm won’t see past the first step of removing a column, and therefore won’t see the full benefit of the swap, and won’t choose that path. It gets trapped in a local minima. This enhancement overcomes the problem by treating column swaps as if they involved just a single step. Here is my script:

# function to show a formula as text
f2text = function(f)
{
 paste(as.character(f)[c(2,1,3)], collapse = " ")
}

# a vector of names of predictor columns
target = "readmitted_flag"
predictors = names(td)
predictors = predictors[predictors != target]

# reset the cache on the glmScore function
forget(glmScr)
# create an indicator vector - in this case we will use a vector of zeroes to start with a NULL model
indicators = matrix(0, length(predictors), 1)
bestScore = glmScr(indicators)
bestParameters = indicators

searching = TRUE
while (searching)
{
 # alternate formulae, with just one column added, or one column removed
 alternates = sapply(seq_len(length(indicators)), function(x) {temp = indicators; temp[x] = 1 - temp[x]; return(temp) })
 
 # alternate formulae, replacing one column for another
 existing = seq_len(length(indicators))[indicators == 1]
 for (i in existing)
 {
 for (j in seq_len(length(indicators)))
 {
 if (i != j && indicators[j] == 0)
 {
 temp = indicators
 temp[i] = 0
 temp[j] = 1
 alternates = cbind(alternates, temp)
 }
 }
 }

 # evaluate all of the alternatives
 scores = apply(alternates, 2, glmScr)
 
 if (min(scores) < bestScore)
 {
 bestScore = min(scores)
 bestParameters = alternates[,which.min(scores)]
 print(paste(round(bestScore), f2text(createFormula(bestParameters)), sep = ": "))
 indicators = bestParameters
 } else
 {
 searching = FALSE
 }
}

20150807 output 1

When you’re impatient, like me, you want things to run as fast as possible. From a speed perspective it makes sense to start with a null model, and add predictors, rather than starting from an overspecified model and removing predictors. You also want parallel processing. Rather than processing one GLM at a time, I want to simultaneously process as many GLMs as my PC will allow! R supports parallel processing, but Windows users have fewer parallel processing options in R than Linux users. Nevertheless, the snowfall package gives me exactly what I want for this particular task.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(snowfall)

sfInit( parallel=TRUE, cpus=3 )
if( sfParallel() )
{
cat( "Running in parallel mode on", sfCpus(), "nodes.\n" )
} else
{
cat( "Running in sequential mode.\n" ) }

# a vector of names of predictor columns
target = "readmitted_flag"
predictors = names(td)
predictors = predictors[predictors != target]

# reset the cache on the glmScore function
forget(glmScr)
# create an indicator vector - in this case we will use a vector of zeroes to start with a NULL model
indicators = matrix(0, length(predictors), 1)
bestScore = glmScr(indicators)
bestParameters = indicators

searching = TRUE
while (searching)
{
# alternate formulae, with just one column added, or one column removed
alternates = sapply(seq_len(length(indicators)), function(x) {temp = indicators; temp[x] = 1 - temp[x]; return(temp) })

# alternate formulae, replacing one column for another
existing = seq_len(length(indicators))[indicators == 1]
for (i in existing)
{
for (j in seq_len(length(indicators)))
{
if (i != j && indicators[j] == 0)
{
temp = indicators
temp[i] = 0
temp[j] = 1
alternates = cbind(alternates, temp)
}
}
}

sfExportAll()
# evaluate all of the alternatives
#scores = apply(alternates, 2, glmScr)
scores = unlist(sfClusterApplyLB(seq_len(ncol(alternates)), function(x) glmScr(alternates[,x])))

if (min(scores) < bestScore)
{
bestScore = min(scores)
bestParameters = alternates[,which.min(scores)]
print(paste(round(bestScore), f2text(createFormula(bestParameters)), sep = ": "))
indicators = bestParameters
} else
{
searching = FALSE
}
}

sfStop()

Evolving Your GLM

Step-wise model building works OK when there aren’t too many predictors, and when I don’t want any interaction terms. But once there are many interaction terms, step-wise model building has to work through too many permutations. The problem becomes too highly dimensional for grid based searches.

We need a new approach. We need our GLMs to evolve into intelligent models.

Evolution

Let’s consider the case where we want two-way interaction terms. Once again we use an indicator vector to define the GLM formula, but this time we need to include a vector cell for each and every combination of individual predictors.

The first step is to create a list of all of the individual predictors, and each two-way combination of predictors.


# a vector of names of predictor columns
target = "readmitted_flag"
predictors = names(td)
predictors = predictors[predictors != target]
indices = expand.grid(seq_len(length(predictors)), seq_len(length(predictors)))
indices = indices[indices[,1] != indices[,2],]
twoWays = apply(indices, 1, function(x) paste(predictors[x], collapse = ":"))
predictors = c(predictors, twoWays)
length(predictors)

Once we include two-way interaction terms, there are 2025 terms that could be included in the GLM formula! A thorough search of all possible formulae would require evaluating 2^2025 GLMs!!!

R has a few packages that implement genetic algorithms. Today I will use the GA package to optimise the indicator vector. Note that genetic algorithms try to maximise a fitness function, whereas we are trying to minimise a score. So we will need to define a fitness function that is -1 times the GLM score.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(GA)

# define a glm fitness function
glmFitness = function(x) return(-1 * glmScr(x))

ga.mod = ga(type = "binary", fitness = glmFitness, nBits = length(predictors), popSize = 100, parallel = TRUE, run = 5)

20150807 output 2

From the output we can see that our first attempt at a genetic algorithm didn’t work very well. That’s because most formulae contain 2-way interactions, and most 2-way interactions contain missing cells, and therefore these formulae have been rejected. To fix this we need to bias the initial population towards formulae that do not contain 2-way terms, and towards formulae than contain very few terms.


# custom function to create an initial population
initPopn = function (object, ...) 
{
 population = matrix(0, nrow = object@popSize, ncol = object@nBits)
 n = ncol(td) - 1
 probability = min(1, 2 / n)
 ones = matrix(rbinom(n * object@popSize, 1, probability), object@popSize, n)
 population[,seq_len(n)] = ones
 return(population)
}

set.seed(1)

ga.mod.2 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors), population = initPopn, popSize = 100, parallel = TRUE, run = 5)

20150807 output 5

This time we had more success. But we need to tune the hyperparameters, using a bigger population size and slower stopping criteria, in order to get better results. I also added a custom function to show me which GLM model formula is the best performing at the end of each epoch. The data preparation and support functions have also been updated to make the process more robust, and I changed the initial population to ensure more genetic diversity.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(memoise, caret, GA)

# function to show a formula as text
f2text = function(f)
{
 if (class(f) != "formula")
 stop("f is not a formula!")
 return(paste(as.character(f)[c(2,1,3)], sep = "", collapse = " "))
}

# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Dropbox\\IntelliM\\";
 
# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# remove all rows containing missing values
td = td[! is.na(td$weight_num),]

# remove low variance columns
td = td[,-nearZeroVar(td)]

# remove the diagnosis columns - just to make the example work faster
td = td[,-c(11, 12, 13)]

# fix the column headers
nm = names(td)
names(td) = gsub(".", "_", nm, fixed=TRUE)

# a vector of names of predictor columns
target = "readmitted_flag"
predictors = names(td)
predictors = predictors[predictors != target]
indices = expand.grid(seq_len(length(predictors)), seq_len(length(predictors)))
indices = indices[indices[,1] < indices[,2],]
twoWays = apply(indices, 1, function(x) paste(predictors[x], collapse = ":"))
predictors = c(predictors, twoWays)
length(predictors)

# a function to create GLM formulae from an indicator vector
createFormula = function(indicators)
{
 if (is.null(indicators))
 stop("createFormula: indicators is NULL!")
 if (any(is.na(indicators)))
 stop("createFormula: NA found in indicators!")
 # do the special case of no predictor columns, so use NULL model
 if (sum(indicators) == 0)
 return (formula(paste(target, " ~ 1", sep="")))
 
 # the more common case
 result = formula(paste(target, " ~ ", paste(predictors[indicators == 1], collapse = " + ")))
 return (result)
}

# the function to be optimised
glmScore = function(indicators)
{
 if (is.null(indicators))
 stop("glmScore: indicators is NULL!")
 if (any(is.na(indicators)))
 {
 if (colin_verbose)
 print("glmScore: NA found in indicators!")
 }

 # return a NULL model if any errors or warnings occur
 fff = createFormula(indicators)
 useMod = FALSE
 glm.mod = NULL
 tryCatch (
 {
 glm.mod = glm(fff, data = td, family = binomial(link=logit));
 useMod = TRUE;
 }
 ,
 error = function(e) {
 if (colin_verbose)
 {
 print(f2text(fff));
 print(e);
 }
 },
 warning = function(w) {
 if (colin_verbose)
 {
 print(f2text(fff));
 print(w);
 }
 }
 )
 if (! useMod) 
 {
 return (9e99)
 }
 # if there are problems with NA parameters (e.g. via separation), then use a NULL model
 if (any(is.na(coef(glm.mod))))
 {
 if (colin_verbose)
 print("NAs in coefficients")
 return (9e99)
 }
 result = BIC(glm.mod)
 return(result)
}

# create a version of GLM score that remembers previous results
glmScr = memoise(glmScore)

# define a glm fitness function
glmFitness = function(x) return(-1 * glmScr(x))

# custom function to create an initial population
initPopn = function (object, ...) 
{
 population = matrix(0, nrow = object@popSize, ncol = object@nBits)
 n = ncol(td) - 1
 probability = min(1, 2 / n)
 ones = matrix(rbinom(n * object@popSize, 1, probability), object@popSize, n)
 population[,seq_len(n)] = ones
 # make the first n rows be the nth predictor
 n2 = min(n, object@popSize)
 for (i in seq_len(n2))
 {
 population[i,] = 0
 population[i, i] = 1
 }
 return(population)
}

# custom function to monitor the genetic algorithm
glmMonitor = function (object, digits = getOption("digits"), ...) 
{
 fitness <- na.exclude(object@fitness)
 i = which.max(fitness[! is.na(object@fitness)])
 p = object@population[! is.na(object@fitness),]
 x = p[i,]
 cat(paste("Iter =", object@iter, " | Median =", 
 format(-1 * median(fitness), digits = 0), 
 " | Best =", 
 format(-1 * max(fitness), digits = 0), 
 f2text(createFormula(x)),
 "\n"))
}

# run a genetic algorithm to find the "best" GLM with interaction terms
forget(glmScr)
colin_verbose = FALSE
ga.mod.3 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors), 
 population = initPopn, 
 popSize = 500, 
 parallel = TRUE, 
 run = 25, 
 monitor = glmMonitor, 
 seed = 1)

20150807 output 6

The “best” GLM has evolved to include 2-way interactions. Since it doesn’t take too long to run, it is helpful to rerun the genetic algorithm using other random seeds, to check that we have converged on the best score.


ga.mod.4 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors),
population = initPopn,
popSize = 500,
parallel = TRUE,
run = 25,
monitor = glmMonitor,
seed = 2)
ga.mod.5 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors),
population = initPopn,
popSize = 500,
parallel = TRUE,
run = 25,
monitor = glmMonitor,
seed = 3)

Comparing the results of the tree genetic algorithm runs, we find that the best scoring model is:

readmitted_flag ~ number_emergency + number_inpatient + discharge_id_desc + age_num:num_lab_procedures + number_emergency:number_inpatient

This model scored 2191. This model also makes intuitive sense. The following data is predictive of the probability of another hospital visit in the next 30 days:

  • the number of emergency visits to hospital over the previous 12 months
  • the number of hospital visits over the previous 12 months
  • the health of the patient when they were discharged (including whether they died in hospital!)
  • the relationship between the patient’s age and the number of treatments that they receive
  • the relationship between the number of emergency visits versus the total number of visits to the hospital over the past 12 months

.

Would you prefer to apply these process improvements without the effort of writing R scripts? Consider using a visual user interface, such as IntelliM.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 2

07 Friday Aug 2015

Posted by Colin Priest in Cluster Analysis, GBM, Gradient Boosting Machine, Image Processing, k-means, Kaggle, Machine Learning, R

≈ 9 Comments

Tags

Cluster Analysis, GBMs, Image Processing, k-means, Kaggle, Machine Learning, R

In the first blog in this series, I explained the nature of the problem to be solved, and showed how to create a simple linear model that gives an RMSE score on the training data of 7.8%. In this blog I introduce the feedback loop for creating features, and we extend and improve the existing competition model to include a couple of new features.

20150731 leaderboard

I have been building machine learning models for almost 20 years now, and never once has my first model been satisfactory, nor has that model been the model that I finally use. The Denoising Dirty Documents competition is no different. In the plots above, you can see my progress through the public leaderboard, as I iteratively improved my model. The same has occurred more broadly in the competition overall, as the best score has iteratively improved since the competition began.

Coming from an actuarial / statistical background, I tend to think in terms of a “control cycle”. For actuaries, the control cycle looks something like this:

control cycle

We can categorise the content of my last blog as:

1) Define the Problem

  • remove noise from images
  • images are three-dimensional surfaces

2) Hypothesise Solutions

  • we hypothesised that the pixel brightnesses needed to be rescaled

3) Implement

  • we fitted a least squares linear model to rescale the pixel brightnesses, and we capped and floored the values to keep them within the [0, 1] range

4) Monitor Results

  • we calculated the RMSE on the training data before and after the implementation
  • we looked at one of the output images

So we have completed one cycle. Now we need to do another cycle.

What I like about the Denoising Dirty Images Competition is that the data visualisation and model visualisation are obvious. We don’t need to develop clever visualisation tools to understand the problem or the model output. We can just look at the dirty and predicted images. Let’s begin with reviewing the example input and output images from the model used in the last blog i.e. let’s begin by recalling the Monitor Results step from the previous cycle:

20150801 - before

20150801 - after

Now we can progress to the beginning of the next cycle, the Define the Problem stage. This time the problem to be solved is to remove one or more of the errors from the output of the existing model. Once we already have a model, we can define the problem by asking ourselves the following question.

Question: In the predicted image, where have prediction errors have been made?

Answer: The predicted image contains some of the crease lines.

So in this cycle, the problem is to remove the remaining crease lines.

It’s time to begin the Hypothesise Solutions stage of the cycle by asking ourselves the following questions:

  • In the dirty image, what is a single characteristic of what we want to keep or remove that we haven’t captured in our existing model?
  • What characteristics do the errors in the predicted image have?
  • What characteristics does the dirty image have at the locations where these errors occur?

Here are a few observations that I made in answer to these questions:

  • In the dirty image, the writing is darker than the background around it. We haven’t captured this locality information.
  • The predicted image retains some of the crease lines. Those remaining crease lines are narrower and not as dark as the writing.
  • In the dirty image, the shadows next to the white centres of the crease are the remaining crease lines from the predicted image.

This led me to the hypothesis that the model for a single pixel needs to include information about the brightness of other pixels in the image. There are multiple ways that we could consider the brightnesses of other pixels, but in today’s blog I will limit myself to considering the brightness of the pixel under consideration versus the range of pixel brightnesses across the entire image.

To Implement a hypothesis test, and solution, we need to know some theory about image processing. There are some good textbooks that cover this material. The image processing / machine vision textbook that I have been using is


Computer and Machine Vision, Fourth Edition: Theory, Algorithms, Practicalities

The machine vision technique that I will apply this today is thresholding, which is the process of turning an image into pixels that can only be black or white, with no grey shades or colours. Writing code to do thresholding is the easy part. The trickier part is to decide the threshold value at which pixels are split into either black or white.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster)

img = readPNG("C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train\\6.png")

# turn the image into a vector
img2vec = function(img)
{
return (matrix(img, nrow(img) * ncol(img), 1))
}

# show a histogram
hist(img2vec(img))

20150808 plot 1

One way of finding a threshold value is to look at a histogram of the pixel brightnesses and look for a natural break between local maxima in the histogram. Since the histogram above is tri-modal, this would leave us with two choices of thresholds, one around 0.3 and one around 0.65.


# threshold at 0.3
img0.3 = img
img0.3[img0.3 <= 0.3] = 0 img0.3[img0.3 > 0.3] = 1
plot(raster(img0.3))

20150808 plot 2

Now we can Monitor Results. Thresholding at 0.3 gives us no false positives for this image, but leaves out a lot of the writing.


# threshold at 0.65
img0.65 = img
img0.65[img0.65 <= 0.65] = 0 img0.65[img0.65 > 0.65] = 1
plot(raster(img0.65))

20150808 plot 3

Thresholding at 0.65 gives us no false positives for this image, and correctly flags the writing without flagging the creases. That makes it quite a good feature for use on this image, as it will help us to remove the residual crease mark that we found when reviewing the output of the last blog’s model. We should definitely include a thresholding feature in our model!

But it feels very clumsy to plot a histogram and manually select the threshold, and such a feature definitely isn’t “machine learning”. Can we automate this?

Otsu’s Method is a well-known technique in image processing that can automatically suggest a threshold. But it assumes a bi-modal histogram, which we have seen is not true.

Inspired by Otsu’s method, we could use cluster analysis to generate three clusters of pixel brightnesses, and use the splits between those clusters to threshold the image.


# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the lower threshold is the halfway point between the top of the lowest cluster and the bottom of the middle cluster
loThresh = 0.5 * (max(v[km.mod$cluster == oc[1]]) + min(v[km.mod$cluster == oc[2]]))
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using lower threshold
imgLo = img
imgLo[imgLo <= loThresh] = 0 imgLo[imgLo > loThresh] = 1
plot(raster(imgLo))

# using upper threshold
imgHi = img
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1
plot(raster(imgHi))

20150808 plot 4 20150808 plot 5

Now we can Monitor Results. Once again, the lower threshold choice doesn’t capture enough of the writing, while the upper threshold choice works well for this image. And this time there wasn’t any manual work required!

Let’s put it all together and combine last blog’s feature (a linear transformation of the raw pixel brightnesses) with this blog’s feature (thresholding). This time we will use something more sophisticated than linear regression. Our model will be based upon R’s GBM package, a gradient boosted machine to combine the raw pixel brightnesses with the thresholded pixels. GBMs are good all-purpose models, are simple to set up, perform well, and are almost always my first choice when creating machine learning models.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table, gbm)

# a function to do k-means thresholding

kmeansThreshold = function(img)
{
# fit 3 clusters
v = img2vec(img)
km.mod = kmeans(v, 3)
# allow for the random ordering of the clusters
oc = order(km.mod$centers)
# the higher threshold is the halfway point between the top of the middle cluster and the bottom of the highest cluster
hiThresh = 0.5 * (max(v[km.mod$cluster == oc[2]]) + min(v[km.mod$cluster == oc[3]]))

# using upper threshold
imgHi = v
imgHi[imgHi <= hiThresh] = 0 imgHi[imgHi > hiThresh] = 1

return (imgHi)
}

dirtyFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

# threshold the image
x2 = kmeansThreshold(imgX)

dat = data.table(cbind(y, x, x2))
setnames(dat,c("y", "raw", "thresholded"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}

# view the data
dat = read.csv(outPath)
rows = sample(nrow(dat), 10000)
d1 = dat[rows,]
plot(d1$raw[dat$thresholded == 0], d1$y[dat$thresholded == 0], col = "blue")
lines(d1$raw[dat$thresholded == 1], d1$y[dat$thresholded == 1], col = "red", type="p")

# fit a model to a subset of the data
rows = sample(nrow(dat), 100000)
gbm.mod = gbm(y ~ raw + thresholded, data = dat[rows,], n.trees = 5000, cv.folds = 10, train.fraction = 0.5)
best.iter <- gbm.perf(gbm.mod,method="cv")

# what score do we get on the training data?
yHat = predict(gbm.mod, newdata=dat, n.trees = best.iter)
rmse = sqrt(mean( (yHat - dat$y) ^ 2 ))
print(rmse)

# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train\\6.png")
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img))
setnames(x, c("raw", "thresholded"))
yHat = predict(gbm.mod, newdata=x, n.trees = best.iter)
imgOut = matrix(yHat, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150808 plot 6
Notice how the sample output image is different from the thresholded image? The edges of the writing are lighter than the centre of the letters. That’s good, because that’s one of the characteristics of the clean image.
20150808 sample output
The sample image that we created from the predicted values is looking really good. The residual crease marks have disappeared, yet we have kept the writing. And the RMSE score on the training data has improved to 6.5%.

But before this blog ends, let’s consider an image that wasn’t cleaned so well:

# here's a sample image that doesn't perform as well
img = readPNG("C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train\\3.png")
x = data.table(matrix(img, nrow(img) * ncol(img), 1), kmeansThreshold(img))
setnames(x, c("raw", "thresholded"))
yHat = predict(gbm.mod, newdata=x)
imgOut = matrix(yHat, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150808 plot 7
Damn! I feel like cursing the jerk who put their coffee cup on top of the important document and left a stain!

In my next blog, I will discuss the first steps towards removing that coffee cup stain.
http://wms-na.amazon-adsystem.com/20070822/US/js/link-enhancer-common.js?tag=keeupwitthela-20&linkId=VYH2LC2X67R2SDNO

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Denoising Dirty Documents: Part 1

01 Saturday Aug 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning, R

≈ 16 Comments

Tags

Image Processing, Kaggle, Machine Learning, R

I recently blogged about my learning curve in my first Kaggle competition. This has become my most popular blog to date, and some readers have asked for more. So this blog is the first in a series of blogs about how to put together a reasonable solution to Kaggle’s Denoising Dirty Documents competition.

Some other competitors have been posting scripts, but those scripts are usually written in Python, whereas my background makes me an R programmer. So I will be writing scripts that make use of R.

The Structure of the Problem

We have been given a series of training images, both dirty (with stains and creased paper) and clean (with a white background and black letters). We are asked to develop an algorithm that converts, as close as possible, the dirty images into clean images.

the problem to be solved

A greyscale image (such as shown above) can be thought of as a three-dimensional surface. The x and y axes are the location within the image, and the z axis is the brightness of the image at that location. The great the brightness, the whiter the image at that location.

So from a mathematical perspective, we are being asked to transform one three-dimensional surface into another three dimensional surface.

436772_SMPNG_7WS66170WB3494916

Our task is to clean the images, to remove the stains, remove the paper creases, improve the contrast, and just leave the writing.

Loading the Image Data

In R, images are stored as matrices, with the row being the y-axis, the column being the x-axis, and the numerical value being the brightness of the pixel. Since Kaggle has stored the images in png format, we can use the png package to load the images.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster)

img = readPNG("C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train\\6.png")
head(img)
plot(raster(img))

20150801 output 1 20150801 output 2

You can see that the brightness values lie within the [0, 1] range, with 0 being black and 1 being white.

Restructuring the Data for Machine Learning

Instead of modelling the entire image at once, we should predict the cleaned-up brightness for each pixel within the image, and construct a cleaned image by combining together a set of predicted pixel brightnesses. We want a vector of y values, and a matrix of x values. The simplest data set is where the x values are just the pixel brightnesses of the dirty images.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(png, raster, data.table)

dirtyFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train"
cleanFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_cleaned"
outFolder = "C:\\Users\\Colin\\Kaggle\\Denoising Dirty Documents\\data\\train_predicted"

outPath = file.path(outFolder, "trainingdata.csv")
filenames = list.files(dirtyFolder)
for (f in filenames)
{
print(f)
imgX = readPNG(file.path(dirtyFolder, f))
imgY = readPNG(file.path(cleanFolder, f))

# turn the images into vectors
x = matrix(imgX, nrow(imgX) * ncol(imgX), 1)
y = matrix(imgY, nrow(imgY) * ncol(imgY), 1)

dat = data.table(cbind(y, x))
setnames(dat,c("y", "x"))
write.table(dat, file=outPath, append=(f != filenames[1]), sep=",", row.names=FALSE, col.names=(f == filenames[1]), quote=FALSE)
}
# view the data
dat = read.csv(outPath)
head(dat)
rows = sample(nrow(dat), 10000)
plot(dat$x[rows], dat$y[rows])

20150801 output 4

The data is now in a familiar format, which each row representing a data point, the first column being the target value, and the remaining column being the predictors.

Our First Predictive Model

Look at the relationship between x and y.

20150801 output 3

Except at the extremes, there is a linear relationship between the brightness of the dirty images and the cleaned images. There is some noise around this linear relationship, and a clump of pixels that are halfway between white and black. There is a broad spread of x values as y approaches 1, and these pixels probably represent stains that need to be removed.

So the obvious first model would be a linear transformation, with truncation to ensure that the predicted brightnesses remain within the [0, 1] range.


# fit a linear model, ignoring the data points at the extremes
lm.mod.1 = lm(y ~ x, data=dat[dat$y &gt; 0.05 & dat$y &lt; 0.95,])
summary(lm.mod.1)
dat$predicted = sapply(predict(lm.mod.1, newdata=dat), function(x) max(min(x, 1),0))
plot(dat$predicted[rows], dat$y[rows])
rmse1 = sqrt(mean( (dat$y - dat$x) ^ 2))
rmse2 = sqrt(mean( (dat$predicted - dat$y) ^ 2))
c(rmse1, rmse2)

20150801 output 5 20150801 output 6

The linear model has done a brightness and contrast correction. This reduces the RMSE score from 0.157 to 0.078. Let’s see an output image:


# show the predicted result for a sample image
img = readPNG("C:\\Users\\Colin\\Dropbox\\Kaggle\\Denoising Dirty Documents\\data\\train\\6.png")
x = data.table(matrix(img, nrow(img) * ncol(img), 1))
setnames(x, "x")
yHat = sapply(predict(lm.mod.1, newdata=x), function(x) max(min(x, 1),0))
imgOut = matrix(yHat, nrow(img), ncol(img))
writePNG(imgOut, "C:\\Users\\Colin\\Dropbox\\Kaggle\\Denoising Dirty Documents\\data\\sample.png")
plot(raster(imgOut))

20150801 output 7

Although we have used a very simple model, we have been able to clean up this image:

20150801 - before

Our predicted image is:

20150801 - after

That’s quite good performance for a simple least squares linear regression!

To be fair though, I deliberately chose an example image that performs well. In my next blog in this series, I will discuss the use of a feedback loop in model design, and how to design new features to use as predictors.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

My Kaggle Learning Curve: Artificial Stupidity

25 Saturday Jul 2015

Posted by Colin Priest in Image Processing, Kaggle, Machine Learning

≈ 10 Comments

Tags

Image Processing, Kaggle, Machine Learning

This month I started competing in my very first Kaggle competition, Denoising Dirty Documents. I was first introduced to Kaggle a few years ago by Xavier Conort, an insurance industry colleague who also lives here in Singapore. But I had been passive with my Kaggle membership, and hadn’t even considered competing.

This year two things changed. Firstly, I joined IntelliM, an image processing, machine learning and software house, and I needed to get out into the real world and make business connections and start adding value in these fields. Secondly, Kaggle opened the Denoising Dirty Documents competition, which is about pre-processing scanned documents so that they are suitable for optical character recognition, and this competition required both image processing skills and machine learning skills. So this competition looked like a great match for me, and hopefully would be an easy transition to build some experience within Kaggle.

COPR

Although I am an actuary by training, I have not always stayed within the traditional bounds of actuarial work. Back in the 1990s I first started playing with machine learning, using neural networks to predict which customers will renew their insurance policies. Then, inspired by Kim and Nelson’s book, I developed a state space regime switching model for predicting periods of massive builder insolvencies. That model has subsequently been adapted for cancer research, to measure the timing of genes switching off and on. In the 2000s I started getting involved in image processing, firstly to create optical character recognition for a web scraper software package, and later developing COPR, license plate recognition software. Over the past decade I have been using machine learning for customer analytics and insurance pricing.

the problem to be solved

So I thought that just doing some pre-processing for optical character recognition would be quick and easy. When I looked at the examples (see one example above), my eyes could quickly see what the answer should look like even before I peeked at the example cleaned image. I was so wrong…

Lesson: Avoid Artificial Stupidity

Machine learning is sometimes called artificial intelligence. After all, aren’t neural networks based upon the architecture of the human brain?

My first competition submission was a pure machine learning solution. I modelled the target image one pixel at a time. For predictors, I got the raw pixel brightnesses for a region around each pixel location. This is a brute force approach that I have used in the past for optical character recognition. I figured that the machine learning algorithm would learn what the character strokes looked like, and thereby know which pixels should be background.

What really happened was that the machine learning algorithm simply adjusted the brightness and contrast of the image, to better match the required solution. So I scored 8.58%, giving me 24th ranking, much higher than I was expecting, and much closer to some naive benchmarks than I was comfortable with.

submission 1

I wanted a top ten placing, but I was a long way away from it. So I fine-tuned the model hyperparameters. This moderately improved the score, and only moved me up 3 ranks. My next competition submission actually scored far worse than my preceding two submissions! I needed to rethink my approach because I was going backwards, and the better submissions were almost an order of magnitude better than mine.

The reason my submission scored so poorly was because I was asking the machine learning model to learn complex interactions between pixels, without any guidance from me. There are heuristics about text images that I intuitively know, but I hadn’t passed on any of that knowledge to the machine learning algorithm, either via predictors or model structure.

My algorithm wasn’t artificially intelligent; it was artificially stupid!

So I stopped making submissions to the competitions, and started looking at the raw images and cleaned images, and I applied some common image processing algorithms. I asked myself these questions:

  • what is it about the text that is different to the background?
  • what are the typical characteristics of text?
  • what are the typical characteristics of stains?
  • what are the typical characteristics of folded or crinkled paper?
  • how does a dark stain differ from dark text?
  • what does the output from a image processing algorithm tell me about whether a pixel is text or background?
  • what are the shortcomings of a particular image processing algorithm?
  • what makes an image processing algorithm drop out some of the text?
  • what makes an image processing algorithm think that a stain is text?
  • what makes an image processing algorithm think that a paper fold is text?
  • which algorithms have opposing types of classification errors?

3

For example, in the image above, the algorithm thins out the text too much, does not remove the outer edges of stains, and does not remove small stains. That prompted me to think that maybe an edge finding algorithm would complement this algorithm.

leaderboard 20150725

After a week of experimentation and feature extraction, I finally made a new competition submission, and it jumped me up in the rankings. Then I started fine tuning my model, and split the one all-encompassing machine learning model into multiple specialist models. At the time of writing this blob I am ranked 4th in the competition, and after looking at the scores of the top 3 competitors, I realise that I will have to do more than just fine tune my algorithm. It’s time for me to get back into research mode and find a new feature that identifies the blob stain at the end of the first paragraph in this image:

3-postprocessed

Kaggle is addictive. I can’t wait to solve this problem!

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Efficiently Building GLMs: Part 2

17 Friday Jul 2015

Posted by Colin Priest in Generalized Linear Models, Genetic Algorithms, Machine Learning, R

≈ 1 Comment

Tags

Automation, Generalized Linear Models, Genetic Algorithms, GLMs, Machine Learning, R

In my last blog, I started this series of blogs discussing how to make the GLM model building process more efficient, and I showed an example of how R can help you find a good set of starting predictors that capture the essence of what can be explained.

In this blog I will show you a few ways that R can also help you to fine tune the choice of predictors.

In my experience, most statisticians and actuaries follow a heuristic process for fine tuning their GLMs. They look for predictors with low statistical significance, that can be dropped. They try to add predictors that they expect might be valuable. It’s quite a manual process, with experimentation and tinkering, and frequently isn’t documented.

Before the invention of the printing press, books were hand written. Monasteries had rooms called scriptoria where monks would copy manuscripts, painstakingly drawing and writing, copying pages of existing books. Later, as the first universities emerged, a new type of scribe, who wasn’t a monk, would carry out the same process in scriptoria that were located within those universities.

Escribano

"Escribano" by Jean Le Tavernier - [1]. Licensed under Public Domain via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Escribano.jpg#/media/File:Escribano.jpg

Are we using statisticians and actuaries like scribes, doing manual work that can (and should) be automated? That would free them up to make more value-added contributions, such as sensibility checks, contextualisation, and recommending practical improvements to underwriting, pricing, risk management and marketing.

Improvement 2: Automating the Process for Selection of Predictors

You can automate the search process. A modern computer can exhaustively search through models at a rate that is several orders of magnitude faster than a human. If you include some of the same heuristics that the humans use, you can save a lot of time, and reduce operational risk.

There are a few R packages that can do this for you, without requiring customisation. Today I will consider two of them, bestglm and glmulti.

By default, but only for normally distributed residuals, the bestglm package uses the “leaps and bounds” algorithm, which was developed by Furnival and Wilson back in 1974. Click here to download and read the original paper. The algorithm begins with an overspecified model, and then recursively considers dropping out each remaining predictor. It rules out searching further when the child models (caused by choosing whether to keep or drop out a specific predictor) do not improve the quality of the model. It can be visualised as searching through the set of models, using a tree structure.

For non-normally distributed GLMs, an exhaustive search is carried out i.e. bestglm considers every possible subset of predictors, and evaluates the quality of the model.

To give a practical demonstration of this process, I will again use the diabetes readmission data in the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008. I have reformatted the data slightly, so if you wish to exactly replicate my analysis, then I suggest that you download a copy of the reformatted data from here.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(bestglm)

# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Documents\\IntelliM\\"

# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# pretend that we only have the target plus these predictors
# remember that the target must be the last column in the data
# just use a subset of the rows, because this is just an example
set.seed(1)
td = td[sample(nrow(td), 1000),c("gender", "time_in_hospital", "number_inpatient", "diag_1", "diag_2", "diag_3", "discharge_id_desc", "readmitted_flag")]

# find the model with the best BIC
bestBIC = bestglm(td, family = binomial, IC = "BIC")

# Show top 5 models
bestBIC$BestModels

# show a summary of the best model
summary(bestBIC$BestModel)</pre>
# show the relationship between the number of predictors and the model quality
plot(seq_len(nrow(bestBIC$Subsets)) - 1, bestBIC$Subsets[,"BIC"], type="b", xlab = "Number of Predictors", ylab = "BIC")

# show again, but cap it at 3 predictors
plot(seq_len(4) - 1, bestBIC$Subsets[seq_len(4),"BIC"], type="b", xlab = "Number of Predictors", ylab = "BIC")

For the purpose of this example, to keep the run time down to a few minutes, I have only used a subset of the columns and the rows in the data. In practice you would use as many of the columns and rows as you need, limited by the computing power available to you. While I have used the BIC as the measure of the “best” GLM model, this is purely for illustrative purposes. The discussion of which measure to use will be the topic of another blog.

bestglm top 5

The BIC criteria heavily penalises the use of diag_1, diag_2 and diag_3 as predictors because of their very high degrees of freedom (each is a factor containing hundreds of different diagnosis codes). Interestingly, the null model (no predictors) ranks number 4 amongst the top 5 models! However, that is probably because we haven’t transformed the numeric predictors to have linear relationships to the target, and we haven’t grouped together any of the factor levels in diag_1, diag_2 and diag_3. Note that the model rankings will change with the number of rows that you include, and the choice of information criteria by which to measure the model performance.

bestglm top summary

The summary of the best model shows a simple model with highly significant predictors.

bestglm plot 2

The plot shows that the best BIC score happens with the use of 2 predictors.

But the bestglm package has its limitations. Exhaustively searching through every possible model is time consuming, even when automated, and especially when the data has many columns and/or rows. And bestglm doesn’t automatically consider interaction terms.

Enter the glmulti package. First let’s get glmulti to replicate the results of the bestglm script.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(glmulti)
# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Documents\\IntelliM\\"

# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# pretend that we only have the target plus these predictors
# just use a subset of the rows, because this is just an example
set.seed(1)
td = td[sample(nrow(td), 1000),c("gender", "time_in_hospital", "number_inpatient", "diag_1", "diag_2", "diag_3", "discharge_id_desc", "readmitted_flag")]

# replicate the analysis done by bestglm
bestBIC = glmulti(readmitted_flag ~ ., data = td, family = binomial, level = 1, crit=bic, fitfunc=glm, method="h",
confsetsize = 256, plotty = TRUE, report = TRUE)

print(bestBIC)

The “level” parameter has been set to a value of 1, which means that we are telling glmulti to not consider interaction effects.

glmulti example 1

It can be seen that the results line up with those of bestglm.

To consider 2-way interaction effects, use “level = 2”.

</pre>
# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# allow for 2-way interactions
bestBIC2 = glmulti(readmitted_flag ~ gender + time_in_hospital + num_procedures + number_inpatient, data = td[sample(nrow(td), 10000),], family = binomial, level = 2, crit=bic, fitfunc=glm, method="h", confsetsize = 256, plotty = TRUE, report = TRUE)
print(bestBIC2)
<pre>

glmulti exhaustive 2 way results glmulti exhaustive 2 way plot

Including 2-way interaction effects increases the run time exponentially. So in this example I have used fewer predictors. Normally you would not do this.

Once we start including 2-way interactions and all of the possible predictor columns, an exhaustive search becomes prohibitively time consuming. In such cases, a better approach is to use genetic algorithms to search through possible models. To do this, change the “method” parameter to “g”.

# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# pretend that we only have the target plus these predictors
# just use a subset of the rows, because this is just an example
set.seed(1)
td = td[sample(nrow(td), 10000),c("race", "gender", "time_in_hospital", "num_medications", "number_emergency", "number_inpatient", "discharge_id_desc", "medical_specialty_desc", "readmitted_flag")]

# solve using genetic algorithms
bestBIC.ga = glmulti(readmitted_flag ~ ., data = td, family = binomial, level = 1, crit=bic, fitfunc=glm, method="g",
confsetsize = 256, plotty = TRUE, report = TRUE)
print(bestBIC.ga)

glmulti ga results 1

glmulti hides the complexity of the genetic algorithm implementation from you, so it won’t take long for you to get up and running with your own model optimisation using genetic algorithms. Despite this ease of use, this year I have switched from using glmulti, to writing my own customised genetic algorithm implementations because:

  • glmulti’s genetic algorithm implementation doesn’t have parallel processing abilities, which would speed up the search (am I being impatient?),
  • I often want to customise the initial population, to allow for knowledge I already have about what constitutes a reasonable starting model (for example, by including the results from the analysis explained in my last blog,
  • the inclusion of interaction terms can often lead to GLM fitting errors, such as collinearity or overspecification, that can crash or freeze up glmulti’s genetic algorithm (and this problem occurs on the diabetes readmission data that I use in this blog),
  • glmulti’s genetic algorithm often has duplicate model choices within its population, and so runs the GLM fitting process for exactly the same choice of columns more than once, and this comes with a computational expense, but by customising the implementation I can cache the previously fitted models and just pull out the results without refitting that model all over again,
  • glmulti has an undocumented upper limit on how many predictors it will consider, throwing the error message “too many predictors” if that limit is exceeded, and that limit is low (low enough to be triggered for the diabetes dataset example that I am using in this blog), and
  • sometimes I want to retain more information about intermediate models e.g. AIC, BIC, coefficient values, leverage

If there are enough requests from readers wanting to know how to create custom genetic algorithm searches for GLM model building, then I will make that the topic of a future blog 🙂
Would you prefer to apply these process improvements without the effort of writing R scripts? Consider using a visual user interface, such as IntelliM.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Efficiently Building GLMs: Part 1

10 Friday Jul 2015

Posted by Colin Priest in Generalized Linear Models, Gradient Boosting Machine, IntelliM, Machine Learning, R

≈ 5 Comments

Tags

Automation, GBMs, Generalized Linear Models, GLMs, IntelliM, Machine Learning, R, Variable Importance

It was way back in 1972 that John Nelder and Robert Wedderburn first introduced GLMs to the world (you can find their original paper here), and by the 1990s actuaries in the insurance industry (I am an actuary) had started using GLMs for technical pricing, empowered by the increased accessibility of modern powerful computers. In some parts of the world there are now huge teams, consisting of dozens of actuaries, sometimes more than 100 of them, building generalized linear models (GLMs) for technical pricing of insurance policies. But is this efficient or is it as out of date as a dinosaur riding a penny farthing?

Dinosaur on Penny Farthing

At university, when I was doing my masters degree in Applied Statistics, I was taught that the “best” GLM model is the one with the lowest AIC or BIC (the topic of whether AIC and BIC are good model benchmarks will be the topic of a future blog). But the information criterion is not a well behaved function (it can have many local minima and maxima). To find the model with the lowest information criterion, one cannot follow a steepest gradient approach, but instead one must systematically search through the different combinations of predictors.

Consider a scenario in which there are 30 possible predictors, consisting of rating factors collected from the customer (e.g. zip code / post code) and / or data variables collected from other sources (e.g. credit rating of the insured) and all of the these values are either categorical, or are numeric and have a linear relationship to the variable being modelled (whether that be claim frequency, severity or claims cost per policy). In such a situation, there are 2^30=1,073,741,824 possible combinations of predictors that could be included in the GLM formula. If each of your actuarial or statistical staff took 10 minutes to test a particular combination of predictors, and each of those staff worked 8 hours per day, for 50 weeks per year, it would take 11,185 man-years to find the best model! Once you include the search for 2-way interactions between predictors, the search time blows out to longer than the life-span of the universe!!!

In practice, actuaries and statisticians are producing GLM models faster than that estimate because they are using heuristics to substantially reduce the number of combinations to search. Those heuristics are usually based upon variants of step-wise regression, whereby they add or remove one predictor at a time. This is still extremely time consuming, and the process still does not necessarily produce the model with the best AIC or BIC.

How can you make this process more efficient?

Let’s make this more real by considering some publicly available data tracking hospital readmission of diabetes patients in USA from 1999 to 2008. You can find the data in the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

Improvement 1: Starting With a Reasonable Choice of Predictors

One can improve upon the step-wise regression approach by starting with a model that already has a few of the most useful predictors.

Instead of beginning with a GLM that uses all the predictors and removes them one-by-one, or starting with no predictors and adding more predictors one-by-one, you can start by understanding which predictors are likely to be the best candidates, and this involves taking a step beyond GLMs into some more modern techniques, including:

  • lasso regularisation
  • variable importance measures via random forests or gradient boosting machines

For the sake of simplicity, this blog will only apply just one of these approaches, the variable importance measure based upon a gradient boosting machine. But any of these three approaches will usually achieve the task.

A gradient boosting machine (GBM) is a very flexible type of machine learning algorithm. It can be used for regression and classification problems. You can use GBMs via the GBM package available in R. GBMs are a forest of trees whereby each successive tree is fitted to the residuals of the previous iteration of the forest i.e. each new tree predicts the errors from the existing forest. The GBM package has a measure of “relative influence” that is quite similar to a variable importance measure, and can be used for the same purpose.

Variable importance or relative influence is a measure of how much of the variation in outcomes is explained by the inclusion of the predictor in the model. A predictor will explain more of the variation in outcomes if:

  • it is statistically significant i.e. the difference isn’t random,
  • the difference is large for different values of the predictor i.e. the predictor differentiates well, and
  • there is considerable variation in the predictor value between observations i.e. more than just a small number of predictor observations are different to the average.

Here is some sample R code to give an indication of how one would do this with the diabetes readmission data:

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(gbm)

# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Documents\\IntelliM\\";

# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# GBM variable importance
set.seed(1)
gbm_imp = gbm(formula = readmitted_flag ~ ., distribution = &quot;bernoulli&quot;, data=td, n.trees = 1000, interaction.depth = 1, verbose=TRUE, shrinkage = 0.01, cv.folds=0, keep.data = F)
s = summary(gbm_imp)
head(s)

As I write in my last blog, I’m a fan of the pacman package in R. It conveniently ensures that I have installed packages before I load them, and then installs and loads the packages as required.

The next step is to read the diabetes readmission data into R. I am reading the data from a comma delimited file that I have created previously after downloading the UCI Machine Learning Repository data. You should edit the sample R script to use the folder and file name of your data.

Finally I fitted a GBM model. For the purposes of this blog I set the random seed, to make the results replicable. Note that the model hyperparameters were not optimised – I am just creating a model for the sake of understanding which predictors are important, not trying to fit the best possible GBM model. But sometimes the interaction.depth hyperparameter does matter. In my script above I have used interaction.depth = 1, which excludes the possibility of 2-way interaction effects between two predictors. I chose a value of 1 for simplicity for this example, and because most of the time it doesn’t make a difference to the discovery of the most important predictors. However, if you have strong reason to believe that your data exhibits strong 2-way effects between predictors, and also have reason to believe that the one-way effect of those same predictors is weak, then try higher values for this hyperparameter.

variable importance console screenshot

The summary function gets the variable importance measures from the GBM model. It displays a plot of the most important predictors, and also stores them in a table. As you can see from the screenshot above, R shows that just 5 predictors will give most of the predictive power. Note that the variable importance scores have been automatically scaled to add to 100.

R variable importance plot

The default plot for GBM variable importance is rather difficult to read and interpret. If I am working on a project that is purely within R, then I usually script for a better looking plot based upon the ggplot2 package in R. Lately I have been using IntelliM to build my GLMs because it automates the feature extraction, model building and validation process for GLMs, and it does it visually instead of manually writing R scripts. It gives an easier to read graph, shown below.

Variable Importance

Later in this series of blogs, I will discuss dimensionality reduction, but since one approach to dimensionality reduction relates to what I have just shown in this blog, I will give you a sneak peak.

The diabetes readmission data contains categorical predictors that contain codes for diagnoses and other data of interest. Some of these predictors have several hundred different possible codes. Unless we had many millions of observations in total, there is no way that all of those codes are going to have enough observations to provide statistically valid predictions. What we expect is that just a small number of these codes are important.

Factor Importance

So we can apply the same variable importance approach to the codes within a categorical predictor, and then group the unimportant codes together, making for a better model.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Tutorial: Using R and Twitter to Analyse Consumer Sentiment

04 Saturday Jul 2015

Posted by Colin Priest in R, Text Mining, Twitter

≈ 145 Comments

Tags

R, Text Mining, Twitter

Tutorial: Using R and Twitter to Analyse Consumer Sentiment
Content

This year I have been working with a Singapore Actuarial Society working party to introduce Singaporean actuaries to big data applications, and the new techniques and tools they need in order to keep up with this technology. The working group’s presentation at the 2015 General Insurance Seminar was well received, and people want more. So we are going to run some training tutorials, and want to extend our work.
One of those extensions is text mining. Inspired by a CAS paper by Roosevelt C. Mosly Jnr, I thought that I’d try to create a simple working example of twitter text mining, using R. I thought that I could just Google for an R script, make some minor changes, and run it. If only it were as simple as that…
I quickly ran into problems that none of the on-line blogs and documentation fully deal with:

    • Twitter changed its search API to require authorisation. That authorisation process is a bit time-consuming and even the most useful blogs got some minor but important details wrong.
    • CRAN has withdrawn its sentiment package, meaning that I couldn’t access the key R library that makes the example interesting.

After much experimentation, and with the help of some R experts, I finally created a working example. Here it goes, step by step:

STEP 1: Log on to https://apps.twitter.com/

Just use your normal Twitter account login. The screen should look like this:
step 1

STEP 2: Create a New Twitter Application

Click on the “Create New App” button, then you will be asked to fill in the following form:
step 2
Choose your own application name, and your own application description. The website needs to be a valid URL. If you don’t have your own URL, then JULIANHI recommends that you use http://test.de/ , then scroll down the page.
step 2b
Click “Yes, I Agree” for the Developer Agreement, and then click the “Create your Twitter application” button. You will see something like this:

step 2c

Go to the “Keys and Access Tokens” tab. Then look for the Consumer Key and the Consumer Secret. I have circled them in the image below. We will use these keys later in our R script, to authorise R to access the Twitter API.

step 2d2

Scroll down to the bottom of the page, where you will find the “Your Access Token” section.

step 2e

Click on the button labelled “Create my access token”.step 2f

Look for the Access Token and Access Token Secret. We will use these in the next step, to authorise R to access the Twitter API.

STEP 3: Authorise R to Access Twitter

First we need to load the Twitter authorisation libraries. I like to use the pacman package to install and load my packages. The other packages we need are:

    • twitteR : which gives an R interface to the Twitter API
    • ROAuth : OAuth authentication to web servers
    • RCurl : http requests and processing the results returned by a web server

The R script is below. But first remember to replace each “xxx” with the respective token or secret you obtained from the Twitter app page.


# authorisation
if (!require('pacman')) install.packages('pacman')
pacman::p_load(twitteR, ROAuth, RCurl)

api_key = 'xxx'
api_secret = 'xxx'
access_token = 'xxx'
access_token_secret = 'xxx'

# Set SSL certs globally
options(RCurlOptions = list(cainfo = system.file('CurlSSL', 'cacert.pem', package = 'RCurl')))

# set up the URLs
reqURL = 'https://api.twitter.com/oauth/request_token'
accessURL = 'https://api.twitter.com/oauth/access_token'
authURL = 'https://api.twitter.com/oauth/authorize'

twitCred = OAuthFactory$new(consumerKey = api_key, consumerSecret = api_secret, requestURL = reqURL, accessURL = accessURL, authURL = authURL)

twitCred$handshake(cainfo = system.file('CurlSSL', 'cacert.pem', package = 'RCurl'))

After substituting your own token and secrets for “xxx”, run the script. It will open a web page in your browser. Note that on some systems R can’t open the browser automatically, so you will have to copy the URL from R, open your browser, then paste the link into your browser. If R gives you any error messages, then check that you have pasted the token and secret strings correctly, and ensure that you have the latest versions of the twitteR, ROAuth and RCurl libraries by reinstalling them using the install.packages command.

The web page will look something like this:

step 3a

Click the “Authorise app” button, and you will be given a PIN (note that your PIN will be different to the one in my example).

step 3b

Copy this PIN to the clipboard and then return to R, which is asking you to enter the PIN.

step 3c

Paste in, or type, the PIN from the Twitter web page, then click enter. R is now authorised to run Twitter searches. You only need to do this once, but you do need to use your token strings and secret strings again in your R search scripts.

Go back to https://apps.twitter.com/ and go to the “Setup” tab for your application.

step 3d

For the Callback URL enter http://127.0.0.1:1410 . This will allow us the option of an alternative authorisation method later.

STEP 4: Install the Sentiment Package

Since the sentiment package is no longer available on CRAN, we have to download the archived source code and install it via this RScript:

if (!require('pacman')) install.packages('pacman&')
pacman::p_load(devtools, installr)
install.Rtools()
install_url('http://cran.r-project.org/src/contrib/Archive/Rstem/Rstem_0.4-1.tar.gz')
install_url('http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz')

Note that we only have to download and install the sentiment package once.

UPDATE: There’s a new package on CRAN for sentiment analysis, and I have written a tutorial about it.

STEP 5: Create A Script to Search Twitter

Finally we can create a script to search twitter. The first step is to set up the authorisation credentials for your script. This requires the following packages:

  • twitteR : which gives an R interface to the Twitter API
  • sentiment : classifies the emotions of text
  • plyr : for splitting text
  • ggplot2 : for plots of the categorised results
  • wordcloud : creates word clouds of the results
  • RColorBrewer :  colour schemes for the plots and wordcloud
  • httpuv : required for the alternative web authorisation process
  • RCurl : http requests and processing the results returned by a web server

if (!require('pacman')) install.packages('pacman')
pacman::p_load(twitteR, sentiment, plyr, ggplot2, wordcloud, RColorBrewer, httpuv, RCurl, base64enc)

options(RCurlOptions = list(cainfo = system.file('CurlSSL', 'cacert.pem', package = 'RCurl')))

api_key = 'xxx'
api_secret = 'xxx'
access_token = 'xxx'
access_token_secret = 'xxx'

setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

Remember to replace the “xxx” strings with your token strings and secret strings.

Using the setup_twitter_oauth function with all four parameters avoids the case where R opens a web browser again. But I have found that it can be problematic to get this function to work on some computers. If you are having problems, then I suggest that you try the alternative call with just two parameters:

setup_twitter_oauth(api_key,api_secret)

This alternative way opens your browser and uses your login credentials from your current Twitter session.

Once authorisation is complete, we can run a search. For this example, I am doing a search on tweets mentioning a well-known brand: Starbucks. I am restricting the results to tweets written in English, and I am getting a sample of 10,000 tweets. It is also possible to give date range and geographic restrictions.


# harvest some tweets
some_tweets = searchTwitter('starbucks', n=10000, lang='en')

# get the text
some_txt = sapply(some_tweets, function(x) x$getText())

Please note that the Twitter search API does not return an exhaustive list of tweets that match your search criteria, as Twitter only makes available a sample of recent tweets. For a more comprehensive search, you will need to use the Twitter streaming API, creating a database of results and regularly updating them, or use an online service that can do this for you.

Now that we have tweet texts, we need to clean them up before doing any analysis. This involves removing content, such as punctuation, that has no emotional content, and removing any content that causes errors.


# remove retweet entities
some_txt = gsub('(RT|via)((?:\\b\\W*@\\w+)+)', '', some_txt)
# remove at people
some_txt = gsub('@\\w+', '', some_txt)
# remove punctuation
some_txt = gsub('[[:punct:]]', '', some_txt)
# remove numbers
some_txt = gsub('[[:digit:]]', '', some_txt)
# remove html links
some_txt = gsub('http\\w+', '', some_txt)
# remove unnecessary spaces
some_txt = gsub('[ \t]{2,}', '', some_txt)
some_txt = gsub('^\\s+|\\s+$', '', some_txt)

# define 'tolower error handling' function
try.error = function(x)
{
# create missing value
y = NA
# tryCatch error
try_error = tryCatch(tolower(x), error=function(e) e)
# if not an error
if (!inherits(try_error, 'error'))
y = tolower(x)
# result
return(y)
}
# lower case using try.error with sapply
some_txt = sapply(some_txt, try.error)

# remove NAs in some_txt
some_txt = some_txt[!is.na(some_txt)]
names(some_txt) = NULL

Now that we have clean text for analysis, we can do sentiment analysis. The classify_emotion function is from the sentiment package and “classifies the emotion (e.g. anger, disgust, fear, joy, sadness, surprise) of a set of texts using a naive Bayes classifier trained on Carlo Strapparava and Alessandro Valitutti’s emotions lexicon.”

# Perform Sentiment Analysis
# classify emotion
class_emo = classify_emotion(some_txt, algorithm='bayes', prior=1.0)
# get emotion best fit
emotion = class_emo[,7]
# substitute NA's by 'unknown'
emotion[is.na(emotion)] = 'unknown'

# classify polarity
class_pol = classify_polarity(some_txt, algorithm='bayes')
# get polarity best fit
polarity = class_pol[,4]
# Create data frame with the results and obtain some general statistics
# data frame with results
sent_df = data.frame(text=some_txt, emotion=emotion,
polarity=polarity, stringsAsFactors=FALSE)

# sort data frame
sent_df = within(sent_df,
emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))

With the sentiment analysis done, we can start to look at the results. Let’s look at a histogram of the number of tweets with each emotion:

# Let’s do some plots of the obtained results
# plot distribution of emotions
ggplot(sent_df, aes(x=emotion)) +
geom_bar(aes(y=..count.., fill=emotion)) +
scale_fill_brewer(palette='Dark2') +
labs(x='emotion categories', y='number of tweets') +
ggtitle('Sentiment Analysis of Tweets about Starbucks\n(classification by emotion)') +
theme(plot.title = element_text(size=12, face='bold'))

step 5a.jpg

Most of the tweets have unknown emotional content. But that sort of makes sense when there are tweets such as “With risky, diantri, and Rizky at Starbucks Coffee Big Mal”.

Let’s get a simpler plot, that just tells us whether the tweet is positive or negative.


# plot distribution of polarity
ggplot(sent_df, aes(x=polarity)) +
geom_bar(aes(y=..count.., fill=polarity)) +
scale_fill_brewer(palette='RdGy') +
labs(x='polarity categories', y='number of tweets') +
ggtitle('Sentiment Analysis of Tweets about Starbucks\n(classification by polarity)') +
theme(plot.title = element_text(size=12, face='bold'))

step 5b

So it’s clear that most of the tweets are positive. That would explain why there are more than 21,000 Starbucks stores around the world!

Finally, let’s look at the words in the tweets, and create a word cloud that uses the emotions of the words to determine their locations within the cloud.

# Separate the text by emotions and visualize the words with a comparison cloud
# separating text by emotion
emos = levels(factor(sent_df$emotion))
nemo = length(emos)
emo.docs = rep('', nemo)
for (i in 1:nemo)
{
tmp = some_txt[emotion == emos[i]]
emo.docs[i] = paste(tmp, collapse=' ')
}

# remove stopwords
emo.docs = removeWords(emo.docs, stopwords('english'))
# create corpus
corpus = Corpus(VectorSource(emo.docs))
tdm = TermDocumentMatrix(corpus)
tdm = as.matrix(tdm)
colnames(tdm) = emos

# comparison word cloud
comparison.cloud(tdm, colors = brewer.pal(nemo, 'Dark2'),
scale = c(3,.5), random.order = FALSE, title.size = 1.5)

step 5c

Word clouds give a more intuitive feel for what people are tweeting. This can help you validate the categorical results you saw earlier.

And that’s it for this post! I hope that you can get Twitter sentiment analysis working on your computer too.

UPDATE: There’s a new package on CRAN for sentiment analysis, and I have written a tutorial about it.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...
Newer posts →

Blogroll

  • Discover New Voices
  • Discuss
  • Get Inspired
  • Get Mobile
  • Get Polling
  • Get Support
  • Great Reads
  • Learn WordPress.com
  • Theme Showcase
  • WordPress.com News
  • www.r-bloggers.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 277 other subscribers

Blog at WordPress.com.

  • Follow Following
    • Keeping Up With The Latest Techniques
    • Join 86 other followers
    • Already have a WordPress.com account? Log in now.
    • Keeping Up With The Latest Techniques
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...
 

    %d bloggers like this: