Keeping Up With The Latest Techniques

~ brief insights

Keeping Up With The Latest Techniques

Tag Archives: Generalized Linear Models

Efficiently Building GLMs: Part 5

19 Wednesday Aug 2015

Posted by Colin Priest in Automation, GAMs, Generalized Additive Models, Generalized Linear Models, GLMs, R

≈ Leave a comment

Tags

Automation, GAMs, Generalized Additive Models, Generalized Linear Models, GLMs, R

In the last few blogs, I discussed how you can automate the process of choosing which predictor columns to use in your model. This week I will meander away from that path to discuss how feature engineering can be automated.

The “L” in GLM stands for “Linear”, and that gives you a huge hint about one of the key assumptions: a GLM assumes that the relationship between a predictor and the outcome must be linear, or more pedantically, that relationship must be linear after the link function has been applied. For example, in my blog about Denoising Dirty Documents, I showed a graph that demonstrated a linear relationship between the pixel brightness from the input images versus the pixel brightness of the target images.

20150801 output 3

But while linear relationships do naturally occur, I find that I frequently have to transform the input data to create a linear relationship. This is especially so when using monetary values, where a logarithmic transformation often helps. But until recently, I used a manual process, making guesses and testing different transformations. It saved me a lot of time once I automated the process.

Once again, we are using the diabetes readmission data from the UCI Machine Learning Repository. You can download my slightly altered version of the data that I am using from here.

The key tool that we are going to use to linearise the relationships is a generalized additive model (GAM). GAMs are an extension of generalized linear models that allow for non-linear relationships using non-parametric smoothing functions. It’s a bit like using rolling averages, but much more sophisticated.

The diabetes readmission data has a few numeric columns, but today we will just consider the column containing the number of inpatient visits to the hospital over the previous 12 months. Let’s start by doing a one-way analysis of the relationship with the rate of hospital readmission.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(data.table)

# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Dropbox\\IntelliM\\";
 
# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))
# fix the column headers
nm = names(td)
names(td) = gsub(".", "_", nm, fixed=TRUE)

# summarise by number_inpatient
dt <- data.table(td)
summary = dt[,list(mean=mean(readmitted_flag)),by=number_inpatient]
plot(summary$number_inpatient, summary$mean)

# fit a straight line
model.lm = lm(mean ~ number_inpatient, data = summary)
summary$straight_line = coef(model.lm)[1] + summary$number_inpatient * coef(model.lm)[2]
plot(summary$number_inpatient, summary$mean)
lines(summary$number_inpatient, summary$straight_line, col = "red")

20150821 output 2
The relationship shows a non-linear relationship. There is curvature, and the relationship appears to be concave.

When we fit a GLM to this data, we use a logit transformation, because the data has a Bernoulli distribution. So instead of linearising the relationship shown in the one-way analysis above, we should linearise the relationship after the logit link function has been applied. We can see the relationship, after the logit link function, by fitting a GAM model.
R has two packages for building GAMs. I use the mgcv package because it includes automatic regularisation, and because it allows me to obtain a table of results from which I can fit a linearising function. The mgcv package supports both tensor and spline smoothing functions. In the sample below I have specified to use a tensor smoothing function, by using the te() syntax applied to the numeric predictor. Note that you will need mgcv package version 1.8.6 or higher for this R script to work.

# fit a GAM
pacman::p_load(mgcv, minpack.lm)

# check the version of mgcv - must be at least 1.8.6
v = packageVersion("mgcv") # need 1.8.6 or higher
nMajor = as.integer(substr(v[1], 1, 1))
nMinor = as.integer(substr(v[1], 3, 3))
nRevision = as.integer(substr(v[1], 5, 5))
updateRequired = TRUE
if (nMajor > 1) updateRequired = FALSE
if (nMajor == 1 && nMinor > 8) updateRequired = FALSE
if (nMajor == 1 && nMinor == 8 && nRevision > 5) updateRequired = FALSE
if (updateRequired)
{
 stop("You need to install a newer version of the mgcv package")
}

model.gam = gam(readmitted_flag ~ te(number_inpatient), data = td, family = binomial(link=logit))

# get a table of partial relativities
pd = plot.gam(model.gam, residuals = FALSE, n=200, select = NULL)
if (is.null(pd))
{
 stop("You need to install a newer version of the mgcv package")
}

20150821 output 3
The GAM plot shows strong curvature, and also shows a confidence interval. Since the confidence interval becomes very broad for high numbers of inpatient visits (due to the low number of patients with these high number of visits) we should be very suspicious of any apparent curvature out in those ranges.
The pd variable contains the values that were used to create the plot shown above. We will fit linearising functions against the data within the pd variable.
Today we will only consider a power transformation and a truncated version of the power transformation. Power transformations are commonly used as linearising function, and I remember being taught to use them when doing my generalized linear models course at university. Given the concave nature of this relationship, you should probably try out a log transformation too.

# create a table of the predicted values and standard errors
pdCol = pd[[1]]
tab = data.frame(pdCol$x, pdCol$fit, pdCol$se)
names(tab) = c("x", "y", "se")
head(tab)

plot(tab$x, tab$y)

# fit a straight line
model.lm.2 = lm(y ~ x, data = tab, weights = 1 / se ^ 2)
tab$linear = coef(model.lm.2)[1] + coef(model.lm.2)[2] * tab$x

# fit a power transformation
initA = tab$y[1]
initB = tab$y[nrow(tab)]
initC = 0
minX = min(tab$x)
maxX = max(tab$x)
model.power = nlsLM(y ~ a + (b - a) / (maxX - minX) * (x - minX) ^ (1 + c) , data = tab, weights = 1/se^2,
 start = c(a = initA, b = initB, c = initC),
 trace = FALSE,
 control = nls.control(maxiter = 1000, tol = 1e-05, minFactor = 1/2048, printEval = FALSE, warnOnly = TRUE) 
 ) 
tab$power = predict(model.power)

# fit a truncated power transformation
# fit a power transformation
initA = tab$y[1]
initB = tab$y[nrow(tab)]
initC = 0
tab$x.truncated = tab$x
tab$x.truncated[tab$x.truncated > 12] = 12
minX = min(tab$x.truncated)
maxX = max(tab$x.truncated)
model.power.truncated = nlsLM(y ~ a + (b - a) / (maxX - minX) * (x.truncated - minX) ^ (1 + c) , data = tab, weights = 1/se^2,
 start = c(a = initA, b = initB, c = initC),
 trace = FALSE,
 control = nls.control(maxiter = 1000, tol = 1e-05, minFactor = 1/2048, printEval = FALSE, warnOnly = TRUE) 
 ) 
tab$power.truncated = predict(model.power.truncated)

# plot the result
ylim = c(min(pdCol$fit - pdCol$se), max(pdCol$fit + pdCol$se))
plot(tab$x, tab$y, ylim=ylim)
polygon(c(pdCol$x, rev(pdCol$x)), c(pdCol$fit - pdCol$se, rev(pdCol$fit + pdCol$se)), col="lightgrey", border=NA)
lines(tab$x, tab$y, type="l", col="black")
lines(tab$x, tab$linear, type="l", col = "blue")
lines(tab$x, tab$power, type="l", col = "red")
lines(tab$x, tab$power.truncated, type="l", col = "green")

20150821 output 4
A linear relationship (the blue line) is clearly a poor fit to the data. The power transformation (the red line) does a pretty good job of capturing the curvature of the relationship over most of the range, staying within the confidence interval. It is not clear whether truncation (the green line) is necessary.
To choose between the linearising functions, one should score the goodness of fit, and choose the least complex function that provides an adequate fit. The definition of “adequate fit” varies according to what you need to use the model for. Remember also that the purpose of linearising functions is not just to fit well, but to capture the shape of the relationship, so I suggest that your goodness of fit score should include some measure of shape, such as the number of runs of the linearising function either side of the GAM smoother.

# choosing between the models based upon the number of runs
pacman::p_load(randtests)
# keep only the data points that lie closest to an actual data point
tab2 = tab[sapply(summary$number_inpatient, function(x) which.min(abs(tab$x - x))),]
runsTest.linear = runs.test(x = (tab2$y - tab2$linear), alternative = "two.sided", threshold = 0, pvalue = "exact", plot = FALSE)$p.value
runsTest.power = runs.test(x = (tab2$y - tab2$power), alternative = "two.sided", threshold = 0, pvalue = "exact", plot = FALSE)$p.value
runsTest.power.truncated = runs.test(x = (tab2$y - tab2$power.truncated), alternative = "two.sided", threshold = 0, pvalue = "exact", plot = FALSE)$p.value
c(runsTest.linear, runsTest.power, runsTest.power.truncated)

To fully automate this R script, you would also need to consider different possible truncation points, and choose the best location to truncate. I leave it to the reader to implement that extra code.
.

Would you prefer to apply these process improvements without the effort of writing R scripts? Consider using a visual user interface, such as IntelliM.

Like this:

Like Loading...

Efficiently Building GLMs: Part 3

31 Friday Jul 2015

Posted by Colin Priest in Generalized Linear Models, R

≈ 1 Comment

Tags

Automation, Generalized Linear Models, GLMs, R

In my last blog in this series, I described some R packages that can automatically choose the best GLM formula for you. Some people wanted more on this topic. In particular, Asko Kauppinen asked about quasi-complete separation, “empty cells in multidimensional space”. So in this blog, I will show you how to customise the automation process to deal with this issue. Once again, we are using the diabetes readmission data from the UCI Machine Repository. You can download my slightly altered version of the data that I am using from here.

Sometimes R gives you a warning when separation occurs, for example when running the following code.


# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Documents\\IntelliM\\"

# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# fit a GLM that has separation
diabetes.glm.1 = glm(readmitted_flag ~ admission_type_desc + discharge_id_desc, data = td, family = binomial(link=logit))


After fitting the model, R shows the following warning:

output 1 - cropped

Note that this is a warning, and not an error. That’s because it may be perfectly valid for a fitted probability of 0 or 1 to occur. In the diabetes readmission data, it is reasonable to expect that no patients having a discharge description of “expired” (i.e. the patient died in hospital) will ever be readmitted to hospital again!

But, to simplify the example, let’s assume that expired patients can be readmitted, and therefore that we don’t want fitted probabilities of zero, and that medical treatment usually has some benefit and therefore that we don’t want fitted probabilities of one.

We could catch the warning thrown by R, and tell our model building algorithm that we don’t want to use these types of models. You could achieve this via R script that looks something like this:


# use a null model if this model has separation
diabetes.glm.2 = tryCatch (
 glm(readmitted_flag ~ admission_type_desc + discharge_id_desc, data = td, family = binomial(link=logit))
 ,
 warning = function(w) { glm(readmitted_flag ~ 1, data = td, family = binomial(link=logit)) }
 )
diabetes.glm.2

The tryCatch function notes when a warning is thrown, and then runs the alternative null model. The result can be seen to be the null model:

output 2

The null model will be the worst performing GLM, so this model will not be selected by the algorithm as being the “best” model.

This simple solution looks too good to be true, and that’s because it is too good to be true. Consider the following R script and output:


# use a null model if this model has separation
diabetes.glm.3 = tryCatch (
 glm(readmitted_flag ~ discharge_id_desc, data = td, family = binomial(link=logit))
 ,
 warning = function(w) { glm(readmitted_flag ~ 1, data = td, family = binomial(link=logit)) }
)
diabetes.glm.3

output 3

This time, R didn’t throw a warning, and, without informing the user, R has predicted the existence of zombies!

-zombie

So we can’t rely on R to consistently identify when separation occurs. We have to code for it ourselves…


# use a null model if this model has separation
# look for counts of zero
if (sum(xtabs(readmitted_flag ~ discharge_id_desc, data = td) == 0) > 0)
{
diabetes.glm.4 = glm(readmitted_flag ~ 1, data = td, family = binomial(link=logit))
print("separation with zeroes!")
} else
{
# look for counts equal to number of observations
if (sum(xtabs(readmitted_flag ~ discharge_id_desc, data = td) == table(td$discharge_id_desc)) > 0)
{
diabetes.glm.4 = glm(readmitted_flag ~ 1, data = td, family = binomial(link=logit))
print("separation with zeroes!")
} else
{
diabetes.glm.4 = glm(readmitted_flag ~ discharge_id_desc, data = td, family = binomial(link=logit))
}
}
diabetes.glm.4

output 4

This time we have been successful at manually identifying separation. A null model has been created. But this script only works for models with a single predictor. To work with models that use more than one predictor, we need to improve our approach. Since the script is getting more complex, I will refactor it into a function that takes a formula and a data frame, and returns a GLM. That GLM will be the null model when separation is found.

# expand a dot formula to show all of the terms
expandDotFormula = function(f, dat)
{
 fAdj = f
 target = gsub("()", "", f[2], fixed=TRUE)
 
 # allow for dot-style formula
 if (gsub("()", "", f[3], fixed=TRUE) == ".")
 {
 nm = names(dat)
 nm = nm[nm != target]
 fAdj = formula(paste(target, " ~ ", paste(nm, collapse=" + "), sep=""))
 }
 
 return (fAdj)
}

# test whether a specific term in a formula will cause separation
termHasSeparation = function(target, predictor, dat)
{
 cols = unlist(strsplit(predictor, ":"))
 isFactor = sapply(cols, function(x) class(dat[,x]) == "character" || class(dat[,x]) == "factor")
 if (sum(isFactor) == 0)
 return (FALSE)
 fff = formula(paste(target, " ~ ", paste(cols[isFactor], collapse=" + "), sep = ""))
 # check whether there are cells in which there are all zeroes
 if (sum(xtabs(fff, data = dat) == 0) > 0)
 return (TRUE)
 # check whether there are cells in which there are all ones
 if (sum(xtabs(fff, data = dat) == table(dat[,cols[isFactor]])) > 0)
 return (TRUE)
 # check whether there are cells with no exposure
 nPossibleCells = prod(sapply(cols[isFactor], function(x) length(unique(dat[,x]))))
 nActualCells = prod(dim(xtabs(fff, data = dat)))
 if (nActualCells < nPossibleCells)
 return (TRUE)
 
 return (FALSE)
}

# test whether a formula will cause separation
fHasSeparation = function(f, dat)
{
 ff = expandDotFormula(f, dat)
 tt = terms(ff)
 ttf = attr(tt, "factors")
 
 # ttf has columns that represent individual terms of the formula
 fTerms = colnames(ttf)
 # check each term whether it exhibits separation
 reject = any(sapply(fTerms, function(x) termHasSeparation(target, x, dat)))
 
 return (reject)
}

# this function returns a NULL model if the formula causes separation
noSeparationGLM = function(f, data)
{
 if (fHasSeparation(f, data))
 {
 target = gsub("()", "", f[2], fixed=TRUE)
 f = formula(paste(target, " ~ 1", sep=""))
 return (glm(f, data = data, family = binomial(link=logit)))
 } else
 {
 return (glm(f, data = data, family = binomial(link=logit)))
 }
}

# test some formulae
fHasSeparation(readmitted_flag ~ ., td)
fHasSeparation(readmitted_flag ~ race, td)
fHasSeparation(readmitted_flag ~ gender, td)
fHasSeparation(readmitted_flag ~ admission_type_desc + discharge_id_desc, td)
fHasSeparation(readmitted_flag ~ admission_type_desc : discharge_id_desc, td)
fHasSeparation(readmitted_flag ~ weight_num : discharge_id_desc, td)
fHasSeparation(readmitted_flag ~ diag_1, td)

# test some GLMs
noSeparationGLM(readmitted_flag ~ ., td)
noSeparationGLM(readmitted_flag ~ race + weight_num, td)

output 5

output 6

The fHasSeparation function now allows us to flag when a formula has terms that will cause separation, and the noSeparationGLM function returns a null model if you try to fit a model that has separation. We can use the noSeparationGLM function directly in glmulti to force it to reject GLMs that have separation. Sample code is shown below:


# fit a GLM allowing that possible models will be rejected due to separation
if (!require("pacman")) install.packages("pacman")
pacman::p_load(glmulti)

glms = glmulti(readmitted_flag ~ ., data = td[1:1000,1:15], level = 1, method="h", fitfunc = noSeparationGLM)

# show the result - note the absence of columns that cause separation

summary(glms)$bestmodel

You should also check which columns and pairs of columns will cause separation. Some of these might be valid e.g. the “expired” patients in the diabetes readmission data, and it can also point out to you where you may need to do some dimensionality reduction. The R script to do this is:


# list the columns that cause separation
names(td[,-1])[sapply(names(td[,-1]), function(x) termHasSeparation("readmitted_flag", x, td))]

# list the pairs of columns that cause separation
colPairs = expand.grid(names(td[,-1]), names(td[,-1]))
colPairs = colPairs[apply(colPairs, 1, function(x) x[1] != x[2]),]
combinations = apply(colPairs, 1, function(x) paste(x, collapse=":"))
combinations[sapply(combinations, function(x) termHasSeparation("readmitted_flag", x, td))]

There are 18 predictor columns that have problems with separation, and 2,024 paired interactions of predictor columns that have problems with separation.

output 7

.

Would you prefer to apply these process improvements without the effort of writing R scripts? Consider using a visual user interface, such as IntelliM.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Efficiently Building GLMs: Part 2

17 Friday Jul 2015

Posted by Colin Priest in Generalized Linear Models, Genetic Algorithms, Machine Learning, R

≈ 1 Comment

Tags

Automation, Generalized Linear Models, Genetic Algorithms, GLMs, Machine Learning, R

In my last blog, I started this series of blogs discussing how to make the GLM model building process more efficient, and I showed an example of how R can help you find a good set of starting predictors that capture the essence of what can be explained.

In this blog I will show you a few ways that R can also help you to fine tune the choice of predictors.

In my experience, most statisticians and actuaries follow a heuristic process for fine tuning their GLMs. They look for predictors with low statistical significance, that can be dropped. They try to add predictors that they expect might be valuable. It’s quite a manual process, with experimentation and tinkering, and frequently isn’t documented.

Before the invention of the printing press, books were hand written. Monasteries had rooms called scriptoria where monks would copy manuscripts, painstakingly drawing and writing, copying pages of existing books. Later, as the first universities emerged, a new type of scribe, who wasn’t a monk, would carry out the same process in scriptoria that were located within those universities.

Escribano

"Escribano" by Jean Le Tavernier - [1]. Licensed under Public Domain via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Escribano.jpg#/media/File:Escribano.jpg

Are we using statisticians and actuaries like scribes, doing manual work that can (and should) be automated? That would free them up to make more value-added contributions, such as sensibility checks, contextualisation, and recommending practical improvements to underwriting, pricing, risk management and marketing.

Improvement 2: Automating the Process for Selection of Predictors

You can automate the search process. A modern computer can exhaustively search through models at a rate that is several orders of magnitude faster than a human. If you include some of the same heuristics that the humans use, you can save a lot of time, and reduce operational risk.

There are a few R packages that can do this for you, without requiring customisation. Today I will consider two of them, bestglm and glmulti.

By default, but only for normally distributed residuals, the bestglm package uses the “leaps and bounds” algorithm, which was developed by Furnival and Wilson back in 1974. Click here to download and read the original paper. The algorithm begins with an overspecified model, and then recursively considers dropping out each remaining predictor. It rules out searching further when the child models (caused by choosing whether to keep or drop out a specific predictor) do not improve the quality of the model. It can be visualised as searching through the set of models, using a tree structure.

For non-normally distributed GLMs, an exhaustive search is carried out i.e. bestglm considers every possible subset of predictors, and evaluates the quality of the model.

To give a practical demonstration of this process, I will again use the diabetes readmission data in the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008. I have reformatted the data slightly, so if you wish to exactly replicate my analysis, then I suggest that you download a copy of the reformatted data from here.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(bestglm)

# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Documents\\IntelliM\\"

# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# pretend that we only have the target plus these predictors
# remember that the target must be the last column in the data
# just use a subset of the rows, because this is just an example
set.seed(1)
td = td[sample(nrow(td), 1000),c("gender", "time_in_hospital", "number_inpatient", "diag_1", "diag_2", "diag_3", "discharge_id_desc", "readmitted_flag")]

# find the model with the best BIC
bestBIC = bestglm(td, family = binomial, IC = "BIC")

# Show top 5 models
bestBIC$BestModels

# show a summary of the best model
summary(bestBIC$BestModel)</pre>
# show the relationship between the number of predictors and the model quality
plot(seq_len(nrow(bestBIC$Subsets)) - 1, bestBIC$Subsets[,"BIC"], type="b", xlab = "Number of Predictors", ylab = "BIC")

# show again, but cap it at 3 predictors
plot(seq_len(4) - 1, bestBIC$Subsets[seq_len(4),"BIC"], type="b", xlab = "Number of Predictors", ylab = "BIC")

For the purpose of this example, to keep the run time down to a few minutes, I have only used a subset of the columns and the rows in the data. In practice you would use as many of the columns and rows as you need, limited by the computing power available to you. While I have used the BIC as the measure of the “best” GLM model, this is purely for illustrative purposes. The discussion of which measure to use will be the topic of another blog.

bestglm top 5

The BIC criteria heavily penalises the use of diag_1, diag_2 and diag_3 as predictors because of their very high degrees of freedom (each is a factor containing hundreds of different diagnosis codes). Interestingly, the null model (no predictors) ranks number 4 amongst the top 5 models! However, that is probably because we haven’t transformed the numeric predictors to have linear relationships to the target, and we haven’t grouped together any of the factor levels in diag_1, diag_2 and diag_3. Note that the model rankings will change with the number of rows that you include, and the choice of information criteria by which to measure the model performance.

bestglm top summary

The summary of the best model shows a simple model with highly significant predictors.

bestglm plot 2

The plot shows that the best BIC score happens with the use of 2 predictors.

But the bestglm package has its limitations. Exhaustively searching through every possible model is time consuming, even when automated, and especially when the data has many columns and/or rows. And bestglm doesn’t automatically consider interaction terms.

Enter the glmulti package. First let’s get glmulti to replicate the results of the bestglm script.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(glmulti)
# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Documents\\IntelliM\\"

# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# pretend that we only have the target plus these predictors
# just use a subset of the rows, because this is just an example
set.seed(1)
td = td[sample(nrow(td), 1000),c("gender", "time_in_hospital", "number_inpatient", "diag_1", "diag_2", "diag_3", "discharge_id_desc", "readmitted_flag")]

# replicate the analysis done by bestglm
bestBIC = glmulti(readmitted_flag ~ ., data = td, family = binomial, level = 1, crit=bic, fitfunc=glm, method="h",
confsetsize = 256, plotty = TRUE, report = TRUE)

print(bestBIC)

The “level” parameter has been set to a value of 1, which means that we are telling glmulti to not consider interaction effects.

glmulti example 1

It can be seen that the results line up with those of bestglm.

To consider 2-way interaction effects, use “level = 2”.

</pre>
# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# allow for 2-way interactions
bestBIC2 = glmulti(readmitted_flag ~ gender + time_in_hospital + num_procedures + number_inpatient, data = td[sample(nrow(td), 10000),], family = binomial, level = 2, crit=bic, fitfunc=glm, method="h", confsetsize = 256, plotty = TRUE, report = TRUE)
print(bestBIC2)
<pre>

glmulti exhaustive 2 way results glmulti exhaustive 2 way plot

Including 2-way interaction effects increases the run time exponentially. So in this example I have used fewer predictors. Normally you would not do this.

Once we start including 2-way interactions and all of the possible predictor columns, an exhaustive search becomes prohibitively time consuming. In such cases, a better approach is to use genetic algorithms to search through possible models. To do this, change the “method” parameter to “g”.

# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# pretend that we only have the target plus these predictors
# just use a subset of the rows, because this is just an example
set.seed(1)
td = td[sample(nrow(td), 10000),c("race", "gender", "time_in_hospital", "num_medications", "number_emergency", "number_inpatient", "discharge_id_desc", "medical_specialty_desc", "readmitted_flag")]

# solve using genetic algorithms
bestBIC.ga = glmulti(readmitted_flag ~ ., data = td, family = binomial, level = 1, crit=bic, fitfunc=glm, method="g",
confsetsize = 256, plotty = TRUE, report = TRUE)
print(bestBIC.ga)

glmulti ga results 1

glmulti hides the complexity of the genetic algorithm implementation from you, so it won’t take long for you to get up and running with your own model optimisation using genetic algorithms. Despite this ease of use, this year I have switched from using glmulti, to writing my own customised genetic algorithm implementations because:

  • glmulti’s genetic algorithm implementation doesn’t have parallel processing abilities, which would speed up the search (am I being impatient?),
  • I often want to customise the initial population, to allow for knowledge I already have about what constitutes a reasonable starting model (for example, by including the results from the analysis explained in my last blog,
  • the inclusion of interaction terms can often lead to GLM fitting errors, such as collinearity or overspecification, that can crash or freeze up glmulti’s genetic algorithm (and this problem occurs on the diabetes readmission data that I use in this blog),
  • glmulti’s genetic algorithm often has duplicate model choices within its population, and so runs the GLM fitting process for exactly the same choice of columns more than once, and this comes with a computational expense, but by customising the implementation I can cache the previously fitted models and just pull out the results without refitting that model all over again,
  • glmulti has an undocumented upper limit on how many predictors it will consider, throwing the error message “too many predictors” if that limit is exceeded, and that limit is low (low enough to be triggered for the diabetes dataset example that I am using in this blog), and
  • sometimes I want to retain more information about intermediate models e.g. AIC, BIC, coefficient values, leverage

If there are enough requests from readers wanting to know how to create custom genetic algorithm searches for GLM model building, then I will make that the topic of a future blog 🙂
Would you prefer to apply these process improvements without the effort of writing R scripts? Consider using a visual user interface, such as IntelliM.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Efficiently Building GLMs: Part 1

10 Friday Jul 2015

Posted by Colin Priest in Generalized Linear Models, Gradient Boosting Machine, IntelliM, Machine Learning, R

≈ 5 Comments

Tags

Automation, GBMs, Generalized Linear Models, GLMs, IntelliM, Machine Learning, R, Variable Importance

It was way back in 1972 that John Nelder and Robert Wedderburn first introduced GLMs to the world (you can find their original paper here), and by the 1990s actuaries in the insurance industry (I am an actuary) had started using GLMs for technical pricing, empowered by the increased accessibility of modern powerful computers. In some parts of the world there are now huge teams, consisting of dozens of actuaries, sometimes more than 100 of them, building generalized linear models (GLMs) for technical pricing of insurance policies. But is this efficient or is it as out of date as a dinosaur riding a penny farthing?

Dinosaur on Penny Farthing

At university, when I was doing my masters degree in Applied Statistics, I was taught that the “best” GLM model is the one with the lowest AIC or BIC (the topic of whether AIC and BIC are good model benchmarks will be the topic of a future blog). But the information criterion is not a well behaved function (it can have many local minima and maxima). To find the model with the lowest information criterion, one cannot follow a steepest gradient approach, but instead one must systematically search through the different combinations of predictors.

Consider a scenario in which there are 30 possible predictors, consisting of rating factors collected from the customer (e.g. zip code / post code) and / or data variables collected from other sources (e.g. credit rating of the insured) and all of the these values are either categorical, or are numeric and have a linear relationship to the variable being modelled (whether that be claim frequency, severity or claims cost per policy). In such a situation, there are 2^30=1,073,741,824 possible combinations of predictors that could be included in the GLM formula. If each of your actuarial or statistical staff took 10 minutes to test a particular combination of predictors, and each of those staff worked 8 hours per day, for 50 weeks per year, it would take 11,185 man-years to find the best model! Once you include the search for 2-way interactions between predictors, the search time blows out to longer than the life-span of the universe!!!

In practice, actuaries and statisticians are producing GLM models faster than that estimate because they are using heuristics to substantially reduce the number of combinations to search. Those heuristics are usually based upon variants of step-wise regression, whereby they add or remove one predictor at a time. This is still extremely time consuming, and the process still does not necessarily produce the model with the best AIC or BIC.

How can you make this process more efficient?

Let’s make this more real by considering some publicly available data tracking hospital readmission of diabetes patients in USA from 1999 to 2008. You can find the data in the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

Improvement 1: Starting With a Reasonable Choice of Predictors

One can improve upon the step-wise regression approach by starting with a model that already has a few of the most useful predictors.

Instead of beginning with a GLM that uses all the predictors and removes them one-by-one, or starting with no predictors and adding more predictors one-by-one, you can start by understanding which predictors are likely to be the best candidates, and this involves taking a step beyond GLMs into some more modern techniques, including:

  • lasso regularisation
  • variable importance measures via random forests or gradient boosting machines

For the sake of simplicity, this blog will only apply just one of these approaches, the variable importance measure based upon a gradient boosting machine. But any of these three approaches will usually achieve the task.

A gradient boosting machine (GBM) is a very flexible type of machine learning algorithm. It can be used for regression and classification problems. You can use GBMs via the GBM package available in R. GBMs are a forest of trees whereby each successive tree is fitted to the residuals of the previous iteration of the forest i.e. each new tree predicts the errors from the existing forest. The GBM package has a measure of “relative influence” that is quite similar to a variable importance measure, and can be used for the same purpose.

Variable importance or relative influence is a measure of how much of the variation in outcomes is explained by the inclusion of the predictor in the model. A predictor will explain more of the variation in outcomes if:

  • it is statistically significant i.e. the difference isn’t random,
  • the difference is large for different values of the predictor i.e. the predictor differentiates well, and
  • there is considerable variation in the predictor value between observations i.e. more than just a small number of predictor observations are different to the average.

Here is some sample R code to give an indication of how one would do this with the diabetes readmission data:

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(gbm)

# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Documents\\IntelliM\\";

# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# GBM variable importance
set.seed(1)
gbm_imp = gbm(formula = readmitted_flag ~ ., distribution = &quot;bernoulli&quot;, data=td, n.trees = 1000, interaction.depth = 1, verbose=TRUE, shrinkage = 0.01, cv.folds=0, keep.data = F)
s = summary(gbm_imp)
head(s)

As I write in my last blog, I’m a fan of the pacman package in R. It conveniently ensures that I have installed packages before I load them, and then installs and loads the packages as required.

The next step is to read the diabetes readmission data into R. I am reading the data from a comma delimited file that I have created previously after downloading the UCI Machine Learning Repository data. You should edit the sample R script to use the folder and file name of your data.

Finally I fitted a GBM model. For the purposes of this blog I set the random seed, to make the results replicable. Note that the model hyperparameters were not optimised – I am just creating a model for the sake of understanding which predictors are important, not trying to fit the best possible GBM model. But sometimes the interaction.depth hyperparameter does matter. In my script above I have used interaction.depth = 1, which excludes the possibility of 2-way interaction effects between two predictors. I chose a value of 1 for simplicity for this example, and because most of the time it doesn’t make a difference to the discovery of the most important predictors. However, if you have strong reason to believe that your data exhibits strong 2-way effects between predictors, and also have reason to believe that the one-way effect of those same predictors is weak, then try higher values for this hyperparameter.

variable importance console screenshot

The summary function gets the variable importance measures from the GBM model. It displays a plot of the most important predictors, and also stores them in a table. As you can see from the screenshot above, R shows that just 5 predictors will give most of the predictive power. Note that the variable importance scores have been automatically scaled to add to 100.

R variable importance plot

The default plot for GBM variable importance is rather difficult to read and interpret. If I am working on a project that is purely within R, then I usually script for a better looking plot based upon the ggplot2 package in R. Lately I have been using IntelliM to build my GLMs because it automates the feature extraction, model building and validation process for GLMs, and it does it visually instead of manually writing R scripts. It gives an easier to read graph, shown below.

Variable Importance

Later in this series of blogs, I will discuss dimensionality reduction, but since one approach to dimensionality reduction relates to what I have just shown in this blog, I will give you a sneak peak.

The diabetes readmission data contains categorical predictors that contain codes for diagnoses and other data of interest. Some of these predictors have several hundred different possible codes. Unless we had many millions of observations in total, there is no way that all of those codes are going to have enough observations to provide statistically valid predictions. What we expect is that just a small number of these codes are important.

Factor Importance

So we can apply the same variable importance approach to the codes within a categorical predictor, and then group the unimportant codes together, making for a better model.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Blogroll

  • Discover New Voices
  • Discuss
  • Get Inspired
  • Get Mobile
  • Get Polling
  • Get Support
  • Great Reads
  • Learn WordPress.com
  • Theme Showcase
  • WordPress.com News
  • www.r-bloggers.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 277 other subscribers

Blog at WordPress.com.

  • Follow Following
    • Keeping Up With The Latest Techniques
    • Join 86 other followers
    • Already have a WordPress.com account? Log in now.
    • Keeping Up With The Latest Techniques
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...
 

    %d bloggers like this: