Keeping Up With The Latest Techniques

~ brief insights

Keeping Up With The Latest Techniques

Tag Archives: Genetic Algorithms

Efficiently Building GLMs: Part 4

08 Saturday Aug 2015

Posted by Colin Priest in Automation, Generalized Linear Models, Genetic Algorithms, Machine Learning, R

≈ 2 Comments

Tags

Automation, Genetic Algorithms, GLMs, Machine Learning, R

In my previous blog in this series, I discussed how you can write custom R functions that can recognise separation and missing cells in multivariate data. This week I discuss how you can customise the process that searches for the best combination of predictor columns.

While the bestglm and glmulti packages are great for getting you started in automation, I found that I wanted more. In particular I wanted:

  • automation of data with more predictor columns than glmulti allows,
  • parallel processing to satisfy my impatience, and
  • customisation of the model selection, to allow for issues such as separation and collinearity.

So I created by own functions as alternatives.

Restructuring the Problem

Before we implement a search algorithm, we need to restructure the GLM model building problem so that it becomes an optimisation problem. We do this by creating a vector of binary values that flags which columns are to be included in the GLM model formula.

1 0 1 0

Consider the binary vector above. Since the first value equals 1, the first predictor column is included in the model formula. Since the third slot in the vector equals 1, then the third predictor column is included in the model formula. However the second and fourth predictor columns are not included, because the indicator variables in those slots have a value of 0. We can code this in R:

# a function to create GLM formulae from an indicator vector
createFormula = function(indicators)
{
 # do the special case of no predictor columns, so use NULL model
 if (sum(indicators) == 0)
 return (formula(paste(target, " ~ 1", sep="")))
 
 # the more common case
 return (formula(paste(target, " ~ ", paste(predictors[indicators == 1], collapse = " + "))))
}

So the optimisation problem is to find the binary vector that creates the optimal GLM formula, where “optimal” relates to the quality of the GLM that you created (allowing for whatever measure you choose to you). To keep this example simple, I will use the BIC score of the GLM, but that does not mean that I am recommending that you should use BIC. I just means that this blog is only about automation, and there isn’t space here to discuss how to measure what the “best” GLM measure would be.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(memoise)

# the function to be optimised
# the function to be optimised
glmScore = function(indicators)
{
 # return a NULL model if any errors or warnings occur
 glm.mod = tryCatch (
 glm(createFormula(indicators), data = td, family = binomial(link=logit)),
 error = function(e) glm(createFormula(indicators * 0), data = td, family = binomial(link=logit)),
 warning = function(w) glm(createFormula(indicators * 0), data = td, family = binomial(link=logit))
 )
 # if there are problems with NA parameters (e.g. via separation), then use a NULL model
 if (any(is.na(coef(glm.mod))))
 glm.mod = glm(createFormula(indicators * 0), data = td, family = binomial(link=logit))
 print(glm.mod$formula)
 return(BIC(glm.mod))
}



glmScr = memoise(glmScore)

As you can see, you can customise how a particular GLM is scored, to allow for your preferences about issues such as separation or collinearity.

The memoise package is another excellent utility that I have recently discovered. It remembers the return value for a function for a given parameter value, and instead of wasting time re-evaluating the function the next time that parameter value is used, it returns the cached return value for that particular parameter input. This can save a lot of processing time when a function is numerically demanding (e.g. fitting a GLM) and called repeatedly using the same parameter values (e.g. when iteratively searching through candidate formulae).

Imputing Missing Values

Since we are going to be comparing scores from different GLM models, we need to ensure that those models and scores are comparable. Missing values can be an issue here. When there are missing values in some predictors, R drops out the rows with the missing values, and only fits the GLM to the rows with no missing values in the predictors. The model scores are based upon the fit to this reduce data set, and therefore have lower scores. But R does not warn the user that it is doing this!

In the diabetes data used in this series of blogs, there are missing values. In particular, more than half of the values for the patients’ weight are missing. A optimisation based on BIC would choose a model that included weight_num as a predictor because the small number of data rows used results in a much lower score for goodness of fit.

To overcome the problem of missing values, we must either remove all rows containing any missing values, or impute values where they are missing.  R provides a few ways to do this. For example, you can impute values using the mice package.


# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Dropbox\\IntelliM\\";
 
# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))
# fix the column headers
nm = names(td)
names(td) = gsub(".", "_", nm, fixed=TRUE)
# clean up the missing values
if (!require("pacman")) install.packages("pacman")
pacman::p_load(mice)

cleaned = mice(td)

In this blog I will take the dubious but quick approach of dropping out any rows that contain missing values in the numeric fields.


# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Dropbox\\IntelliM\\";
 
# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# fix the column headers
nm = names(td)
names(td) = gsub(".", "_", nm, fixed=TRUE)
# remove all rows containing missing values
td = td[! is.na(td$weight_num),]

For the diabetes readmission data this reduces the data from 101,000 rows to 3,000 rows, which is hardly a suitable solution. A better solution would have been to simply drop the weight_num column from the data.

Customised Step-wise Model Building

While step-wise model building is not the best way to put together a GLM, it is the quickest way to get to a reasonable model. I use the step-wise approach whenever I am feeling very, very impatient!

But instead of using an out-of-the-box version of step-wise model building, I wrote my own function. It begins by using a GBM to measure variable importance, and to create a reasonable starting model. Then it uses a version of the step-wise approach that has been enhanced to allow for one predictor to be swapped out for another predictor. Usually step-wise implementations are greedy algorithms, i.e. they only look one step ahead. But swapping one column for another involves two steps – removing one column, then adding another column. A greedy algorithm won’t see past the first step of removing a column, and therefore won’t see the full benefit of the swap, and won’t choose that path. It gets trapped in a local minima. This enhancement overcomes the problem by treating column swaps as if they involved just a single step. Here is my script:

# function to show a formula as text
f2text = function(f)
{
 paste(as.character(f)[c(2,1,3)], collapse = " ")
}

# a vector of names of predictor columns
target = "readmitted_flag"
predictors = names(td)
predictors = predictors[predictors != target]

# reset the cache on the glmScore function
forget(glmScr)
# create an indicator vector - in this case we will use a vector of zeroes to start with a NULL model
indicators = matrix(0, length(predictors), 1)
bestScore = glmScr(indicators)
bestParameters = indicators

searching = TRUE
while (searching)
{
 # alternate formulae, with just one column added, or one column removed
 alternates = sapply(seq_len(length(indicators)), function(x) {temp = indicators; temp[x] = 1 - temp[x]; return(temp) })
 
 # alternate formulae, replacing one column for another
 existing = seq_len(length(indicators))[indicators == 1]
 for (i in existing)
 {
 for (j in seq_len(length(indicators)))
 {
 if (i != j && indicators[j] == 0)
 {
 temp = indicators
 temp[i] = 0
 temp[j] = 1
 alternates = cbind(alternates, temp)
 }
 }
 }

 # evaluate all of the alternatives
 scores = apply(alternates, 2, glmScr)
 
 if (min(scores) < bestScore)
 {
 bestScore = min(scores)
 bestParameters = alternates[,which.min(scores)]
 print(paste(round(bestScore), f2text(createFormula(bestParameters)), sep = ": "))
 indicators = bestParameters
 } else
 {
 searching = FALSE
 }
}

20150807 output 1

When you’re impatient, like me, you want things to run as fast as possible. From a speed perspective it makes sense to start with a null model, and add predictors, rather than starting from an overspecified model and removing predictors. You also want parallel processing. Rather than processing one GLM at a time, I want to simultaneously process as many GLMs as my PC will allow! R supports parallel processing, but Windows users have fewer parallel processing options in R than Linux users. Nevertheless, the snowfall package gives me exactly what I want for this particular task.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(snowfall)

sfInit( parallel=TRUE, cpus=3 )
if( sfParallel() )
{
cat( "Running in parallel mode on", sfCpus(), "nodes.\n" )
} else
{
cat( "Running in sequential mode.\n" ) }

# a vector of names of predictor columns
target = "readmitted_flag"
predictors = names(td)
predictors = predictors[predictors != target]

# reset the cache on the glmScore function
forget(glmScr)
# create an indicator vector - in this case we will use a vector of zeroes to start with a NULL model
indicators = matrix(0, length(predictors), 1)
bestScore = glmScr(indicators)
bestParameters = indicators

searching = TRUE
while (searching)
{
# alternate formulae, with just one column added, or one column removed
alternates = sapply(seq_len(length(indicators)), function(x) {temp = indicators; temp[x] = 1 - temp[x]; return(temp) })

# alternate formulae, replacing one column for another
existing = seq_len(length(indicators))[indicators == 1]
for (i in existing)
{
for (j in seq_len(length(indicators)))
{
if (i != j && indicators[j] == 0)
{
temp = indicators
temp[i] = 0
temp[j] = 1
alternates = cbind(alternates, temp)
}
}
}

sfExportAll()
# evaluate all of the alternatives
#scores = apply(alternates, 2, glmScr)
scores = unlist(sfClusterApplyLB(seq_len(ncol(alternates)), function(x) glmScr(alternates[,x])))

if (min(scores) < bestScore)
{
bestScore = min(scores)
bestParameters = alternates[,which.min(scores)]
print(paste(round(bestScore), f2text(createFormula(bestParameters)), sep = ": "))
indicators = bestParameters
} else
{
searching = FALSE
}
}

sfStop()

Evolving Your GLM

Step-wise model building works OK when there aren’t too many predictors, and when I don’t want any interaction terms. But once there are many interaction terms, step-wise model building has to work through too many permutations. The problem becomes too highly dimensional for grid based searches.

We need a new approach. We need our GLMs to evolve into intelligent models.

Evolution

Let’s consider the case where we want two-way interaction terms. Once again we use an indicator vector to define the GLM formula, but this time we need to include a vector cell for each and every combination of individual predictors.

The first step is to create a list of all of the individual predictors, and each two-way combination of predictors.


# a vector of names of predictor columns
target = "readmitted_flag"
predictors = names(td)
predictors = predictors[predictors != target]
indices = expand.grid(seq_len(length(predictors)), seq_len(length(predictors)))
indices = indices[indices[,1] != indices[,2],]
twoWays = apply(indices, 1, function(x) paste(predictors[x], collapse = ":"))
predictors = c(predictors, twoWays)
length(predictors)

Once we include two-way interaction terms, there are 2025 terms that could be included in the GLM formula! A thorough search of all possible formulae would require evaluating 2^2025 GLMs!!!

R has a few packages that implement genetic algorithms. Today I will use the GA package to optimise the indicator vector. Note that genetic algorithms try to maximise a fitness function, whereas we are trying to minimise a score. So we will need to define a fitness function that is -1 times the GLM score.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(GA)

# define a glm fitness function
glmFitness = function(x) return(-1 * glmScr(x))

ga.mod = ga(type = "binary", fitness = glmFitness, nBits = length(predictors), popSize = 100, parallel = TRUE, run = 5)

20150807 output 2

From the output we can see that our first attempt at a genetic algorithm didn’t work very well. That’s because most formulae contain 2-way interactions, and most 2-way interactions contain missing cells, and therefore these formulae have been rejected. To fix this we need to bias the initial population towards formulae that do not contain 2-way terms, and towards formulae than contain very few terms.


# custom function to create an initial population
initPopn = function (object, ...) 
{
 population = matrix(0, nrow = object@popSize, ncol = object@nBits)
 n = ncol(td) - 1
 probability = min(1, 2 / n)
 ones = matrix(rbinom(n * object@popSize, 1, probability), object@popSize, n)
 population[,seq_len(n)] = ones
 return(population)
}

set.seed(1)

ga.mod.2 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors), population = initPopn, popSize = 100, parallel = TRUE, run = 5)

20150807 output 5

This time we had more success. But we need to tune the hyperparameters, using a bigger population size and slower stopping criteria, in order to get better results. I also added a custom function to show me which GLM model formula is the best performing at the end of each epoch. The data preparation and support functions have also been updated to make the process more robust, and I changed the initial population to ensure more genetic diversity.


# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(memoise, caret, GA)

# function to show a formula as text
f2text = function(f)
{
 if (class(f) != "formula")
 stop("f is not a formula!")
 return(paste(as.character(f)[c(2,1,3)], sep = "", collapse = " "))
}

# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Dropbox\\IntelliM\\";
 
# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# remove all rows containing missing values
td = td[! is.na(td$weight_num),]

# remove low variance columns
td = td[,-nearZeroVar(td)]

# remove the diagnosis columns - just to make the example work faster
td = td[,-c(11, 12, 13)]

# fix the column headers
nm = names(td)
names(td) = gsub(".", "_", nm, fixed=TRUE)

# a vector of names of predictor columns
target = "readmitted_flag"
predictors = names(td)
predictors = predictors[predictors != target]
indices = expand.grid(seq_len(length(predictors)), seq_len(length(predictors)))
indices = indices[indices[,1] < indices[,2],]
twoWays = apply(indices, 1, function(x) paste(predictors[x], collapse = ":"))
predictors = c(predictors, twoWays)
length(predictors)

# a function to create GLM formulae from an indicator vector
createFormula = function(indicators)
{
 if (is.null(indicators))
 stop("createFormula: indicators is NULL!")
 if (any(is.na(indicators)))
 stop("createFormula: NA found in indicators!")
 # do the special case of no predictor columns, so use NULL model
 if (sum(indicators) == 0)
 return (formula(paste(target, " ~ 1", sep="")))
 
 # the more common case
 result = formula(paste(target, " ~ ", paste(predictors[indicators == 1], collapse = " + ")))
 return (result)
}

# the function to be optimised
glmScore = function(indicators)
{
 if (is.null(indicators))
 stop("glmScore: indicators is NULL!")
 if (any(is.na(indicators)))
 {
 if (colin_verbose)
 print("glmScore: NA found in indicators!")
 }

 # return a NULL model if any errors or warnings occur
 fff = createFormula(indicators)
 useMod = FALSE
 glm.mod = NULL
 tryCatch (
 {
 glm.mod = glm(fff, data = td, family = binomial(link=logit));
 useMod = TRUE;
 }
 ,
 error = function(e) {
 if (colin_verbose)
 {
 print(f2text(fff));
 print(e);
 }
 },
 warning = function(w) {
 if (colin_verbose)
 {
 print(f2text(fff));
 print(w);
 }
 }
 )
 if (! useMod) 
 {
 return (9e99)
 }
 # if there are problems with NA parameters (e.g. via separation), then use a NULL model
 if (any(is.na(coef(glm.mod))))
 {
 if (colin_verbose)
 print("NAs in coefficients")
 return (9e99)
 }
 result = BIC(glm.mod)
 return(result)
}

# create a version of GLM score that remembers previous results
glmScr = memoise(glmScore)

# define a glm fitness function
glmFitness = function(x) return(-1 * glmScr(x))

# custom function to create an initial population
initPopn = function (object, ...) 
{
 population = matrix(0, nrow = object@popSize, ncol = object@nBits)
 n = ncol(td) - 1
 probability = min(1, 2 / n)
 ones = matrix(rbinom(n * object@popSize, 1, probability), object@popSize, n)
 population[,seq_len(n)] = ones
 # make the first n rows be the nth predictor
 n2 = min(n, object@popSize)
 for (i in seq_len(n2))
 {
 population[i,] = 0
 population[i, i] = 1
 }
 return(population)
}

# custom function to monitor the genetic algorithm
glmMonitor = function (object, digits = getOption("digits"), ...) 
{
 fitness <- na.exclude(object@fitness)
 i = which.max(fitness[! is.na(object@fitness)])
 p = object@population[! is.na(object@fitness),]
 x = p[i,]
 cat(paste("Iter =", object@iter, " | Median =", 
 format(-1 * median(fitness), digits = 0), 
 " | Best =", 
 format(-1 * max(fitness), digits = 0), 
 f2text(createFormula(x)),
 "\n"))
}

# run a genetic algorithm to find the "best" GLM with interaction terms
forget(glmScr)
colin_verbose = FALSE
ga.mod.3 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors), 
 population = initPopn, 
 popSize = 500, 
 parallel = TRUE, 
 run = 25, 
 monitor = glmMonitor, 
 seed = 1)

20150807 output 6

The “best” GLM has evolved to include 2-way interactions. Since it doesn’t take too long to run, it is helpful to rerun the genetic algorithm using other random seeds, to check that we have converged on the best score.


ga.mod.4 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors),
population = initPopn,
popSize = 500,
parallel = TRUE,
run = 25,
monitor = glmMonitor,
seed = 2)
ga.mod.5 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors),
population = initPopn,
popSize = 500,
parallel = TRUE,
run = 25,
monitor = glmMonitor,
seed = 3)

Comparing the results of the tree genetic algorithm runs, we find that the best scoring model is:

readmitted_flag ~ number_emergency + number_inpatient + discharge_id_desc + age_num:num_lab_procedures + number_emergency:number_inpatient

This model scored 2191. This model also makes intuitive sense. The following data is predictive of the probability of another hospital visit in the next 30 days:

  • the number of emergency visits to hospital over the previous 12 months
  • the number of hospital visits over the previous 12 months
  • the health of the patient when they were discharged (including whether they died in hospital!)
  • the relationship between the patient’s age and the number of treatments that they receive
  • the relationship between the number of emergency visits versus the total number of visits to the hospital over the past 12 months

.

Would you prefer to apply these process improvements without the effort of writing R scripts? Consider using a visual user interface, such as IntelliM.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Efficiently Building GLMs: Part 2

17 Friday Jul 2015

Posted by Colin Priest in Generalized Linear Models, Genetic Algorithms, Machine Learning, R

≈ 1 Comment

Tags

Automation, Generalized Linear Models, Genetic Algorithms, GLMs, Machine Learning, R

In my last blog, I started this series of blogs discussing how to make the GLM model building process more efficient, and I showed an example of how R can help you find a good set of starting predictors that capture the essence of what can be explained.

In this blog I will show you a few ways that R can also help you to fine tune the choice of predictors.

In my experience, most statisticians and actuaries follow a heuristic process for fine tuning their GLMs. They look for predictors with low statistical significance, that can be dropped. They try to add predictors that they expect might be valuable. It’s quite a manual process, with experimentation and tinkering, and frequently isn’t documented.

Before the invention of the printing press, books were hand written. Monasteries had rooms called scriptoria where monks would copy manuscripts, painstakingly drawing and writing, copying pages of existing books. Later, as the first universities emerged, a new type of scribe, who wasn’t a monk, would carry out the same process in scriptoria that were located within those universities.

Escribano

"Escribano" by Jean Le Tavernier - [1]. Licensed under Public Domain via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Escribano.jpg#/media/File:Escribano.jpg

Are we using statisticians and actuaries like scribes, doing manual work that can (and should) be automated? That would free them up to make more value-added contributions, such as sensibility checks, contextualisation, and recommending practical improvements to underwriting, pricing, risk management and marketing.

Improvement 2: Automating the Process for Selection of Predictors

You can automate the search process. A modern computer can exhaustively search through models at a rate that is several orders of magnitude faster than a human. If you include some of the same heuristics that the humans use, you can save a lot of time, and reduce operational risk.

There are a few R packages that can do this for you, without requiring customisation. Today I will consider two of them, bestglm and glmulti.

By default, but only for normally distributed residuals, the bestglm package uses the “leaps and bounds” algorithm, which was developed by Furnival and Wilson back in 1974. Click here to download and read the original paper. The algorithm begins with an overspecified model, and then recursively considers dropping out each remaining predictor. It rules out searching further when the child models (caused by choosing whether to keep or drop out a specific predictor) do not improve the quality of the model. It can be visualised as searching through the set of models, using a tree structure.

For non-normally distributed GLMs, an exhaustive search is carried out i.e. bestglm considers every possible subset of predictors, and evaluates the quality of the model.

To give a practical demonstration of this process, I will again use the diabetes readmission data in the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008. I have reformatted the data slightly, so if you wish to exactly replicate my analysis, then I suggest that you download a copy of the reformatted data from here.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(bestglm)

# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Documents\\IntelliM\\"

# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# pretend that we only have the target plus these predictors
# remember that the target must be the last column in the data
# just use a subset of the rows, because this is just an example
set.seed(1)
td = td[sample(nrow(td), 1000),c("gender", "time_in_hospital", "number_inpatient", "diag_1", "diag_2", "diag_3", "discharge_id_desc", "readmitted_flag")]

# find the model with the best BIC
bestBIC = bestglm(td, family = binomial, IC = "BIC")

# Show top 5 models
bestBIC$BestModels

# show a summary of the best model
summary(bestBIC$BestModel)</pre>
# show the relationship between the number of predictors and the model quality
plot(seq_len(nrow(bestBIC$Subsets)) - 1, bestBIC$Subsets[,"BIC"], type="b", xlab = "Number of Predictors", ylab = "BIC")

# show again, but cap it at 3 predictors
plot(seq_len(4) - 1, bestBIC$Subsets[seq_len(4),"BIC"], type="b", xlab = "Number of Predictors", ylab = "BIC")

For the purpose of this example, to keep the run time down to a few minutes, I have only used a subset of the columns and the rows in the data. In practice you would use as many of the columns and rows as you need, limited by the computing power available to you. While I have used the BIC as the measure of the “best” GLM model, this is purely for illustrative purposes. The discussion of which measure to use will be the topic of another blog.

bestglm top 5

The BIC criteria heavily penalises the use of diag_1, diag_2 and diag_3 as predictors because of their very high degrees of freedom (each is a factor containing hundreds of different diagnosis codes). Interestingly, the null model (no predictors) ranks number 4 amongst the top 5 models! However, that is probably because we haven’t transformed the numeric predictors to have linear relationships to the target, and we haven’t grouped together any of the factor levels in diag_1, diag_2 and diag_3. Note that the model rankings will change with the number of rows that you include, and the choice of information criteria by which to measure the model performance.

bestglm top summary

The summary of the best model shows a simple model with highly significant predictors.

bestglm plot 2

The plot shows that the best BIC score happens with the use of 2 predictors.

But the bestglm package has its limitations. Exhaustively searching through every possible model is time consuming, even when automated, and especially when the data has many columns and/or rows. And bestglm doesn’t automatically consider interaction terms.

Enter the glmulti package. First let’s get glmulti to replicate the results of the bestglm script.

# libraries
if (!require("pacman")) install.packages("pacman")
pacman::p_load(glmulti)
# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Documents\\IntelliM\\"

# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# pretend that we only have the target plus these predictors
# just use a subset of the rows, because this is just an example
set.seed(1)
td = td[sample(nrow(td), 1000),c("gender", "time_in_hospital", "number_inpatient", "diag_1", "diag_2", "diag_3", "discharge_id_desc", "readmitted_flag")]

# replicate the analysis done by bestglm
bestBIC = glmulti(readmitted_flag ~ ., data = td, family = binomial, level = 1, crit=bic, fitfunc=glm, method="h",
confsetsize = 256, plotty = TRUE, report = TRUE)

print(bestBIC)

The “level” parameter has been set to a value of 1, which means that we are telling glmulti to not consider interaction effects.

glmulti example 1

It can be seen that the results line up with those of bestglm.

To consider 2-way interaction effects, use “level = 2”.

</pre>
# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# allow for 2-way interactions
bestBIC2 = glmulti(readmitted_flag ~ gender + time_in_hospital + num_procedures + number_inpatient, data = td[sample(nrow(td), 10000),], family = binomial, level = 2, crit=bic, fitfunc=glm, method="h", confsetsize = 256, plotty = TRUE, report = TRUE)
print(bestBIC2)
<pre>

glmulti exhaustive 2 way results glmulti exhaustive 2 way plot

Including 2-way interaction effects increases the run time exponentially. So in this example I have used fewer predictors. Normally you would not do this.

Once we start including 2-way interactions and all of the possible predictor columns, an exhaustive search becomes prohibitively time consuming. In such cases, a better approach is to use genetic algorithms to search through possible models. To do this, change the “method” parameter to “g”.

# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# pretend that we only have the target plus these predictors
# just use a subset of the rows, because this is just an example
set.seed(1)
td = td[sample(nrow(td), 10000),c("race", "gender", "time_in_hospital", "num_medications", "number_emergency", "number_inpatient", "discharge_id_desc", "medical_specialty_desc", "readmitted_flag")]

# solve using genetic algorithms
bestBIC.ga = glmulti(readmitted_flag ~ ., data = td, family = binomial, level = 1, crit=bic, fitfunc=glm, method="g",
confsetsize = 256, plotty = TRUE, report = TRUE)
print(bestBIC.ga)

glmulti ga results 1

glmulti hides the complexity of the genetic algorithm implementation from you, so it won’t take long for you to get up and running with your own model optimisation using genetic algorithms. Despite this ease of use, this year I have switched from using glmulti, to writing my own customised genetic algorithm implementations because:

  • glmulti’s genetic algorithm implementation doesn’t have parallel processing abilities, which would speed up the search (am I being impatient?),
  • I often want to customise the initial population, to allow for knowledge I already have about what constitutes a reasonable starting model (for example, by including the results from the analysis explained in my last blog,
  • the inclusion of interaction terms can often lead to GLM fitting errors, such as collinearity or overspecification, that can crash or freeze up glmulti’s genetic algorithm (and this problem occurs on the diabetes readmission data that I use in this blog),
  • glmulti’s genetic algorithm often has duplicate model choices within its population, and so runs the GLM fitting process for exactly the same choice of columns more than once, and this comes with a computational expense, but by customising the implementation I can cache the previously fitted models and just pull out the results without refitting that model all over again,
  • glmulti has an undocumented upper limit on how many predictors it will consider, throwing the error message “too many predictors” if that limit is exceeded, and that limit is low (low enough to be triggered for the diabetes dataset example that I am using in this blog), and
  • sometimes I want to retain more information about intermediate models e.g. AIC, BIC, coefficient values, leverage

If there are enough requests from readers wanting to know how to create custom genetic algorithm searches for GLM model building, then I will make that the topic of a future blog 🙂
Would you prefer to apply these process improvements without the effort of writing R scripts? Consider using a visual user interface, such as IntelliM.

Share this:

  • Twitter
  • Facebook

Like this:

Like Loading...

Blogroll

  • Discover New Voices
  • Discuss
  • Get Inspired
  • Get Mobile
  • Get Polling
  • Get Support
  • Great Reads
  • Learn WordPress.com
  • Theme Showcase
  • WordPress.com News
  • www.r-bloggers.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 277 other subscribers

Blog at WordPress.com.

  • Follow Following
    • Keeping Up With The Latest Techniques
    • Join 86 other followers
    • Already have a WordPress.com account? Log in now.
    • Keeping Up With The Latest Techniques
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...
 

    %d bloggers like this: