In my previous blog in this series, I discussed how you can write custom R functions that can recognise separation and missing cells in multivariate data. This week I discuss how you can customise the process that searches for the best combination of predictor columns.

While the bestglm and glmulti packages are great for getting you started in automation, I found that I wanted more. In particular I wanted:

- automation of data with more predictor columns than glmulti allows,
- parallel processing to satisfy my impatience, and
- customisation of the model selection, to allow for issues such as separation and collinearity.

So I created by own functions as alternatives.

**Restructuring the Problem**

Before we implement a search algorithm, we need to restructure the GLM model building problem so that it becomes an optimisation problem. We do this by creating a vector of binary values that flags which columns are to be included in the GLM model formula.

1 | 0 | 1 | 0 |

Consider the binary vector above. Since the first value equals 1, the first predictor column is included in the model formula. Since the third slot in the vector equals 1, then the third predictor column is included in the model formula. However the second and fourth predictor columns are not included, because the indicator variables in those slots have a value of 0. We can code this in R:

# a function to create GLM formulae from an indicator vector createFormula = function(indicators) { # do the special case of no predictor columns, so use NULL model if (sum(indicators) == 0) return (formula(paste(target, " ~ 1", sep=""))) # the more common case return (formula(paste(target, " ~ ", paste(predictors[indicators == 1], collapse = " + ")))) }

So the optimisation problem is to find the binary vector that creates the optimal GLM formula, where “optimal” relates to the quality of the GLM that you created (allowing for whatever measure you choose to you). To keep this example simple, I will use the BIC score of the GLM, but that does not mean that I am recommending that you should use BIC. I just means that this blog is only about automation, and there isn’t space here to discuss how to measure what the “best” GLM measure would be.

# libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(memoise) # the function to be optimised # the function to be optimised glmScore = function(indicators) { # return a NULL model if any errors or warnings occur glm.mod = tryCatch ( glm(createFormula(indicators), data = td, family = binomial(link=logit)), error = function(e) glm(createFormula(indicators * 0), data = td, family = binomial(link=logit)), warning = function(w) glm(createFormula(indicators * 0), data = td, family = binomial(link=logit)) ) # if there are problems with NA parameters (e.g. via separation), then use a NULL model if (any(is.na(coef(glm.mod)))) glm.mod = glm(createFormula(indicators * 0), data = td, family = binomial(link=logit)) print(glm.mod$formula) return(BIC(glm.mod)) } glmScr = memoise(glmScore)

As you can see, you can customise how a particular GLM is scored, to allow for your preferences about issues such as separation or collinearity.

The memoise package is another excellent utility that I have recently discovered. It remembers the return value for a function for a given parameter value, and instead of wasting time re-evaluating the function the next time that parameter value is used, it returns the cached return value for that particular parameter input. This can save a lot of processing time when a function is numerically demanding (e.g. fitting a GLM) and called repeatedly using the same parameter values (e.g. when iteratively searching through candidate formulae).

**Imputing Missing Values**

Since we are going to be comparing scores from different GLM models, we need to ensure that those models and scores are comparable. Missing values can be an issue here. When there are missing values in some predictors, R drops out the rows with the missing values, and only fits the GLM to the rows with no missing values in the predictors. The model scores are based upon the fit to this reduce data set, and therefore have lower scores. But R does not warn the user that it is doing this!

In the diabetes data used in this series of blogs, there are missing values. In particular, more than half of the values for the patients’ weight are missing. A optimisation based on BIC would choose a model that included weight_num as a predictor because the small number of data rows used results in a much lower score for goodness of fit.

To overcome the problem of missing values, we must either remove all rows containing any missing values, or impute values where they are missing. R provides a few ways to do this. For example, you can impute values using the mice package.

# the working folder for this batch job folderPath = "C:\\Users\\Colin\\Dropbox\\IntelliM\\"; # read the training data td = read.csv(paste(folderPath, "training.csv", sep="")) # fix the column headers nm = names(td) names(td) = gsub(".", "_", nm, fixed=TRUE) # clean up the missing values if (!require("pacman")) install.packages("pacman") pacman::p_load(mice) cleaned = mice(td)

In this blog I will take the dubious but quick approach of dropping out any rows that contain missing values in the numeric fields.

# the working folder for this batch job folderPath = "C:\\Users\\Colin\\Dropbox\\IntelliM\\"; # read the training data td = read.csv(paste(folderPath, "training.csv", sep="")) # fix the column headers nm = names(td) names(td) = gsub(".", "_", nm, fixed=TRUE) # remove all rows containing missing values td = td[! is.na(td$weight_num),]

For the diabetes readmission data this reduces the data from 101,000 rows to 3,000 rows, which is hardly a suitable solution. A better solution would have been to simply drop the weight_num column from the data.

**Customised Step-wise Model Building**

While step-wise model building is not the best way to put together a GLM, it is the quickest way to get to a reasonable model. I use the step-wise approach whenever I am feeling very, very impatient!

But instead of using an out-of-the-box version of step-wise model building, I wrote my own function. It begins by using a GBM to measure variable importance, and to create a reasonable starting model. Then it uses a version of the step-wise approach that has been enhanced to allow for one predictor to be swapped out for another predictor. Usually step-wise implementations are greedy algorithms, i.e. they only look one step ahead. But swapping one column for another involves two steps – removing one column, then adding another column. A greedy algorithm won’t see past the first step of removing a column, and therefore won’t see the full benefit of the swap, and won’t choose that path. It gets trapped in a local minima. This enhancement overcomes the problem by treating column swaps as if they involved just a single step. Here is my script:

# function to show a formula as text f2text = function(f) { paste(as.character(f)[c(2,1,3)], collapse = " ") } # a vector of names of predictor columns target = "readmitted_flag" predictors = names(td) predictors = predictors[predictors != target] # reset the cache on the glmScore function forget(glmScr) # create an indicator vector - in this case we will use a vector of zeroes to start with a NULL model indicators = matrix(0, length(predictors), 1) bestScore = glmScr(indicators) bestParameters = indicators searching = TRUE while (searching) { # alternate formulae, with just one column added, or one column removed alternates = sapply(seq_len(length(indicators)), function(x) {temp = indicators; temp[x] = 1 - temp[x]; return(temp) }) # alternate formulae, replacing one column for another existing = seq_len(length(indicators))[indicators == 1] for (i in existing) { for (j in seq_len(length(indicators))) { if (i != j && indicators[j] == 0) { temp = indicators temp[i] = 0 temp[j] = 1 alternates = cbind(alternates, temp) } } } # evaluate all of the alternatives scores = apply(alternates, 2, glmScr) if (min(scores) < bestScore) { bestScore = min(scores) bestParameters = alternates[,which.min(scores)] print(paste(round(bestScore), f2text(createFormula(bestParameters)), sep = ": ")) indicators = bestParameters } else { searching = FALSE } }

When you’re impatient, like me, you want things to run as fast as possible. From a speed perspective it makes sense to start with a null model, and add predictors, rather than starting from an overspecified model and removing predictors. You also want parallel processing. Rather than processing one GLM at a time, I want to simultaneously process as many GLMs as my PC will allow! R supports parallel processing, but Windows users have fewer parallel processing options in R than Linux users. Nevertheless, the snowfall package gives me exactly what I want for this particular task.

# libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(snowfall) sfInit( parallel=TRUE, cpus=3 ) if( sfParallel() ) { cat( "Running in parallel mode on", sfCpus(), "nodes.\n" ) } else { cat( "Running in sequential mode.\n" ) } # a vector of names of predictor columns target = "readmitted_flag" predictors = names(td) predictors = predictors[predictors != target] # reset the cache on the glmScore function forget(glmScr) # create an indicator vector - in this case we will use a vector of zeroes to start with a NULL model indicators = matrix(0, length(predictors), 1) bestScore = glmScr(indicators) bestParameters = indicators searching = TRUE while (searching) { # alternate formulae, with just one column added, or one column removed alternates = sapply(seq_len(length(indicators)), function(x) {temp = indicators; temp[x] = 1 - temp[x]; return(temp) }) # alternate formulae, replacing one column for another existing = seq_len(length(indicators))[indicators == 1] for (i in existing) { for (j in seq_len(length(indicators))) { if (i != j && indicators[j] == 0) { temp = indicators temp[i] = 0 temp[j] = 1 alternates = cbind(alternates, temp) } } } sfExportAll() # evaluate all of the alternatives #scores = apply(alternates, 2, glmScr) scores = unlist(sfClusterApplyLB(seq_len(ncol(alternates)), function(x) glmScr(alternates[,x]))) if (min(scores) < bestScore) { bestScore = min(scores) bestParameters = alternates[,which.min(scores)] print(paste(round(bestScore), f2text(createFormula(bestParameters)), sep = ": ")) indicators = bestParameters } else { searching = FALSE } } sfStop()

**Evolving Your GLM**

Step-wise model building works OK when there aren’t too many predictors, and when I don’t want any interaction terms. But once there are many interaction terms, step-wise model building has to work through too many permutations. The problem becomes too highly dimensional for grid based searches.

We need a new approach. We need our GLMs to evolve into intelligent models.

Let’s consider the case where we want two-way interaction terms. Once again we use an indicator vector to define the GLM formula, but this time we need to include a vector cell for each and every combination of individual predictors.

The first step is to create a list of all of the individual predictors, and each two-way combination of predictors.

# a vector of names of predictor columns target = "readmitted_flag" predictors = names(td) predictors = predictors[predictors != target] indices = expand.grid(seq_len(length(predictors)), seq_len(length(predictors))) indices = indices[indices[,1] != indices[,2],] twoWays = apply(indices, 1, function(x) paste(predictors[x], collapse = ":")) predictors = c(predictors, twoWays) length(predictors)

Once we include two-way interaction terms, there are 2025 terms that could be included in the GLM formula! A thorough search of all possible formulae would require evaluating 2^2025 GLMs!!!

R has a few packages that implement genetic algorithms. Today I will use the GA package to optimise the indicator vector. Note that genetic algorithms try to maximise a fitness function, whereas we are trying to minimise a score. So we will need to define a fitness function that is -1 times the GLM score.

# libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(GA) # define a glm fitness function glmFitness = function(x) return(-1 * glmScr(x)) ga.mod = ga(type = "binary", fitness = glmFitness, nBits = length(predictors), popSize = 100, parallel = TRUE, run = 5)

From the output we can see that our first attempt at a genetic algorithm didn’t work very well. That’s because most formulae contain 2-way interactions, and most 2-way interactions contain missing cells, and therefore these formulae have been rejected. To fix this we need to bias the initial population towards formulae that do not contain 2-way terms, and towards formulae than contain very few terms.

# custom function to create an initial population initPopn = function (object, ...) { population = matrix(0, nrow = object@popSize, ncol = object@nBits) n = ncol(td) - 1 probability = min(1, 2 / n) ones = matrix(rbinom(n * object@popSize, 1, probability), object@popSize, n) population[,seq_len(n)] = ones return(population) } set.seed(1) ga.mod.2 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors), population = initPopn, popSize = 100, parallel = TRUE, run = 5)

This time we had more success. But we need to tune the hyperparameters, using a bigger population size and slower stopping criteria, in order to get better results. I also added a custom function to show me which GLM model formula is the best performing at the end of each epoch. The data preparation and support functions have also been updated to make the process more robust, and I changed the initial population to ensure more genetic diversity.

# libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(memoise, caret, GA) # function to show a formula as text f2text = function(f) { if (class(f) != "formula") stop("f is not a formula!") return(paste(as.character(f)[c(2,1,3)], sep = "", collapse = " ")) } # the working folder for this batch job folderPath = "C:\\Users\\Colin\\Dropbox\\IntelliM\\"; # read the training data td = read.csv(paste(folderPath, "training.csv", sep="")) # remove all rows containing missing values td = td[! is.na(td$weight_num),] # remove low variance columns td = td[,-nearZeroVar(td)] # remove the diagnosis columns - just to make the example work faster td = td[,-c(11, 12, 13)] # fix the column headers nm = names(td) names(td) = gsub(".", "_", nm, fixed=TRUE) # a vector of names of predictor columns target = "readmitted_flag" predictors = names(td) predictors = predictors[predictors != target] indices = expand.grid(seq_len(length(predictors)), seq_len(length(predictors))) indices = indices[indices[,1] < indices[,2],] twoWays = apply(indices, 1, function(x) paste(predictors[x], collapse = ":")) predictors = c(predictors, twoWays) length(predictors) # a function to create GLM formulae from an indicator vector createFormula = function(indicators) { if (is.null(indicators)) stop("createFormula: indicators is NULL!") if (any(is.na(indicators))) stop("createFormula: NA found in indicators!") # do the special case of no predictor columns, so use NULL model if (sum(indicators) == 0) return (formula(paste(target, " ~ 1", sep=""))) # the more common case result = formula(paste(target, " ~ ", paste(predictors[indicators == 1], collapse = " + "))) return (result) } # the function to be optimised glmScore = function(indicators) { if (is.null(indicators)) stop("glmScore: indicators is NULL!") if (any(is.na(indicators))) { if (colin_verbose) print("glmScore: NA found in indicators!") } # return a NULL model if any errors or warnings occur fff = createFormula(indicators) useMod = FALSE glm.mod = NULL tryCatch ( { glm.mod = glm(fff, data = td, family = binomial(link=logit)); useMod = TRUE; } , error = function(e) { if (colin_verbose) { print(f2text(fff)); print(e); } }, warning = function(w) { if (colin_verbose) { print(f2text(fff)); print(w); } } ) if (! useMod) { return (9e99) } # if there are problems with NA parameters (e.g. via separation), then use a NULL model if (any(is.na(coef(glm.mod)))) { if (colin_verbose) print("NAs in coefficients") return (9e99) } result = BIC(glm.mod) return(result) } # create a version of GLM score that remembers previous results glmScr = memoise(glmScore) # define a glm fitness function glmFitness = function(x) return(-1 * glmScr(x)) # custom function to create an initial population initPopn = function (object, ...) { population = matrix(0, nrow = object@popSize, ncol = object@nBits) n = ncol(td) - 1 probability = min(1, 2 / n) ones = matrix(rbinom(n * object@popSize, 1, probability), object@popSize, n) population[,seq_len(n)] = ones # make the first n rows be the nth predictor n2 = min(n, object@popSize) for (i in seq_len(n2)) { population[i,] = 0 population[i, i] = 1 } return(population) } # custom function to monitor the genetic algorithm glmMonitor = function (object, digits = getOption("digits"), ...) { fitness <- na.exclude(object@fitness) i = which.max(fitness[! is.na(object@fitness)]) p = object@population[! is.na(object@fitness),] x = p[i,] cat(paste("Iter =", object@iter, " | Median =", format(-1 * median(fitness), digits = 0), " | Best =", format(-1 * max(fitness), digits = 0), f2text(createFormula(x)), "\n")) } # run a genetic algorithm to find the "best" GLM with interaction terms forget(glmScr) colin_verbose = FALSE ga.mod.3 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors), population = initPopn, popSize = 500, parallel = TRUE, run = 25, monitor = glmMonitor, seed = 1)

The “best” GLM has evolved to include 2-way interactions. Since it doesn’t take too long to run, it is helpful to rerun the genetic algorithm using other random seeds, to check that we have converged on the best score.

ga.mod.4 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors), population = initPopn, popSize = 500, parallel = TRUE, run = 25, monitor = glmMonitor, seed = 2) ga.mod.5 = ga(type = "binary", fitness = glmFitness, nBits = length(predictors), population = initPopn, popSize = 500, parallel = TRUE, run = 25, monitor = glmMonitor, seed = 3)

Comparing the results of the tree genetic algorithm runs, we find that the best scoring model is:

readmitted_flag ~ number_emergency + number_inpatient + discharge_id_desc + age_num:num_lab_procedures + number_emergency:number_inpatient

This model scored 2191. This model also makes intuitive sense. The following data is predictive of the probability of another hospital visit in the next 30 days:

- the number of emergency visits to hospital over the previous 12 months
- the number of hospital visits over the previous 12 months
- the health of the patient when they were discharged (including whether they died in hospital!)
- the relationship between the patient’s age and the number of treatments that they receive
- the relationship between the number of emergency visits versus the total number of visits to the hospital over the past 12 months

.

Would you prefer to apply these process improvements without the effort of writing R scripts? Consider using a visual user interface, such as IntelliM.