It was way back in 1972 that John Nelder and Robert Wedderburn first introduced GLMs to the world (you can find their original paper here), and by the 1990s actuaries in the insurance industry (I am an actuary) had started using GLMs for technical pricing, empowered by the increased accessibility of modern powerful computers. In some parts of the world there are now huge teams, consisting of dozens of actuaries, sometimes more than 100 of them, building generalized linear models (GLMs) for technical pricing of insurance policies. But is this efficient or is it as out of date as a dinosaur riding a penny farthing?
At university, when I was doing my masters degree in Applied Statistics, I was taught that the “best” GLM model is the one with the lowest AIC or BIC (the topic of whether AIC and BIC are good model benchmarks will be the topic of a future blog). But the information criterion is not a well behaved function (it can have many local minima and maxima). To find the model with the lowest information criterion, one cannot follow a steepest gradient approach, but instead one must systematically search through the different combinations of predictors.
Consider a scenario in which there are 30 possible predictors, consisting of rating factors collected from the customer (e.g. zip code / post code) and / or data variables collected from other sources (e.g. credit rating of the insured) and all of the these values are either categorical, or are numeric and have a linear relationship to the variable being modelled (whether that be claim frequency, severity or claims cost per policy). In such a situation, there are 2^30=1,073,741,824 possible combinations of predictors that could be included in the GLM formula. If each of your actuarial or statistical staff took 10 minutes to test a particular combination of predictors, and each of those staff worked 8 hours per day, for 50 weeks per year, it would take 11,185 man-years to find the best model! Once you include the search for 2-way interactions between predictors, the search time blows out to longer than the life-span of the universe!!!
In practice, actuaries and statisticians are producing GLM models faster than that estimate because they are using heuristics to substantially reduce the number of combinations to search. Those heuristics are usually based upon variants of step-wise regression, whereby they add or remove one predictor at a time. This is still extremely time consuming, and the process still does not necessarily produce the model with the best AIC or BIC.
How can you make this process more efficient?
Let’s make this more real by considering some publicly available data tracking hospital readmission of diabetes patients in USA from 1999 to 2008. You can find the data in the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
Improvement 1: Starting With a Reasonable Choice of Predictors
One can improve upon the step-wise regression approach by starting with a model that already has a few of the most useful predictors.
Instead of beginning with a GLM that uses all the predictors and removes them one-by-one, or starting with no predictors and adding more predictors one-by-one, you can start by understanding which predictors are likely to be the best candidates, and this involves taking a step beyond GLMs into some more modern techniques, including:
For the sake of simplicity, this blog will only apply just one of these approaches, the variable importance measure based upon a gradient boosting machine. But any of these three approaches will usually achieve the task.
A gradient boosting machine (GBM) is a very flexible type of machine learning algorithm. It can be used for regression and classification problems. You can use GBMs via the GBM package available in R. GBMs are a forest of trees whereby each successive tree is fitted to the residuals of the previous iteration of the forest i.e. each new tree predicts the errors from the existing forest. The GBM package has a measure of “relative influence” that is quite similar to a variable importance measure, and can be used for the same purpose.
Variable importance or relative influence is a measure of how much of the variation in outcomes is explained by the inclusion of the predictor in the model. A predictor will explain more of the variation in outcomes if:
- it is statistically significant i.e. the difference isn’t random,
- the difference is large for different values of the predictor i.e. the predictor differentiates well, and
- there is considerable variation in the predictor value between observations i.e. more than just a small number of predictor observations are different to the average.
Here is some sample R code to give an indication of how one would do this with the diabetes readmission data:
# libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(gbm) # the working folder for this batch job folderPath = "C:\\Users\\Colin\\Documents\\IntelliM\\"; # read the training data td = read.csv(paste(folderPath, "training.csv", sep="")) # GBM variable importance set.seed(1) gbm_imp = gbm(formula = readmitted_flag ~ ., distribution = "bernoulli", data=td, n.trees = 1000, interaction.depth = 1, verbose=TRUE, shrinkage = 0.01, cv.folds=0, keep.data = F) s = summary(gbm_imp) head(s)
The next step is to read the diabetes readmission data into R. I am reading the data from a comma delimited file that I have created previously after downloading the UCI Machine Learning Repository data. You should edit the sample R script to use the folder and file name of your data.
Finally I fitted a GBM model. For the purposes of this blog I set the random seed, to make the results replicable. Note that the model hyperparameters were not optimised – I am just creating a model for the sake of understanding which predictors are important, not trying to fit the best possible GBM model. But sometimes the interaction.depth hyperparameter does matter. In my script above I have used interaction.depth = 1, which excludes the possibility of 2-way interaction effects between two predictors. I chose a value of 1 for simplicity for this example, and because most of the time it doesn’t make a difference to the discovery of the most important predictors. However, if you have strong reason to believe that your data exhibits strong 2-way effects between predictors, and also have reason to believe that the one-way effect of those same predictors is weak, then try higher values for this hyperparameter.
The summary function gets the variable importance measures from the GBM model. It displays a plot of the most important predictors, and also stores them in a table. As you can see from the screenshot above, R shows that just 5 predictors will give most of the predictive power. Note that the variable importance scores have been automatically scaled to add to 100.
The default plot for GBM variable importance is rather difficult to read and interpret. If I am working on a project that is purely within R, then I usually script for a better looking plot based upon the ggplot2 package in R. Lately I have been using IntelliM to build my GLMs because it automates the feature extraction, model building and validation process for GLMs, and it does it visually instead of manually writing R scripts. It gives an easier to read graph, shown below.
Later in this series of blogs, I will discuss dimensionality reduction, but since one approach to dimensionality reduction relates to what I have just shown in this blog, I will give you a sneak peak.
The diabetes readmission data contains categorical predictors that contain codes for diagnoses and other data of interest. Some of these predictors have several hundred different possible codes. Unless we had many millions of observations in total, there is no way that all of those codes are going to have enough observations to provide statistically valid predictions. What we expect is that just a small number of these codes are important.
So we can apply the same variable importance approach to the codes within a categorical predictor, and then group the unimportant codes together, making for a better model.