In the last few blogs, I discussed how you can automate the process of choosing which predictor columns to use in your model. This week I will meander away from that path to discuss how feature engineering can be automated.
The “L” in GLM stands for “Linear”, and that gives you a huge hint about one of the key assumptions: a GLM assumes that the relationship between a predictor and the outcome must be linear, or more pedantically, that relationship must be linear after the link function has been applied. For example, in my blog about Denoising Dirty Documents, I showed a graph that demonstrated a linear relationship between the pixel brightness from the input images versus the pixel brightness of the target images.
But while linear relationships do naturally occur, I find that I frequently have to transform the input data to create a linear relationship. This is especially so when using monetary values, where a logarithmic transformation often helps. But until recently, I used a manual process, making guesses and testing different transformations. It saved me a lot of time once I automated the process.
Once again, we are using the diabetes readmission data from the UCI Machine Learning Repository. You can download my slightly altered version of the data that I am using from here.
The key tool that we are going to use to linearise the relationships is a generalized additive model (GAM). GAMs are an extension of generalized linear models that allow for non-linear relationships using non-parametric smoothing functions. It’s a bit like using rolling averages, but much more sophisticated.
The diabetes readmission data has a few numeric columns, but today we will just consider the column containing the number of inpatient visits to the hospital over the previous 12 months. Let’s start by doing a one-way analysis of the relationship with the rate of hospital readmission.
# libraries if (!require("pacman")) install.packages("pacman") pacman::p_load(data.table) # the working folder for this batch job folderPath = "C:\\Users\\Colin\\Dropbox\\IntelliM\\"; # read the training data td = read.csv(paste(folderPath, "training.csv", sep="")) # fix the column headers nm = names(td) names(td) = gsub(".", "_", nm, fixed=TRUE) # summarise by number_inpatient dt <- data.table(td) summary = dt[,list(mean=mean(readmitted_flag)),by=number_inpatient] plot(summary$number_inpatient, summary$mean) # fit a straight line model.lm = lm(mean ~ number_inpatient, data = summary) summary$straight_line = coef(model.lm)[1] + summary$number_inpatient * coef(model.lm)[2] plot(summary$number_inpatient, summary$mean) lines(summary$number_inpatient, summary$straight_line, col = "red")
The relationship shows a non-linear relationship. There is curvature, and the relationship appears to be concave.
When we fit a GLM to this data, we use a logit transformation, because the data has a Bernoulli distribution. So instead of linearising the relationship shown in the one-way analysis above, we should linearise the relationship after the logit link function has been applied. We can see the relationship, after the logit link function, by fitting a GAM model.
R has two packages for building GAMs. I use the mgcv package because it includes automatic regularisation, and because it allows me to obtain a table of results from which I can fit a linearising function. The mgcv package supports both tensor and spline smoothing functions. In the sample below I have specified to use a tensor smoothing function, by using the te() syntax applied to the numeric predictor. Note that you will need mgcv package version 1.8.6 or higher for this R script to work.
# fit a GAM pacman::p_load(mgcv, minpack.lm) # check the version of mgcv - must be at least 1.8.6 v = packageVersion("mgcv") # need 1.8.6 or higher nMajor = as.integer(substr(v[1], 1, 1)) nMinor = as.integer(substr(v[1], 3, 3)) nRevision = as.integer(substr(v[1], 5, 5)) updateRequired = TRUE if (nMajor > 1) updateRequired = FALSE if (nMajor == 1 && nMinor > 8) updateRequired = FALSE if (nMajor == 1 && nMinor == 8 && nRevision > 5) updateRequired = FALSE if (updateRequired) { stop("You need to install a newer version of the mgcv package") } model.gam = gam(readmitted_flag ~ te(number_inpatient), data = td, family = binomial(link=logit)) # get a table of partial relativities pd = plot.gam(model.gam, residuals = FALSE, n=200, select = NULL) if (is.null(pd)) { stop("You need to install a newer version of the mgcv package") }
The GAM plot shows strong curvature, and also shows a confidence interval. Since the confidence interval becomes very broad for high numbers of inpatient visits (due to the low number of patients with these high number of visits) we should be very suspicious of any apparent curvature out in those ranges.
The pd variable contains the values that were used to create the plot shown above. We will fit linearising functions against the data within the pd variable.
Today we will only consider a power transformation and a truncated version of the power transformation. Power transformations are commonly used as linearising function, and I remember being taught to use them when doing my generalized linear models course at university. Given the concave nature of this relationship, you should probably try out a log transformation too.
# create a table of the predicted values and standard errors pdCol = pd[[1]] tab = data.frame(pdCol$x, pdCol$fit, pdCol$se) names(tab) = c("x", "y", "se") head(tab) plot(tab$x, tab$y) # fit a straight line model.lm.2 = lm(y ~ x, data = tab, weights = 1 / se ^ 2) tab$linear = coef(model.lm.2)[1] + coef(model.lm.2)[2] * tab$x # fit a power transformation initA = tab$y[1] initB = tab$y[nrow(tab)] initC = 0 minX = min(tab$x) maxX = max(tab$x) model.power = nlsLM(y ~ a + (b - a) / (maxX - minX) * (x - minX) ^ (1 + c) , data = tab, weights = 1/se^2, start = c(a = initA, b = initB, c = initC), trace = FALSE, control = nls.control(maxiter = 1000, tol = 1e-05, minFactor = 1/2048, printEval = FALSE, warnOnly = TRUE) ) tab$power = predict(model.power) # fit a truncated power transformation # fit a power transformation initA = tab$y[1] initB = tab$y[nrow(tab)] initC = 0 tab$x.truncated = tab$x tab$x.truncated[tab$x.truncated > 12] = 12 minX = min(tab$x.truncated) maxX = max(tab$x.truncated) model.power.truncated = nlsLM(y ~ a + (b - a) / (maxX - minX) * (x.truncated - minX) ^ (1 + c) , data = tab, weights = 1/se^2, start = c(a = initA, b = initB, c = initC), trace = FALSE, control = nls.control(maxiter = 1000, tol = 1e-05, minFactor = 1/2048, printEval = FALSE, warnOnly = TRUE) ) tab$power.truncated = predict(model.power.truncated) # plot the result ylim = c(min(pdCol$fit - pdCol$se), max(pdCol$fit + pdCol$se)) plot(tab$x, tab$y, ylim=ylim) polygon(c(pdCol$x, rev(pdCol$x)), c(pdCol$fit - pdCol$se, rev(pdCol$fit + pdCol$se)), col="lightgrey", border=NA) lines(tab$x, tab$y, type="l", col="black") lines(tab$x, tab$linear, type="l", col = "blue") lines(tab$x, tab$power, type="l", col = "red") lines(tab$x, tab$power.truncated, type="l", col = "green")
A linear relationship (the blue line) is clearly a poor fit to the data. The power transformation (the red line) does a pretty good job of capturing the curvature of the relationship over most of the range, staying within the confidence interval. It is not clear whether truncation (the green line) is necessary.
To choose between the linearising functions, one should score the goodness of fit, and choose the least complex function that provides an adequate fit. The definition of “adequate fit” varies according to what you need to use the model for. Remember also that the purpose of linearising functions is not just to fit well, but to capture the shape of the relationship, so I suggest that your goodness of fit score should include some measure of shape, such as the number of runs of the linearising function either side of the GAM smoother.
# choosing between the models based upon the number of runs pacman::p_load(randtests) # keep only the data points that lie closest to an actual data point tab2 = tab[sapply(summary$number_inpatient, function(x) which.min(abs(tab$x - x))),] runsTest.linear = runs.test(x = (tab2$y - tab2$linear), alternative = "two.sided", threshold = 0, pvalue = "exact", plot = FALSE)$p.value runsTest.power = runs.test(x = (tab2$y - tab2$power), alternative = "two.sided", threshold = 0, pvalue = "exact", plot = FALSE)$p.value runsTest.power.truncated = runs.test(x = (tab2$y - tab2$power.truncated), alternative = "two.sided", threshold = 0, pvalue = "exact", plot = FALSE)$p.value c(runsTest.linear, runsTest.power, runsTest.power.truncated)
To fully automate this R script, you would also need to consider different possible truncation points, and choose the best location to truncate. I leave it to the reader to implement that extra code.
.
Would you prefer to apply these process improvements without the effort of writing R scripts? Consider using a visual user interface, such as IntelliM.