Tags
In my last blog in this series, I described some R packages that can automatically choose the best GLM formula for you. Some people wanted more on this topic. In particular, Asko Kauppinen asked about quasi-complete separation, “empty cells in multidimensional space”. So in this blog, I will show you how to customise the automation process to deal with this issue. Once again, we are using the diabetes readmission data from the UCI Machine Repository. You can download my slightly altered version of the data that I am using from here.
Sometimes R gives you a warning when separation occurs, for example when running the following code.
# the working folder for this batch job folderPath = "C:\\Users\\Colin\\Documents\\IntelliM\\" # read the training data td = read.csv(paste(folderPath, "training.csv", sep="")) # fit a GLM that has separation diabetes.glm.1 = glm(readmitted_flag ~ admission_type_desc + discharge_id_desc, data = td, family = binomial(link=logit))
After fitting the model, R shows the following warning:
Note that this is a warning, and not an error. That’s because it may be perfectly valid for a fitted probability of 0 or 1 to occur. In the diabetes readmission data, it is reasonable to expect that no patients having a discharge description of “expired” (i.e. the patient died in hospital) will ever be readmitted to hospital again!
But, to simplify the example, let’s assume that expired patients can be readmitted, and therefore that we don’t want fitted probabilities of zero, and that medical treatment usually has some benefit and therefore that we don’t want fitted probabilities of one.
We could catch the warning thrown by R, and tell our model building algorithm that we don’t want to use these types of models. You could achieve this via R script that looks something like this:
# use a null model if this model has separation diabetes.glm.2 = tryCatch ( glm(readmitted_flag ~ admission_type_desc + discharge_id_desc, data = td, family = binomial(link=logit)) , warning = function(w) { glm(readmitted_flag ~ 1, data = td, family = binomial(link=logit)) } ) diabetes.glm.2
The tryCatch function notes when a warning is thrown, and then runs the alternative null model. The result can be seen to be the null model:
The null model will be the worst performing GLM, so this model will not be selected by the algorithm as being the “best” model.
This simple solution looks too good to be true, and that’s because it is too good to be true. Consider the following R script and output:
# use a null model if this model has separation diabetes.glm.3 = tryCatch ( glm(readmitted_flag ~ discharge_id_desc, data = td, family = binomial(link=logit)) , warning = function(w) { glm(readmitted_flag ~ 1, data = td, family = binomial(link=logit)) } ) diabetes.glm.3
This time, R didn’t throw a warning, and, without informing the user, R has predicted the existence of zombies!
So we can’t rely on R to consistently identify when separation occurs. We have to code for it ourselves…
# use a null model if this model has separation # look for counts of zero if (sum(xtabs(readmitted_flag ~ discharge_id_desc, data = td) == 0) > 0) { diabetes.glm.4 = glm(readmitted_flag ~ 1, data = td, family = binomial(link=logit)) print("separation with zeroes!") } else { # look for counts equal to number of observations if (sum(xtabs(readmitted_flag ~ discharge_id_desc, data = td) == table(td$discharge_id_desc)) > 0) { diabetes.glm.4 = glm(readmitted_flag ~ 1, data = td, family = binomial(link=logit)) print("separation with zeroes!") } else { diabetes.glm.4 = glm(readmitted_flag ~ discharge_id_desc, data = td, family = binomial(link=logit)) } } diabetes.glm.4
This time we have been successful at manually identifying separation. A null model has been created. But this script only works for models with a single predictor. To work with models that use more than one predictor, we need to improve our approach. Since the script is getting more complex, I will refactor it into a function that takes a formula and a data frame, and returns a GLM. That GLM will be the null model when separation is found.
# expand a dot formula to show all of the terms expandDotFormula = function(f, dat) { fAdj = f target = gsub("()", "", f[2], fixed=TRUE) # allow for dot-style formula if (gsub("()", "", f[3], fixed=TRUE) == ".") { nm = names(dat) nm = nm[nm != target] fAdj = formula(paste(target, " ~ ", paste(nm, collapse=" + "), sep="")) } return (fAdj) } # test whether a specific term in a formula will cause separation termHasSeparation = function(target, predictor, dat) { cols = unlist(strsplit(predictor, ":")) isFactor = sapply(cols, function(x) class(dat[,x]) == "character" || class(dat[,x]) == "factor") if (sum(isFactor) == 0) return (FALSE) fff = formula(paste(target, " ~ ", paste(cols[isFactor], collapse=" + "), sep = "")) # check whether there are cells in which there are all zeroes if (sum(xtabs(fff, data = dat) == 0) > 0) return (TRUE) # check whether there are cells in which there are all ones if (sum(xtabs(fff, data = dat) == table(dat[,cols[isFactor]])) > 0) return (TRUE) # check whether there are cells with no exposure nPossibleCells = prod(sapply(cols[isFactor], function(x) length(unique(dat[,x])))) nActualCells = prod(dim(xtabs(fff, data = dat))) if (nActualCells < nPossibleCells) return (TRUE) return (FALSE) } # test whether a formula will cause separation fHasSeparation = function(f, dat) { ff = expandDotFormula(f, dat) tt = terms(ff) ttf = attr(tt, "factors") # ttf has columns that represent individual terms of the formula fTerms = colnames(ttf) # check each term whether it exhibits separation reject = any(sapply(fTerms, function(x) termHasSeparation(target, x, dat))) return (reject) } # this function returns a NULL model if the formula causes separation noSeparationGLM = function(f, data) { if (fHasSeparation(f, data)) { target = gsub("()", "", f[2], fixed=TRUE) f = formula(paste(target, " ~ 1", sep="")) return (glm(f, data = data, family = binomial(link=logit))) } else { return (glm(f, data = data, family = binomial(link=logit))) } } # test some formulae fHasSeparation(readmitted_flag ~ ., td) fHasSeparation(readmitted_flag ~ race, td) fHasSeparation(readmitted_flag ~ gender, td) fHasSeparation(readmitted_flag ~ admission_type_desc + discharge_id_desc, td) fHasSeparation(readmitted_flag ~ admission_type_desc : discharge_id_desc, td) fHasSeparation(readmitted_flag ~ weight_num : discharge_id_desc, td) fHasSeparation(readmitted_flag ~ diag_1, td) # test some GLMs noSeparationGLM(readmitted_flag ~ ., td) noSeparationGLM(readmitted_flag ~ race + weight_num, td)
The fHasSeparation function now allows us to flag when a formula has terms that will cause separation, and the noSeparationGLM function returns a null model if you try to fit a model that has separation. We can use the noSeparationGLM function directly in glmulti to force it to reject GLMs that have separation. Sample code is shown below:
# fit a GLM allowing that possible models will be rejected due to separation if (!require("pacman")) install.packages("pacman") pacman::p_load(glmulti) glms = glmulti(readmitted_flag ~ ., data = td[1:1000,1:15], level = 1, method="h", fitfunc = noSeparationGLM) # show the result - note the absence of columns that cause separation summary(glms)$bestmodel
You should also check which columns and pairs of columns will cause separation. Some of these might be valid e.g. the “expired” patients in the diabetes readmission data, and it can also point out to you where you may need to do some dimensionality reduction. The R script to do this is:
# list the columns that cause separation names(td[,-1])[sapply(names(td[,-1]), function(x) termHasSeparation("readmitted_flag", x, td))] # list the pairs of columns that cause separation colPairs = expand.grid(names(td[,-1]), names(td[,-1])) colPairs = colPairs[apply(colPairs, 1, function(x) x[1] != x[2]),] combinations = apply(colPairs, 1, function(x) paste(x, collapse=":")) combinations[sapply(combinations, function(x) termHasSeparation("readmitted_flag", x, td))]
There are 18 predictor columns that have problems with separation, and 2,024 paired interactions of predictor columns that have problems with separation.
.
Would you prefer to apply these process improvements without the effort of writing R scripts? Consider using a visual user interface, such as IntelliM.
Pingback: Efficiently Building GLMs: Part 4 | Keeping Up With The Latest Techniques