Tags

, , ,

In my last blog in this series, I described some R packages that can automatically choose the best GLM formula for you. Some people wanted more on this topic. In particular, Asko Kauppinen asked about quasi-complete separation, “empty cells in multidimensional space”. So in this blog, I will show you how to customise the automation process to deal with this issue. Once again, we are using the diabetes readmission data from the UCI Machine Repository. You can download my slightly altered version of the data that I am using from here.

Sometimes R gives you a warning when separation occurs, for example when running the following code.


# the working folder for this batch job
folderPath = "C:\\Users\\Colin\\Documents\\IntelliM\\"

# read the training data
td = read.csv(paste(folderPath, "training.csv", sep=""))

# fit a GLM that has separation
diabetes.glm.1 = glm(readmitted_flag ~ admission_type_desc + discharge_id_desc, data = td, family = binomial(link=logit))


After fitting the model, R shows the following warning:

output 1 - cropped

Note that this is a warning, and not an error. That’s because it may be perfectly valid for a fitted probability of 0 or 1 to occur. In the diabetes readmission data, it is reasonable to expect that no patients having a discharge description of “expired” (i.e. the patient died in hospital) will ever be readmitted to hospital again!

But, to simplify the example, let’s assume that expired patients can be readmitted, and therefore that we don’t want fitted probabilities of zero, and that medical treatment usually has some benefit and therefore that we don’t want fitted probabilities of one.

We could catch the warning thrown by R, and tell our model building algorithm that we don’t want to use these types of models. You could achieve this via R script that looks something like this:


# use a null model if this model has separation
diabetes.glm.2 = tryCatch (
 glm(readmitted_flag ~ admission_type_desc + discharge_id_desc, data = td, family = binomial(link=logit))
 ,
 warning = function(w) { glm(readmitted_flag ~ 1, data = td, family = binomial(link=logit)) }
 )
diabetes.glm.2

The tryCatch function notes when a warning is thrown, and then runs the alternative null model. The result can be seen to be the null model:

output 2

The null model will be the worst performing GLM, so this model will not be selected by the algorithm as being the “best” model.

This simple solution looks too good to be true, and that’s because it is too good to be true. Consider the following R script and output:


# use a null model if this model has separation
diabetes.glm.3 = tryCatch (
 glm(readmitted_flag ~ discharge_id_desc, data = td, family = binomial(link=logit))
 ,
 warning = function(w) { glm(readmitted_flag ~ 1, data = td, family = binomial(link=logit)) }
)
diabetes.glm.3

output 3

This time, R didn’t throw a warning, and, without informing the user, R has predicted the existence of zombies!

-zombie

So we can’t rely on R to consistently identify when separation occurs. We have to code for it ourselves…


# use a null model if this model has separation
# look for counts of zero
if (sum(xtabs(readmitted_flag ~ discharge_id_desc, data = td) == 0) > 0)
{
diabetes.glm.4 = glm(readmitted_flag ~ 1, data = td, family = binomial(link=logit))
print("separation with zeroes!")
} else
{
# look for counts equal to number of observations
if (sum(xtabs(readmitted_flag ~ discharge_id_desc, data = td) == table(td$discharge_id_desc)) > 0)
{
diabetes.glm.4 = glm(readmitted_flag ~ 1, data = td, family = binomial(link=logit))
print("separation with zeroes!")
} else
{
diabetes.glm.4 = glm(readmitted_flag ~ discharge_id_desc, data = td, family = binomial(link=logit))
}
}
diabetes.glm.4

output 4

This time we have been successful at manually identifying separation. A null model has been created. But this script only works for models with a single predictor. To work with models that use more than one predictor, we need to improve our approach. Since the script is getting more complex, I will refactor it into a function that takes a formula and a data frame, and returns a GLM. That GLM will be the null model when separation is found.

# expand a dot formula to show all of the terms
expandDotFormula = function(f, dat)
{
 fAdj = f
 target = gsub("()", "", f[2], fixed=TRUE)
 
 # allow for dot-style formula
 if (gsub("()", "", f[3], fixed=TRUE) == ".")
 {
 nm = names(dat)
 nm = nm[nm != target]
 fAdj = formula(paste(target, " ~ ", paste(nm, collapse=" + "), sep=""))
 }
 
 return (fAdj)
}

# test whether a specific term in a formula will cause separation
termHasSeparation = function(target, predictor, dat)
{
 cols = unlist(strsplit(predictor, ":"))
 isFactor = sapply(cols, function(x) class(dat[,x]) == "character" || class(dat[,x]) == "factor")
 if (sum(isFactor) == 0)
 return (FALSE)
 fff = formula(paste(target, " ~ ", paste(cols[isFactor], collapse=" + "), sep = ""))
 # check whether there are cells in which there are all zeroes
 if (sum(xtabs(fff, data = dat) == 0) > 0)
 return (TRUE)
 # check whether there are cells in which there are all ones
 if (sum(xtabs(fff, data = dat) == table(dat[,cols[isFactor]])) > 0)
 return (TRUE)
 # check whether there are cells with no exposure
 nPossibleCells = prod(sapply(cols[isFactor], function(x) length(unique(dat[,x]))))
 nActualCells = prod(dim(xtabs(fff, data = dat)))
 if (nActualCells < nPossibleCells)
 return (TRUE)
 
 return (FALSE)
}

# test whether a formula will cause separation
fHasSeparation = function(f, dat)
{
 ff = expandDotFormula(f, dat)
 tt = terms(ff)
 ttf = attr(tt, "factors")
 
 # ttf has columns that represent individual terms of the formula
 fTerms = colnames(ttf)
 # check each term whether it exhibits separation
 reject = any(sapply(fTerms, function(x) termHasSeparation(target, x, dat)))
 
 return (reject)
}

# this function returns a NULL model if the formula causes separation
noSeparationGLM = function(f, data)
{
 if (fHasSeparation(f, data))
 {
 target = gsub("()", "", f[2], fixed=TRUE)
 f = formula(paste(target, " ~ 1", sep=""))
 return (glm(f, data = data, family = binomial(link=logit)))
 } else
 {
 return (glm(f, data = data, family = binomial(link=logit)))
 }
}

# test some formulae
fHasSeparation(readmitted_flag ~ ., td)
fHasSeparation(readmitted_flag ~ race, td)
fHasSeparation(readmitted_flag ~ gender, td)
fHasSeparation(readmitted_flag ~ admission_type_desc + discharge_id_desc, td)
fHasSeparation(readmitted_flag ~ admission_type_desc : discharge_id_desc, td)
fHasSeparation(readmitted_flag ~ weight_num : discharge_id_desc, td)
fHasSeparation(readmitted_flag ~ diag_1, td)

# test some GLMs
noSeparationGLM(readmitted_flag ~ ., td)
noSeparationGLM(readmitted_flag ~ race + weight_num, td)

output 5

output 6

The fHasSeparation function now allows us to flag when a formula has terms that will cause separation, and the noSeparationGLM function returns a null model if you try to fit a model that has separation. We can use the noSeparationGLM function directly in glmulti to force it to reject GLMs that have separation. Sample code is shown below:


# fit a GLM allowing that possible models will be rejected due to separation
if (!require("pacman")) install.packages("pacman")
pacman::p_load(glmulti)

glms = glmulti(readmitted_flag ~ ., data = td[1:1000,1:15], level = 1, method="h", fitfunc = noSeparationGLM)

# show the result - note the absence of columns that cause separation

summary(glms)$bestmodel

You should also check which columns and pairs of columns will cause separation. Some of these might be valid e.g. the “expired” patients in the diabetes readmission data, and it can also point out to you where you may need to do some dimensionality reduction. The R script to do this is:


# list the columns that cause separation
names(td[,-1])[sapply(names(td[,-1]), function(x) termHasSeparation("readmitted_flag", x, td))]

# list the pairs of columns that cause separation
colPairs = expand.grid(names(td[,-1]), names(td[,-1]))
colPairs = colPairs[apply(colPairs, 1, function(x) x[1] != x[2]),]
combinations = apply(colPairs, 1, function(x) paste(x, collapse=":"))
combinations[sapply(combinations, function(x) termHasSeparation("readmitted_flag", x, td))]

There are 18 predictor columns that have problems with separation, and 2,024 paired interactions of predictor columns that have problems with separation.

output 7

.

Would you prefer to apply these process improvements without the effort of writing R scripts? Consider using a visual user interface, such as IntelliM.