Generalized Linear Modeling

The purpose of this TP is to practice using R for two types of generalized linear modeling: logistic regression and Poisson regression. We will apply GLM to two problems with data in the HSAUR3 package. To carry out these analyses, first load R and the HSAUR3 package.

As usual, make sure that you read the help for any new functions and data that you use.

Blood plasma data

Load the plasma data and read the help for this data set so that you understand the scientific background and question of interest as well as the variables included in the data. Then start off making a graphical exploration of the data. Here, we will look at conditional density plots of the response variable ESR given each each of the (numerical) explanatory variables fibrinogen and gamma globulin change.

data("plasma", package = "HSAUR3")
layout(matrix(1:2, nrow = 2))
cdplot(ESR ~ fibrinogen, data = plasma)
cdplot(ESR ~ globulin, data = plasma)

The conditional density for each ESR category (smaller or larger than 20) is shown in a different shade, with the numerical value on the right hand vertical scale. Here, we can see that higher levels of each protein are associated with ESR values above 20, since the conditional probability for ESR bigger than 20 in both plots is larger as we move to the right of the plot horizontally (larger values of fibrinogen/gamma globulin).

Now, fit a logistic regression model to the data with the glm function, including only the single variable fibrinogen:

plasma.glm.1 <- glm(ESR ~ fibrinogen, data = plasma, family = binomial())

As is the case for the lm function, an intercept term is automatically included in the model. The family argument for glm specifies the distribution of the response - in this case, a binomial distribution. The default link function for the binomial family is logistic.

Apply the summary function to your fitted model to obtain a description of the fitting. Is the coefficient for fibrinogen significant at the 5% level? Interpret the meaning of this coefficient.

You can obtain a 95% confidence interval for the coefficient (make sure to read the help):

confint(plasma.glm.1, parm = "fibrinogen")

Since these values correspond to the log-odds, we can get the odds by exponentiating the estimate (subsetting coef according to the value that we want; here it is for fibrinogen and not the intercept) and confidence interval:

exp(coef(plasma.glm.1)["fibrinogen"])
exp(confint(plasma.glm.1, parm = "fibrinogen"))

We can see that the confidence interval is very wide. Can you think of any reason why this might be?

Now, fit a logistic regression model using both explanatory variables:

plasma.glm.2 <- glm(ESR ~ fibrinogen + globulin, data = plasma, family = binomial())

and output the results using summary. Is the coefficient for gamma globulin significantly different from 0?

You can perform a likelihood ratio (chi-square) test by subtracting the residual deviance of the second (bigger) model from that of the first (smaller) model, then comparing the result to a chi-square distribution with degrees of freedom equal to the difference in degrees of freedom between the two models:

anova(plasma.glm.1, plasma.glm.2, test = "Chisq")

Here, we see that the p-value is large, meaning that we do not reject the null hypothesis that the coefficient for globulin is zero.

Even though we would not include globulin in our final model, for purposes of illustration we will use model 2 to obtain predicted values for each observation, then plot these against the values of both explanatory variables using the symbols to create a bubbleplot. The estimated conditional probability of an ESR value larger than 20 is obtained by:

prob <- predict(plasma.glm.2, type = "response")

and then assign a larger circle to observations with larger probability:

plot(globulin ~ fibrinogen, data = plasma, xlim = c(2,6), ylim = c(25,55), pch = "*")
symbols(plasma$fibrinogen, plasma$globulin, circles = prob, add = TRUE)

This plot shows an increasing probability of ESR bigger than 20 (larger circles) with increasing fibrinogen and, to a lesser extent, with increasing gamma globulin.

Briefly summarize your findings.

Colon polyps data

Load the polyps data and read the help.

You will see that the response variable is a count. We have already used multiple regression to model a numerical response that is continuous, but there are problems with using this approach for count data: a count can only take on positive values, and a count is unlikely to be normally distributed. So here, we will fit a GLM with a log link function, thus ensuring that the fitted values are positive, and with a Poisson error distribution - i.e., Poisson regression:

data("polyps", package = "HSAUR3")
polyps.glm.1 <- glm(number ~ treat + age, data = polyps, family = poisson())

By default, the link function for the Poisson family is the log function.

Look at the results of the fitting (summary). Does there seem to be a larger variance in observed counts than expected from the Poisson assumption? How can you tell?

Now use the quasi-likelihood approach to deal with this over-dispersion, by specifying the quasipoisson family:

polyps.glm.2 <- glm(number ~ treat + age, data = polyps, family = quasipoisson())

then look at the results of the fitting. How do these results compare with those of the first fitting? Can you explain why they are different?

Briefly summarize your results and conclusions: does the drug treatment appear to be effective in reducing the number of polyps? Does the age of the patient have a large effect?