We started this course with simple linear regression and we saw how increasing the number of parameters can soak up variance and improve a model’s explanatory power, how we can use mixed effects models to account for hierarchical data structures, how we can modify the variance structure to account for heteroskedasticity, and how we can modify the correlation matrix to correct for temporal, spatial, or phylogenetic autocorrelation. All of these improvements took place in a linear regression framework, with models of the general form:
We can get a lot of mileage out of this formulation, but the problem with this structure is that the range of the Gaussian distribution is \(-\infty, \infty\). This means that if we set up our problem this way our models tell us that our residuals should be normally distributed and that our response can, theoretically, take any value between \(-\infty, \infty\). For many datasets, this assumption is perfectly appropriate. But what if it’s not? Does this make sense for response variables that can only take positive values? Or for discrete outcomes? Or if we’re working with proportions that are bound between 0 and 1?
There are three steps you can take if you think that Gaussian distributed residuals is not a reasonable assumption for your data:
You are asked to complete the following exercises and submit to Canvas before the deadline. In addition to the points detailed below, 5 points are assigned to the cleanliness of the code and resulting pdf document. Only knit documents (.pdf, .doc, or .html) will be accepted. Unknit .Rmd files will not be graded.
``In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.’’ — Wikipedia
In 1972, Nelder & Wedderburn (1972) worked out a generalisation of the linear regression model that extended the models we’ve been working with so far to any member of the family of exponential distributions (Gaussian, Poisson, binomial, negative binomial, gamma, etc.). In particular, they showed how all of these distributions can be expressed by the general formulation:
This means that a single set of equations can be used when modelling random variables drawn from any of the distributions from the exponential family. Now that we have a general expression for the stochastic component of our model, we just need to find a way to `link’ the expectation value of our model with the expectation value of the distribution. To do this we need to carry out 3 steps when fitting GLMs:
Although it can be challenging to work with these complex
distributions, the glm()
function in R
streamlines this process for us. Note: This is not the only function for
fitting GLMs, but it’s a good place to start.
In this Practical we will:
As biologists we often find ourselves with count data. Count data usually range between 0 and \(\infty\). They’re also usually discrete integers because we don’t count fractions of things (unless those things were very unlucky…). The Gaussian distribution is continuous and has support between \(-\infty, \infty\), so we can already tell it’s probably not a good option. If we model count data using a linear regression model, the stochastic part of our model will be misspecified, so there is a good chance that our model’s predictive power will be low. This means that if we’re working with count data, we’re probably going to need to build our models in a GLM framework. To explore this concept we are going to use data from:
These data are openly available as part of the faraway
package.
In Johnson and Raven’s paper, the authors used regression analyses to explore the relationship between an island’s area and the number of plant species found on the island. They measured species abundance and area for 30 Galapagos islands. The motivation of their work was to build on the theory of Island Biogeography. Variables in this dataset include:
Species
: The number of plant species found on the
islandEndemics
: The number of endemic speciesArea
: The area of the island (km\(^2\))Elevation
: The highest elevation of the island (m)Nearest
: The distance from the nearest island (km)Scruz
: The distance from Santa Cruz island (km)Adjacent
: The area of the adjacent island (km\(^2\))One of the simples places to start when fitting a model to a datset is with a simple linear regression model. This can work fine in many cases, so let’s see if we can get away with this here.
faraway
and nlme
package and import the gala
dataset. – 0.5 point(s)Area
data and store this as a new
variable in the same dataset called log.area
. – 0.5
point(s)Area
on
the X and Species
on the Y. Do you expect there to be a
significant relationship? Does a Gaussian linear model look appropriate
for these data? – 1 point(s)Species
from log10
transformed Area
using the gls
function. – 0.5
point(s)Area
on
the X and Species
on the Y. Use the abline()
function to plot the fitted model. – 0.5 point(s)predict()
function on this model to predict the
most likely outcomes for the fitted values. Do these numbers seem
plausible? – 1 point(s)The data look non-linear, so if we knew nothing about GLMs, we could try to fit a second order polynomial model to correct for the disconnect between the fitted model and the data.
We could keep trying ad hoc corrections to improve the model (e.g., maybe log-scaling the species abundance data to smooth out some of the non-linearity), but these are all going to be half measures because the real issue here is that we are using the wrong distribution to model these data. Because we’re working with counts of species, what we’re looking for is a discrete distribution with support between 0 and \(\infty\). Given these requirements, the Poisson distribution is a good candidate for modelling these data.
In order to model these data using a Poisson distribution to describe the model’s stochastic component, we need to switch to a GLM workflow. To do this we need to carry out 3 steps before fitting our GLM:
We just said that we think working with a Poisson distribution is a good place to start, so after step 1 we get:
The second step of a GLM is to specify deterministic part:
The last step is to link \(\eta\) and \(\mu_i\). Because our species abundance data can only be positive, we can’t use an identity link. Instead, we use a log-link to ensure the fitted values are always positive:
This whole process is streamlined via the glm()
function, which we will now use to try and improve our model
Species
from
log10 transformed Area
using the glm
function.
Be sure to specify the distribution you want to model from and the link
function. – 1 point(s)Area
on
the X and Species
on the Y. – 1 point(s)abline()
to plot the fitted model. What happened? –
1 point(s)
In a GLM framework we have a link function that lies between \(\eta\) and the response variable. If we simply plot the fitted model on a linear scale without factoring in the link function, we can not place our fitted model on the appropriate scale. We need to work around this when visualising GLMs.
New_Data
that is comprised
of a single column called log.area
that is made up of a
sequence of 200 numbers between -2 to 4. – 1 point(s)predict()
function
to generate predictions from the fitted GLM model. You will need to
correctly specify some of the arguments of this function, so set
newdata = New_Data
, type = "link"
, and
se = TRUE
. Run class()
and
names()
on the output of predict()
. – 1
point(s)fit
slot of this list contains the model’s
deterministic prediction. The fit.se
slot contains the
standard errors of the predictions. Use the fit
slot to
plot the fitted GLM overlayed on the original data. Remember \(\mu_i = e^{\eta}\), so you need to
transform the predictions before plotting them. – 2 point(s)fit.se
slot to plot the 95% confidence bands.
Note: the 95% CIs = \(\mu_i \pm SE \times
1.96\). – 1 point(s)
Remember that the potential problem with Poisson GLMs is overdispersion. Overdispersion means that the variance is larger than the mean (when we compare what we would expect from a Poisson distribution).
1-pchisq(residual deviance, degrees of freedom on the residual deviance)
– 1 point(s)Switching from a Gaussian distribution to a Poisson distribution is often a good fix for modelling count data, but it’s not always the most appropriate distribution for count data. One of the primary reasons why a Poisson won’t work very well on count data is over-dispersion (because the variance is tied to the mean and therefore less flexible). The negative binomial distribution is often a viable alternative to the Poisson distribution as it allows for more heterogeneity because variance \(\neq\) mean.
MASS
package.Species
from
log10 transformed Area
using the glm.nb()
function from the MASS
package. Be sure to specify the link
function. – 1 point(s)Johnson M.P. & Raven, P.H. (1973) Species number and endemism: The Galapagos Archipelago revisited Science, 179, 893-895.
Nelder, J.A. & Wedderburn, R.W. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135, 370–384.
More information on plotting GLMs can be found here