Last practical we saw how generalised linear models (GLMs) can give you more flexibility when modelling count data that do not easily fit into the standard Gaussian framework. We learned that to fit a GLM we need to carry out 3 steps beyond those required by conventional regressional analyses:
Because R
functions such as glm()
streamline the process of fitting GLMs, the key step that’s left in your
hands is knowing when you will need to switch from a Gaussian model to a
GLM, and how to identify the correct distribution for your dataset. In
this practical we are going to learn how to extend what we covered last
practical about modelling count data to two other common data types in
biology:
Applying GLMs to these data is also commonly referred to as ‘logistic regression’. The term ‘logistic regression’ comes from the fact that the link function we use fits a logistic curve to the relationship between \(x\) and \(y\). The three step process we covered last practical is identical for these data types, we just need to familiarise ourselves with a new set of distributions, and a new link function.
Logistic regression is a method for fitting a regression curve, \(y = f(x)\), when \(y\) consists of proportions, probabilities, or binary coded (0,1–failure,success) data (i.e., anything bound between 0 and 1). Like linear regression, logistic regression makes a number of key assumptions:
The first step of fitting a GLM is to make a distributional assumption on our 0,1 or proportion data. A good candidate for data that scale between 0 and 1 is the binomial distribution. The binomial distribution describes the probability of obtaining \(k\) yes/no successes in a sample of size \(n\), or in other words, the distribution of the number of successful trials among a defined number of trials. The Probability Mass Function (PMF) of the binomial distribution is given by:
The second step is to specify the deterministic model. This is no different from fitting other regressional models.
The last step is to specify a link function that maps the values between 0 and 1. The ‘logit’ link is a link function that maps values between \(0, 1\) and the most routinely used link function for modelling \(0, 1\) data.
\(Y_i \sim Binomial(1, \pi_i) \quad \quad E(Y_i) = \pi_i \quad \mathrm{and} \quad \mathrm{var}(Y_i) = \pi_i \times(1-\pi_i)\)
\(\pi_i = \frac{e^{\beta_0 + \beta_1 X + \ldots}} {1 + e^{\beta_0 + \beta_1 X + \ldots}}\)Although it can be challenging to work with these complex
distributions and link functions, the glm()
function in
R
streamlines this process for us. Again, this is not the
only function for fitting GLMs, but it’s a good place to start.
In this Practical we will:
You are asked to complete the following exercises and submit to Canvas before the deadline. In addition to the points detailed below, 5 points are assigned to the cleanliness of the code and resulting pdf document. Only knit documents (.pdf, .doc, or .html) will be accepted. Unknit .Rmd files will not be graded.
As biologists we often find ourselves with discrete data with values that be either 0 or 1. The Gaussian distribution is continuous and has support between \(-\infty, \infty\), so it’s clearly not a good option when have this kind of data. If we model these data using a linear regression model, the stochastic part of our model will be misspecified, so our model’s predictive power will be low, the residuals will almost always look terrible and, typically, no amount of transformations will help us. We need a deterministic function that maps the values between 0 and 1, and a distribution that makes more sense. In other words, if we’re working with discrete, 0,1 data, we’re probably going to need to build our models in a logistic regression framework. To explore this concept we are going to use data from:
These data are openly available here.
In Ozgul et al’s paper, the authors explored factors influencing the prevalence of upper respiratory tract disease (URTD) caused by Mycoplasma agassizii in gopher tortoises (Gopherus polyphemus). Variables in this dataset include:
TortID
: Tortoise identification numberDate
: Date of capture and samplingYEAR
: Year of capture and samplingCL
: Carapace length in millimetersSex
: Sex of tortoise (M: Male, F: Female, Juv:
Juvenile)ELISA
: Result of ELISA test (POSITIVE, negative)SITE
: Study site ID (all data from CF site)status
: Result of ELISA test (1: M. agassizii
Positive, 0: Negative)age
: Equivalent to carapace lengthThe first thing to do when working with a new dataset is to carry exploratory data visualisation. Visualising binary data can be challenging, however, and many standard data visualisation tools will not result in readily interpretable figures. Here we will compare traditional scatter plots and box plots with two new approaches.
gophertortoise
dataset. Make sure that sex
is imported as a factor.CL
). Add a regression line to this as a visual aid.
Do you expect there to be a relationship? Do you expect it to be strong?
– 1 point(s)cdplot()
function to create a conditional density plot of
infection status as a function of carapace length (CL
). If
this more or less clear than the scatter plot? Note: For plotting
purposes, status
will have to be a factor here. – 0.5
point(s)After getting a feel for the data, the next step is to fit a logistic regression model and perform model selection to identify the best fit model for the data.
status
from
CL
, Sex
, and the interaction between these
using the glm
function. Be sure to specify the correct
distribution and link function for these data. – 1 point(s)CVbinary()
function from
the DAAG
package to cross-validate the model. – 1
point(s)In a GLM framework we have a link function that lies between the fitted model and the response variable. As we saw last practical, if we simply plot the fitted model on a linear scale without factoring in the link function, we can not place our fitted model on the appropriate scale. We need to work around this when visualising GLMs. We also have the complication here that we have two parameters to visualise, carapace length, and sex.
New_Data
that is comprised
of two columns. The first should be called CL
that is made
up of a sequence of numbers between the minimum and maximum values of
CL
by steps of 0.1, the second should be called
Sex
, and should be a column filled with the factor
"F"
. – 1 point(s)predict()
function to
generate predictions from the fitted GLM model. You will need to
correctly specify some of the arguments of this function, so set
newdata = New_Data
, and type = "response"
. – 1
point(s)As many as ~15% of papers in ecology include some kind of proportional data. Proportions scale between 0-1 (or 0 and 100 for percentages), but they can also take any value between these limits. Most of the time, ecologists model proportion data using an \(\arcsin(\sqrt{p})\) transformation, but this is not an ideal solution (if you’re interested in knowing why this is the case, I encourage you to read this paper)
Fitting a logistic regression to proportion data is very similar to fitting a logistic regression to 0,1 data. To learn how to do this we will use data collected as part of a series of laboratory experiments on the density- and size-dependent predation rate of an African reed frog, Hyperolius spinigularis, and used in the following publication:
These data are openly available as part of the emdbook
package. Variables in this dataset include:
density
: initial tadpole density (number of tadpoles in
a 1.2 x 0.8 x 0.4 m tank) [experiment 1]pred
: factor: predators present or absent [experiment
1]size
: factor: big or small tadpoles [experiment 1]surv
: number survivingpropsurv
: proportion surviving (=surv/density)
[experiment 1]emdbook
package and import the
ReedfrogPred
dataset.propsurv
from size
, pred
, and the interaction between
these using the glm
function. – 1 point(s)Ozgul, A., Oli, M.K., Bolker, B.M. and Perez-Heydrich, C. (2009), Upper respiratory tract disease, force of infection, and effects on survival of gopher tortoises. Ecological Applications, 19: 786-798.
Vonesh and Bolker (2005) Compensatory larval responses shift trade-offs associated with predator-induced hatching plasticity. Ecology 86:1580-1591