This page was last updated on March 20, 2024.
The linear regression models we’ve been working with so far assume that the residuals should be Independent and Identically Distributed (IID). More specifically, they should be normally distributed with a constant variance \(\sigma^{2}\):
Biological data are often very noise and, in real systems, variances can differ across groups, times, etc. The IID assumption is therefore likely to be broken in many real datasets. Modification of the variance-covariance matrix can help correct for issues related to autocorrelation and/or variances that differ systematically.
In this Practical we will:
Anything that causes some data points in a dataset to be more similar to each other than others can result in statistically detectable autocorrelation. Temporal autocorrelation is a phenomenon that occurs because events that occur close to each other in time tend to be more similar than events that occur further apart in time. This pattern can be found in many different situations, not only in biology but in any field where data are measured over time. As biologists we often find ourselves working in systems where the best (or at least the most pragmatic) way to understand what is going on is to collect data over time.
To this end, standard linear regression models are of the form
The diagonal in the variance-covariance matrix defines the variances. All 1s indicates homogeneity of variances. The off-diagonals define the co-variances and the 0s indicate independence. This standard format does not permit any covariance between the residuals, and violation of this assumption can bias parameter estimates and negatively impact our capacity to learn from our models. Correcting for autocorrelation `simply’ involves identifying the autocorrelation structure of the residuals and modifying the variance-covariance matrix such that the off-diagonals \(\neq\) 0.
We are going to use data from:
Freitas, C., E. M. Olsen, H. Knutsen, J. Albretsen, and E. Moland. 2016. Temperature-associated habitat selection in a cold-water marine fish. Journal of Animal Ecology 85:628–637.
In this paper, the authors looked at the effect of ocean temperature on how deep individually tagged Atlantic cod (Gadus morhua) dove along the Norwegian coast. Variables in this dataset include:
fish
: The ID of the fishdate
: The date the data were collecteddepth_mean_day
: The mean dive depth during the daydepth_mean_night
: The mean dive depth at nighttemperature
: The surface water temperaturetemperature_centered
slope grouped by fish
).
For this model, log transform the dive depth data. Use the
lme()
function from the nlme
package.The simplest case is when all the co-variances are constant, non-zero value (referred to as `compound symmetry’). With compound symmetric error structures the degree of correlation between residuals (\(\rho\)) is equal to \(\rho=\frac{\theta}{\theta + \sigma^2}\).
Compound symmetric errors can be applied via the
corCompSymm()
function from the nlme
package.
lme
, but this time add a
compound symmetric correlation structure.The first order auto-regressive (AR1) structure defines a correlation structure in which the degree of correlation between two observations is proportional to the relative amount of elapsed time. With AR-1 error structures the degree of correlation between a pair of residuals is defined as \(\rho^{\vert t-s\vert}\), where \(\vert t-s\vert\) is the absolute difference between the current time (\(t\)) and the previous time (\(s\)).
AR-1 errors can be applied via the corAR1()
function
from the nlme
package.
lme
, but this time add an
AR-1 correlation structure.Note: The following 2 exercises are bonus problems worth a total of 0.5% on your final grade. They involve the use of nested for loops which are detailed on the R help page.
This week we learned that when the IID assumption is violated by unmodelled heteroskedasticity, variances, and hence standard errors, can be poorly estimated resulting in misleading p-values. When this is the case, the deterministic part of the model may be correctly specified, but any inference is likely to be unreliable. Not the emphasis on unmodelled. This is because, unless there are extreme issues with the data, it is by ignoring key features that we run into issues (not just the presence of heteroskedasticity). Let’s explore that concept using a simulation experiment.
R
functions can
help make this exercise easier: aggregate()
,
do.call()