Practical 06 - Heteroskedasticity, Temporal Autocorrelation, and the IID Assumption

This page was last updated on March 20, 2024.

Background

The linear regression models we’ve been working with so far assume that the residuals should be Independent and Identically Distributed (IID). More specifically, they should be normally distributed with a constant variance \(\sigma^{2}\):

\(\varepsilon_i \sim \mathcal{N}(0,\,\sigma^{2})\)

Biological data are often very noise and, in real systems, variances can differ across groups, times, etc. The IID assumption is therefore likely to be broken in many real datasets. Modification of the variance-covariance matrix can help correct for issues related to autocorrelation and/or variances that differ systematically.

In this Practical we will:

Use simulations to explore the impact of unmodelled heteroskedasticity.
Explore the impact of modelling heteroskedasticity as a correction.
Identify temporal autocorrelation in residuals.
Build mixed effects models that correct for temporal autocorrelation.
Identify the best autocorrelation structure via AIC based model selection.

Temporal autocorrelation

Anything that causes some data points in a dataset to be more similar to each other than others can result in statistically detectable autocorrelation. Temporal autocorrelation is a phenomenon that occurs because events that occur close to each other in time tend to be more similar than events that occur further apart in time. This pattern can be found in many different situations, not only in biology but in any field where data are measured over time. As biologists we often find ourselves working in systems where the best (or at least the most pragmatic) way to understand what is going on is to collect data over time.

To this end, standard linear regression models are of the form

\(y_i = \beta_0 + \beta_1 \times x_{i} + \varepsilon_i \quad\quad \varepsilon_i \sim \mathcal{N}(0,\, V) \quad\quad V = \sigma^2 {\begin{bmatrix}1& 0 &\cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots &\vdots &\ddots &\vdots \\ 0 & 0 & \cdots &1 \end{bmatrix}}\)

The diagonal in the variance-covariance matrix defines the variances. All 1s indicates homogeneity of variances. The off-diagonals define the co-variances and the 0s indicate independence. This standard format does not permit any covariance between the residuals, and violation of this assumption can bias parameter estimates and negatively impact our capacity to learn from our models. Correcting for autocorrelation `simply’ involves identifying the autocorrelation structure of the residuals and modifying the variance-covariance matrix such that the off-diagonals \(\neq\) 0.

\(V = \sigma^2 {\begin{bmatrix}1& \rho &\cdots & \rho \\ \rho & 1 & \cdots & \rho \\ \vdots &\vdots &\ddots &\vdots \\ \rho & \rho & \cdots &1 \end{bmatrix}}\)

Identifying temporal autocorrelation

We are going to use data from:

Freitas, C., E. M. Olsen, H. Knutsen, J. Albretsen, and E. Moland. 2016. Temperature-associated habitat selection in a cold-water marine fish. Journal of Animal Ecology 85:628–637.

In this paper, the authors looked at the effect of ocean temperature on how deep individually tagged Atlantic cod (Gadus morhua) dove along the Norwegian coast. Variables in this dataset include:

fish: The ID of the fish
date: The date the data were collected
depth_mean_day: The mean dive depth during the day
depth_mean_night: The mean dive depth at night
temperature: The surface water temperature

- Import the Cod_daily_depth_data dataset (make sure the variable data is of the class date and make sure fish is a factor).
- Mean center the temperature data to reduce correlation between the intercepts and slopes.
- Fit a mixed-effects model predicting cod dive depth during the day from surface temperature that allows the intercept and slope to vary with each fish (i.e., include a random intercept and random temperature_centered slope grouped by fish). For this model, log transform the dive depth data. Use the lme() function from the nlme package.
- Inspect the summary overview of the model.
- Plot the normalized residuals against the fitted values.
- Plot the normalized residuals against date.
- Plot the acf of the normalized residuals. Note: The default residuals do not incorporate the correlation structure, which is not very helpful when we are trying to correct for autocorrelation. This is why we always plot normalized residuals when trying to deal with autocorrelation.

Correcting temporal autocorrelation

Compound Symmetric Error Structure

The simplest case is when all the co-variances are constant, non-zero value (referred to as `compound symmetry’). With compound symmetric error structures the degree of correlation between residuals (\(\rho\)) is equal to \(\rho=\frac{\theta}{\theta + \sigma^2}\).

\(V = {\begin{bmatrix}\theta + \sigma^2& \theta &\cdots & \theta \\ \theta & \theta + \sigma^2 & \cdots & \theta \\ \vdots &\vdots &\ddots &\vdots \\ \theta & \theta & \cdots &\theta + \sigma^2 \end{bmatrix}}\)

Compound symmetric errors can be applied via the corCompSymm() function from the nlme package.

- Refit the same model using lme, but this time add a compound symmetric correlation structure.
- Examine the residuals and acf as above. Visually, do you see an improvement?
- Check the improvement via AIC.
- Describe the effect

AR-1 Error Structure

The first order auto-regressive (AR1) structure defines a correlation structure in which the degree of correlation between two observations is proportional to the relative amount of elapsed time. With AR-1 error structures the degree of correlation between a pair of residuals is defined as \(\rho^{\vert t-s\vert}\), where \(\vert t-s\vert\) is the absolute difference between the current time (\(t\)) and the previous time (\(s\)).

\(V = \sigma^2 {\begin{bmatrix}1& \rho & \rho^2 &\cdots & \rho^i \\ \rho & 1 & \rho & \ddots & \vdots \\ \rho^2 & \rho & 1 & \ddots & \rho^2 \\ \vdots &\ddots &\ddots &\ddots & \rho \\ \rho^1 & \cdots & \rho^2 & \rho &1 \end{bmatrix}}\)

AR-1 errors can be applied via the corAR1() function from the nlme package.

- Refit the same model using lme, but this time add an AR-1 correlation structure.
- Examine the residuals and acf as above. Visually, do you see an improvement?
- Check the improvement via AIC.
- Describe the effect

Impact of heteroskedasticity

Note: The following 2 exercises are bonus problems worth a total of 0.5% on your final grade. They involve the use of nested for loops which are detailed on the R help page.

This week we learned that when the IID assumption is violated by unmodelled heteroskedasticity, variances, and hence standard errors, can be poorly estimated resulting in misleading p-values. When this is the case, the deterministic part of the model may be correctly specified, but any inference is likely to be unreliable. Not the emphasis on unmodelled. This is because, unless there are extreme issues with the data, it is by ignoring key features that we run into issues (not just the presence of heteroskedasticity). Let’s explore that concept using a simulation experiment.

Unmodelled heteroskedasticity

- Create a function for simulating from a linear model which has the following parameters: an intercept, \(\beta_0\), of 0, \(\beta_1 = 2\) and \(\beta_2 = 0.5\). Have the errors for these data be drawn from a power variance error structure \(\varepsilon_{ij} \sim \mathcal{N}(0,\,\sigma^{2} \times \vert x_{1,i} \vert ^{2\delta_j})\) that is a function of \(x_1\) and with \(\sigma^2\) = 4.
- Create a sequence of \(\delta\) values ranging from 0 to 1 by an interval of 0.05.
- For each of the \(\delta\) values: i) simulate 50 uniformly sampled \(x_1\) and 50 uniformly sampled \(x_2\) values each ranging from 0 to 20; ii) use these data and your function to generate the y values; iii) fit a linear model to the data; iv) extract and store the p-values for the intercept,\(\beta_1\) and \(\beta_2\).
- Repeat this 100 times per \(\delta\) value.
- Plot the mean \(p\)-value for each parameter at each \(\delta\) value.
- Describe the impact of unmodelled heteroskedasticity.
Hint: Using for loops that store values in a list is a straightforward approach to this exercise. The following R functions can help make this exercise easier: aggregate(), do.call()

Modelled heteroskedasticity

- Repeat the above simulation experiment but this time, instead of ignoring the heteroskedasticity, fit a model with a power variance structure.
- Describe the impact that modelling heteroskedasticity has on the capacity to identify significant parameters.