Biological data are often collected by measuring quantities over space, time, or comparing differences across species. When collecting data in this way, measurements that are close together in space or time, or from species that are closely related (depending on what you’re measuring) will be more similar to each other than other measurements. This phenomenon is known as autocorrelation. Autocorrelated data can be highly informative, and the autocorrelation structure may actually help you learn more about how a system operates, but the data must be treated with greater statistical care than IID data. In statistically independent data, the nominal sample size \(n\) accurately reflects the true amount of information contained within the data (i.e., the effective sample size \(n_{\textrm{effective}}\)). This is not the case in autocorrelated data and \(n \neq n_{\textrm{effective}}\). If the discrepancy between \(n\) and \(n_{\textrm{effective}}\) is not accounted for in the analyses, the results can be misrepresentative and should not be trusted. The larger the discrepancy between \(n\) and \(n_{\textrm{effective}}\), the more untrustworthy the results. Seen another way, collecting data in a way that increases the amount of autocorrelation (e.g., closer in time/space or for many closely related species) can actually make our models worse if not accounted for.
To this end, standard linear regression models are of the form
As we have seen previously, the diagonal in the variance-covariance matrix defines the variances. All 1s indicates homogeneity of variances. The off-diagonals define the co-variances and the 0s indicate independence. This standard format does not permit any covariance between the residuals, and violation of this assumption can bias parameter estimates and negatively impact our capacity to learn from our models. Correcting for autocorrelation (spatial, temporal, phylogenetic, etc.) requires modifying the variance-covariance matrix such that the off-diagonals \(\neq\) 0.
In this Practical we will:
R
You are asked to complete the following exercises and submit to Canvas before the deadline. In addition to the points detailed below, 5 points are assigned to the cleanliness of the code and resulting pdf document. Only knit documents (.pdf, .doc, or .html) will be accepted. Unknit .Rmd files will not be graded.
Spatial autocorrelation can arise when the variation between the values of the datapoints is affected by their spatial distance. The underlying reason for this is typically that many of the drivers of biological patterns such as environmental conditions, topography, ecosystem structure/composition act at large spatial scales, making data that are spatially close more similar than data collected further apart. To explore this we are going to use data from:
Cruikshanks, R., Lauridsen, R., Harrison, A., Hartl, M. G., Kelly-Quinn, M., Giller, P. S., & O’Halloran, J. (2006). Evaluation of the use of the Sodium Dominance Index as a potential measure of acid sensitivity. WATERAC Final Report. EPA and COFORD, Ireland.
And follow a workflow based partially on that described by:
Zuur, A., Ieno, E. N., Walker, N., Saveliev, A. A., & Smith, G. M. (2009). Mixed effects models and extensions in ecology with R. Springer Science & Business Media.
In Cruikshanks et al.’s technical report, the authors looked at different ways of identifying the acid-sensitivity of rivers. Methods at the time relied on pH, but pH is highly variable within catchments as it is sensitive to local flow conditions, geology, etc. They proposed the Sodium Dominance Index (SDI) as an alternative. SDI is essentially a ratio of the concentration of Na+, divided by the sum of the concentrations of Na+, Ca2+, and Mg2+, expressed as a percentage. The motivation of their work was to demonstrate a need for increasing the amount of forest cover in Irish landscapes. Variables in this dataset include:
ID
: The ID of the datapointRivername
: The name of the river where the data were
collectedEasting
: The easting of the sample siteNorthing
: The northing of the sample siteAltitude
: The altitude of the sample siteForested
: A factor indicating whether the site was
forested or not (1 = forested 2= not forested)Date
: The date of data collectionDate2
: The date of data collection transformedFieldpH
: The field pHpH
: The pH of the waterTemperature
: The water temperatureSDI
: Sodium Dominance Index expressed as a
percentagenlme
package.Forested
is a factor).SDI
on the X and
pH
on the Y. Do you expect there to be a significant
relationship? – 1 point(s)gls
function. (Fit the model via ML not REML). – 1
point(s)library(nlme)
data <- read.csv("https://noonanm.github.io/Biol520C/Practical_Datasets/SDI2003.csv")
Spatial autocorrelation can be difficult to see in a simple residuals vs. fitted plot because they are not designed for this purpose.`Bubble plots’ are an easy tool to quickly assess the residuals for potential autocorrelation, but can be hard to read and not very formal. We will cover two tools for assessing your data for spatial autocorrelation: Moran’s I and variograms.
Moran’s I is a correlation coefficient that measures the overall spatial autocorrelation of a data set (think of it as \(\sim\) weighted covariance):
\(N\) is the number of spatial units indexed by \(i\) and \(j\); \(x\) is the variable of interest and \(\bar{x}\) is the mean of \(x\); \(w_{ij}\) is a matrix of spatial weights and \(W\) is the sum of all \(w_{ij}\).
Values of Moran’s I usually range from -1 to +1. There are Many
R
packages for calculating Moran’s I but today we will work
with the implementation in the ape
package.
ape
and fields
packages.rdist
function from the fields
package to generate a matrix of distances between the coordinates that
will be used as the weights. – 0.5 point(s)Moran.I
function from the ape
package. – 1 point(s)Moran’s I can be a useful tool for identifying the presence of autocorrelation and is quite popular. The challenge is how to act on this information? In other words, it lets you know if you have a problem or not, but doesn’t help in finding an appropriate spatial correlation structure to act as a solution. Semi-variograms (or just variograms) are spatial data’s equivalent of the ACF facilitate visual assessment of spatial autocorrelations in the data. Semi-variance is a measure of the degree of similarity between pairs of points separated by a specific distance \(h\). Plot of semi-variance vs. separation distance is called a variogram. Usefully, the different spatial correlation models all have differently shaped theoretical variograms. In other words the shape of a dataset’s empirical variogram can provide clues on which spatial correlation model is most appropriate. Let’s see what that looks like for the SDI data.
Variogram
function from the nlme
package to calculate the semi-variogram of the linear model. Be sure to
include a nugget
effect, and remember to specify the
spatial coordinates via the form
argument (see
help(Variogram.lme)
). – 1 point(s)As biologists, we often find ourselves trying to understand drivers of patterns across species. If we can identify such patterns, it allows us to make general statements about the way that life on earth operates, and the results can be highly informative. These types of analyses can be used to test evolutionary hypotheses and have a long history in evolutionary biology. Indeed, Charles Darwin used differences and similarities between species as a major source of evidence in ‘The Origin of Species’. When we’re carrying out these types of analyses, we’re interested in between species comparisons. As we saw in this week’s lecture, however, the problem here is that species are part of a hierarchically structured phylogeny, and cannot be considered to be statistically independent. This type of non-independence can be accounted for by incorporating information on the species’ phylogenetic relationships.
We will explore how that works in practice on a dataset of rockfish traits. Rockfish (Sebastes spp) are a diverse group of fishes that live primarily in the north Pacific Ocean. Rockfish range from the inter-tidal zone to depths of \(\sim\) 3,000 m. The are usually benthic, living in and around various substrates, often rock outcrops. Some rockfish species are extremely long-lived for their size, and are amongst the longest-living fish on earth, with a maximum reported age of 205 years.
We are going to use data on 56 species gathered from a mix of sources and obtained from here
Variables in this dataset include:
species
: The species ID of the datapoint
(Genus.species)maxlength.cm
: The species’ maximum body size in cmmaxage.y
: The maximum recorded age of the species (in
years)maxdepth.m
: The maximum recorded depth the species
lives ar (in m)We will also use a is a consensus tree of the 56 species based on 7 mitochondrial and 2 nuclear genes, based on:
Hyde, J. R., & Vetter, R. D. (2007). The origin, evolution, and diversification of rockfishes of the genus Sebastes (Cuvier). Molecular phylogenetics and evolution, 44(2), 790-811.
We will use these data to explore the correlation between lifespan, body size, and depth, and we’ll begin with a naive analysis that ignores phylogenetic relationships, so that we have a baseline for comparison.
gls()
via ML
not REML. – 1 point(s)data <- read.csv("https://noonanm.github.io/Biol520C/Practical_Datasets/rockfish.csv")
The first step in a phylogentically controlled regression is
importing and inspecting the phylogentic tree, and generating some basic
visual diagnostics. The phylogenetic trees we will be working with get
imported as a special type of list of the class “phylo”, defined by the
developers of the ape
package. There are multiple slots in
this list that contain some of the information stored in the phylo
object. this includes:
tip.label
(the names of the taxa)edge.length
(the lengths of all tree
branches)edge
(the identity of branches from node, to
node)read.tree()
function from the ape
package.summary()
and plot()
commands on
the tree to get a feel for what you’re working with. – 1 point(s)tip.color
argument. – 1 point(s)tree <- ape::read.tree("https://noonanm.github.io/Biol520C/Practical_Datasets/rockfish.phy.txt")
With a phylogeny in hand we can now correct our original model for
phylogenetic dependence using the correlation structures provided by the
ape
package.