In this Practical you will:
You are asked to complete the following exercises and submit to Canvas before the deadline. In addition to the points detailed below, 5 points are assigned to the cleanliness of the code and resulting pdf document. Only knit documents (.pdf, .doc, or .html) will be accepted. Unknit .Rmd files will not be graded.
For most biological systems there are typically multiple variables that influence outcomes. To account for this, we fit multiple-linear regression models to our data. Multiple linear regression models typically take the form:
, where \(y_i\) represents our response variable (indexed by \(i\)), \(beta_0 ... beta_n\) represent the regression coefficients that we are trying to estimate, \(x_1 ... x_n\) represent our predictor variables, and \(varepsilon_i\) our error (indexed by \(i\)). Assuming this model structure is appropriate for our system, the challenge is accurately estimating the regression coefficients from our data.
In this practical we will explore multiple linear regression using
data that were collected by Carrascal
and Palomino to explore the influence of island characteristics and
the abundance of terrestrial bird species in Selvagem and the Canary
Islands. Carrascal and Palomino compiled data from the peer reviewed
literature to explore whether factors such as an island’s size, distance
to the mainland, diversity of habitats, etc… correlated with terrestrial
bird species diversity. The dataset includes a number of different
measures, but here we will be focusing on i) bird_species
,
the number of terrestrial bird species found on each island; ii)
island_area
, the area of each island (in km\(^2\)); iii) maximum_altitude
,
the maximum altitude on each island (in meters), and iv)
habitat_diversity
, which is a unitless metric based on the
Shannon-Weaver index that represents the diversity different habitat
types on each island.
plot()
and cor()
functions to
check for any correlations between the 3 predictor variables;
island_area
, maximum_altitude
, and
habitat_diversity
. – 1 point(s)bird_species
and island_area
. – 1
point(s)residuals()
function. – 1 point(s)island_area
, maximum_altitude
, and
habitat_diversity
that you feel should be included based on
your previous analyses. what happened to the R-squared? – 1
point(s)data <- read.csv("https://noonanm.github.io/Biol520C/Practical_Datasets/Island_Biogeography.csv")
The residuals measure how accurately a fitted model predicts the observed data, and they are also used to calculate the coefficient of determination (i.e., R-squared). A model’s R-squared value is the proportion of the variance in the response variable that is predictable from the predictor(s). You will often see ecologists say things like our model explained 87% of the variance in our response variable. In most cases they are getting this number from a model’s R-squared.
summary()
. – 1
point(s)
The problem with R-squared values is that they don’t take into account the number of parameters in the model. If we keep adding meaningless parameters to a model to soak up some of the residual variation, the R-square will necessarily decrease, which can trick us into thinking we have a better model.
island_age
. What happened to the R-squared? Is this a
meaningful measure of model performance? – 1 point(s)When you add more parameters to the model, the r-squared converges to 1. This suggests that we have an increasingly better model, but we know for a fact that it will do this even if the parameters are just noise. The reason for this is that the R-squared value doesn’t factor in the number of parameters in the model. As a correction, the adjusted R-square is ‘adjusted’ based on the number of parameters in the model.
Sometime data will exhibit “heteroscedasticity”, meaning that the residuals (i.e., variance in the data that is not explained by the model) get larger as the prediction moves from small to large values of x (or from large to small).
hetero <- function(x) {
B_0 <- 0
B_1 <- 2
sigma = x^2
eps = rnorm(x,mean=0,sd=sqrt(sigma))
y = B_0 + B_1*x + eps
y
}
Heteroskedasticity on its own does not inherently create a problem (notice the parameter estimates), but it’s often an indication that your model can be improved. Importantly, if your sample size is small, and you can’t fix the issue, your p-values may be a bit be a bit higher or lower than they should be, so possibly a variable that is right on the border of significance may end up erroneously on the wrong side of that border. Your regression coefficients will still be generally accurate though. The accuracy of any of your predictions will also be off (either too narrow, or too wide), depending on where you fall along the x-axis.
The most frequent suggestion is to transform a variable, however…
More often than not, heteroscedasticity indicates that your model is missing a component and transformations are attempt to mask interesting features rather than incorporate them into your model.