Practical 03 - Multiple linear regression and model residuals

Background

In this Practical you will:

Fit a multiple linear regression model to empirical data.
Use simulated data to explore the difference between R-squared and adjusted R-squared
Explore the residuals of models for data with heteroskedastic errors
Explore the residuals of models for data with lots of 0 entries

You are asked to complete the following exercises and submit to Canvas before the deadline. In addition to the points detailed below, 5 points are assigned to the cleanliness of the code and resulting pdf document. Only knit documents (.pdf, .doc, or .html) will be accepted. Unknit .Rmd files will not be graded.

Multiple linear regression model

For most biological systems there are typically multiple variables that influence outcomes. To account for this, we fit multiple-linear regression models to our data. Multiple linear regression models typically take the form:

\(y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... \beta_m x_m + \varepsilon_i\)

, where \(y_i\) represents our response variable (indexed by \(i\)), \(beta_0 ... beta_n\) represent the regression coefficients that we are trying to estimate, \(x_1 ... x_n\) represent our predictor variables, and \(varepsilon_i\) our error (indexed by \(i\)). Assuming this model structure is appropriate for our system, the challenge is accurately estimating the regression coefficients from our data.

In this practical we will explore multiple linear regression using data that were collected by Carrascal and Palomino to explore the influence of island characteristics and the abundance of terrestrial bird species in Selvagem and the Canary Islands. Carrascal and Palomino compiled data from the peer reviewed literature to explore whether factors such as an island’s size, distance to the mainland, diversity of habitats, etc… correlated with terrestrial bird species diversity. The dataset includes a number of different measures, but here we will be focusing on i) bird_species, the number of terrestrial bird species found on each island; ii) island_area, the area of each island (in km\(^2\)); iii) maximum_altitude, the maximum altitude on each island (in meters), and iv) habitat_diversity, which is a unitless metric based on the Shannon-Weaver index that represents the diversity different habitat types on each island.

6 points
- Import the Island_Biogeography dataset.
- Use the plot() and cor() functions to check for any correlations between the 3 predictor variables; island_area, maximum_altitude, and habitat_diversity. – 1 point(s)
- Describe the outcome. Is there any cause for concern? Are there any variables you should consider dropping? Why/why not? – 1 point(s)
- Fit a simple linear regression model to model the relationship between bird_species and island_area. – 1 point(s)
- Manually calculate the the ordinary residuals, then compare them to the output of the residuals() function. – 1 point(s)
- Plot the ordinary residuals against the fitted values, plot a histogram of the ordinary residuals, describe what you see. – 0.5 point(s)
- Fit a model that includes all of the predictor variables from island_area, maximum_altitude, and habitat_diversity that you feel should be included based on your previous analyses. what happened to the R-squared? – 1 point(s)
- Plot and inspect the ordinary residuals, describe what you see. – 0.5 point(s)

data <- read.csv("https://noonanm.github.io/Biol520C/Practical_Datasets/Island_Biogeography.csv")

R-squared and adjusted R-squared

R-squared

The residuals measure how accurately a fitted model predicts the observed data, and they are also used to calculate the coefficient of determination (i.e., R-squared). A model’s R-squared value is the proportion of the variance in the response variable that is predictable from the predictor(s). You will often see ecologists say things like our model explained 87% of the variance in our response variable. In most cases they are getting this number from a model’s R-squared.

2 points
- Manually calculate the R-squared value for the full model you just fit to the bird species abundance data. – 1 point(s)
- Compare this value to the output of summary(). – 1 point(s)

The problem with R-squared values is that they don’t take into account the number of parameters in the model. If we keep adding meaningless parameters to a model to soak up some of the residual variation, the R-square will necessarily decrease, which can trick us into thinking we have a better model.

4 points
- Make a figure showing the relationship between island age and bird diversity. From these data, do you think that island age is driving bird diversity. – 1 point(s)
- Refit the model you previously fit, but now include island_age. What happened to the R-squared? Is this a meaningful measure of model performance? – 1 point(s)
- Generate 12 randomly distributed numbers from a distribution of your choice. Add these to your dataset, and refit the model you just fit but now include these random numbers. Describe the effect on the R-squared. – 2 point(s)

Adjusted R-squared

When you add more parameters to the model, the r-squared converges to 1. This suggests that we have an increasingly better model, but we know for a fact that it will do this even if the parameters are just noise. The reason for this is that the R-squared value doesn’t factor in the number of parameters in the model. As a correction, the adjusted R-square is ‘adjusted’ based on the number of parameters in the model.

2 points
- Compare the adjusted R-squared value you obtained in exercise 2, with that from the model with the random noise. Describe what you see now. – 2 point(s)

Heteroskedasticity

Problem

Sometime data will exhibit “heteroscedasticity”, meaning that the residuals (i.e., variance in the data that is not explained by the model) get larger as the prediction moves from small to large values of x (or from large to small).

3 points
- Describe how the following function works to generate heteroskedastic data: – 1 point(s)
```
hetero <- function(x) {
  B_0 <- 0
  B_1 <- 2
  sigma = x^2
  eps = rnorm(x,mean=0,sd=sqrt(sigma))
  y = B_0 + B_1*x + eps
  y
}
```
- Use this function to generate 400 uniformly sampled data points between 1 and 60. – 0.5 point(s)
- Plot the data and describe what you see. – 0.5 point(s)
- Fit a model to these data. – 0.5 point(s)
- Plot the residuals against the fitted values and describe the effect. – 0.5 point(s)

Implications

Heteroskedasticity on its own does not inherently create a problem (notice the parameter estimates), but it’s often an indication that your model can be improved. Importantly, if your sample size is small, and you can’t fix the issue, your p-values may be a bit be a bit higher or lower than they should be, so possibly a variable that is right on the border of significance may end up erroneously on the wrong side of that border. Your regression coefficients will still be generally accurate though. The accuracy of any of your predictions will also be off (either too narrow, or too wide), depending on where you fall along the x-axis.

How to solve the issue

The most frequent suggestion is to transform a variable, however…
More often than not, heteroscedasticity indicates that your model is missing a component and transformations are attempt to mask interesting features rather than incorporate them into your model.

2 points
- Attempt to fix the heteroscedasticity in these data by square root transforming the y values. – 0.5 point(s)
- Plot the transformed data and fitted regression line. – 0.5 point(s)
- Plot the residuals against the fitted values and describe the effect of the transformation. – 1 point(s)