Background
In this Practical you will:
- Write R functions for simulating data
- Explore the concepts of deterministic and stochastic components of
models.
- Explore the concept of error distributions and independent and
identically distributed (IID) errors.
- Fit a simple linear regression model to simulated and empirical
data.
- Estimate a parameter via maximum log-likelihood estimation
- Estimate confidence intervals via log-likelihood profiles
Deterministic models
Statistical models are typically comprised of a deterministic
component, that describes the basic trend in a system, and a stochastic
component that describes the system’s randomness.
Let us supose that we are interested in the relationship between body
mass and age for some species, and our hypothesis is that mass is
proportional to age. This verbal description of our hypothesis does not
give us any clear way of actually testing our hypothesis. We are instead
going to formalise this hypothesis acroding to a simple linear
deterministic model:
\(mass_i = \beta_0 + age_i \beta_1\)
The parameter \(\beta_0\) describes
the intercept of this model, and \(\beta_1\) the slope. When an individual is
born, it starts out with some mass, so the intercept is probably
non-zero. Let’s say our intercept here is \(\beta_0 = 1\), meaning individuals at age 0
have a mass of 1kg. Let’s also say that we know for every unit of time
the animal ages, the mass increases by 2kg (i.e., \(\beta_1 = 2\)). We can then more formally
express the deterministic relationship as:
\(mass_i = 1 + age_i\times2\)
- 3 points
- Write an
R function that expresses this relationship. –
1 point(s)
- Generate a vector of ages from 0 to 20 years, seperated by 1 year –
0.5 point(s)
- Use the deterministic model to predict the mass of animals of these
ages – 0.5 point(s)
- Plot the data to inspect the results – 0.5 point(s)
- Generate predictions for 20 animals all of age 5 and inspect the
results – 0.5 point(s)
Stochastic models
Stochastic models describe the randomness of the process. Simple
linear regression accounts for stochasticity by assuming that each
observation \(age_i\) is drawn from a
normal distribution.
\[
mass_i = \beta_0 + \beta_1 age_i + \varepsilon_i,
\qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2)
\]
Going back to our numbers from above, what we’re now saying is that
all animals of age 5 will have a mass that’s Gaussian distributed around
11.
- 3 points
- Write a new function for our previous model that also includes a
Gaussian distributed stochastic component with standard deviation of 2.
Note: You will need to make use of the
rnorm() function. –
1.5 point(s)
- Generate 60 uniformly distributed ages between 0 and 20. Note: You
will need to make use of the
runif() function. – 0.5
point(s)
- Use the stochastic model to predict the mass of these 60 animals. –
0.5 point(s)
- Plot the data to inspect the results. – 0.5 point(s)
Independent and identically distributed (IID) errors
Beyond making an assumption of normality on the errors, simple linear
regression also assumes that the errors are independent and identically
distributed
\(\varepsilon_i \sim
\mathcal{N}(0,\,\sigma^{2})\)
For our example, the identical part of IID means that the
distribution of errors around each \(mass_i\) is the same. In other words, \(\sigma^{2}\) does not change with age and
the error distribution for animals of age 2 is the same as the
distribution of animals at age 3, which is the same as animals at age 4,
and so on. The independent part means that there is no relationship
among the errors. We will be discussing independence in greater detail
in later lectures/practicals, but for now, let’s focus on what is meant
by identically distributed errors.
The first thing our model is saying, is that for any age \(i\), we can expect mass to be evenly
distributed around the expected value. And our errors \(\varepsilon_i\) to be evenly distributed
around 0 (remember that \(\varepsilon_i = y_i
- (\beta_0 + x_i \beta_1)\)).
- 4 points
- Use the
set.seed() function to set the random seed as 1
(this will ensure consistent results). – 0.5 point(s)
- Generate predictions for 1000 animals all of age 5. – 0.5
point(s)
- Use a histogram to inspect the results. – 0.5 point(s)
- Calculate the errors for these predictions. – 0.5 point(s)
- Calculate the mean of the errors. – 0.5 point(s)
- Use a histogram to inspect the distribution of errors. – 0.5
point(s)
- Describe how the histogram allows us to diagnose the IID assumption.
– 1 point(s)
The second thing our model is saying, is that for any age \(i\), we can expect the distribution of
errors to be identical.
- 4 points
- Quantify the errors on predictions for 1000 animals all of age 5. –
0.5 point(s)
- Quantify the errors on predictions for 1000 animals all of age 30. –
0.5 point(s)
- Calculate the means of these error distributions. – 0.5
point(s)
- Use histograms to inspect the results. Note: Use different colours
for the two distributions. You can also use
add = TRUE to
overlay to plots. – 1 point(s)
- What would the figure look like if the IID assumption was broken? –
1 point(s)
Fitting simple linear models in R
Simple linear regression models can be easily fit in R
either by manually estimating the regression coefficients, or via the
lm() function.
- 9 points
Using a dataset of musteloid traits taken from a study by Noonan et al. (2016), we
will determine whether there is any evidence of a linear relationship
between female body mass and mean litter size (i.e., mean number of
offspring a female has at a given time). The data are available on the
course website and the code to import them is detailed below. From these
data, you will need to complete the following steps:
- Remove any instances of missing data. – 0.5 point(s)
- Make a scatterplot showing the relationship between litter size and
female body mass. – 1 point(s)
- Estimate the regression parameters using the least-squares matrix
estimator. – 2 point(s)
- Fit a similar linear regression to the same data using the
lm() function.– 1 point(s)
- Use the
summary() to inspect the results. – 0.5
point(s)
- Plot the data and the fitted regression line. – 1 point(s)
- Use a histogram to plot the residuals for visual inspection.
Describe how the residuals compare to those in exercise 4. – 1
point(s)
- Based on these residuals, do you think these data meet the
assumptions of simple linear regression? What led you to make that
decision? – 2 point(s)
Maximum likelihood
- 14 points
A tiny species of parasitic wasp, Trichogramma brassicae, rides on
female cabbage white butterflies, Pieris brassicae. When a butterfly
lays her eggs, the wasp climbs off of the female and parasitizes the
freshly laid eggs. Fatouros et
al. (2005) carried out trials to determine whether the wasps can
distinguish mated female butterflies from unmated females (recently
mated females are more likely to lay eggs than unmated females and would
make a better choice ride/follow around). In each trial a single wasp
was presented with two female cabbage white butterflies, one an unmated
female, the other recently mated, and the choice of which butterly to
ride was recorded. The question is, are the wasps making an informed
decision, or are the randomly selecting butterflies.
Answer this by applying maximum likelihood estimation of the
proportion of parasitic wasp individuals that choose the mated
butterflies. You will need to cary out the following steps:
- Import and inspect the mate choice data. Note: use the
table() function to create a frequency table. – 1
point(s)
- Calculate the likelihood and log-likelihood of random host choice.
Note these are yes/no data, what distribution should you use? What
parameter values would result in a random, 50/50 split? – 1
point(s)
- Calculate the negative log-likelihood of a range of parameter values
that allow you to find the maximum likelihood. Note, you can make use of
the
seq() function. – 2 point(s)
- Find the minimum negative log-likelihood estimate. – 1 point(s)
- Plot the negative log-likelihood curve. – 1 point(s)
- Explain why the likelihood surface has the shape you observe. – 2
point(s)
- Calculate the likelihood-based 95% confidence interval. Note: We
learned in the lecture that an approximate 95% confidence interval is
within 1.92 of the minimum negative log-likelihood. – 3 point(s)
- What do the ML estimate and 95% confidence interval tell you about
host choice? Is it random, or non-random? – 1 point(s)
- How would increasing/decreasing the sample size change the inference
you just made? Be specific in your answer with respect to the likelihood
profile. – 2 point(s)