Background

In this Practical you will:

  • Write R functions for simulating data
  • Explore the concepts of deterministic and stochastic components of models.
  • Explore the concept of error distributions and independent and identically distributed (IID) errors.
  • Fit a simple linear regression model to simulated and empirical data.
  • Estimate a parameter via maximum log-likelihood estimation
  • Estimate confidence intervals via log-likelihood profiles

Deterministic models

Statistical models are typically comprised of a deterministic component, that describes the basic trend in a system, and a stochastic component that describes the system’s randomness.

Let us supose that we are interested in the relationship between body mass and age for some species, and our hypothesis is that mass is proportional to age. This verbal description of our hypothesis does not give us any clear way of actually testing our hypothesis. We are instead going to formalise this hypothesis acroding to a simple linear deterministic model:

\(mass_i = \beta_0 + age_i \beta_1\)


The parameter \(\beta_0\) describes the intercept of this model, and \(\beta_1\) the slope. When an individual is born, it starts out with some mass, so the intercept is probably non-zero. Let’s say our intercept here is \(\beta_0 = 1\), meaning individuals at age 0 have a mass of 1kg. Let’s also say that we know for every unit of time the animal ages, the mass increases by 2kg (i.e., \(\beta_1 = 2\)). We can then more formally express the deterministic relationship as:

\(mass_i = 1 + age_i\times2\)


  1. 3 points
    • Write an R function that expresses this relationship. – 1 point(s)
    • Generate a vector of ages from 0 to 20 years, seperated by 1 year – 0.5 point(s)
    • Use the deterministic model to predict the mass of animals of these ages – 0.5 point(s)
    • Plot the data to inspect the results – 0.5 point(s)
    • Generate predictions for 20 animals all of age 5 and inspect the results – 0.5 point(s)

Stochastic models

Stochastic models describe the randomness of the process. Simple linear regression accounts for stochasticity by assuming that each observation \(age_i\) is drawn from a normal distribution.

\[ mass_i = \beta_0 + \beta_1 age_i + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2) \]


Going back to our numbers from above, what we’re now saying is that all animals of age 5 will have a mass that’s Gaussian distributed around 11.

  1. 3 points
    • Write a new function for our previous model that also includes a Gaussian distributed stochastic component with standard deviation of 2. Note: You will need to make use of the rnorm() function. – 1.5 point(s)
    • Generate 60 uniformly distributed ages between 0 and 20. Note: You will need to make use of the runif() function. – 0.5 point(s)
    • Use the stochastic model to predict the mass of these 60 animals. – 0.5 point(s)
    • Plot the data to inspect the results. – 0.5 point(s)

Independent and identically distributed (IID) errors

Beyond making an assumption of normality on the errors, simple linear regression also assumes that the errors are independent and identically distributed

\(\varepsilon_i \sim \mathcal{N}(0,\,\sigma^{2})\)


For our example, the identical part of IID means that the distribution of errors around each \(mass_i\) is the same. In other words, \(\sigma^{2}\) does not change with age and the error distribution for animals of age 2 is the same as the distribution of animals at age 3, which is the same as animals at age 4, and so on. The independent part means that there is no relationship among the errors. We will be discussing independence in greater detail in later lectures/practicals, but for now, let’s focus on what is meant by identically distributed errors.

The first thing our model is saying, is that for any age \(i\), we can expect mass to be evenly distributed around the expected value. And our errors \(\varepsilon_i\) to be evenly distributed around 0 (remember that \(\varepsilon_i = y_i - (\beta_0 + x_i \beta_1)\)).


  1. 4 points
    • Use the set.seed() function to set the random seed as 1 (this will ensure consistent results). – 0.5 point(s)
    • Generate predictions for 1000 animals all of age 5. – 0.5 point(s)
    • Use a histogram to inspect the results. – 0.5 point(s)
    • Calculate the errors for these predictions. – 0.5 point(s)
    • Calculate the mean of the errors. – 0.5 point(s)
    • Use a histogram to inspect the distribution of errors. – 0.5 point(s)
    • Describe how the histogram allows us to diagnose the IID assumption. – 1 point(s)

The second thing our model is saying, is that for any age \(i\), we can expect the distribution of errors to be identical.

  1. 4 points
    • Quantify the errors on predictions for 1000 animals all of age 5. – 0.5 point(s)
    • Quantify the errors on predictions for 1000 animals all of age 30. – 0.5 point(s)
    • Calculate the means of these error distributions. – 0.5 point(s)
    • Use histograms to inspect the results. Note: Use different colours for the two distributions. You can also use add = TRUE to overlay to plots. – 1 point(s)
    • What would the figure look like if the IID assumption was broken? – 1 point(s)

Fitting simple linear models in R

Simple linear regression models can be easily fit in R either by manually estimating the regression coefficients, or via the lm() function.

  1. 9 points
      Using a dataset of musteloid traits taken from a study by Noonan et al. (2016), we will determine whether there is any evidence of a linear relationship between female body mass and mean litter size (i.e., mean number of offspring a female has at a given time). The data are available on the course website and the code to import them is detailed below. From these data, you will need to complete the following steps:
    • Remove any instances of missing data. – 0.5 point(s)
    • Make a scatterplot showing the relationship between litter size and female body mass. – 1 point(s)
    • Estimate the regression parameters using the least-squares matrix estimator. – 2 point(s)
    • Fit a similar linear regression to the same data using the lm() function.– 1 point(s)
    • Use the summary() to inspect the results. – 0.5 point(s)
    • Plot the data and the fitted regression line. – 1 point(s)
    • Use a histogram to plot the residuals for visual inspection. Describe how the residuals compare to those in exercise 4. – 1 point(s)
    • Based on these residuals, do you think these data meet the assumptions of simple linear regression? What led you to make that decision? – 2 point(s)

Maximum likelihood

  1. 14 points
      A tiny species of parasitic wasp, Trichogramma brassicae, rides on female cabbage white butterflies, Pieris brassicae. When a butterfly lays her eggs, the wasp climbs off of the female and parasitizes the freshly laid eggs. Fatouros et al. (2005) carried out trials to determine whether the wasps can distinguish mated female butterflies from unmated females (recently mated females are more likely to lay eggs than unmated females and would make a better choice ride/follow around). In each trial a single wasp was presented with two female cabbage white butterflies, one an unmated female, the other recently mated, and the choice of which butterly to ride was recorded. The question is, are the wasps making an informed decision, or are the randomly selecting butterflies.


    Answer this by applying maximum likelihood estimation of the proportion of parasitic wasp individuals that choose the mated butterflies. You will need to cary out the following steps:

    • Import and inspect the mate choice data. Note: use the table() function to create a frequency table. – 1 point(s)
    • Calculate the likelihood and log-likelihood of random host choice. Note these are yes/no data, what distribution should you use? What parameter values would result in a random, 50/50 split? – 1 point(s)
    • Calculate the negative log-likelihood of a range of parameter values that allow you to find the maximum likelihood. Note, you can make use of the seq() function. – 2 point(s)
    • Find the minimum negative log-likelihood estimate. – 1 point(s)
    • Plot the negative log-likelihood curve. – 1 point(s)
    • Explain why the likelihood surface has the shape you observe. – 2 point(s)
    • Calculate the likelihood-based 95% confidence interval. Note: We learned in the lecture that an approximate 95% confidence interval is within 1.92 of the minimum negative log-likelihood. – 3 point(s)
    • What do the ML estimate and 95% confidence interval tell you about host choice? Is it random, or non-random? – 1 point(s)
    • How would increasing/decreasing the sample size change the inference you just made? Be specific in your answer with respect to the likelihood profile. – 2 point(s)