This page was last updated on March 20, 2024.


Stochastic Simulation and the Monte Carlo Method

Everything we’ve been doing so far has been focused on fitting/selecting models and estimating their parameters from a given dataset. This can been seen as ‘inverse’ modelling. In other words, we have the data in hand and we backtrack to try and identify the deterministic and stochastic processes that generated it. Simulation, on the other hand, can be seen as `forward’ modelling. With forward modelling we don’t start with data, but with a model. If we pick and/or build a model to describe our system, our goal is to explore what patterns we might expect to see in the data. Because the computer code required to generate simulations are easier to learn than high-level mathematics, simulation studies can put practicing biologists on equal footing with even the most experienced mathematicians. This makes simulations powerful tools for understanding biological systems, generalising the results of our analyses, and making predictions.

If we can write down a model that describes our system, or are prepared to make assumptions about what that model might look like, then we can generate simulated data. These simulated data can help us understand how a system might be expected to respond to conditions that we can not/couldn’t measure, or observe what patterns we would expect to see. If we chain together simulations from multiple models we can generate incredibly rich and complex descriptions of biological systems. There is an important caveat that comes with designing simulation studies however. It is critical to remember that the more moving pieces a simulation setup has, the harder the outcomes can be to understand (and this is even more true when weaving together non-linear models). Consequently, simple, carefully tailored simulations are often more informative than complex simulations that we can’t keep track of.

‘Stochastic’ simulation is a special type of simulation that relies on computational algorithms to randomly sample values from probability distributions to emulate a system’s stochasticity (e.g., rnorm(), runif(), rbinom() etc.). Stochastic simulation is also referred to regularly as Monte Carlo sampling. In this practical will focus on how to use stochastic simulation as a tool for understanding and predicting biological systems.

Biologists typically use simulations for two main tasks:

  • Exploring patterns that would emerge from a given model(s); or
  • Calculating power and/or planning future studies.

We will explore both of these concepts in this practical.

Exploring Patterns That Would Emerge From a Fitted Model

We’re going to work with a dataset collected by Vicente et al. (2006). They analyzed the distribution of tuberculosis-like lesions in wild boar (Sus scrofa) for potential importance of persistence of tuberculosis in south central Spain. In class we saw that there was a positive relationship between body size and Tb prevalence. The pieces required to fit a logistic regression to these data are:


\(Y_i \sim Binomial(1, \pi_i) \quad \quad E(Y_i) = \pi_i \quad \mathrm{and} \quad var(Y_i) = \pi_i \times(1-\pi_i)\)


\(\pi_i = \frac{e^{\beta_0 + \beta_1 \times \mathrm{Length}}} {1 + e^{\beta_0 + \beta_1 \times \mathrm{Length}}}\)


We’re going to pair this model with stochastic simulations to get an idea of what proportion of individuals in a population might be infected depending on the mean body size of the population. We will also look at how we might use the information from the model to inform management decisions to control Tb in this population.

    • Import the wild boar dataset and remove any NA values.
    • Fit a logistic regression predicting Tb status from body length.
    • Plot the data and the fitted regression line.
    • Use the parameter values from the fitted model to build a function for predicting Tb status. This function should have both deterministic and stochastic components.
    • If we assume that the body size of animals in the population is normally distributed with \(\mu\) = 120 cm and \(\sigma\) = 20, use a simulation to find out what proportion of a population of 1000 animals can be expected to be Tb positive? What if the mean body size in the population was 140 cm instead?

Let’s say that to control Tb in a population of wild boars consisting of 1,000 individuals we are going to undertake an antibiotic treatment program. Testing wild animals for Tb can be difficult, expensive and unreliable, and there are insufficient funds to treat the entire population. That means we can’t test each individual and only threat the Tb positive animals. Because Tb infection status correlates positively with body size, a reasonable recommendation would be to treat only animals above a certain size threshold. But what should that threshold be? One way to answer that question is via simulations.

Note: In reality, Tb is very infectious, takes a long time to treat with regular antibiotic treatments, and vaccines are not always effective. Tb is therefore usually controlled by culling populations to prevent further spread. This is a simulation though, so we can be kind to our simulated boar.

    • Use simulations to explore how the proportion of Tb positive individuals in the population changes depending on the size threshold for treatment. Assume all animals above the size threshold get treated and can no longer be Tb positive, and that the body size of animals in the population is normally distributed with \(\mu\) = 120cm and \(\sigma\) = 20. Test size thresholds ranging between 80 cm and 160 cm.
    • Visualise the results of your simulation by plotting the proportion of Tb positive animals in the population against the treatment size threshold.
    • Based on these simulations, what is the size threshold needed to keep the proportion of infected individuals below 10% of the population?

Simulation Based Power Analysis

Power analysis, in a traditional sense, means identifying the minimum sample size needed to detect the presence of a real effect if one is present. It is traditionally associated with design based inference and pairwise hypothesis testing (ANOVAs, t-tests, etc.). In model based inference each parameter has it’s own effect, so the concept translates, but the tools required are different. Power analysis in our context involves a special kind of simulation study aimed at exploring how much data you would need in order to get reasonably accurate estimates of your parameters, detect significance of parameters with true effects, and/or detect differences between groups.

We will explore power analysis for a multiple linear regression model using the data that were collected by Ismail et al. to explore the influence of forest degradation on the recruitment of Dysoxylum malabaricum. As we saw in Practical 03, D. malabaricum is a large canopy tree species found in highly fragmented forest patches within a complex agro-forest landscape of the Western Ghats biodiversity hot spot, South India. The measures of degradation include:

  • Canopy_closure: The percentage of closed canopy.
  • Cv_juvenile_density: The density of Clerodendrum viscosum, an early successional light demanding shrub that establishes itself well in degraded forests.
  • Termite_nest_density: The number of arboreal termite nests attached to branches and trees, which has been shown to be an effective indicator of forest disturbance, because termites are sensitive to forest disturbance and to canopy loss.

Ismail et al. collected data on the densities of seedlings along with indicators of degradation in 17 forest patches, so it is possible that the small sample size would limit the power of detecting significant effects in these data. But what was the power of the model fit to these data? To answer that question we will rely on an approach that involves simulating data from an empirically defined model to re-create the study post hoc. This is termed ‘empirically guided power analysis’. With this type of approach, instead of defining a model prior to collecting data as we saw in the lectures, we assume that the fitted model is true and are interested in the power of our fitted model given our empirical sample size. To test the power in this context, we use the fitted model and random number generation to simulate data that look like the empirical data, re-fit a model to the simulated data, and look at whether the parameters are identified as being significant or not. We then repeat this process a sufficient number of times in order to allow us to quantify the power for each of our parameters.

    • Write the pseudo-code for how you would expect to test the power of detecting the effects of Canopy_closure, Cv_juvenile_density, and Termite_nest_density on Dm_seedling_density from the seedling dataset. Note: use 1000 replicates to estimate the power.
    • Cary out the power analysis.
    • Do you have equal confidence in your ability to detect an effect for all of the parameters? Remember that a reasonable power is 0.8.

Fitting Non-Linear Models

All of the models we’ve been working with so far are categorised as `linear’ deterministic functions. We say linear not because of the shape of the relationship, but because our regression parameters, \(\beta_n\), are linear combinations of one another.


\(\mu = \beta_0 + \beta_1x_{1} + \beta_2x_{2}\)


Even if we add higher order polynomial terms, or fit GLMs with link functions that produce non-linear trends, the regression parameters still combine linearly. Biological systems are not always linear, and it is important to become familiar with a broad range of deterministic functions. Here we will learn how to fit non-linear models to data in R.

In terms of the computer code required, fitting non-linear models in R is only slightly more challenging than fitting linear regressions. The nls() function allows fitting of non-linear relationships between a response variable and one or more explanatory variables using non-linear least squares. To learn how to do this we will use records of individual deers’ jaw bone lengths and their age from the R Book by Crawley. This dataset is comprised of two variables:

  • age: The animal’s age.
  • bone: The length of the jaw bone.
    • Import the jaw_bone.csv dataset and inspect the data by creating a scatter plot depicting jaw bone length as a function of age.
    • Describe the relationship between deer jaw bone length and age.
    • Fit a linear regression model predicting bone from age, using the gls function. Fit the model using maximum likelihood.
    • Inspect the summary overview of the full model and plot the residuals against the fitted values. Does this fit look reasonable?

Polynomial Functions

Polynomial functions have the general form \(y = \sum _{i=1}^{n} \beta_ix^i\). For instance:

  • Linear: \(f(x) = \beta_0 + \beta_1x\)

  • Quadratic: \(f(x) = \beta_0 + \beta_1x+ \beta_2x^2\)

  • Cubic: \(f(x) = \beta_0 + \beta_1x+ \beta_2x^2+ \beta_3x^3\)

  • and so on…

Polynomials benefit from being easy to understand, easy to reduce, easy to extend to higher orders, and can fit arbitrarily complex data. Before we attempting fitting a non-linear model to these data, it’s worth checking to see if a polynomial model would better fit our data as compared to the linear model above. Again, it’s important to note that polynomial deterministic functions still take the same general form of the linear model we just fit, despite the fact that the relationship is not a straight line.

    • Fit a second order polynomial (i.e., quadratic) regression model predicting bone from age and I(age^2), using the gls. Fit the model using maximum likelihood.
    • Fit a third order polynomial (i.e., cubic) regression model predicting bone from from age, I(age^2), and I(age^3), using the gls function. Fit the model using maximum likelihood.
    • Plot the data and the fitted polynomial regression lines. Do these fits look better than a linear model?
    • Plot the data and the fitted polynomial regression lines again, but this time ensure that the ages used to generate the predicted values range from the minimum observed age to 100 (i.e., extrapolate beyond the sampled values). Make sure the axis limits extend far enough to see the how the fitted models do at predicting jaw bone size beyond the observed ages. Do these fits still look like good options?

Non-linear functions

As we just saw, a challenge of working with higher order polynomial functions is that they do not level off as \(X \to \infty~\mathrm{or}~-\infty\). As a result, extrapolations are often very unrealistic (would you expect a deer’s jaw size to shrink to 0 or grow exponentially as it ages?). Consequently, higher order polynomials can be unstable and difficult to interpret. We could try piecewise polynomials, but really what we need here is a non-linear model with asymptotic behaviour.

A good option for these data would be a Michaelis-Menten function.

    • Fit a Michaelis-Menten model to the data using the nls() function. Remember to provide reasonable starting guesstimates of the parameter values.
    • Re-create the same two plots from the previous exercise, but this time overlay the fitted Michaelis-Menten model. Which of these models do you expect to make more accurate predictions?
    • Use AIC based model selection to find out which of the four models we just fit is the most appropriate for this dataset?

References

Bolker, B. M. (2008). Ecological models and data in R. Princeton University Press.

Crawley, M. J. (2012). The R book. John Wiley & Sons.

Ismail, S. A., Ghazoul, J., Ravikanth, G., Kushalappa, C. G., Shaanker, R. U., & Kettle, C. J. (2014). Forest trees in human modified landscapes: ecological and genetic drivers of recruitment failure in Dysoxylum malabaricum (Meliaceae). PLoS One, 9(2), e89437.

Vicente, J., Höfle, U., Garrido, J. M., Fernández-De-Mera, I. G., Juste, R., Barral, M., & Gortazar, C. (2006). Wild boar and red deer display high prevalences of tuberculosis-like lesions in Spain. Veterinary research, 37(1), 107-119.