This page was last updated on March 20, 2024.
Everything we’ve been doing so far has been focused on fitting/selecting models and estimating their parameters from a given dataset. This can been seen as ‘inverse’ modelling. In other words, we have the data in hand and we backtrack to try and identify the deterministic and stochastic processes that generated it. Simulation, on the other hand, can be seen as `forward’ modelling. With forward modelling we don’t start with data, but with a model. If we pick and/or build a model to describe our system, our goal is to explore what patterns we might expect to see in the data. Because the computer code required to generate simulations are easier to learn than high-level mathematics, simulation studies can put practicing biologists on equal footing with even the most experienced mathematicians. This makes simulations powerful tools for understanding biological systems, generalising the results of our analyses, and making predictions.
If we can write down a model that describes our system, or are prepared to make assumptions about what that model might look like, then we can generate simulated data. These simulated data can help us understand how a system might be expected to respond to conditions that we can not/couldn’t measure, or observe what patterns we would expect to see. If we chain together simulations from multiple models we can generate incredibly rich and complex descriptions of biological systems. There is an important caveat that comes with designing simulation studies however. It is critical to remember that the more moving pieces a simulation setup has, the harder the outcomes can be to understand (and this is even more true when weaving together non-linear models). Consequently, simple, carefully tailored simulations are often more informative than complex simulations that we can’t keep track of.
‘Stochastic’ simulation is a special type of simulation that relies
on computational algorithms to randomly sample values from probability
distributions to emulate a system’s stochasticity (e.g.,
rnorm()
, runif()
, rbinom()
etc.).
Stochastic simulation is also referred to regularly as Monte Carlo
sampling. In this practical will focus on how to use stochastic
simulation as a tool for understanding and predicting biological
systems.
Biologists typically use simulations for two main tasks:
We will explore both of these concepts in this practical.
We’re going to work with a dataset collected by Vicente et al. (2006). They analyzed the distribution of tuberculosis-like lesions in wild boar (Sus scrofa) for potential importance of persistence of tuberculosis in south central Spain. In class we saw that there was a positive relationship between body size and Tb prevalence. The pieces required to fit a logistic regression to these data are:
\(Y_i \sim Binomial(1, \pi_i) \quad \quad E(Y_i) = \pi_i \quad \mathrm{and} \quad var(Y_i) = \pi_i \times(1-\pi_i)\)
\(\pi_i = \frac{e^{\beta_0 + \beta_1 \times \mathrm{Length}}} {1 + e^{\beta_0 + \beta_1 \times \mathrm{Length}}}\)We’re going to pair this model with stochastic simulations to get an idea of what proportion of individuals in a population might be infected depending on the mean body size of the population. We will also look at how we might use the information from the model to inform management decisions to control Tb in this population.
NA
values.Let’s say that to control Tb in a population of wild boars consisting of 1,000 individuals we are going to undertake an antibiotic treatment program. Testing wild animals for Tb can be difficult, expensive and unreliable, and there are insufficient funds to treat the entire population. That means we can’t test each individual and only threat the Tb positive animals. Because Tb infection status correlates positively with body size, a reasonable recommendation would be to treat only animals above a certain size threshold. But what should that threshold be? One way to answer that question is via simulations.
Note: In reality, Tb is very infectious, takes a long time to treat with regular antibiotic treatments, and vaccines are not always effective. Tb is therefore usually controlled by culling populations to prevent further spread. This is a simulation though, so we can be kind to our simulated boar.
Power analysis, in a traditional sense, means identifying the minimum sample size needed to detect the presence of a real effect if one is present. It is traditionally associated with design based inference and pairwise hypothesis testing (ANOVAs, t-tests, etc.). In model based inference each parameter has it’s own effect, so the concept translates, but the tools required are different. Power analysis in our context involves a special kind of simulation study aimed at exploring how much data you would need in order to get reasonably accurate estimates of your parameters, detect significance of parameters with true effects, and/or detect differences between groups.
We will explore power analysis for a multiple linear regression model using the data that were collected by Ismail et al. to explore the influence of forest degradation on the recruitment of Dysoxylum malabaricum. As we saw in Practical 03, D. malabaricum is a large canopy tree species found in highly fragmented forest patches within a complex agro-forest landscape of the Western Ghats biodiversity hot spot, South India. The measures of degradation include:
Canopy_closure
: The percentage of closed canopy.Cv_juvenile_density
: The density of Clerodendrum
viscosum, an early successional light demanding shrub that
establishes itself well in degraded forests.Termite_nest_density
: The number of arboreal termite
nests attached to branches and trees, which has been shown to be an
effective indicator of forest disturbance, because termites are
sensitive to forest disturbance and to canopy loss.Ismail et al. collected data on the densities of seedlings along with indicators of degradation in 17 forest patches, so it is possible that the small sample size would limit the power of detecting significant effects in these data. But what was the power of the model fit to these data? To answer that question we will rely on an approach that involves simulating data from an empirically defined model to re-create the study post hoc. This is termed ‘empirically guided power analysis’. With this type of approach, instead of defining a model prior to collecting data as we saw in the lectures, we assume that the fitted model is true and are interested in the power of our fitted model given our empirical sample size. To test the power in this context, we use the fitted model and random number generation to simulate data that look like the empirical data, re-fit a model to the simulated data, and look at whether the parameters are identified as being significant or not. We then repeat this process a sufficient number of times in order to allow us to quantify the power for each of our parameters.
Canopy_closure
,
Cv_juvenile_density
, and Termite_nest_density
on Dm_seedling_density
from the seedling dataset. Note: use
1000 replicates to estimate the power.All of the models we’ve been working with so far are categorised as `linear’ deterministic functions. We say linear not because of the shape of the relationship, but because our regression parameters, \(\beta_n\), are linear combinations of one another.
Even if we add higher order polynomial terms, or fit GLMs with link
functions that produce non-linear trends, the regression parameters
still combine linearly. Biological systems are not always linear, and it
is important to become familiar with a broad range of deterministic
functions. Here we will learn how to fit non-linear models to data in
R
.
In terms of the computer code required, fitting non-linear models in
R
is only slightly more challenging than fitting linear
regressions. The nls()
function allows fitting of
non-linear relationships between a response variable and one or more
explanatory variables using non-linear least squares. To learn how to do
this we will use records of individual deers’ jaw bone lengths and their
age from the R Book by Crawley. This dataset is comprised of two
variables:
age
: The animal’s age.bone
: The length of the jaw bone.bone
from
age
, using the gls
function. Fit the model
using maximum likelihood.Polynomial functions have the general form \(y = \sum _{i=1}^{n} \beta_ix^i\). For instance:
Linear: \(f(x) = \beta_0 + \beta_1x\)
Quadratic: \(f(x) = \beta_0 + \beta_1x+ \beta_2x^2\)
Cubic: \(f(x) = \beta_0 + \beta_1x+ \beta_2x^2+ \beta_3x^3\)
and so on…
Polynomials benefit from being easy to understand, easy to reduce, easy to extend to higher orders, and can fit arbitrarily complex data. Before we attempting fitting a non-linear model to these data, it’s worth checking to see if a polynomial model would better fit our data as compared to the linear model above. Again, it’s important to note that polynomial deterministic functions still take the same general form of the linear model we just fit, despite the fact that the relationship is not a straight line.
bone
from age
and
I(age^2)
, using the gls
. Fit the model using
maximum likelihood.bone
from from age
,
I(age^2)
, and I(age^3)
, using the
gls
function. Fit the model using maximum likelihood.As we just saw, a challenge of working with higher order polynomial functions is that they do not level off as \(X \to \infty~\mathrm{or}~-\infty\). As a result, extrapolations are often very unrealistic (would you expect a deer’s jaw size to shrink to 0 or grow exponentially as it ages?). Consequently, higher order polynomials can be unstable and difficult to interpret. We could try piecewise polynomials, but really what we need here is a non-linear model with asymptotic behaviour.
A good option for these data would be a Michaelis-Menten function.
nls()
function. Remember to provide reasonable starting
guesstimates of the parameter values.Bolker, B. M. (2008). Ecological models and data in R. Princeton University Press.
Crawley, M. J. (2012). The R book. John Wiley & Sons.
Ismail, S. A., Ghazoul, J., Ravikanth, G., Kushalappa, C. G., Shaanker, R. U., & Kettle, C. J. (2014). Forest trees in human modified landscapes: ecological and genetic drivers of recruitment failure in Dysoxylum malabaricum (Meliaceae). PLoS One, 9(2), e89437.
Vicente, J., Höfle, U., Garrido, J. M., Fernández-De-Mera, I. G., Juste, R., Barral, M., & Gortazar, C. (2006). Wild boar and red deer display high prevalences of tuberculosis-like lesions in Spain. Veterinary research, 37(1), 107-119.