Course website for Biol 520C: Statistical modelling for biological data

Course Description

Advanced course on biostatistics focused on combining data with models to generate mechanistic descriptions of biological patterns. Building from simple linear regression, the course teaches regression methods for handling the most routinely encountered features in biological data. Emphasis is placed on understanding and interpreting the analyses in biological terms and linking analyses to theoretical or applied questions. The course is divided into core lectures and practicals.

Course Objectives

This course combines lectures on theory and concepts with time practicing statistical tools in R based software packages. The course is designed to equip students with the tools and knowledge they need to perform a variety of regression-based analyses that can model common features of biological data. For each topic, there will be a core lecture module and a lab based practical module. With the lectures, students will learn the theory behind core modelling tools including multiple linear regression, mixed effects models, generalised linear models, and generalised least squares, while the practicals are intended to provide a pathway that students can follow over the longer term as their skills develop.

Learning Outcomes:

Fit regression models using the principle of maximum likelihood
Analyze and interpret residuals from regression models to assess model fit, and determine model improvement strategies. Fit and interpret models that include both fixed and random effects.
Apply regression techniques to fit models with non-Gaussian error distributions.
Fit and assess non-linear models, and interpret model parameters.
Identify and manage non-independent data structures, including those influenced by phylogenetic inertia and temporal/spatial autocorrelation.
Apply AIC-based model selection to compare and evaluate competing statistical models.
Build and implement data simulation models and conduct power analyses.
Integrate research questions, study design, and analytical approaches

Course Format

Lectures: Lectures will cover the core concepts of the course. Lecture slides will be posted on the course website the evening prior to the lecture. Students are encouraged to take notes, and to ask questions in the lectures. Lectures will be given in Zoom and a link to the lecture is available here (Links to an external site.) and in the Canvas navigation sidebar. All lectures will be recorded and made available to the students. Calculators will be useful in the lectures but are not mandatory. Questions will be used throughout the lectures to reinforce key concepts. No grades are assigned to these questions, but and students are encouraged to answer these to the best of their ability in order to assist them in tracking their progress.

Practicals: The practicals use structured tutorials to guide students on the use of the open-source software program R for applying the methods learned in the lectures to data. The lectures will cover the material required for students to complete the practical assignments, however, they are designed to be complementary and not all the material in the practicals will be covered in the lectures and vice versa. These also provide training on communicating the results of statistical analyses, with a focus on reproducibility. Assessment of student learning will be based on submitted assignments (see detailed schedule below).

Core project: The core research project is a major component of the course and is designed to train students in data analysis, interpretation, and scientific communication using best practices in open science. The project is divided into two components that mirror a typical analytical workflow: 1) Development of an initial hypothesis and expected outcome(s), and 2) A verbal presentation of the analyses and findings.

Pre-course Checklist

The material learned in most entry level statistics course provides important background and context on the topics that will be covered in this course. It is recommended that you re-familiarise yourself with this material before the course (e.g., understand the differences between means, medians, and modes, variances, standard deviations, etc.).
To fully engage in all of the course material will require a basic familiarity with R. Working through the book `R for Beginners’ by Paradis (available at https://cran.r-project.org (Links to an external site.) –> ~Documentation/Contributed) before beginning the course is a good idea if you want to improve with R or with coding. Alternatively, the swirl R package (available at: https://swirlstats.com/students.html (Links to an external site.)) teaches R programming and data science interactively, at your own pace, and right in the R console.
If you intend to use your personal computer during the course, it is a good idea to try and install R, and RStudio before the course.

Optional Material

There is no textbook for this course, but if students are interested in expanding their knowledge beyond what is covered in the course, the following textbooks are recommended:

Bolker, B. M. (2008). Ecological models and data in R. Princeton University Press.
Hilborn R, Mangel M. The ecological detective: confronting models with data. 1997. Princeton University Press. ISBN 978-0691034973.
Zuur, A., Ieno, E. N., Walker, N., Saveliev, A. A., & Smith, G. M. (2009). Mixed effects models and extensions in ecology with R. Springer Science & Business Media.

Course Evaluation

The assignments and evaluation for this course are structured along two distinct, but complimentary lines: i) guided practical assignments; and ii) a core research project. The practical assignments are designed to provide students with the opportunity to learn how to apply the methods learned in the lectures to real data using the statistical software R. The core research project is designed to train students in data analysis and presentation using best practices in open science. The core project is divided into 3 components aimed at replicating the typical analytical process: 1) initial hypothesis and expected outcome(s); 2) a verbal presentation of the analyses and findings; and 3) the submission of a paper describing the work.

Hypothesis and expected outcome(s)	5%	Week 4
Presentation	30%	Weeks 12 & 13
Practicals (10)	25%	Due on ~weekly basis (see schedule below)
Participation from practicals	10%	Due on ~weekly basis (see schedule below)
Take-home exam	30%	Week 14
Total	100%

Practical Assignments and Participation (35%)

Beginning in the first week of class, students will complete practical assignments on an approximately weekly basis (see schedule below). A total of ten practical assignments will be completed throughout the course. These practicals are designed to provide hands-on experience applying the statistical methods introduced in lectures to real biological datasets using the statistical software R. The primary purpose of the practical assignments is to ensure that students develop the practical skills required to conduct effective statistical analyses and communicate results clearly and reproducibly. The skills developed through these assignments are directly transferable to students’ thesis work and independent research projects. Practical assignments, along with the associated datasets, will be hosted on the course GitHub webpage. All submissions must be completed electronically via Canvas. Lectures will be held on Mondays and Wednesdays. By the end of the Wednesday lecture, all material required to complete that week’s practical will have been covered. Practical assignments are due before the start of the following Wednesday lecture.

Grading: Each practical assignment is worth 3.5% of the final course grade, for a total of 35% across all ten practicals. Of this:

1% is awarded for submitting the assignment on time, regardless of correctness. This component constitutes the participation grade for the course.
The remaining 2.5% is based on the accuracy of the responses, the quality and clarity of the analysis, well-structured and well-commented code, and the quality of figures and data presentation.

Solutions to each practical assignment will be posted after the submission deadline. Late practicals will be accepted after solutions have been released; however, late submissions will not receive the participation component of the grade. No extensions will be granted for practical assignments except under documented exceptional circumstances.

Hypothesis and expected outcome(s) (5%)

By the end of Week 4, students must submit a one-page proposal describing the study system they will use for their course project. The proposal should clearly state the initial hypothesis, the expected outcome(s), and the dataset(s) that will be used to address the research question. This assignment is intended to function as a light form of preregistration. Its purpose is to encourage students to articulate their analytical questions, hypotheses, and expectations a priori, before conducting formal analyses. By doing so, this assignment helps distinguish between confirmatory analyses (testing pre-specified hypotheses) and exploratory analyses, and reduces the risk of post hoc hypothesis generation, fishing expeditions, or p-hacking. Students are not penalised if the results ultimately do not support the initial hypothesis. Instead, the emphasis is on clearly documenting the original expectations and analytical intent prior to seeing the results.

Grading: This assignment is assessed on a completion basis. All proposals that meet the submission requirements will receive full credit (5/5). Late submissions will receive a grade of 0. Submission of this assignment is a prerequisite for all subsequent core project components. No further project-related assignments will be accepted until the hypothesis and expected outcome(s) proposal has been submitted.

Presentations (30%)

Students will complete a core data analysis assignment as the capstone component of the course. Each student will deliver a 15-minute conference-style presentation based on their analyses. This assignment is designed to provide students with an opportunity to apply the statistical modelling tools introduced in lectures and practicals in an unguided setting, analogous to analysing data outside of a classroom environment. For this presentation, students will apply the modelling approaches covered in the course to an empirical dataset of their choosing (see Datasets below) and present their work in a professional scientific format. Presentations should be organised into the following sections:

Presentation structure

Introduction Provide a brief description of the study system and clearly state the research question(s) being addressed. Relevant background literature should be cited. Recommended length: ~2–3 minutes.

Methods Briefly describe how the data were collected and what variables are included. Provide a detailed description of the analytical workflow applied to the data, including model structure, assumptions, and justification for methodological choices. All relevant literature and statistical packages should be cited. The description should be sufficiently detailed that the analysis could be reproduced by others with access to the data. Recommended length: ~6–7 minutes.

Results Present the statistical results using appropriate figures and tables. Results should be explicitly contrasted with the predictions stated in the Hypothesis and Expected Outcome(s) assignment, noting where outcomes align with or differ from initial expectations. Students are not evaluated on whether results support their initial hypotheses, but on the clarity, appropriateness, and transparency of their analytical decisions and interpretation. Recommended length: ~3–5 minutes.

Discussion / Conclusion Summarise the key findings and place them in a biological context. Unexpected or null results should be discussed where relevant. Recommended length: ~1–2 minutes.

References Include references for all data sources, methods, and background literature.

Note: The recommended section lengths are guidelines rather than strict requirements. The primary emphasis of the presentation should be on the statistical modelling itself, including rationale for analytical choices, assumptions of the methods used, and potential sources of bias or uncertainty. Overly long introductions or discussions should be avoided.

Reproducibility appendix

In addition to the in-class presentation, students must submit a reproducibility appendix generated using R Markdown (PDF, HTML, or Word format). This document should detail every step of the analysis presented, including all data processing, model fitting, and figure generation. Code should be clearly commented and structured to allow full reproducibility of the results shown in the presentation.

Datasets: Students may select from a set of pre-approved datasets provided on the course website. Students are strongly encouraged to use their own data where possible; however, instructor approval is required. In general, there are no restrictions on data type, provided that the statistical methods covered in the course are appropriate. Datasets requiring specialised bioinformatics or preprocessing pipelines (e.g., RNA-seq pipelines) are not permitted. Students intending to use their own data are encouraged to discuss suitability with the instructor as early as possible.

Submission: Presentation slides must be submitted in advance of the presentation date (exact deadline to be announced later in the semester). The order of presentations will be determined randomly in class on the day of the presentations.

Grading: The presentation will be worth 20% of your total grade. Grading will based on a pre-provided rubric. The appendix material will be worth 10% of your total grade.

Take-Home Final Exam (30%)

The take-home final exam will consist of a set of questions designed to assess students’ understanding of the core statistical concepts covered in the course. The exam will be comprehensive, and material from both lectures and practical assignments may be tested. The exam will emphasise statistical reasoning, interpretation, and model evaluation, rather than rote calculation. Questions may require students to interpret model outputs, assess assumptions, justify analytical choices, and evaluate alternative modelling approaches.

Grading: The final exam is worth 30% of the final course grade.

Lecture Outline

(Approximate schedule of topics covered in lectures)

Week	Lecture Topics	Practical Assignment
1	Course introduction; Regression refresher	Practical 01: Introduction to R as a programming language
2	Probability theory; Likelihood; Maximum likelihood	Practical 02: Deterministic and stochastic models, simple linear regression, and maximum likelihood
3	Multiple linear regression; Parameter interactions; Interpreting residuals	Practical 03: Multiple linear regression, Residuals
4	Mixed effects models; Model Selection; Information criterion	Practical 04: Data visualisation; Hypothesis and predictions due
5	Model Selection; Model averaging; Heteroskedasticity	Practical 05: Mixed effect models, model selection, and model averaging
6	Temporal autocorrelation; Spatial Autocorrelation	Practical 06: Heteroskedasticity; Autocorrelation Function; Models with autocorrelated errors
7	Phylogenetic inertia; Phylogenetically controlled regression	Practical 07: Spatial and Phylogenetic Autocorrelation
8	Non-Gaussian residuals; Generalised linear regression Poisson and Logistic regression	Practical 08: Generalised Linear Models for Count Data
9	Zero-inflated data; Non-linear modelling; Deterministic functions	Practical 09: GLMs for Binary and Proportion Data
10	Stochastic simulation and power analysis; Course Overview	Practical 10: Stochastic Simulations, Power Analysis, Non-linear models
11	Independent Project Work	No Practical
12	Student presentations	No Practical
13	Student presentations	Term presentations due (35%)