Project 1: Developing species distribution models to predict geographical distributions of species of interest in infectious disease

**Supervisors: Professor Christl Donnelly (Professor of Applied Statistics, Department of Statistics) and Dr Sarah Hayes (Postdoctoral Researcher, Department of Statistics)**

Description and Outcome

Species distribution models (SDMs) use computer algorithms to predict the distribution of species across geographical space and time, most often using environmental data to inform these predictions. They have traditionally been used in ecology, but have recently been applied to other areas, such as infectious disease epidemiology, where they have been used to predict the geographical distribution of pathogens.

Understanding the geographical distribution of species can provide information that can enhance our understanding of infectious disease transmission and control. Across south-east Tanzania, almost half of the animal rabies cases observed occur in jackals, which is an unusually high proportion. Information on the distribution of jackals across this area could help us to understand the role they are playing in maintaining rabies in this region. For highly-pathogenic avian influenza, understanding the spatial distribution of the virus in wild bird hosts, could aid in planning targeted surveillance strategies.

In this project, we will develop species distribution models to predict either the distribution of jackals across south-east Tanzania[1] or to predict the spatial distribution of avian influenza in wild birds across Great Britain[2]. The choice of scenario will be based both on student-preference and data availability. Both options will involve working with spatial data and implementation of an existing SDM algorithm, such as maximum entropy models or Bayesian additive regression trees[3, 4]. Experience working with GIS and programming in R would be beneficial, but not essential.

Bibliography

1. Serva, Davide, et al. "A shifting carnivore’s community: habitat modeling suggests increased overlap between the golden jackal and the Eurasian lynx in Europe." Frontiers in Ecology and Evolution 11 (2023): 1165968.

2. Belkhiria, Jaber, Moh A. Alkhamis, and Beatriz Martínez-López. "Application of species distribution modeling for avian influenza surveillance in the United States considering the North America migratory flyways." Scientific reports 6.1 (2016): 33161.

3. Li, Xinhai, and Yuan Wang. "Applying various algorithms for species distribution modelling." Integrative zoology 8.2 (2013): 124-135.

4. Guisan, Antoine, and Niklaus E. Zimmermann. "Predictive habitat distribution models in ecology." Ecological modelling 135.2-3 (2000): 147-186.

Project 2: The median-of-medians estimator in multivariable two-sample Mendelian randomisation

**Supervisor: Professor Frank Windmeijer (Professorial Research Fellow, Department of Statistics)**

Description

In Mendelian randomisation (MR), genetic markers are used as instrumental variables to identify and estimate causals effect of modifiable phenotypes on outcomes, for example, the effect of body weight on blood pressure. Multivariable models aim to determine the effects on an outcome of multiple exposures at the same time. A recently proposed estimator is the median-of-medians estimator, which is robust to having some genetic markers that are invalid instruments. The project aim is to evaluate the performance of this estimator in the two-sample MR setting and to compare it to other robust methods, like the least absolute deviations estimator. Both simulation methods and data applications, using the R package TwoSampleMR, will be used for this assessment.

Outcome

A paper describing the results, including estimation results for a substantive application.

Bibliography

Grant, AJ and S Burgess, (2021), Pleiotropy robust methods for multivariable Mendelian randomization, Statistics in Medicine 40, 5813-5830.

Liang, X, E Sanderson and F Windmeijer, (2022), Selecting Valid Instrumental Variables in Linear Models with Multiple Exposure Variables: Adaptive Lasso and the Median-of-Medians Estimator, arXiv:2208.05278.

Sanderson, E, G Davey Smith, F Windmeijer, and J Bowden, (2019), An Examination of Multivariable Mendelian Randomization in the Single-Sample and Two-Sample Summary Data Settings,” International Journal of Epidemiology, 48, 713–727.

TwoSampleMR, R-package, https://mrcieu.github.io/TwoSampleMR/index.html.

Project 3: Within the Research Area of Bioinformatics

Supervisor: Professor Jotun Hein (Professor of Bioinformatics, Department of Statistics)

Descriptions

The below are project suggestions. Interns are welcome to propose their own project within Bioinformatics. The specific project will be allocated through discussions which divulge areas of interest.

Inferring Recombination’s - There are methods that reconstruct the history of a set of homologous sequences including recombination events, minimizing the total amount of events. This can be criticized from a statistical standpoint, but additionally, it seems that a very large set of histories gives the same number of events. It is suggested that the size of the solution space is mapped for a few sequences with a few events. These are two lectures (with useful references) for the background to this problem. Lecture 1 and Lecture 2.

Probability of Recombination Detectability - In the Coalescent with Recombination and Mutation Model, the probability of detectability of a recombination conditioning on different events like one recombination, two mutations, two recombinations, etc. This is very illuminative for interpreting analysis on minimizing events like in the first project. There are many open questions in this field, which is surprising since it started in 1985 (38 years ago). Hudson and Kaplan (1985) showed that a majority of recombination events are fundamentally invisible, Myers (2004 PhD Thesis) added mutations to this scenario, Hein, Schierup and Wiuf (2005) simplified the Hudson-Kaplan calculations, Hayman, Ignatieva and Hein (2023) extended these calculations to further complicated scenarios. This project needs strong programming skills and a good understanding of basic probability theory.

Statistical Alignments & Carrillo-Lipman Bounds - For more than 2 sequences (say k) computational methods can be strongly accelerated by using information from alignments of pairs of sequences. This has been richly explored for pairs of sequences, but not for probabilistic models of sequence evolution. It is suggested that this is explored for 3 sequences, i.e. if you have the pairwise alignments of 3 sequences, how much can this be made to constrain the alignment of all 3 sequences? Useful material can be found in these lectures with voice on alignment: Lecture 1, Lecture 2, Lecture 3, Lecture 4 and Lecture 5.

Inferring Insertion-Deletion Parameters from observed sequence lengths - This is again a statistical alignment project - but without the alignment! Statistical alignment is an advanced technique, that gives probability for any possible alignment but is computationally slow. There could be much information in the length of the sequences (thus ignoring the actual content of the sequences). All models used are Markov Models and it is possible to write the probability distribution, and the combined lengths of the input sequences and thus make inferences about parameters. The lectures for the above project also provide a suitable background for this project.

Additionally, several interesting problems are mentioned in the course Topics in Computational Biology given by Jotun Hein, that could provide inspiration.

Project 4: Integrating a function in high dimensions using Regression Trees and Gaussian Processes

**Supervisor: Dr Ben Lambert (Academic Director Schmidt Futures, Department of Statistics)**

Description and Outcome

The trapezium rule for numerical integration works by breaking a function up into a series of discrete bins. By assuming that the function varies linearly within a bin, this then allows an integral to be approximated as the sum of trapezia.

Recent approaches to numerical integration are even simpler: they split a function up into a series of bins and assume that the function is constant within them (1). This approximation then turns an integral into a sum of cuboidal volumes.

A benefit of this simple approach is that it extends easily to high dimensions, and regression tree algorithms provide a powerful way to build such function approximations (1). Bayesian Additive Regression Trees (BART) take this approach one step further and can capture a degree of the integral approximation error by representing the uncertainty in the underlying space of trees (2).

This project could go in many directions depending on the interests of the student. One method would be to meld regression-tree approaches with Gaussian Processes (GPs), where GPs are used to better approximate function behaviour within a cuboidal volume. Another direction would be to develop software consisting of a set of benchmark problems for existing high-dimensional integration methods in order to explore their strengths and relative weaknesses.

The project will involve considerable coding in R and/or Python. It will also aim to create software that can be used by other researchers in the community. As such, we will make use of test-led software development practices and collaborative code review over Github.

Bibliography

(1) Foster, T., Lei, C. L., Robinson, M., Gavaghan, D., & Lambert, B. (2020). Model evidence with fast tree-based quadrature. arXiv preprint arXiv:2005.11300.

(2) Zhu, H., Liu, X., Kang, R., Shen, Z., Flaxman, S., & Briol, F. X. (2020). Bayesian probabilistic numerical integration with tree-based models. Advances in Neural Information Processing Systems, 33, 5837-5849.

Project 5: Characterising mosquito dispersal from a meta-analysis of mark-release-recapture experiments

**Supervisor: Dr Ben Lambert (Academic Director Schmidt Futures, Department of Statistics)**

Description and Outcome

In terms of the people they kill and sicken, mosquitoes are the most dangerous animal on earth. Yet, despite their notoriety, we still know relatively little about their ecology. Interventions against mosquito-borne diseases are affected by how far mosquitoes travel during their lifetimes, which affects the spread of insecticide and drug resistance. Mosquito dispersal is also of great importance for interventions involving the release of genetically modified insects.

In this project, the student will conduct a statistical meta-analysis of a database of mark-release-recapture experiments (the predominant method of determining mosquito flight distances) to characterise the dispersal of different vector species (1). In so doing, the student will gain experience in data analysis and an appreciation of mosquito ecology – of great value if considering a career in public health or infectious disease modelling.

The project will involve applied statistical modelling using Bayesian methods and will include coding in either R or Python. Throughout the work, we will use approaches to ensure that the work is reproducible by other researchers, including by hosting the code openly on Github and using data analysis pipelines.

Bibliography

(1) Guerra, C. A., Reiner, R. C., Perkins, T. A., Lindsay, S. W., Midega, J. T., Brady, O. J., ... & Smith, D. L. (2014). A global assembly of adult female mosquito mark-release-recapture data to inform the control of mosquito-borne pathogens. Parasites & vectors, 7(1), 1-15.

Summer Internship Research Projects

Project 1: Developing species distribution models to predict geographical distributions of species of interest in infectious disease

**Supervisors: Professor Christl Donnelly (Professor of Applied Statistics, Department of Statistics) and Dr Sarah Hayes (Postdoctoral Researcher, Department of Statistics)**

Description and Outcome

Bibliography

Project 2: The median-of-medians estimator in multivariable two-sample Mendelian randomisation

**Supervisor: Professor Frank Windmeijer (Professorial Research Fellow, Department of Statistics)**

Description

Outcome

Bibliography

Project 3: Within the Research Area of Bioinformatics

Supervisor: Professor Jotun Hein (Professor of Bioinformatics, Department of Statistics)

Descriptions

Project 4: Integrating a function in high dimensions using Regression Trees and Gaussian Processes

**Supervisor: Dr Ben Lambert (Academic Director Schmidt Futures, Department of Statistics)**

Description and Outcome

Bibliography

Project 5: Characterising mosquito dispersal from a meta-analysis of mark-release-recapture experiments

**Supervisor: Dr Ben Lambert (Academic Director Schmidt Futures, Department of Statistics)**

Description and Outcome

Bibliography

Discover More

Statistics Summer Research Internships