Below are some projects that have been done in the past by our summer interns.
The recorded cases of COVID-19 represent a fraction of the underlying infections with the SARS-CoV-2 virus as mildly symptomatic and asymptomatic infections were less likely to be tested and recorded than more serious cases. COVID-19-related deaths were much more reliably recorded, so a compartmental transmission model was fitted to these data. After fitting to the entire population of Florida assuming homogeneous mixing, the SEIR model (with compartments for susceptible, exposed, infectious and recovered/removed individuals) was generalised to allow for mixing within and between five county-level populations in Florida. Access to greater computational power would have allowed the model to include more counties and to allow for mixing between populations from non-adjacent counties.
Supervised by Professor Christl Donnelly and Dr Nicholas Irons.
Research Areas: Statistical Genetics and Epidemiology.
Uncertainty estimation aims to quantify how confident a model is in its predictions, which is essential for deploying machine learning systems in real-world settings. Conformal prediction is a post hoc, distribution-free method that uses calibration data to construct prediction intervals or sets that contain the true label with a guaranteed probability, while requiring minimal assumptions about the data or underlying model. Conformal training extends this framework by training the model specifically to perform well under conformal prediction.
The project explored how conformally trained models differ from normally trained models. It did so by using techniques from explainable AI to analyse which features the models relied on when making predictions. We found that conformally trained models typically focus on fewer but larger, contiguous regions of an image, while normally trained models spread attention across more, smaller regions.
Supervised by Dr Fergus Imrie.
Research Areas: Computational Statistics and Machine Learning, Statistical Theory and Methodology.
Description
You will learn how to apply the latest machine learning and AI technologies to help discover new inhibitors of a key drug target in SARS-CoV-2, the virus that causes COVID 19. By training models on binding data and 3D atomic structures of inhibitors of SARS CoV-2 main protease, you will advance our understanding of how to block viral maturation and how to develop new drugs to treat COVID-19.
Outcomes
You will explore data from the COVID Moonshot project to develop a variety of classical ML models and more advanced methods such as Graph Neural Networks, Atomic Environment Vector-based models, and molecular transformers.
Supervised by Professor Garrett Morris.
Research Areas: Computational Biology and Bioinformatics, Computational Statistics and Machine Learning.
Preference or ranking data are lists of preferences or ranks. For example, we might ask each person in the study to rank 5 sushi dishes, novels, or election candidates, from most preferred to least preferred. Or we might have an observation of a queue in which people higher up the queue have greater social status. Or we might have the outcome of a multiplayer game in which we record the order in which the players came. We need statistical models to help us interpret these data. For example, for the queue data, if we observe enough queues involving the same people, can we estimate the underlying social hierarchy which determines position in the queue? This is an active area of research and one our group has been looking at for the last couple of years.
There are several models that have been proposed for this sort of data. It would be interesting to get a better understanding of how the models are related. This could involve methods from probability, simulation using computers, or model fitting using for example maximum likelihood or Bayesian inference and MCMC.
Supervised by Professor Geoff Nicholls.
Research Areas: Statistical Theory and Methodology, Computational Statistics and Machine Learning
Data in -omics fields is characterised by high dimensionality and low sample size. Due to the potential for overfitting or poor interpretability, this makes typical statistical analyses challenging. To overcome this, variable selection is a common approach. It is supported by the biological idea that only a small number of genes are relevant for prediction or categorisation. In this project we will develop new methods for Bayesian unsupervised variable selection, extending the approach taken by Eliseussen, Fleischer, and Vitelli (2022). We convert continuous multivariate data to rankings and use rank based analysis for a robust inference.
Eliseussen et al. use a Mallows ranking model for their analysis. This imposes a total order relationship between variables. It seems likely that a more general partial order model will better represent the true data as it does not require all variables to be comparable (genes may interact in groups, corresponding to expression pathways, with little or no interaction between genes on different pathways).
The aim of the project will be to explore these ideas. Quite a bit of code is available and can be adapted to this analysis. However, the research is not just applied work and coding (though it could be if you like). There is scope for developing new statistical models and calculating their properties and evaluating their performance.
The full project is available here.
Supervised by Geoff Nicholls.
Research Areas: Statistical Theory and Methodology, Computational Biology and Bioinformatics.
Species distribution models (SDMs) use computer algorithms to predict the distribution of species across geographical space and time, most often using environmental data to inform these predictions. They have traditionally been used in ecology, but have recently been applied to other areas, such as infectious disease epidemiology, where they have been used to predict the geographical distribution of pathogens.
Understanding the geographical distribution of species can provide information that can enhance our understanding of infectious disease transmission and control. Across south-east Tanzania, almost half of the animal rabies cases observed occur in jackals, which is an unusually high proportion. Information on the distribution of jackals across this area could help us to understand the role they are playing in maintaining rabies in this region. For highly-pathogenic avian influenza, understanding the spatial distribution of the virus in wild bird hosts, could aid in planning targeted surveillance strategies.
In this project, we will develop species distribution models to predict either the distribution of jackals across south-east Tanzania[1] or to predict the spatial distribution of avian influenza in wild birds across Great Britain[2]. The choice of scenario will be based both on student-preference and data availability. Both options will involve working with spatial data and implementation of an existing SDM algorithm, such as maximum entropy models or Bayesian additive regression trees[3, 4]. Experience working with GIS and programming in R would be beneficial, but not essential.
Supervised by Professor Christl Donnelly and Dr Sarah Hayes
In Mendelian randomisation (MR), genetic markers are used as instrumental variables to identify and estimate causals effect of modifiable phenotypes on outcomes, for example, the effect of body weight on blood pressure. Multivariable models aim to determine the effects on an outcome of multiple exposures at the same time. A recently proposed estimator is the median-of-medians estimator, which is robust to having some genetic markers that are invalid instruments. The project aim is to evaluate the performance of this estimator in the two-sample MR setting and to compare it to other robust methods, like the least absolute deviations estimator. Both simulation methods and data applications, using the R package TwoSampleMR, will be used for this assessment.
Supervised by Professor Frank Windmeijer.
The trapezium rule for numerical integration works by breaking a function up into a series of discrete bins. By assuming that the function varies linearly within a bin, this then allows an integral to be approximated as the sum of trapezia.
Recent approaches to numerical integration are even simpler: they split a function up into a series of bins and assume that the function is constant within them (1). This approximation then turns an integral into a sum of cuboidal volumes.
A benefit of this simple approach is that it extends easily to high dimensions, and regression tree algorithms provide a powerful way to build such function approximations (1). Bayesian Additive Regression Trees (BART) take this approach one step further and can capture a degree of the integral approximation error by representing the uncertainty in the underlying space of trees (2).
This project could go in many directions depending on the interests of the student. One method would be to meld regression-tree approaches with Gaussian Processes (GPs), where GPs are used to better approximate function behaviour within a cuboidal volume. Another direction would be to develop software consisting of a set of benchmark problems for existing high-dimensional integration methods in order to explore their strengths and relative weaknesses.
The project will involve considerable coding in R and/or Python. It will also aim to create software that can be used by other researchers in the community. As such, we will make use of test-led software development practices and collaborative code review over Github.
Supervised by Dr Ben Lambert.
In terms of the people they kill and sicken, mosquitoes are the most dangerous animal on earth. Yet, despite their notoriety, we still know relatively little about their ecology. Interventions against mosquito-borne diseases are affected by how far mosquitoes travel during their lifetimes, which affects the spread of insecticide and drug resistance. Mosquito dispersal is also of great importance for interventions involving the release of genetically modified insects.
In this project, the student will conduct a statistical meta-analysis of a database of mark-release-recapture experiments (the predominant method of determining mosquito flight distances) to characterise the dispersal of different vector species (1). In so doing, the student will gain experience in data analysis and an appreciation of mosquito ecology – of great value if considering a career in public health or infectious disease modelling.
The project will involve applied statistical modelling using Bayesian methods and will include coding in either R or Python. Throughout the work, we will use approaches to ensure that the work is reproducible by other researchers, including by hosting the code openly on Github and using data analysis pipelines.
Supervised by Dr Ben Lambert