Supervisors: Professor Christl Donnelly (Professor of Applied Statistics, Department of Statistics) and Dr Sarah Hayes (Postdoctoral Researcher, Department of Statistics) in collaboration with Professor Katie Hampson (Professor of Infectious Disease Ecology, University of Glasgow)
Description
Zero by thirty is a global strategic plan to eliminate dog-mediated human rabies deaths by 2030. Understanding the burden of disease and targeting interventions is an important part of this plan. Cluster detection can be used to identify areas of increased incidence of disease and thus be useful in targeting interventions to areas of greatest need.
Spatio-temporal scan statistics can be used for cluster detection. Whilst most rely on the information on the background population at risk, a permutation method exists that can be used on case data alone. Similarly, the distribution of space-time windows between cases can be used to infer what proportion of circulating cases have been detected by surveillance.
This project will apply several cluster detection methods and/or space-time case detection algorithms to data on animal rabies cases across different parts of Tanzania. Results from these methods will be compared and analysed to assess within- and between-population transmission and levels of case detection in these different settings where the epidemiology of rabies and implementation of dog vaccination differ. The ability to detect clusters in “real-time” will be assessed by comparing the results from data subsets obtained by censoring the full dataset at different time points, while the robustness of case detection will be compared using sampling techniques.
Outcome
To identify clusters of animal rabies cases and assess levels of case detection across different parts of Tanzania with a view to generating information that could be useful in targeting interventions and improving surveillance.
Supervisor: Professor Robin Evans (Associate Professor of Statistics, Department of Statistics) supported by Daniel Manela (DPhil in Statistics student, Department of Statistics)
Description
A critical assumption in Causal Inference is the identification and measurement of all confounders between a treatment and an outcome. Satisfying this requirement in practice is extremely challenging.
Proximal Learning methods (Tchetgen Tchetgen et al., 2020) are an exciting new development which allows users to bypass the “no unmeasured confounders” assumption, and instead, perform inference with imperfect proxies of the true confounding mechanisms. These may be easier and cheaper to measure, making them highly attractive to those working in medical and social science domains where data is difficult to measure without error.
The central aim of the project is to assess the performance of PL approaches and compare them to other existing causal methods. Additionally, we would like the student to explore the robustness and sensitivity of these methods to model misspecification, and perhaps explore to what extent recent kernelized approaches help mitigate this issue (Mastouri et al., 2021). These experiments would be conducted on a mixture of simulated data as well as real clinical and socio-economic datasets.
Strong programming (R and/or Python) and linear algebra skills are crucial for this project.
Outcome
We hope to have enough for submission to a Machine Learning Conference.
References
A. Mastouri, Y. Zhu, L. Gultchin, A. Korba, R. Silva, M. Kusner, A. Gretton, and K. Muandet. Prox[1]imal causal learning with kernels: Two-stage estimation and moment restriction. In International Conference on Machine Learning, pages 7512–7523. PMLR, 2021. E. J. Tchetgen Tchetgen, A. Ying, Y. Cui, X. Shi, and W. Miao. An introduction to proximal causal learning. arXiv preprint arXiv:2009.10982, 2020.
Supervisors: Professor Christl Donnelly (Professor of Applied Statistics, Department of Statistics) and Matthew Penn (DPhil in Statistics student, Department of Statistics)
Description
Recent advances in computer vision have enabled football clubs to collect a vast amount of data from each match they play. Using this data effectively can have a substantial impact on a football team, both in their tactical planning and their recruitment strategy.
A key challenge in analysing this data is assessing defensive players. For example, weaker teams which will (in general) have weaker defenders will do more defending, meaning that their defenders will have more opportunities to make key challenges and interceptions than their counterparts playing for stronger teams. Thus, simply counting these key interventions does not provide a good way of comparing defenders from different teams.
This project will seek to explore ways of assessing defenders with the aim of producing a software package that calculates various metrics that can be used to compare players across the same division.
The project will be carried out in association with Oxford City Football Club, giving the successful applicant invaluable insight into the world of elite football. This will also provide the applicant with the chance to directly apply the tools they develop.
Outcome
The development of a software package that analyses match data to assess the performance and ability of defensive players.
Preference would be given to students with Python programming skills, but if the students know R, then they could pick up Python during the internship.
Supervisors: Professor Judith Rousseau (Professor of Statistics, Department of Statistics), Dr Raiha Browning (Post-doctorate, Warwick University and BDI Oxford) and Dan Moss (PhD student, Department of Statistics)
Description
A key indicator of disorder in society is the occurrence of conflict events, such as protests, riots, and battles between organised armed groups. An understanding of the real-time risk of conflict events around the world is crucial for a number of parties, including governments, journalists and researchers, especially during times of instability and unrest. The longer-term trends over time are also of interest to these users. In this project, we use spatio-temporal self-exciting processes to understand how both the temporal dynamics and spatial location affects the risk of these events occurring. Ultimately, this provides an estimate of the risk of conflict events over space and time. A further interest is how certain factors, such as the type of violence and demographic factors, impact the risk of conflict.
Outcome
Development of a software to analyse the conflicts data from the ACLED data set.
Supervisor: Professor Geoff Nicholls (Associate Professor of Statistics, Department of Statistics)
Description
Preference or ranking data are lists of preferences or ranks. For example, we might ask each person in the study to rank 5 sushi dishes, novels, or election candidates, from most preferred to least preferred. Or we might have an observation of a queue in which people higher up the queue have greater social status. Or we might have the outcome of a multiplayer game in which we record the order in which the players came. We need statistical models to help us interpret these data. For example, for the queue data, if we observe enough queues involving the same people, can we estimate the underlying social hierarchy which determines position in the queue? This is an active area of research and one our group has been looking at for the last couple of years.
Outcome
There are several models that have been proposed for this sort of data. It would be interesting to get a better understanding of how the models are related. This could involve methods from probability, simulation using computers, or model fitting using for example maximum likelihood or Bayesian inference and MCMC.
Supervisor: Professor Chris Holmes (Professor of Biostatistics, Department of Statistics)
Description
The aim is to explore and produce a scientific report on, the potential use of unlabelled data to improve machine learning (ML) classification on labelled data.
We will consider a classification task such as medical diagnosis where we have access to class-labelled training data, e.g. medical images with associated patient diagnoses, in addition to unlabelled data sets, e.g. medical images without diagnoses. The task is to explore whether classification accuracy of machine learning can be improved by utilising unlabelled data (which is often more abundant than labelled data) alongside the labelled data.
In particular, we will explore whether ML classifiers (like logistic regression or deep neural networks) can be improved in performance by comparing 3 strategies:
1. Training a “teacher” ML model only on the labelled data.
2. Training a “student" ML model that uses both labelled data and the unlabelled data that has now been machine labelled by the teacher model.
3. Iteratively repeating step 2, by replacing the teacher model with the previously learned student model.
The researchers will try different modifications of the above approach such as only augmenting with data observations that the teacher model predicts confidently, or giving the predicted data less weight than other observations.
Outcome
A report into the use of data augmentation and self-training in AI.
Supervisors: Professor Frank Windmeijer (Professorial Research Fellow, Department of Statistics) and Jeffrey Tse (DPhil in Statistics student, Department of Statistics)
Description
For the standard linear model with endogenous explanatory variables, the method of instrumental variables (IVs) is one of the techniques to obtain consistent and normal estimates of the causal effects. Instruments affect the endogenous treatment variables but have no direct or indirect effects on the outcome variable, other than through the treatments.
The distributions of IV estimators become nonstandard when the instruments are only weakly correlated with the treatments, leading e.g. to Wald tests with incorrect size. Inference robust to weak instruments is therefore an important research topic.
The Anderson-Rubin (AR) subvector test applies to the case where interest is focused, and hypotheses formulated, on parameters of a subset of the endogenous variables only. With weak instruments, this test is undersized. To improve power, Guggenberger, Kleibergen and Mavroeidis (2019) have proposed an adjustment to the critical value, conditioning on the largest eigenvalue of a concentration matrix. This concentration matrix conveys information about the strength of the instruments.
A better indication of the instrument strength is however the value of the smallest eigenvalue of this concentration matrix. Conditioning on this smallest eigenvalue can be shown in simulations to perform very well, further improving the power whilst maintaining the correct size. This project aims to formalise these results by deriving the conditional distribution of the subvector AR test, conditional on the smallest eigenvalue of the concentration matrix, analytically and/or by appropriate Monte Carlo methods.
Outcome
A paper to be submitted to conferences and for publication.
Reference
Guggenberger, P., F. Kleibergen, and S. Mavroeidis (2019), “A More Powerful Subvector Anderson Rubin Test in Linear Instrumental Variables Regression", Quantitative Economics, 10, 487-526.
Supervisor: Professor George Deligiannidis (Associate Professor of Statistics, Department of Statistics)
Description
Given samples from some unknown distribution, generative modelling aims to learn to produce new samples. Generative models find applications in many areas and have been used to generate images, audio, and text. Recent examples include the famous ChatGPT chatbot and text-to-image generators. Generative models, fuelled by advances in deep learning, have made huge progress in the last decade. This progress has been largely driven by certain types of models, the most famous one being Generative Adversarial Networks (GANs). Recently, however, a new type of generative model, namely Score-Based Generative Models have been outperforming GANs in a wide variety of tasks, including imaging. Score-based generative models corrupt the data by progressively adding noise until they become indistinguishable from noise. The model then learns to reverse this noising process and therefore generate fresh samples from noise. There is a very elegant formulation in terms of time-reversed stochastic differential equations.
Outcome
The student will implement score-based generative models in Python and perform various numerical experiments to gain insight into their properties gaining experience in Python and generative modelling. The student will produce a report highlighting any insights gained from the experiments, in particular testing under various scenarios whether the model truly generates fresh samples or simply memorises the data.