Professor Charlotte Deane from the Department of Statistics has been announced as a senior principal investigator on a £8 million government-backed consortium that will create the world's largest dataset for AI-driven drug discovery.
OpenBind, a new £8 million consortium, will create the world's largest open dataset of drug-protein interactions, generating over 500,000 experimentally validated protein-ligand complex structures over the next five years – representing a 20-fold increase over all publicly available data collected in the past half-century.
‘OpenBind realises a major gear-shift for AI in drug discovery by investing in the data that powers it,’ said Professor Deane. ‘This funding will mean we can begin generating a catalogue that not only dwarfs in quantity everything messily accumulated over half a century, but transcends it in quality and is geared towards powering the AI algorithms.’
Most medicines work by binding to specific proteins – the building blocks that make our bodies function – but researchers have historically lacked sufficient high-quality data about these interactions to train AI systems effectively. This data shortage has been a major barrier to using artificial intelligence to predict which new compounds might work as drugs, leaving pharmaceutical companies reliant on empirical testing methods that can take decades and cost billions. OpenBind promises to bridge that gap by creating structured, comprehensive data specifically designed for machine learning applications.
The consortium will deploy automated chemistry and high-throughput X-ray crystallography at Diamond Light Source, the UK's national synchrotron facility in Oxfordshire, to generate unprecedented volumes of precise molecular interaction data structured for AI training.
Professor Deane is working alongside an international team of researchers, including colleagues Professor Frank von Delft (who also holds a position at Diamond Light Source) and Professor Paul Brennan, both from Oxford’s Nuffield Department of Medicine. The consortium also includes Nobel Prize winner Professor David Baker from the University of Washington, and leading computational scientists from institutions including Memorial Sloan Kettering Cancer Centre, MIT, and Columbia University.
The OpenBind dataset is designed to support multiple areas of computational innovation, including structure prediction, generative molecular design, docking algorithms, and active learning workflows. These applications demonstrate how statistical methods developed for one domain can have far-reaching impacts across multiple fields of scientific inquiry. The project also has potential applications beyond healthcare, supporting research into engineering biology solutions for challenges such as developing new enzymes to tackle plastic waste.
OpenBind is backed by the UK government's newly established Sovereign AI Unit and positions the UK at the forefront of AI-driven scientific discovery. The project will help train the next generation of AI models for drug discovery while establishing new standards for open scientific data sharing. The announcement comes as part of the government's broader Plan for Change, highlighting how statistical and computational expertise developed at Oxford is directly contributing to national economic growth and international scientific leadership.
The project also demonstrates how statistical expertise developed in the department is being applied to accelerate medical breakthroughs that could benefit patients worldwide – and underpin decades of future innovation in computational biology and pharmaceutical research.