Researchers from the Department of Statistics, as part of the OpenBind consortium, have published a new dataset and predictive AI model, strengthening the data foundations needed for AI in drug discovery.

Most medicines work by binding to specific disease-related proteins in the body. Predicting which molecules will bind, and how strongly, is a central part of early-stage drug design. Although AI has transformed areas such as protein structure prediction, the impact on predicting how drugs interact with their targets has been more limited, in large part because of the shortage of experimental data on these interactions.

The OpenBind consortium's experimental data, generated using high-throughput pipelines at Diamond Light Source in Oxfordshire, combines automated chemistry, robust binding measurements and crystallography, with the data processed into formats suitable for machine learning.

The new release provides detailed X-ray images of 699 compounds binding to the EV-A71 virus protein, with binding strength measurements for 601 of them – one of the largest public datasets for a single protein target.

‘This first release is an important step because it shows we can now generate high-quality, standardised data at scale, specifically designed for AI in drug discovery,’ said Professor Charlotte Deane, Professor of Structural Bioinformatics at the University of Oxford and a senior OpenBind investigator. ‘As the dataset grows, it will give researchers the kind of consistent, reliable information needed to improve how these models perform.’

Even the most advanced AI systems used in structural biology and drug discovery, such as AlphaFold and Boltz, are limited by the data they are trained on. While they can model biological structures similar to those in their training data, predicting new targets that look significantly different remains a challenge.

OpenBind is addressing that limitation by generating large volumes of new, experimental data. The aim is to give AI models the examples needed to move beyond recognising patterns in existing data and start making more reliable predictions about new drugs, helping to streamline a process where narrowing down viable compounds is often slow and costly.

'High-quality experimental data is essential for developing new and improved AI models. As AI performance improves, this in turn helps guide future experiments, helping to accelerate discovery. The lessons from these early cycles are already helping us improve the speed, consistency, and reproducibility of the pipeline, which will be critical as OpenBind grows,' said Dr Fergus Imrie, Associate Professor at the Department of Statistics and OpenBind computational researcher.

The dataset and accompanying EV-A71 2A protease target-specific AI model are openly available to researchers worldwide as a basis for developing and testing new computational approaches. Because the data is consistently structured, it could also provide a more rigorous test of how well current AI approaches perform, and where improvements are needed.

OpenBind was co-founded by the University of Oxford and Diamond Light Source as the first programme dedicated to producing drug discovery datasets at industrial scale, designed specifically for AI, and released openly on a continuous basis. It is backed by an £8 million grant from the Department for Science, Innovation and Technology (DSIT).

The consortium also includes researchers from Columbia University, Memorial Sloan Kettering Cancer Center, the Open Molecular Software Foundation and the University of Washington, alongside industry partners. Further data releases are planned as the programme expands to include more targets and larger datasets. A new general predictive model, OpenBind v1, is also expected to be released at the end of the month.

‘We couldn't have made such rapid progress without the contributions of our consortium members and operational team,’ said Professor Frank von Delft, Professor of Structural Chemical Biology in Oxford's Nuffield Department of Medicine and Principal Scientist at Diamond Light Source. ‘We will now implement the lessons from this foundation phase to ramp up a long-term operation that links high-volume production of AI data with active discovery projects.’

The OpenBind dataset can be accessed on the OpenBind website.