OpenBind releases first open dataset and AI model for drug discovery | Department of Statistics

Dr Lizbé Koekemoer, Team leader at the Centre for Medicines Discovery, University of Oxford, and Dr Jasmin Aschenbrenner, researcher at Diamond Light Source, reviewing a molecular structure in the Diamond laboratory. Credit: Stuart March – DNDi.

OpenBind releases first open dataset and AI model for drug discovery

13 May 2026 | Research

Researchers from the Department of Statistics, as part of the OpenBind consortium, have published a new dataset and predictive AI model, strengthening the data foundations needed for AI in drug discovery.

Most medicines work by binding to specific disease-related proteins in the body. Predicting which molecules will bind, and how strongly, is a central part of early-stage drug design. Although AI has transformed areas such as protein structure prediction, the impact on predicting how drugs interact with their targets has been more limited, in large part because of the shortage of experimental data on these interactions.

The OpenBind consortium's experimental data, generated using high-throughput pipelines at Diamond Light Source in Oxfordshire, combines automated chemistry, robust binding measurements and crystallography, with the data processed into formats suitable for machine learning.

The new release provides detailed X-ray images of 699 compounds binding to the EV-A71 virus protein, with binding strength measurements for 601 of them – one of the largest public datasets for a single protein target.

‘This first release is an important step because it shows we can now generate high-quality, standardised data at scale, specifically designed for AI in drug discovery,’ said Professor Charlotte Deane, Professor of Structural Bioinformatics at the University of Oxford and a senior OpenBind investigator. ‘As the dataset grows, it will give researchers the kind of consistent, reliable information needed to improve how these models perform.’

Even the most advanced AI systems used in structural biology and drug discovery, such as AlphaFold and Boltz, are limited by the data they are trained on. While they can model biological structures similar to those in their training data, predicting new targets that look significantly different remains a challenge.

OpenBind is addressing that limitation by generating large volumes of new, experimental data. The aim is to give AI models the examples needed to move beyond recognising patterns in existing data and start making more reliable predictions about new drugs, helping to streamline a process where narrowing down viable compounds is often slow and costly.

'High-quality experimental data is essential for developing new and improved AI models. As AI performance improves, this in turn helps guide future experiments, helping to accelerate discovery. The lessons from these early cycles are already helping us improve the speed, consistency, and reproducibility of the pipeline, which will be critical as OpenBind grows,' said Dr Fergus Imrie, Associate Professor at the Department of Statistics and OpenBind computational researcher.

The dataset and accompanying EV-A71 2A protease target-specific AI model are openly available to researchers worldwide as a basis for developing and testing new computational approaches. Because the data is consistently structured, it could also provide a more rigorous test of how well current AI approaches perform, and where improvements are needed.

OpenBind was co-founded by the University of Oxford and Diamond Light Source as the first programme dedicated to producing drug discovery datasets at industrial scale, designed specifically for AI, and released openly on a continuous basis. It is backed by an £8 million grant from the Department for Science, Innovation and Technology (DSIT).

The consortium also includes researchers from Columbia University, Memorial Sloan Kettering Cancer Center, the Open Molecular Software Foundation and the University of Washington, alongside industry partners. Further data releases are planned as the programme expands to include more targets and larger datasets. A new general predictive model, OpenBind v1, is also expected to be released at the end of the month.

‘We couldn't have made such rapid progress without the contributions of our consortium members and operational team,’ said Professor Frank von Delft, Professor of Structural Chemical Biology in Oxford's Nuffield Department of Medicine and Principal Scientist at Diamond Light Source. ‘We will now implement the lessons from this foundation phase to ramp up a long-term operation that links high-volume production of AI data with active discovery projects.’

The OpenBind dataset can be accessed on the OpenBind website.

What does a statistician do for the England football team?

From squad selection to modelling how footballs behave at altitude, statistician Matt Penn explains how data is helping shape the modern game, and why coaches will always matter more than the numbers.

Find out more

New evidence suggests vast hidden magma systems inside Mars

Researchers from the Departments of Earth Science and Statistics have found evidence that Mars once hosted enormous, Earth-like magmatic systems deep below its surface – even though the planet lacks the plate tectonics long considered essential for this kind of geological complexity. The findings open up new possibilities for how rocky planets become habitable.

Find out more

Finding a needle in the genomic haystack: Targeting rare genes using statistical outliers

In statistical modelling, extreme outliers are often written off as 'noise'. But a new study by researchers from Oxford's Department of Statistics and Big Data Institute published this week in The American Journal of Human Genetics reverses that principle, using these outliers as the basis of a targeting system for locating rare, high-impact genetic mutations.

Find out more