Artificial intelligence models that promise to accelerate antibody drug development are falling short when put to rigorous tests, according to recent research from the Department of Statistics’ Oxford Protein Informatics Group.
 

The study, published in Nature Computational Science, shows that building AI that can reliably predict how molecular changes affect antibody performance will require experimental datasets orders of magnitude larger than those currently available.

The findings highlight a key bottleneck in computational drug discovery, where impressive-looking AI results often don't translate to real-world pharmaceutical applications – particularly in the development of antibody-based therapies.

Lead author Dr Alissa Hummer said: 'Overconfidence in the performance of AI models does not serve their ultimate purpose. We need to have an unbiased understanding of how well models work to move them beyond publications to improving drug development.'

In medicine, antibodies, the engineered versions of proteins that our immune systems use to recognise and neutralise threats, have become some of the most successful treatments for cancer, autoimmune diseases and other conditions. How tightly an antibody binds to its target, known as its binding affinity, largely determines whether a therapy will work.

Optimising antibodies has typically required thousands of laboratory experiments. AI promises to narrow that effort by predicting which changes might improve binding before costly laboratory testing begins. But current approaches suffer from a basic flaw: they appear accurate when tested initially but fail when evaluated more stringently. The models memorise patterns from very similar examples rather than learning the underlying principles.

Oxford researchers developed an AI model called Graphinity that reads the three-dimensional structure around where an amino acid change occurs in an antibody–target complex. The model appeared highly accurate when tested using standard approaches, but when the team applied stricter evaluations that prevented similar antibodies from appearing in both training and test sets, performance dropped by more than 60 per cent. The model was overfitting to the limited diversity in current datasets rather than learning transferable scientific principles.

The problem is not limited to this approach; it affects the entire field. Previous methods showed similar failures when subjected to rigorous evaluation.

The underlying problem is the small size and limited diversity of current experimental datasets, which contain only a few hundred mutations from a small number of antibody–target pairs.

The research team created synthetic datasets more than 1,000 times larger than current experimental collections to understand what would be needed for robust predictions. Using physics-based computational tools, they generated binding affinity data for almost one million antibody mutations, and on these larger, more diverse datasets AI performance remained strong even under strict testing conditions. Learning curve analyses revealed that meaningful progress likely requires at least 90,000 experimentally measured mutations – roughly 100 times more than the largest current experimental dataset.

To validate the approach on real experimental data, Graphinity was applied to a dataset of over 36,000 variants of trastuzumab, the breast cancer drug Herceptin. The model successfully distinguished binding from non-binding variants, achieving performance comparable to previous methods whilst offering better potential for generalisation to new antibody–target pairs.

Current experimental datasets are heavily skewed, with over half the mutations in one major database involving changes to one amino acid, alanine. This lack of diversity means models struggle to generalise beyond the narrow patterns they have seen during training. 'Our study shows that robust AI models need not just more data, but more varied data,' said co-author Dr Lewis Chinery.

More varied data, together with fairer evaluation through blind community challenges such as CASP, AIntibody and Ginkgo’s AbDev, will be important to the development of realistic benchmarks for antibody AI.

The need for more diverse datasets and more rigorous benchmarks points to a broader lesson across computational biology. Progress now depends more on systematic data collection than on algorithmic innovations. 'Data that has been generated to answer specific biological questions is inherently different from the data that is needed to build generalisable AI models,' said Dr Hummer. 'Until we build datasets specifically for AI development, there will be a ceiling to the predictive power we can achieve.'

The full paper ‘Investigating the volume and diversity of data needed for generalizable antibody–antigen ΔΔG prediction’ was published in Nature Computational Science

The research was supported by the UKRI Medical Research Council, Engineering and Physical Sciences Research Council, Biotechnology and Biological Sciences Research Council, AstraZeneca, and GlaxoSmithKline.