Statistics is not just an academic exercise – the work done in the department has an impact on scientific knowledge and real benefits to society. The work done has helped to predict protein structure, which is used in the development of new drugs, and other work has helped to predict ‘missing’ genetic data, which allows easier identification of genes that may be associated with any given disease.
Predicting Proteins in Three Dimensions
Researchers in the Department of Statistics have developed novel rapid methods for predicting protein structure, which are now used by major pharmaceutical companies in the development of new drugs.
Proteins perform crucial functions in all biological processes. What they do is generally determined by their three-dimensional structure, which makes information about protein structure critical to processes such as drug discovery. Experimental methods such as x-ray crystallography, however, cannot always give all the necessary information, especially when trying to establish the loop-structures of proteins. Loops are the most variable regions of protein structure and tend to be the aspect most closely related to protein function. Computational methods for loop prediction can therefore offer a powerful addition to the data derived from experiments.
As there are a vast number of different proteins, it had been assumed that the most powerful approach was to predict protein structure from first principles. However, research led by Professor Charlotte Deane identified that the power of database search methods had been underestimated. Deane and her team re-evaluated FREAD, a database search program and loop-modelling algorithm, and developed this into a new program, pyFREAD.
pyFREAD’s approach was based on the fact that, given fixed anchor structures, the loop’s structure is independent of that of the rest of the protein and is solely determined by its amino acid sequence. Thus, accurate loop modelling can be achieved by searching the database of known structures for a stretch of amino acids with similar sequence and similar anchor structures. The revised program incorporated a completely new scoring system which, combined with bigger databases of protein structures and faster computers, resulted in a significant improvement in the ability to model loops.
Subsequent research was undertaken in collaboration with UCB Pharma, a large pharmaceutical company operating in 40 countries worldwide, with a global revenue of €3.4 billion in 2012. This work established that pyFREAD could also be used when multiple segments of data were missing, and thus help to model residues not defined owing to the experimental limitations of x-ray crystallography and molecular dynamics simulations. pyFREAD’s algorithms were also significantly speeded up, and the method generalised to allow modelling of any fragment of the protein, not just loop structures.
UCB Pharma have made extensive use of pyFREAD in their drug discovery work. The company has found the program to be at least 1000 times faster than comparable commercial packages, and to produce more accurate results. Lead compound optimisation is one of the most costly steps in drug discovery and development, requiring on average £6m per campaign, and pyFREAD is expected to save the company over £5m per drug approval.
A version of pyFREAD coded in C also exists in a free, downloadable version, as well as a web-based computational version which in 2013 performed an average of over 60 predictions per month and was visited by over 200 unique users per month from throughout the world. It is used regularly by, among others, Oxford spin-out computational drug discovery company Oxford Drug Design.
The predictive power of FREAD: The black loop is the actual protein loop structure. The grey loop shows the prediction made by the original FREAD program. The white loop is the prediction made by Deane’s new version of FREAD – very close to the actual structure.
“The research work at Professor Deane’s laboratory has generated significant economic value for UCB Pharma through the acceleration of the drug discovery process. More importantly, faster drug discovery means that patients receive better treatment sooner. While the impact on patients’ quality of life is hard to quantify, it is what matters most”
Director of Computational Structural Biology, UCB Pharma
Research funded by EPSRC, BBSRC and the Wellcome Trust
IMPUTE – a powerful new statistical tool for identifying ‘disease genes’
IMPUTE, developed in the Department of Statistics, has changed the field of human genetics by enabling the accurate prediction of ‘missing’ genetic data. This allows much easier identification of genes that are may be associated with any given disease.
In genetic studies of human disease it is now routine to collect genetic data on thousands of individuals. A typical study will measure up to a million variable positions across the genome (single nucleotide polymorphisms, or SNPs) in thousands of subjects, and look for significant differences between individuals with and without a particular disease. The identification of these ‘disease genes’ can help understanding of the disease mechanisms. However, the genetic data collected is incomplete, with many millions of sites of the genome unmeasured. Any method of predicting, or imputing, the unobserved genetic data would be of enormous use in genetic studies.
The first model to be able do this was developed by Professor Jonathan Marchini and Professor Peter Donnelly as part of their involvement in the Wellcome Trust Case Control Consortium (WTCCC) from 2006-2007. They realised that existing reference databases such as the 1000 Genomes Project (which contains millions of SNPs) could be used to help predict unobserved genotypes, and that recently developed Hidden Markov models developed in the area of population genetics could be adapted to carry out this task.
Professor Marchini wrote IMPUTE to predict missing data using patterns of haplotypes (a set of SNPs that are associated statistically) that are shared between two datasets: a reference database and a genetic study. IMPUTE was applied successfully to all 7 disease studies carried out by the WTCCC. For common genetic variants of interest, the accuracy of imputation is over 95%. Further refinements to IMPUTE enabled the method to scale better as reference databases increased in size, allowing the selection of relevant subsets of the reference database for each individual. This also allowed predictions to be matched to an individual’s ancestry (e.g. European or African).
One key benefit of the method is that once unobserved genotypes have been predicted in several different studies, they can then be combined, via meta-analysis, to produce much more powerful studies. This approach has changed the field of human genetics and groups now routinely share data via this approach. One of the earliest examples of this was in the study of type 2 diabetes; a meta-analysis of three studies involving over 10,000 individuals and 2.2 million SNPs led to the discovery of 6 new genes that were strongly associated with the inheritance of type 2 diabetes.
IMPUTE has had a significant impact on two major companies working in the field of genetics and pharmaceuticals: Affymetrix and Roche. Affymetrix uses IMPUTE as a central part of the process of designing all its genotyping products including SNP arrays, which are used to study slight variations between genomes and thus determine susceptibility to disease. The design of these arrays has helped the company to win a genotyping contract worth around £25m.
Roche saved around $1m by using IMPUTE in a study of drug response. Many medications exhibit a variable response rate that is thought to be partly genetic. Roche used IMPUTE to analyse the genetics of response to the drug tocilizumab (used to treat rheumatoid arthritis). The study was able to implicate the involvement of 8 loci in the patient response to tocilizumab treatment, and show that patients carrying the specific genetic markers had a higher remission compared to those who did not.
“The impute software that we licenced from Oxford University has been used extensively at Affymetrix […] This has made a significant impact in the way we design arrays and could not have happened without using IMPUTE2, which has been shown to be the most accurate method of imputation in the literature”
Vice President for Informatics, Affymetrix
Genome-wide scan for seven common diseases from the Wellcome Trust Case Control Consortium, using IMPUTE
Research funded by the Wellcome Trust