Site_Graphic Site_Graphic

Trinity Term 2009

1st Year Presentations

Independence preserving graphs

In this talk we introduce DAG models that capture conditional independence statements generated by joint density of probability distributions definedfor a random vector variable. We then introduce three known classes of independence preserving graphs with three types of edge, called MC,summary and maximal ancestral graphs, that are closed undermarginalization andconditioning and contain all DAG independence models.

Kayvan Sadeghi

-----------------------------

Estimating mutation rates using high-throughput sequencing

Some rare congenital disorders (such as Apert, Crouzon, Pfeiffer, Muenke, Costello and Noonan syndromes and achondroplasia) originate from spontaneous mutations in the germline of healthy fathers which are older than the average (paternal age effect). The prevalence of these specific mutations may reflect a protein-driven, positive selection of mutant cells according to the functional consequences of the encoded amino acidsubstitution. We aimed to use new sequencing technologies to quantify all possible point mutations at codon K650 (AAG) of the fibroblast growth factor receptor 3 (FGFR3) leading to achondroplasia.

78 sperm and 8 blood samples were sequenced using the Illumina sequencing technology. An attempt to use Illumina's inbuilt quality scores to estimate the rate of de novo mutations could not accountfor the underlying error structure. Therefore, a Bayesian approach was used to fit a model to the observed counts of each codon in the sequencing data to account for errors and bias derived from the rounds of PCR and digestion during the sample preparation, and the sequencing process. Titration data were analysed together with biological samples to validate our method down to the level of 10-5. Whilst mutation rates in blood were low, 73% of the total mutations quantified in the sperm samples were caused by a 1948A>G mutation. It reached high mutation levels (with a maximum of 2.1 x 10-4) in sperm samples which were significantly correlated with donor age (Spearman rank r=0.34, P=0.002). Several other substitutions attained levels >10-5 in a minority of sperm samples. These results show the utility of advanced statistical methods to estimate mutation rates in human sperm from high-throughput sequencing data down to the level of 10-5, whilst capturing subtle features of the machine and run dependent error structure.

Pfeifer

---------------------------

Embedding Levy Process in Brownian motion

Yu Xue
---------------------------

Stochastic simulation of the yeast cell cycle

Enuo He

---------------------------

Bayesian Methods of Estimating Human Ancestry using whole genome SNP data

Estimation of the genetic ancestry of an individual is useful for association studies, disease risk prediction, population genetic analyses and is of inherent interest for the individual themselves. There has been a recent rise in interest surrounding ancestryinference from personal genomics companies, such as 23andMe. I have investigated 2 Bayesian methods of estimating ancestry using whole-genome SNP data on each individual. The first method performs well in producing a global estimate of overall ancestry proportion, while the second method is superior in its ability to infer ancestry locally, along the chromosomes.

Claire Churchhouse

-------------------------------

The TASEP and Related Models

The totally asymmetric simple exclusion process (or TASEP) is a simple model for an interacting particle system in which particles move on the integer line and interact by exclusion. It relates to various other mathematical models like the corner growth model, last-passage percolation and competition interfaces.

Philipp Schmidt

-------------------------------

Quality score recalibration of next generation sequencing reads: application to the 1000 Genomes data

Androniki Menelaou

-------------------------------

Massively Parallel Advanced Monte Carlo Methods on Many-Core Processors

A recent trend in desktop computer architecture is the move from traditional, single-core processors to multi-core processors and further to many-core or massively multi-core processors. Therefore, statistical methods that can take advantage of many-core architectures are likely to make the best use of the latest technology. A particularly promising avenue in this regard is the implementation of statistical algorithms for execution on graphics processing units (GPUs) since they are low cost, low maintenance, energy-efficient devices that are becoming increasingly easy to program. I present a case study on the suitability of using GPUs for three population-based Monte Carlo algorithms - population-based MCMC, sequential Monte Carlo samplers and the particle filter - with speedups ranging from 35 to 500 fold over conventional single-threaded computation.

Anthony Lee

------------------------------------

Lasso Isotone for High Dimensional Additive Monotone Regression

In this talk, we introduce the Liso (Lasso ISOtone) estimate, involving a Lasso style penalty on the range of each component function. This estimator has the desirable property of invariance under monotone transformations of predictors, can be computed efficiently, and produces simple sparse representations of estimated functions.

We identify a backfitting type algorithm for calculation of Liso, that is efficient both in computational intensity and memory usage. We demonstrate its performance on simulated and real data, with comparison to other non-parametric estimators. Finally, if there is time, there will be a discussion of the estimator and algorithm's advantages and shortcomings.

Zhou Fang

------------------------------------

Do differences in the transcriptional activity of the HLA influence disease susceptibility? Outline of a case-control based approach

Alexander Dilthey

-----------------------------------

Estimating Heterogeneous Treatment Effects in Randomized Experiments

Randomized experiments have become increasingly important for both political researchers and practitioners.  With few exceptions, these experiments have addressed the overall causal effect of an intervention across the entire population, known as the Average Treatment Effect (ATE).  A much broader set of questions can often be addressed by allowing for heterogeneous treatment effects.

We discuss methods for estimating such effects in other disciplines and introduce key concepts, especially the Conditional Average Treatment Effect (CATE), to the analysis of randomized experiments in politics.  We expand on this literature by proposing a novel application of Generalized Additive Models (GAMs) to estimate non-linear heterogeneous treatment effects. We demonstrate the practical importance of these techniques by re-analyzing a major experimental study published in the American Political Science Review in 2008 and a previously unpublished experiment from the 2008 US election.

Avi Feller

---------------------------