From exact marginals to good importance sampling
DNA sequences from a population has an unobservable genealogical history. In analyzing such data, a crucial stepping stone is to be able to integrate over evolutionary histories according their probability according to a model and given the data. Doing this has been the focus of research for more than 2 decades. The basic probability model for genealogical histories without recombination was given Watterson (1975) and Kingman (1982). Until 1994 (Griffiths and Tavare), this was solely used as a tool for simulating genealogical histories without knowing the content (mutational configuration) of the sequences. Since late 90s there has been a string of attempts to apply stochastic integration methods (Importance Sampling, MCMC,..) to do this. In the absence of recombination it is a hard, but doable problem. Due to the enormous increase in DNA sequences from populations and the importance of this problem in genetic mapping, the problem remains as important as ever. Major contributions to this problem can be found in Stephens and Donnelly (2000) and Hobolt et al.(2008).
The basic idea of this project is to use cases where exact computations can be performed maximally to construct either a pseudo-likelihood function for the data (Hudson, 2001) or an importance sampler. We will only consider the case without recombination. In this case, we think this approach could have important potential and lead to the ability to analyze data sets of fully realistic size, for instance 100 segregating sites and many hundred samples.