Statistical Alignment via k-Restricted Steiner Trees

The problem of inferring the phylogeny relating a set of homologous sequences together with the ancestral sequences is really just a special case of a more general problem called the Steiner tree problem. The Steiner tree problem is the problem of connecting a set of points by a tree structure, usually such that the total length of the tree is minimised. A common example is connecting a number of houses by a phone line network such that the amount of cable used is minimal. The Steiner tree problem is in general a hard problem to solve, but numerous methods to achieve a good solution have been proposed. Many, if not most, of the most succesful such methods are based on connection optimal solutions for small subsets of the points that need to be connected at shared points. A tree based on subsets of at most k points is called a k-restricted Steiner tree.

This translates nicely into a heuristic for  boosting the power of statistical alignment. If we can align k sequences, we can align more sequences by defining (non-disjoint) subsets of k sequences and connect alignments of these subsets at shared sequences. Once we have a k-restricted Steiner tree relating the sequences, the probability of an alignment or of the sequences having evolved according the tree factors nicely into a product of independent contributions from each component. This project intends to explore how well statistical alignment combines with methods for constructing the k-restricted Steiner trees based on state-of-the-art approximation algorithms as well as from phylogenies computed by standard phylogeny reconstruction programs.