SAPF: Statistical Aligner and Phylogenetic Footprinter Rahul Satija, Lior Pachter, Jotun Hein USAGE: After initial setup (see SETUP) The program can be run with a command of the form: perl Sapf.pl alignment tree model [-noparams] Results are stored in the file results.prob . The results consist of base-by-base probabilites, for one reference sequence, that the basewas generated by a "slow" state in the model. The estimated parameter set is reported in params.output example usage: perl Sapf.pl TestThree.aln example.tree models/FullModel3.composed This command runs SAPF on one of the simulated alignments from the first simulated dataset. The number of sequences has been reduced to three to allow the example to run quickly. In the alignment file TestThree.aln, capital letters represent bases that were simulated as functional. ARGUMENTS: alignment: an initial multiple sequence alignment of the sequences to be analyzed in clustalw format. The alignment needs to be only a rough approximation, SAPF will calculate a probability distribution of all alignment columns within 50bp of this first alignment. A fast, heuristic aligner like ClustalW is ideal. While tree and model files can be located anywhere, the alignment file should be located in the directory with the perl scripts. tree: a tree file, in newick format, relating the species in the alignment. The tree must contain all species in the alignment, but additional species are OK (SAPF automatically prunes the tree to consider only relevant species). The reference sequence, for which results are reported, is specified by the left-most node of the tree model: a model file generated by phyloComposer [1]. phyloComposer generates a multiple HMM based off of a branch HMM and a guide tree. phyloComposer is available as part of the DART package at biowiki.org/DART if you would like to make your own models, but I have included sample models for 4, 3, and 2 sequences in the models directory. -noparams (optional) : skips the parameter estimation process and uses the inital guess. SETUP: After installing (see INSTALL) there are 2 files to setup with SAPF. 1. sapf.dir : to save time and memory, SAPF pre-computes emission and transition probabilities in advance. Edit the file sapf.dir to specify the directory where they will be stored 2. initial.param : holds the initial parameter guess. The order is the same as Table 3 in the Supplementary Material. The format is identical for the output file params.output ADVANCED: Corner Cutting. SAPF first calculates a set of pairwise alignments and then restricts a path through the multidimensional DP matrix based on these. Pairwise homologies below a certain probability cutoff are discarded. The cutoff is currently set to ~ln(10) = 2.3. This means that all pairwise alignment columns less than 1/10 as probable as the Viterbi alignment column will be discarded. Increasing this cutoff, set by the $ProbCutoff variable in the beginning of Sapf.pl, will increase the number of alignment columns considered but will decrease the running time. While we currently use the Viterbi alignment for a reference while corner cutting, we hope to replace this with the MPP alignment (see manuscript) in a future version. CONTACT Please address all questions and comments to Rahul Satija at satija@stats.ox.ac.uk REFERENCES [1] Holmes, I. (2007). Phylocomposer and Phylodirector: Analysis and Visualization of Transducer Indel Models. Bioinformatics.