The program RecMin.c calculates a minimum number of recombination events required in a sample history under the assumption that each segregating site in the sample has mutated exactly once. It is suitable for use on phased datasets with diallelic segregating sites. The program can be used to give an overall bound for the region, as well as producing a matrix giving the minimum number of recombinations between every pair of sites. For further details of the algorithms employed and properties of the minima produced, see Myers and Griffiths (2003).

The input data format required is a binary text file (input_filename below), with the start of data indicated by a double forward slash. The next line should contain the positions of each site, separated by spaces, and subsequent lines must contain the data sequences to be read in. The reader should refer to the file testdata.txt, in this directory, for an example of the format accepted by the program, as well as details of how multiple input datasets can be included in the same file.

To compile, type e.g.

gcc RecMin.c -o RecMin.exe -lm

(optimisations such as -03 could also help with certain compilers).

To run, type

RecMin.exe input_filename output_filename options

Options:
        -s valss (maximum subset size value, default valss=5)
        -w valw  (maximum width value, default valw=12; pick valw >= valss)
        -t known (ancestral types known if known=1, unknown if known=0, default unknown)
        -d disp  (display parameters, affects what is put on screen:
                  disp=0 show bounds only to file only
                  disp=1 show bounds only to screen and file
                  disp=2 show bounds, solution and final matrix in file only
                  disp=3 show bounds, solution and matrix on screen and file
                  disp=4 as 3 but also show bounds matrix both places
                  disp=5 as 4 but also output additional information, default
                  disp=3)
        -e enh.  (enhanced history based bound used if enhance=1,default 0)

For example:

RecMin.exe input.txt output.txt -s 8 -w 20 -t 0 -d 1 

will put valss=8, valw=20, known=0, disp=1 and use sequence file input.txt, outputting the results to output.txt.

Hudson and Kaplan's (1985) bound R_m is always calculated first with a solution determined through the method of the program (the computation process of this bound is needed by the program). The program then calculates an improved bound, which is dependent on the parameters valss, valw and enh., which may be set by the user. Setting known=1 improves enables the program to used knowledge of which types are ancestral (the 0 type at each site is assumed ancestral in this case; this might be appropriate if sequence information is available for a suitable outgroup) to possibly improve the bounds.

The program currently works for haplotype data, which may contain positions where there is some missing data (actually, only the minima R_m and R_h may be calculated in the case of missing data; the algorithm for R_s is more difficult to adapt to this case, so the program gives an error message if the user chooses enhance=1 and missing data is found). Unfortunately, the program does not handle genotype data due to the difficulties involved in developing local bound sets for such data.

According to whether enhance is 0 or 1 respectively, the program calculates the (normally quick) "haplotype" bound R_h or attempts to calculate a "history based" bound R_s which always improves this, but is often uncalculable for larger datasets (see Myers and Griffiths (2003) for more details). In both cases, the program calculates a large set of minimum numbers of recombination events that apply to sub-regions of the data, by choosing appropriate subsets of the set of segregating sites. This results in a set of bounds that are then combined in an optimal way using a dynamic programming algorithm, which improves these bounds using information from the full set simultaneously. The two statistics R_h and R_s differ in the method used to calculate the initial bound set.

The other parameters valss and valw only have an effect if enh. is 0, and in this case determine the maximal subset size and endpoint distance of the segregating sites in subsets considered in calculating R_h. Increasing either of these, especially the latter, improves R_h but increases the runtime somewhat; a prudent approach for a given dataset is to begin with reasonable values (eg the default values) and then increase first valss and then valw and repeat until either the bounds stop increasing (so no more useful information is being extracted) or the computational time becomes too long. This approach is valid since any choice of the parameters valss and valw maintains the lower bound properties of the statistic R_h. Looking at the raw bound matrix (see below), the enhance bound is likely to be usable only if the raw haplotype bounds are all smaller than about 9, and so it is wise to use enh.=0 (the default) initially.

The parameter disp controls what is put on the screen and written to the output file given.

The program always outputs a summary of the results obtained across the datasets analysed; this is of the form of giving the expectation of the statistics and their squares, minimum and maximum values obtained, and number of datasets analysed. (Of course if only one set of sequences is input, these numbers will just reflect the single two minima obtained.) In addition to this information, disp can be set to show only the bounds produced (to only the file or the screen too) using disp=0 or 1. If disp is increased to 2 or 3, the program outputs a particular minimal solution for the recombination endpoint positions (which places all recombination events as far along the sequence as possible towards the right), as well as a matrix R. R gives the minimal number of recombinations needed to satisfy the local bounds for every pair of sites, and so gives a more full picture of the detection landscape than any single optimal solution (there are often very many other possible optimal solutions). Most users will probably not wish to increase disp beyond 3; if set to four or more, the program outputs the raw bound matrix used before all local bounds are combined (of which some entries may be smaller than the corresponding entries of R, or even zero if subsets with such endpoints did not need to be considered by the program), and other internally employed information for even larger values.

Refs: 
      Myers and Griffiths, Genetics 163(1):375-394, January, 2003
      Hudson and Kaplan,   Genetics 111   :147-164, September, 1985