Phasing Algorithm Benchmark Datasets


This webpage describes the benchmark datasets used to evaluate phasing methods in the paper

J. Marchini, D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, Z.S. Qin, H.M. Munro, G.R. Abecasis, P. Donnelly, and International HapMap Consortium (2006) A Comparison of Phasing Algorithms for Trios and Unrelated Individuals. Amercan Journal of Human Genetics, 78:437-450 [PDF

It is our intention that the datasets used in this paper form the basis of a benchmark set of data made freely available for the further development and open assessment of methods.

The datasets are distributed without the answers but investigators may send sets of imputed haplotypes to us and will assess their accuracy using the same measures used in the above paper. Investigators should be prepared that these results will be posted on this webpage at a mutually agreed later date.

Datasets

The following simulated and real datasets were used in the assessment of the different methods

SU1
100 data sets of 90 unrelated individuals simulated with constant recombination rate across the region, constant population size, and random mating. Each of the 100 data sets consisted of 1 Mb of sequence
SU2
Same as SU1, but with the addition of a variable recombination rate across the region.
SU3
Same as SU2, except a model of demography consistent with white Americans was used.
SU4
Same as SU3, with 2% missing data (missing at random).
SU-100kb
Since some studies may be concerned only with the performance of phasing algorithms on lengths of sequence shorter than 1 Mb, we simulated a set of data sets identical to set SU3, except that the sequences were only 100 kb in length. Each of these 100-kb data sets was created by subsampling a set of 1,180 simulated haplotypes. The remaining 1,000 haplotypes were used to estimate the  true  population haplotype frequencies. This allowed a comparison of each method s ability to predict the haplotype frequencies in a small region of interest.
ST1
100 data sets of 30 trios simulated with constant recombination rate across the region, constant population size, and random mating. Each of the 100 data sets consisted of 1 Mb of sequence.
ST2
Same as ST1, but with the addition of a variable recombination rate across the region.
ST3
Same as ST2, except a model of demography consistent with white Americans was used.
ST4
Same as ST3, with 2% missing data (missing at random).
RU
We used HapMap CEU sample to create artificial data sets of unrelated individuals by simply removing the children from each of the trios. Since the phase of a large number of heterozygous genotypes will be known from the trios, we can use these phase-known sites to assess the performance of the algorithms for unrelated data. One hundred 1-Mb regions were selected at random from the CEU sample and processed in this way.
RT-CEU
100 data sets consisting of 30 HapMap CEU trios across 1 Mb of sequence. For each data set, we created 30 new data sets, each with a different trio altered so that the transmission status of the alleles in one of the parents is switched. By switching only one trio at a time to create a new data set, the majority of the genotypes are unaltered, and a minimum amount of new missing data is introduced. In each region, the error rates for the different algorithms were calculated using only the phase estimates in the altered trios.
RT-YRI Same as RT-CEU, except 30 HapMap YRI trios were used.
RT-YRI
Same as RT-CEU, except 30 HapMap YRI trios were used.

The datasets can be downloaded from here - test.data.tgz

The subdirectories SU1, SU2, SU3, SU4, SU-100kb and RU contain datasets of unphased unrelated individuals. There are 100 datasets in each directory. For each dataset there is file containing the genotypes and a file containing the positions of the SNPs in basepairs.

For example, SU1/genos.haps.1 contains the genotypes for the first dataset in SU1 set. The genotype of each individual is split into 2 lines. The alleles are coded 1 and 2 and a 9 is used for a heterozygous site. 0 is the code for missing. An individual id appears at the end of each line. SU1/posinfo.1 gives the positions of the sites for the first dataset in the SU1 set. The third column of this files gives the positions of the sites in basepairs.

The subdirectories ST1, ST2, ST3 and ST4 contain datasets of simulated unphased father-mother-child trios. There are 100 datasets in each directory. For each dataset there is file containing the genotypes of the parents (pgenos.haps prefix) and a file containing the genotypes of the child (cgenos.haps prefix) and a file containing the positions of the SNPs in basepairs (posinfo prefix). The format of the genotype and positions files is as above. The genotypes for the parents of each trio occur consequetively in the pgenos.haps files. The genotypes of the children in the cgenos.haps files are in the same family order as the parents in the pgens.haps files. An individual ID appears at the end of each line in the pgenos.haps and cgenos.haps files. For example, the first 12 lines of one of the pgenos.haps files ends with the following ids which correspond to the parents of the first 3 trios

FAM1:FATH
FAM1:FATH
FAM1:MOTH
FAM1:MOTH
FAM2:FATH
FAM2:FATH
FAM2:MOTH
FAM2:MOTH
FAM3:FATH
FAM3:FATH
FAM3:MOTH
FAM3:MOTH

and the first 6 lines of the associated cgenos.haps file ends with the following ids which correspond to the children of the first 3 trios

FAM1:CHILD
FAM1:CHILD
FAM2:CHILD
FAM2:CHILD
FAM3:CHILD
FAM3:CHILD

The subdirectories RT-CEU and RT-YRI contain datasets of real unphased father-mother-child trios. For each of the 100 pgenos.haps files there are 30 associated cgenos.haps files. Each of these files has had a different childs genotypes replaced with genotypes of child created by switching the transmission status of the alleles in one of the parents. Please note that as the RT-CEU and RT-YRI datasets are real datasets there are a small number of Mendel Errors in some trios which may cause problems for some algrithms if these are not detected. In the above paper these errors were set to missing data before the algorithms were run.

Please refer to the paper above for more details about how these datasets were created.

In addition, a set of trial datasets with answer files is available from here - trial.data.tgz
These datasets were used by the authors of the above paper in the extension of the algorithms to handle trio data. The datasets consist of smaller versions of the ST1, ST2, ST3 and ST4 datasets described above.

Assessing Performance

Please contact Jonathan Marchini (marchini <at> stats.ox.ac.uk) if you would like to asesss the performance of your method on these test datasets. Please do not email a set of results to this email address. Due to the size of the datasets it may be necessary to find an alternative method of submitting the results i.e. posting them on a website etc.

We will only assess the accuracy of results for the following sets of datasets
I. Simulated Unrelated Individuals Datasets (SU1, SU2, SU3, SU4, SU-100kb)
II. Simulated Trio Datasets (ST1, ST2, ST3, ST4)
III. Real Unrelated Individuals Datasets (RU)
IV. Real Trio Datasets (RT-CEU, RT-YRI)

Please submit the results in the following format
(a) One line for each imputed haplotype.
(b) Pairs of haplotypes corresponding to an individual should appear in the file in the same order as the individuals appear in the input files.
(c) Do not include any individual ids in the files.
(d) Use the same coding of alleles as used in the input files.
(e) We would strongly recommend you use the naming convention for the answer files you submit illustrated using the following examples

Input files
Answer files
SU1/genos.haps.1 SU1/genos.haps.res.1
ST1/pgenos.haps.1, ST1/cgenos.haps.1 ST1/genos.haps.res.1
RT-CEU1/pgenos.haps.1, RT-CEU/cgenos.haps.1.5 RT-CEU/genos.haps.res.1.5

Please organize results files in a set of directories with the same names as we have used for each set of datasets. Please note that it is not necessary to submit the files containing the estimated haplotypes of the children for the trio datasets.
(f) If your method produces estimates of haplotype frequencies then please submit frequency estimates for the SU-100kb datasets for comparison with other methods. A simple list of haplotypes with their frequency estimate will be fine. eg

222222222222222122 0.005282
222222112211212222 0.005306
222221222222222122 0.031351
222221222222221122 0.103530
222221222222221121 0.008754
222221222222211122 0.022704
.
.

Please also provide details of the running time of your algorithm and the computer architecture on which the program was run for comparison with other methods.


Performance Results

So far the following methods have been applied to the benchmark datasets.

HAP
Lin S, Cutler DJ, Zwick ME, Chakravarti A (2002) Haplotype inference in random population samples. Am J Hum Genet 71:1129-1137
HAP2
Eskin E, Halperin E, Karp R (2003) Efficient reconstruction of haplotype structure via perfect phylogeny. J Bioinform Comput Biol 1:1-20
PLEM
Qin ZS, Niu T, Liu JS (2002) Partition-ligation expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am J Hum Genet 71:1242-1247
wphase
Unpublished algorithm by Nick Patterson, Broad Institute of MIT and Harvard (see Marchini et al. (2006) for more details.
PHASE v2.1
Stephens M, Donnelly P (2003) A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 73:1162-1169
fastPHASE
As yet unpublished algorithm written by Paul Sheet and Matthew Stephens.


The following error measures were applied to each dataset

Switch
Switch error is the percentage of possible switches in haplotype orientation, used to recover the correct phase in an individual or trio (Lin et al. (2004)).
IHP
Incorrect Haplotype Percentage (IHP) is the percentage of ambiguous individuals whose haplotype estimates are not completely correct (Stephens et al. 2001). It is worth noting that, as the length of the considered region increases, all methods will find it harder to correctly infer entire haplotypes. Thus, this measure will increase with genetic distance and eventually reach 100%, once the region becomes long enough.
IGP
Incorrect Genotype Percentage (IGP). We counted the number of genotypes (ambiguous heterozygotes and missing genotypes) that had their phase incorrectly inferred and expressed them as a percentage of the total number of genotypes. To calculate this measure, we first aligned the estimated haplotypes with the true haplotypes, to minimize the number of sites at which there were phase differences. For the trio data, this alignment is fixed by the known transmission status of alleles at nonambiguous sites. For the real data sets in which the truth for the missing data was not known, we removed such sites from consideration in both the numerator and the denominator. We believe the utility of this measure lies in its comparison with levels of genotyping error and missing data.
Missing Error
Missing error is the percentage of incorrectly inferred missing data. To calculate this measure, we first aligned the estimated haplotypes with the true haplotypes, to minimize the number of sites at which there were phase differences. This alignment ignored the sites at which there was missing data.We then compared the estimated and true haplotypes at the sites of missing data and counted the number of incorrectly imputed alleles and then expressed this as a percentage of the total number of missing data.


The following tables detail the results of the above algorithms for each of the datasets. In each table the methods have been ordered in increasing value of Switch Error. A few of these results differ slightly to those given in the above paper due to a small bug in the original calculations.

SU1 Switch
IHP
IGP
PHASE v2.1
2.41 35.46 2.54
wphase 3.65 48.02 3.50
fastPHASE 4.47 65.25 5.65
HAP 6.53 88.62 7.86
HAP2 6.92
73.50 7.11
PLEM 8.98 61.13 5.81


SU2 Switch
IHP
IGP
PHASE v2.1 2.21 40.39  2.45
wphase 3.66 55.51
4.29
fastPHASE 6.92
88.68
7.84
HAP 9.75
97.15
9.47
PLEM 13.18
83.42
9.47
HAP2 15.14 99.00 11.03


SU3 Switch
IHP
IGP
PHASE v2.1 4.79
59.08 5.08
fastPHASE 5.64 76.22 7.00
wphase
6.62 66.42 5.79
HAP 7.13 90.06 8.47
HAP2 8.21 85.10
8.59
PLEM 11.02 81.42 8.21


SU4 Switch
IHP
IGP
Missing Error
PHASE v2.1 5.04 60.80 5.24 
7.29
fastPHASE 5.75
75.98 6.75
9.06
wphase
6.60
67.97 5.98
10.14
HAP 7.44
90.64 8.39
11.59
HAP2 8.73
87.06 8.65 15.02
PLEM 10.86 81.49 8.03
19.36


SU-100kb Switch
IHP
IGP
PHASE v2.1 4.16 17.17 1.53
wphase
5.08 19.44 1.77
HAP 5.39 21.83 1.94
HAP2 5.44 22.22 2.05
PLEM 7.92
24.69 2.34


ST1 Switch
IHP
IGP
PHASE v2.1 0.74 5.55 0.06
wphase
0.98 6.52 0.08
HAP 2.14 12.79 0.17
HAP2 2.58 17.16 0.23
PLEM 3.03 18.59 0.24


ST2 Switch
IHP
IGP
PHASE v2.1 0.22 1.89 0.02
wphase
0.22 1.89 0.02
HAP 1.52 11.44 0.11
PLEM
2.88 21.15 0.20
HAP2
5.97 36.23 0.43


ST3 Switch
IHP
IGP
PHASE v2.1 1.36 10.36 0.13
wphase
2.23 14.21 0.20
HAP 2.40 17.01 0.21
HAP2
2.95
20.76 0.27
PLEM
3.81 24.80 0.33


ST4 Switch
IHP
IGP
Missing Error
PHASE v2.1 1.48 10.30 0.13 1.46
wphase
2.34 14.67 0.20 1.89
HAP 2.63 17.83 0.22 4.36
HAP2 3.17 21.31 0.30
5.26
PLEM 4.12 25.06 0.36
3.38


RU
Switch
IHP
IGP
PHASE v2.1 8.41
77.66 2.69
fastPHASE
9.21
83.57 3.02
HAP 10.72 87.96 3.26
HAP2
12.56 87.67 3.39


RT-CEU
Switch
IHP
IGP
PHASE v2.1 0.53
6.20
0.05
HAP2 2.05 20.42 0.33
HAP
2.95 20.78 0.40


RT-YRI
Switch
IHP
IGP
PHASE v2.1 2.16
15.7
0.16
HAP
4.44 29.25 0.33