HAPGEN


HAPGEN is a program thats simulates case control datasets at SNP markers and can output data in the FILE FORMAT used by IMPUTE, SNPTEST and GTOOL. The approach can handle markers in LD and can simulate datasets over large regions such as whole chromosomes. Hapgen simulates haplotypes by conditioning on a set of population haplotypes and an estimate of the fine-scale recombination rate across the region. The disease model is specified through the choice of a single SNP as the disease causing variant together with the relative risks of the genotypes at the disease SNP. The program is designed to work with publically available files that contain the haplotypes estimated as part of the HapMap project and the estimated fine-scale recombination map derived from that data. Hapgen is computationally tractable. On a modern desktop HAPGEN can simulate several thousand case and control data on a whole chromosome at Hapmap Phase 2 marker density within minutes. This program has been used to assess the power of several different commercially available genotyping chips [1], in the design stage of the 7 genome-wide association studies carried out by the Wellcome Trust Case-Control Consortium (WTCCC) [3] and for evaluating the power of different methods for detecting association in genome-wide studies [2].

Home
Contributors
Download Version History
Running HAPGEN
Options
HapMap Data Files
References
Contact Information



Contributors (top)

The following people have contributed to the development of the methodology and software for HAPGEN.

Zhan Su, Jonathan Marchini, Peter Donnelly

Download (top)


Pre-compiled versions of the program and example files can be downloaded from the links below. We've supplied both static and dynamic versions of the Linux executables. If you intend to run HAPGEN on a machine running an old kernel then you probably want to use the dynamic version. If you have any problems getting the program to work on your machine please contact me.

Platform
File
Linux (x86_64) Static Executable
hapgen_v1.3.0_x86_64_static.tgz
Linux (x86_64) Static Executable (SuSE 9.3)
hapgen_v1.3.0_SuSE9.3_x86_64_static.tgz
Linux (x86_64) Dynamic Executable
hapgen_v1.3.0_x86_64_dynamic.tgz
Linux (i386) Static Executable
hapgen_v1.3.0_i386_static.tgz
Linux (i386) Dynamic Executable
hapgen_v1.3.0_i386_dynamic.tgz
Mac OS X 10.4.11 Tiger (Intel)
hapgen_v1.3.0_MacOSX_10.4_Intel.tgz
Mac OS X 10.5.1 Leopard (Intel) hapgen_v1.3.0_MacOSX_10.5_Intel.tgz
Mac OS X (PowerPC) hapgen_v1.3.0_MacOSX_PowerPC.tgz
Solaris 5.8 (Sun SPARC)
hapgen_v1.3.0_Solaris5.8_SPARC.tgz
Solaris 5.10 (AMD Opteron)
hapgen_v1.3.0_Solaris5.10_Opteron.tgz
SLES 10 (Intel Itanium2)
hapgen_v1.3.0_SLES10_Itanium2.tgz
Windows MS-DOS (Intel)
hapgen_v1.3.0_Windows_Intel.tgz

Please fill out the registration form to receive emails about updates to this software.

To unpack the files use a command like

tar zxvf hapgen_vX.X.X_i386.tgz

This will create an executable called hapgen and a directory /example that contains the example files.

Version History (top)

1.0.5 07-06-2007 First version made available
1.2.0 26-07-2007
  • Addition of -hap and -gen flags that specify whether haplotype or genotype data should be output.
  • Addition of -snptest flag to output genotype data in the FILE FORMAT used for SNPTEST and IMPUTE.
  • File name convention for output files has changed (see below).
1.2.1
22-10-2007
Added LICENCE
1.3.0
17-01-2008
  • Addition of -int option to specify region of simulation.

Running HAPGEN (top)

HAPGEN is a command line program. To illustrate its use we have included an example dataset in the directory /example

To run the program on the example file use

./hapgen -h example/ex.haps -l example/ex.leg -r example/ex.map -o sim -n 2 2 -gen -rr 2.0 4.0 -dl 14439734

This will produce files sim.all.g, sim.y and sim.aux that contain the results of the simulation. See below for a description of the options, input file formats and output file formats.

NOTE : HAPGEN sets the random seed of its random number generator using the time of day to th nearest second. You should be aware of this when running multiple simulations using HAPGEN as runs that are started very close in time will produce identical results.

Options (top)

Flags
Required/Optional
Description
-h <file>
Required
A file containing a set of known haplotypes. Each line of this file should specify one haplotype. Each haplotype should be a sequence of 1's and 0's that correspond to the alleles at each locus of the haplotype. The length of each haplotype must be the same. For example, the example input file (ex.haps) contains 5 haplotypes at 10 SNPs. See the following section for links to the relevant HapMap files.
-l <file>
Required A legend file for the SNP markers. This file should have 4 columns with one line for each SNP. The columns should contain an ID for each SNP i.e. rs id of the marker, the base pair position of each SNP, base represented by 0 and base represented by 1. The first line of the legend file are column labels (these are not used by the program but the file is required to contain a header line). See the example file ex.leg. See the following section for links to the relevant HapMap files.
 -r <file>
Optional
The program can also take a file containing the fine-scale recombination rate across the region. This file should have 3 columns with one line for each SNP. The columns should contain physical location, rate in cM/Mb to the right of the marker and the cumulative rate in cM to the left of the marker. A header line containing the column labels is required. See the example file ex.map. See the following section for links to the relevant HapMap files.If no recombination file is specified and the option -rho (see below) is not used then the recombination rate between all loci is set to 0.
-n <int> <int> Recommended
Sets the number of control and the number of case individuals to simulate. For example -n 100 100 simulates 100 control and 100 case individuals. The default is to generate 1 control and 1 case individual.
-gen
Optional Output files that contain genotypes for each individual. The files will have the suffix .g.  The genotypes will be given as one line per individual. The genotypes will be coded 0,1 or 2. The genotypes will be given at the set of SNPs specified by the user (see -t option below).
-snptest
Optional
Output files in the FILE FORMAT used by IMPUTE, SNPTEST and GTOOL. A separate genotype and a sample file is written for cases and controls. The files will have the suffices .gen and .sample.The genotypes will be given at the set of SNPs specified by the user (see -t option below).
-hap
Optional
Output files that contain pairs of haplotypes for each individual. The files will have the suffix .h.  There will be one line for each haplotype (pairs of conseutive lines contain the two haplotypes of each individual). The haplotypes will be given as a sequence of 1's and 0's. The haplotypes will be given at the set of SNPs specified by the user (see -t option below).
-int <int> <int>
Optional
Specify the lower and upper boundaries of the region in which you wish to carry out simulation.
-o <file> Required
Output file prefix. For example -o foo creates files foo.* .The  files that are produced depend upon the combination of the flags -gen, -hap, -snptest, -all and -t. Output files not described elsewhere include

foo.y - File containing the phenotype of each individual. The file has one line for each indvidual and 1 and 0 are used to denote cases and controls respectively. This file is only produced when the -gen and -hap flags are specified.

foo.aux - File containing summaries of some of the parameters used in the simulation. The file has the following format

disease marker index
physical position
minor allele at disease locus
minor allele frequency in haplotype set
heterozygous relative risk
homozygous relative risk
-t <file>
Optional SNP subset file. This option allows the user to output data at only a subset of the SNP markers in the simulated dataset i.e. at a set of tag SNPs. The file should contain the physical location of markers that will be in the output on one line per SNP. The physical locations must match those in the legend file. If this option is selected then a .tags output file will be produced that contains the positions of the SNPs in the output file.
-all
Optional
Only relevant if the -t option is used. In addition to the output files at the set of SNPs selected by the -t option a set of files at all SNPs will be produced. The additional files will include .all. in their names. If the -t option is not used then the -all option is automatically turned on, thus producing files that contain data at all SNPs.
-dl <int>
Optional
Sets disease locus location. For example, -dl 1000 sets the disease locus to be the marker with physical location 1000. Must be one of the locus in the legend file. If none is given a random locus is chosen that satisfies the minor allele frequency (MAF) range specified by the -freq option below.
-freq <real> <real>
Optional
Sets min and max MAF for a disease locus. Only relevant if no valid disease locus is specified using the -dl option. For example, -freq 0.1 0.3 sets the minimum and maximum MAF at the disease locus to be 0.1 and 0.3. The default values are 0.05 and 0.5.
-rr <real> <real>
Optional Sets heterozygous and homozygous relative risk. For example -rr 1.5 2.25 sets the heterozygous relative risk to 1.5 and homozygous relative risk to 2.25. The default relative risk for heterozygotes and homozygotes are both 1.0.
-flip
Optional Specifies that the major allele will be the disease allele. Minor allele is the disease allele by default.
-Ne <int> Required if -r is used Sets effective population size that scales the fine-scale recombination map for the given population. For example, -Ne 11000 sets the effective population size to 11000. For autosomal chromosomes, we highly recommend the values 11418 for CEPH, 17469 for Yoruban and 14269 for Chinese Japanese populations.
-rho <real>
Optional Sets constant recombination rate between all loci. Only relevant if no recombination rate file is given. For example, -rho 10 sets the recombination rate between all loci to 10 cM/Mb and this will then be scaled by the user supplied value of Ne. If -r and -rho are not used then the recombination rate is set to 0.
-theta <real>
Optional Sets mutation rate in the model. For example, -theta 10 sets the scaled mutation rate to 10. Mutation rate is set to that the expected number of mutations at a given SNP is equal to 1 by default.

HapMap data files  (top)

HAPGEN requires a set of haplotypes (-h) , an associated legend file (-l) and a recombination rate map across the region (-r). We recommend using the following HapMap files which are in the correct format.


HapMap rel#22 - NCBI Build 36 (dbSNP b126)
HapMap rel#21 - NCBI Build 35 (dbSNP b125)
Haplotype and Legend files
http://www.hapmap.org/downloads/phasing/2007-08_rel22/phased/
http://www.hapmap.org/downloads/phasing/2006-07_phaseII/
Recombination rate files
https://mathgen.stats.ox.ac.uk/wtccc-software/recombination_rates/
https://mathgen.stats.ox.ac.uk/wtccc-software/recombination_rates/


References (top)

[1] Chris C. A. Spencer, Zhan Su, Peter Donnelly, Jonathan Marchini (2009) Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip. PLoS Genet 5(5). [Link]
[2] J. Marchini, B. Howie, S. Myers, G. McVean and P. Donnelly (2007) A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics 39 : 906-913 [Free Access PDF][Supplementary Material][News and Views Article]
[3] The Wellcome Trust Case Control Consortium (2007) Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls.
Nature 661-78. PMID: 17554300 DOI: 10.1038/nature05911

Contact Information (top)

If you have any questions regarding the use of this program please send an email to Dr Zhan Su (zhan <at> well <dot> ox <dot> ac <dot> uk)