HAPGEN
is
a program thats simulates
case control datasets at SNP markers and can output data in the FILE
FORMAT used by IMPUTE, SNPTEST
and GTOOL. The approach can
handle markers
in LD and can simulate datasets over large regions such as whole
chromosomes. Hapgen simulates haplotypes by conditioning on a set of
population haplotypes and an estimate of the fine-scale recombination
rate across the region. The disease model is specified through the
choice of a single SNP as the disease causing variant together with the
relative risks of the genotypes at the disease SNP. The program is
designed to work with publically available files that contain the
haplotypes estimated as part of the HapMap project and the estimated
fine-scale recombination map derived from that data. Hapgen is
computationally tractable. On a modern desktop HAPGEN can simulate several
thousand
case and control data on a whole chromosome at Hapmap Phase 2 marker
density within minutes. This program has been used to assess the power
of several different commercially available genotyping chips [1], in the design
stage of the 7 genome-wide
association
studies carried out by the Wellcome Trust Case-Control Consortium (WTCCC) [3] and for evaluating the
power of different methods for detecting association in genome-wide
studies [2].
|
![]() |
| Platform |
File |
| Linux
(x86_64) Static Executable |
hapgen_v1.3.0_x86_64_static.tgz |
| Linux
(x86_64) Static Executable (SuSE 9.3) |
hapgen_v1.3.0_SuSE9.3_x86_64_static.tgz |
| Linux
(x86_64) Dynamic Executable |
hapgen_v1.3.0_x86_64_dynamic.tgz |
| Linux
(i386) Static Executable |
hapgen_v1.3.0_i386_static.tgz |
| Linux
(i386) Dynamic Executable |
hapgen_v1.3.0_i386_dynamic.tgz |
| Mac
OS X 10.4.11 Tiger (Intel) |
hapgen_v1.3.0_MacOSX_10.4_Intel.tgz |
| Mac OS X 10.5.1 Leopard (Intel) | hapgen_v1.3.0_MacOSX_10.5_Intel.tgz |
| Mac OS X (PowerPC) | hapgen_v1.3.0_MacOSX_PowerPC.tgz |
| Solaris
5.8 (Sun SPARC) |
hapgen_v1.3.0_Solaris5.8_SPARC.tgz |
| Solaris
5.10 (AMD Opteron) |
hapgen_v1.3.0_Solaris5.10_Opteron.tgz |
| SLES
10 (Intel Itanium2) |
hapgen_v1.3.0_SLES10_Itanium2.tgz |
| Windows
MS-DOS (Intel) |
hapgen_v1.3.0_Windows_Intel.tgz |
| tar zxvf hapgen_vX.X.X_i386.tgz |
| 1.0.5 | 07-06-2007 | First version made available |
| 1.2.0 | 26-07-2007 |
|
| 1.2.1 |
22-10-2007 |
Added
LICENCE |
| 1.3.0 |
17-01-2008 |
|
| ./hapgen -h
example/ex.haps -l example/ex.leg -r example/ex.map -o sim -n 2 2 -gen
-rr 2.0 4.0 -dl 14439734 |
| Flags |
Required/Optional |
Description |
| -h <file> |
Required |
A file containing a set of known
haplotypes. Each line of this
file should specify one haplotype. Each haplotype should be a sequence
of 1's and 0's that correspond to the alleles at each locus of the
haplotype. The length of each haplotype must be the same. For example,
the example input file (ex.haps)
contains 5 haplotypes at 10 SNPs. See the following section for links to the relevant HapMap
files. |
| -l <file> |
Required | A
legend file
for the SNP markers. This file should have 4 columns
with one line for each SNP. The columns should contain an ID for each
SNP i.e. rs id of the marker, the base pair position of each SNP, base
represented by 0 and base represented by 1. The first line of the
legend file are column labels (these are not used by the program but
the file is required to contain a header line). See the example file
ex.leg. See the following section for links to
the relevant HapMap files. |
| -r <file> |
Optional |
The
program can also
take a file containing the fine-scale recombination rate across the
region.
This file should have 3 columns with one line for each SNP. The columns
should
contain physical location, rate in cM/Mb to the right of the marker and
the
cumulative rate in cM to the left of the marker. A header line
containing
the column labels is required. See the example file ex.map. See the following section for links to the relevant HapMap
files.If
no recombination file is specified and the option -rho
(see below) is not used then the recombination rate between all loci is
set to 0. |
| -n <int> <int> | Recommended |
Sets the number of control and the number of case individuals to simulate. For example -n 100 100 simulates 100 control and 100 case individuals. The default is to generate 1 control and 1 case individual. |
| -gen |
Optional | Output files that contain genotypes for each individual. The files will have the suffix .g. The genotypes will be given as one line per individual. The genotypes will be coded 0,1 or 2. The genotypes will be given at the set of SNPs specified by the user (see -t option below). |
| -snptest |
Optional |
Output files in the FILE FORMAT used by IMPUTE, SNPTEST and GTOOL. A separate genotype and a sample file is written for cases and controls. The files will have the suffices .gen and .sample.The genotypes will be given at the set of SNPs specified by the user (see -t option below). |
| -hap |
Optional |
Output files that contain pairs of haplotypes for each individual. The files will have the suffix .h. There will be one line for each haplotype (pairs of conseutive lines contain the two haplotypes of each individual). The haplotypes will be given as a sequence of 1's and 0's. The haplotypes will be given at the set of SNPs specified by the user (see -t option below). |
| -int <int> <int> |
Optional |
Specify the lower and upper
boundaries of the region in which you wish to carry out simulation. |
| -o <file> | Required |
Output file prefix. For example -o
foo creates files foo.*
.The files that are produced depend upon the combination of the
flags -gen,
-hap, -snptest, -all and -t.
Output files not described elsewhere include foo.y - File containing the phenotype of each individual. The file has one line for each indvidual and 1 and 0 are used to denote cases and controls respectively. This file is only produced when the -gen and -hap flags are specified. foo.aux - File containing summaries of some of the parameters used in the simulation. The file has the following format disease marker index physical position minor allele at disease locus minor allele frequency in haplotype set heterozygous relative risk homozygous relative risk |
| -t <file> |
Optional | SNP subset file. This option
allows the user to output data at only a subset of the SNP markers in
the simulated dataset i.e. at a set of tag SNPs. The file should
contain the physical location of markers that will be in the output on
one line per SNP. The physical locations must match those in the legend
file. If this option is selected then a
.tags output file will be
produced that contains the positions of the SNPs in the output file. |
| -all |
Optional |
Only relevant if the -t
option is used. In addition to the output files at the set of SNPs
selected by the -t
option a set of files at all SNPs will be produced. The additional
files will include .all.
in their names. If the -t option is not used then the -all option
is automatically turned on, thus producing files that contain data at
all SNPs. |
| -dl <int> |
Optional |
Sets disease locus location. For
example, -dl
1000 sets the disease locus to be the marker with physical
location 1000. Must be one of the locus in the legend file. If none is
given a random locus is chosen that satisfies the minor allele
frequency (MAF) range specified by the -freq
option below. |
| -freq <real> <real> |
Optional |
Sets min and max MAF for a
disease locus. Only relevant if no valid disease locus is specified
using the -dl
option. For example, -freq
0.1 0.3 sets the minimum and maximum MAF at the disease locus to
be 0.1 and 0.3. The default values are 0.05 and 0.5. |
| -rr <real> <real> |
Optional | Sets heterozygous and homozygous
relative risk. For example -rr
1.5 2.25 sets the heterozygous relative risk to 1.5 and
homozygous relative risk to 2.25. The default relative risk for
heterozygotes and homozygotes are both 1.0. |
| -flip |
Optional | Specifies that the major allele
will be
the disease allele. Minor allele is the disease allele by default. |
| -Ne <int> | Required if -r is used | Sets effective population size that scales the fine-scale recombination map for the given population. For example, -Ne 11000 sets the effective population size to 11000. For autosomal chromosomes, we highly recommend the values 11418 for CEPH, 17469 for Yoruban and 14269 for Chinese Japanese populations. |
| -rho <real> |
Optional | Sets constant recombination rate
between all loci. Only relevant if no recombination rate file is given.
For example, -rho
10 sets the recombination rate between all loci to 10 cM/Mb and
this will then be scaled by the user supplied value of Ne. If -r
and -rho
are not used then the recombination rate is set to 0. |
| -theta <real> |
Optional | Sets mutation rate in the model.
For example, -theta
10 sets the scaled mutation rate to 10. Mutation rate is set to
that the expected number of mutations at a given SNP is equal to 1 by
default. |
| HapMap
rel#22 - NCBI Build 36 (dbSNP b126) |
HapMap
rel#21 - NCBI Build 35 (dbSNP b125) |
|
| Haplotype
and Legend files |
http://www.hapmap.org/downloads/phasing/2007-08_rel22/phased/ |
http://www.hapmap.org/downloads/phasing/2006-07_phaseII/ |
| Recombination
rate files |
https://mathgen.stats.ox.ac.uk/wtccc-software/recombination_rates/ |
https://mathgen.stats.ox.ac.uk/wtccc-software/recombination_rates/ |