GTOOL is a
program for transforming sets of genotype data for use with the programs SNPTEST
and IMPUTE.
GTOOL can
be used to:
(a) generate subsets of genotype data,
(b) to convert genotype data between the PED file format and the FILE
FORMAT used by SNPTEST and IMPUTE,
(c) merge genotype datasets together.
Pre-compiled
versions of the program and example files can be downloaded from the links
below. We've supplied both static and dynamic versions of the Linux
executables. If you intend to run SNPTEST on a machine running an old kernel
then you probably want to use the dynamic version. If you have any problems
getting the program to work on your machine please contact me.
0.5.0
- 28-07-2008 - Addition of Merge mode (-M).
Addition of logging and --log option. Missing data is
represented by 0 0 0. Samples can be excluded in Subset mode
(--sample_excl). Bug fix for PED to GEN, if a SNP in the PED
file was e.g. 22 44 22 22 then in GEN alleles were (C -) not (C T). 0.4.1 -12-02-2008 - Addition of --sex
option to specify which column of the SAMPLE file is used for sex when
converting GEN to PED files. 0.4.0 - 10-01-2008 - Read and write gzipped files. Bug fix
for space/tab discrimination issue. Pass chromosome field between formats, if
applicable. 0.3.0 - 07-09-2007 - Addition of the GEN to PED conversion
option. Default output file names. 0.2.0 - 17-07-2007 - Addition of conversion option. Change
made to sample file format to be in-line with the FILE
FORMAT used for SNPTEST and IMPUTE. 0.1.6 - 07-06-2007 - First version made available
GTOOL can be used to create subsets of
datasets using the -S
option in conjunction with several sub-options. These options are illustrated
in the following examples. In these examples the genotype and sample files
to be subsetted are specified using the --g
and --s
options. The genotype and sample files should be in the format specified by
the FILE
FORMAT webpage.
The output files are specified using the --og
and --os
options.
Selecting a subset
of individuals specified by a list (--sample_id)
The --sample_id
file should be a list of sample_ids, one per line.
Selecting
a subset of SNPs based on their base-pair position (--start, --end)
SNP subsets can be generated by position
using the --start
and --end
flags. All SNPs in that range are included in the output. If --start
is defined but not --end,
--endis set as the last SNP in the data set. If
--endis defined but not --start, --startis set as 0.
NOTE : The options described above can be
used together to create subsets of SNPs and samples at the same time.
The priority for SNP subsets from highest to lowest
is: Inclusion > Exclusion > Position i.e. if a snp is selected by
position but is on the exclusion list, then it is not output. If it is also
on the inclusion list then it is output regardless.
GTOOL
can be used to convert datasets stored in PED files into the
FILE
FORMAT used bySNPTEST and IMPUTE.
PED and associated
MAP files are specified using the --ped
and --mapoptions.
PED files usually use
a genotype coding scheme of A,C,G,T,N or 1,2,3,4,0. GTOOL can use either.
GTOOL assumes that the
PED file contains has the following first 6 columns : Family ID,
Individual ID, Paternal ID, Maternal ID, Sex (1=male; 2=female;
other=unknown), Phenotype. The IDs are alphanumeric: the combination of
family and individual ID should uniquely identify a person. A PED file
must have 1 and only 1 phenotype in the sixth column. The phenotype can
be either a quantitative trait or an affection status column.
The usual MAP
file format is:
chromosome
SNP_id
genetic_distance
position
GTOOLwill place the
chromosome number in the first column of the GEN file. The
genetic_distance values are not used.
If you have
allelic type information for the SNPs you can add it as extra columns in
the MAP file e.g.
chromosome
SNP_id
genetic_distance
position
allele1
allele2
22
rs1234
0.001
1000000
A
G
Otherwise, it will
be inferred from the data and the output convention in the genotype file
will be alphabetic order for the alleles.
The names of the
genotype and sample files to be created are specified using the --g
and --s
options.
The--discrete_phenotypeoption is used to
control how the phenotype information in the PED file is used. If
--discrete_phenotype
1is used
then for each value of phenotype GTOOLoutputs a separate
genotype and a sample file with output filenames appended by the
phenotype value. If --discrete_phenotypeis 0 then GTOOLoutputs one genotype and
sample file.
An example of using
GTOOLto convert a PED/MAP file
pair is given below
GTOOL
can be used to convert datasets stored in GEN file format into PED
files.
In the GEN format each SNP is represented as a set of three probabilities
which correspond to the allele pairs AA,AB,BB. If one of the probabilities is
over the threshold specified by --threshold, then the genotype in the PED
file is expressed as the corresponding allele pair. The genotypes are
expressed as pairs of A,C,G,T. If none of the probabilities are over the
threshold then the pair is unknown, NN.
You can use any one of the phenotypes in the SAMPLE file as the phenotype in
the PED file. The name of the phenotype is specified with --phenotype.
This should correspond to a field on the first line of the SAMPLE file. If
the phenotype does not exist or you don't wish to set a phenotype, the
phenotype is given a value of -9 in the PED file. The name of the sex column
in the SAMPLE file is specified using the --sex
option. If unspecified will look for a column named "sex" or "gender". If no
column is found then the sex column in the PED file will be set to -9.
An
example of using GTOOLto convert a GEN and sample
file pair is given below
If the first column of the GEN file
contains the chromosome number of each SNP in the file (i.e.
example/example_chr.gen)
then these numbers are placed in the
chromosome column of the generated MAP file. Otherwise, the column in filled
with zeros. The following example illustrates this
An example of using GTOOLto convert a GEN and sample
file pair, using the default values for --ped,--map,--phenotype,--sex
and --threshold,
is given below
GTOOL can be
used to merge two or more datasets stored in GEN file format.
SNPs in the output
GEN file are ordered by position. Samples are output in the order that they
are read in. Missing data is represented as 0 0 0. If a SNP is not in a
dataset, then it is represented as missing in those samples which are
uniquely in the dataset.
If a given locus
(SNP + Sample) occurs in more than one file, when merging, there are four
possible outcomes:
Identical e.g.
FileA 0 0 1 FileB 0 0 1. In this case the output is 0 0 1.
Different. e.g.
FileA 0 0 1 FileB 1 0 0. In this case the output would be set as 0 0
0.
Similar e.g.
FileA 0 0 1 FileB 0.03 0 0.97. In this case apply the threshold
(--threshold). Say we set the threshold as 0.9, then the output would be
0 0 1.
Different, but in
one of the data sets there is missing data e.g. FileA 0 0 0 FileB 0 0 1.
In this case the output would be 0 0 1.
If a SNP occurs in
more than one file but the allele type information (alleleA and alleleB,
columns 4 and 5 in the GEN file) is different then there are four possible
outcomes:
Reverse e.g.
FileA C T FileB T C. In this case the probabilities in FileB are
reversed, so 0 0 1 → 1 0 0 and vice versa.
Reverse-Complement e.g.
FileA C T FileB A G. In this case the probabilities in FileB are
reversed.
Complement e.g.
FileA C T FileB G A. In this case the probabilities in FileB are
unchanged.
Different e.g.
FileA C T FileB G T. In this case the SNP is removed from the
output.
GTOOL is unable to
determine the relative strand of AT,CG SNPs, which may lead to some SNPs of
this type having missing data. A solution for this is being developed.
The sample
information for the genotypes including phenotype and covariate information
are also merged. If a sample occurs in two datasets, if there are different
phenotype and covariate information in each file, the union of the sets of
information are output to the Sample (--os) file. If a sample occurs in two
datasets but has different values for the same fields, it is set to -9 in the
output.
An example of using GTOOL
to merge two files is given below.
[1]
J. Marchini, C. Spencer. Y.Y. Teo and P. Donnelly (2007) A Bayesian Hierarchical Mixture Model for
Genotype Calling in a multi-cohort study. (in preparation) [2]
J. Marchini, B.
Howie, S. Myers, G. McVean and P. Donnelly (2007) A new
multipoint method for genome-wide association studies via imputation of
genotypes.
Nature Genetics 39 : 906-913 [Free Access
PDF][Supplementary
Material][News and
Views Article]
[3] The Wellcome Trust Case Control Consortium (2007)
Genomewide association study of 14,000
cases of seven common diseases and 3,000 shared controls. Nature 447;661-78. PMID:
17554300DOI: 10.1038/nature05911