GTOOL

GTOOL is a program for transforming sets of genotype data for use with the programs SNPTEST and IMPUTE.

GTOOL can be used to:
(a) generate subsets of genotype data,
(b) to convert genotype data between the PED file format and the FILE FORMAT used by SNPTEST and IMPUTE,
(c) merge genotype datasets together.


Home Contributors
Download
Version History
Running GTOOL Options References Contact Information

Contributors (top)

The following people have contributed to the development of the methodology and software for GTOOL

Colin Freeman, Jonathan Marchini


Download (top)

Pre-compiled versions of the program and example files can be downloaded from the links below. We've supplied both static and dynamic versions of the Linux executables. If you intend to run SNPTEST on a machine running an old kernel then you probably want to use the dynamic version. If you have any problems getting the program to work on your machine please contact me.


Platform
File
Linux (x86_64) Static Executable
gtool_v0.5.0_x86_64_static.tgz
Linux (x86_64) Static Executable (SuSE 9.3)
gtool_v0.5.0_SuSE9.3_x86_64_static.tgz
Linux (x86_64) Dynamic Executable
gtool_v0.5.0_x86_64_dynamic.tgz
Linux (i386) Static Executable
gtool_v0.5.0_i386_static.tgz
Linux (i386) Dynamic Executable
gtool_v0.5.0_i386_dynamic.tgz
Solaris 5.10 (AMD Opteron)
gtool_v0.5.0_Solaris5.10_Opteron.tgz

The previous version (v0.4.1) can be found here.

Please fill out the registration form to receive emails about updates to this software.

To unpack the files use a command like

tar zxvf gtool_vX.X.X_i386.tgz

This will create an executable called gtool and a directory /example that contains the following example data files

example.gen A genotype file (containing data at 11 SNPs and 5 samples).
example.sample A sample file.
rs_id.txt A file containing a list of rs_ids.
sample_id.txt A file containing a list of sample_ids.
exclusion.txt A file containing a list of rs_ids .
sample_excl.txt A file containing a list of sample_ids.
example.ped An example PED file.
example.map An example MAP file.
example10,11,12.gen Example genotype files for merging.
example10,11,12.sample Example sample files for merging.

Version History (top)

0.5.0 - 28-07-2008 - Addition of Merge mode (-M). Addition of logging and --log option. Missing data is represented by 0 0 0. Samples can be excluded in Subset mode (--sample_excl). Bug fix for PED to GEN, if a SNP in the PED file was e.g. 22 44 22 22 then in GEN alleles were (C -) not (C T).
0.4.1 -12-02-2008 - Addition of --sex option to specify which column of the SAMPLE file is used for sex when converting GEN to PED files.
0.4.0 - 10-01-2008 - Read and write gzipped files. Bug fix for space/tab discrimination issue. Pass chromosome field between formats, if applicable.
0.3.0 - 07-09-2007 - Addition of the GEN to PED conversion option. Default output file names.
0.2.0 - 17-07-2007 - Addition of conversion option. Change made to sample file format to be in-line with the FILE FORMAT used for SNPTEST and IMPUTE.
0.1.6 - 07-06-2007 - First version made available


Running GTOOL (top)

To run GTOOL, and see the parameters that it requires, type:

./gtool

GTOOL can be run in one of four different modes : Subset Mode (-S) , PED to GEN Conversion Mode (-P) , GEN to PED Conversion Mode (-G) and Merge Mode (-M).

GTOOL will read gzipped or non-compressed files at input. Output files will be gzipped if the main input data file (GEN or PED) is gzipped.

Subset Mode (top)

GTOOL can be used to create subsets of datasets using the -S option in conjunction with several sub-options. These options are illustrated in the following examples.  In these examples the genotype and sample files to be subsetted are specified using the --g and --s options. The genotype and sample files should be in the format specified by the FILE FORMAT webpage.

The output files are specified using the --og and --os options.

Selecting a subset of individuals specified by a list (--sample_id)

The --sample_id file should be a list of sample_ids, one per line.

./gtool -S --g example/example.gen --s example/example.sample --og example/out.gen --os example/out.sample--sample_id example/sample_id.txt

Excluding a subset of individuals (--sample_excl)

The --sample_excl file should be a list of sample_ids, one per line.

./gtool -S --g example/example.gen --s example/example.sample --og example/out.gen --os example/out.sample--sample_excl example/sample_excl.txt

Selecting a subset of SNPs specified by a list (--inclusion)

The --inclusion file should be a single column file with a SNP ID on each line.

./gtool -S --g example/example.gen --s example/example.sample --og example/out.gen --os example/out.sample
--inclusion example/rs_id.txt

Excluding subsets of SNPs (--exclusion)

The --exclusion file should be a single column file with a SNP ID on each line.

./gtool -S --g example/example.gen --s example/example.sample --og example/out.gen --os example/out.sample
--exclusion example/exclusion.txt

Selecting a subset of SNPs based on their base-pair position (--start, --end)

SNP subsets can be generated by position using the --start and --end flags. All SNPs in that range are included in the output. If --start is defined but not --end, --endis set as the last SNP in the data set. If --endis defined but not --start, --startis set as 0.

./gtool -S --g example/example.gen --s example/example.sample --og example/out.gen --os example/out.sample
--start 10015000 --end 10075000

NOTE : The options described above can be used together to create subsets of SNPs and samples at the same time. The priority for SNP subsets from highest to lowest is: Inclusion > Exclusion > Position i.e. if a snp is selected by position but is on the exclusion list, then it is not output. If it is also on the inclusion list then it is output regardless.

PED to GEN Conversion Mode (top)

GTOOL can be used to convert datasets stored in PED files into the FILE FORMAT used bySNPTEST and IMPUTE.
An example of using GTOOLto convert a PED/MAP file pair is given below

./gtool -P --ped example/example.ped --map example/example.map --og example/out.gen --os example/out.sample --discrete_phenotype 1

An example of using
GTOOLto convert a PED/MAP file pair, using the default values for --og and --os, is given below

./gtool -P --ped example/example.ped --map example/example.map --discrete_phenotype 1

GEN to PED Conversion Mode (top)

GTOOL can be used to convert datasets stored in GEN file format into PED files.

In the GEN format each SNP is represented as a set of three probabilities which correspond to the allele pairs AA,AB,BB. If one of the probabilities is over the threshold specified by --threshold, then the genotype in the PED file is expressed as the corresponding allele pair. The genotypes are expressed as pairs of A,C,G,T. If none of the probabilities are over the threshold then the pair is unknown, NN.

You can use any one of the phenotypes in the SAMPLE file as the phenotype in the PED file. The name of the phenotype is specified with --phenotype. This should correspond to a field on the first line of the SAMPLE file. If the phenotype does not exist or you don't wish to set a phenotype, the phenotype is given a value of -9 in the PED file. The name of the sex column in the SAMPLE file is specified using the --sex option. If unspecified will look for a column named "sex" or "gender". If no column is found then the sex column in the PED file will be set to -9.

An example of using GTOOLto convert a GEN and sample file pair is given below

./gtool -G --g example/example.gen --s example/example.sample --ped example/out.ped --map example/out.map --phenotype phenotype_1 --threshold 0.9

If the first column of the GEN file contains the chromosome number of each SNP in the file (i.e. example/example_chr.gen) then these numbers are placed in the chromosome column of the generated MAP file. Otherwise, the column in filled with zeros. The following example illustrates this

./gtool -G --g example/example_chr.gen --s example/example.sample --ped example/out.ped --map example/out.map --phenotype phenotype_1 --threshold 0.9

An example of using
GTOOLto convert a GEN and sample file pair, using the default values for --ped,--map,--phenotype,--sex and --threshold, is given below

./gtool -G --g example/example.gen --s example/example.sample

Merge Mode (top)

GTOOL can be used to merge two or more datasets stored in GEN file format.

SNPs in the output GEN file are ordered by position. Samples are output in the order that they are read in. Missing data is represented as 0 0 0. If a SNP is not in a dataset, then it is represented as missing in those samples which are uniquely in the dataset.

If a given locus (SNP + Sample) occurs in more than one file, when merging, there are four possible outcomes:

If a SNP occurs in more than one file but the allele type information (alleleA and alleleB, columns 4 and 5 in the GEN file) is different then there are four possible outcomes:

GTOOL is unable to determine the relative strand of AT,CG SNPs, which may lead to some SNPs of this type having missing data. A solution for this is being developed.

The sample information for the genotypes including phenotype and covariate information are also merged. If a sample occurs in two datasets, if there are different phenotype and covariate information in each file, the union of the sets of information are output to the Sample (--os) file. If a sample occurs in two datasets but has different values for the same fields, it is set to -9 in the output.

An example of using GTOOL to merge two files is given below.

./gtool -M --g example/example10.gen example/example11.gen --s example/example10.sample example/example11.sample --log example/example10_example11.log

A further example,

./gtool -M --g example/example11.gen example/example12.gen --s example/example11.sample example/example12.sample --threshold 0.9 --log example/example11_example12.log

Options (top)

A complete set of options is given in the following table

Parameters Type Description
-S
Subset mode
--g File Input genotype file
--s File Input sample file
--og
File
Output genotype file. Default, append .subset to genotype file name
--os
File
Output sample file. Default, append .subset to sample file name
--sample_id File Define a subset of genotypes from a list of sample_ids
--start Integer Define a subset of SNPs by position (in basepairs) in the range start ≤ position ≤ end
--end Integer Define a subset of SNPs by position (in basepairs) in the range start ≤ position ≤ end
--inclusion File Define a subset of SNPs to include. The --inclusion file should be a single column file with a SNP ID on each line.
--exclusion File Define a subset of SNPs to exclude. The --exclusion file should be a single column file with a SNP ID on each line.
-P
PED to GEN Conversion mode
--ped File PED format genotype file
--map File MAP SNP file which accompanies the --ped file
--og
File
Output genotype file. Default, append .gen to PED file name
--os
File
Output sample file. Default, append .sample to PED file name
--discrete_phenotype Integer 1 if the PED phenotype is discrete e.g. affectation status. 0 if the PED phenotype is continuous e.g. height
-G
GEN to PED Conversion mode
--g
File
Input genotype file
--s File
Input sample file
--ped File
Output PED file. Default, append.ped to genotype file name
--map File
Output MAP file. Default, append .map to genotype file name
--phenotype String
Name of the phenotype column in the SAMPLE file to output to PED. Default -9 (Unknown)
--sex String
Name of the sex column in the SAMPLE file to output to PED. If unspecified will look for a column named "sex" or "gender". Default -9 (Unknown)
--threshold Double
Threshold for merging genotypes from GEN probability. Default 0.9
-M
Merge mode
--g File List of input genotype files
--s File List of input sample files
--og File Output genotype file. Default format File1_File2_...FileN.gen
--os File Output sample file. Default File1_File2_...FileN.sample
--threshold Double Threshold for calling genotypes from GEN probability. Default >1 (strict comparison).



--log File Name of log file. Use with -S,-P,-G,-M. Default gtool.log

References (top)

[1] J. Marchini, C. Spencer. Y.Y. Teo and P. Donnelly (2007) A Bayesian Hierarchical Mixture Model for Genotype Calling in a multi-cohort study. (in preparation)
[2] J. Marchini, B. Howie, S. Myers, G. McVean and P. Donnelly (2007) A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics 39 : 906-913 [Free Access PDF][Supplementary Material][News and Views Article]
[3] The Wellcome Trust Case Control Consortium (2007) Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls.
Nature 447;661-78. PMID: 17554300DOI: 10.1038/nature05911

Contact Information (top)

If you have any questions regarding the use of this program please send an email to Dr Colin Freeman (cfreeman <at> well <dot> ox <dot> ac <dot> uk)