FILE FORMATS

CHIAMO, HAPGEN, IMPUTE, SNPTEST and GTOOL  are designed to work together in a seemless fashion. As such there is a single file format that links the programs. This format consists of two parts (a) a genotype file that contains genotype data in a one-line-per-SNP format, and (a) a sample file that contains the information about each individual i.e. individual IDs, covariates, phenotypes and missing data proportions. This format is described below with examples.

Home Genotype File Format
Sample File Format

NOTE
: the sample file format has changed with the release of SNPTEST v2. Specifically, the way in which covariates and phenotypes are coded on the second line of the header file has changed. See below for more details. If you have been using SNPTEST v1.1.5 you will need to edit the sample files of your data so that SNPTEST v2 will work with your data.

Genotype File Format (top)

The genotype file stores data on a one-line-per-SNP format. The first 5 entries of each line should be the SNP ID, RS ID of the SNP, base-pair position of the SNP, the allele coded A and the allele coded B. The SNP ID can be used to denote the chromosome number of each SNP. The next three numbers on the line should be the probabilities of the three genotypes AA, AB and BB at the SNP for the first individual in the cohort. The next three numbers should be the genotype probabilities for the second individual in the cohort. The next three numbers are for the third individual and so on. The order of individuals in the genotype file should match the order of the individuals in the sample file (see below). Also, the probabilities need not sum to 1 to allow for the possibility of a NULL genotype call. This format allows for genotype uncertainty. This genotype file format is the same as that produced by the genotype calling algorithm CHIAMO.

NOTE : We recommend that you arrange SNPs in base-pair order in the genotype files. This is required if you want to use the files with IMPUTE and will make viewing the output of SNPTEST somewhat easier.

Example

Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are

SNP 1 : AA AA
SNP 2 : GG GT
SNP 3 : CC CT
SNP 4 : CT CT
SNP 5 : AG GG

The correct genotype file would be

SNP1 rs1 1000 A C 1 0 0 1 0 0
SNP2 rs2 2000 G T 1 0 0 0 1 0
SNP3 rs3 3000 C T 1 0 0 0 1 0
SNP4 rs4 4000 C T 0 1 0 0 1 0
SNP5 rs5 5000 A G 0 1 0 0 0 1

So, at SNP3 the two alleles are C and T so the set of 3 probabilities for each indvidual correspond to the genotypes CC, CT and TT respectively.

Note : columns 2 and 3 (that contain the RS ID and base-pair position of the SNPs are set arbitrarily in this example.

Sample File Format (top)

NOTE : the sample file format has changed with the release of SNPTEST v2. Specifically, the way in which covariates and phenotypes are coded on the second line of the header file has changed. If you have been using SNPTEST v1.1.5 you will need to edit the sample files of your data so that SNPTEST v2 will work with your data.

The sample file has three parts (a) a header line detailing the names of the columns in the file, (b) a line detailing the types of variables stored in each column, and (c) a line for each individual detailing the information for that individual. Here is an example of the start of a sample file for reference

ID_1 ID_2 missing cov_1 cov_2 cov_3 cov_4 pheno1 bin1
0 0 0 D D C C P B
1 1 0.007 1 2 0.0019 -0.008 1.233 1
2 2 0.009 1 2 0.0022 -0.001 6.234 0
3 3 0.005 1 2 0.0025 0.0028 6.121 1
4 4 0.007 2 1 0.0017 -0.011 3.234 1
5 5 0.004 3 2 -0.012 0.0236 2.786 0

The header line

This line needs a minimum of three entries. The first three entries should always be ID_1, ID_2 and missing. They denote that the first three columns contain the first ID, second ID and missing data proportion of each individual. Additional entries on this line should be the names of covariates or phenotypes that are included in the file. In the above example, there are 4 covariates named cov_1, cov_2, cov_3, cov_4, a continuous phenotype named pheno1 and a binary phenotype named bin1.
NOTE : All phenotypes should appear after the covariates in this file.

The second line (the variable type line)

The second line of the file details the type of variables included in each column. The first three entries of this line should be set to 0. Subsequent entries in this line for covariates and phenotypes should be specified by the following rules

D
Discrete covariate (coded using positive integers) 
C
Continuous covariates
P
Continuous Phenotype
B
Binary Phenotype (0 = Controls, 1 = Cases)

Individual information

The remainder of the file should consist of a line for eah individual containing the information specified by the entries of the header line (see example above).
Use spaces to separate the entries of the sample file and not TABS.

Missing values - Specifying missing values for covariates and phenotypes is possible. We used to recommend that you use -9 for missing values. This was the default value assumed by SNPTEST v1, although the -missing_code option in SNPTEST v1 meant that you could use other numeric values for the missing code, In SNPTEST v2 we changed the behaviour of the -missing_code option so that it now takes a comma-separated list of values, each of which is treated as missing when encountered in the sample file(s). The default missing values in now the two character string "NA".