FILE FORMATS

CHIAMO, HAPGEN, IMPUTE, SNPTEST and GTOOL  are designed to work together in a seemless fashion. As such there is a single file format that links the programs. This format consists of two parts (a) a genotype file that contains genotype data in a one-line-per-SNP format, and (a) a sample file that contains the information about each individual i.e. individual IDs, covariates, phenotypes and missing data proportions. This format is described below with examples.

Home Genotype File Format
Sample File Format

Genotype File Format (top)

The genotype file stores data on a one-line-per-SNP format. The first 5 entries of each line should be the SNP ID, RS ID of the SNP, base-pair position of the SNP, the allele coded A and the allele coded B. The SNP ID can be used to denote the chromosome number of each SNP. The next three numbers on the line should be the probabilities of the three genotypes AA, AB and BB at the SNP for the first individual in the cohort. The next three numbers should be the genotype probabilities for the second individual in the cohort. The next three numbers are for the third individual and so on. The order of individuals in the genotype file should match the order of the individuals in the sample file (see below). Also, the probabilities need not sum to 1 to allow for the possibility of a NULL genotype call. This format allows for genotype uncertainty. This genotype file format is the same as that produced by the genotype calling algorithm CHIAMO.

NOTE : We recommend that you arrange SNPs in base-pair order in the genotype files. This is required if you want to use the files with IMPUTE and will make viewing the output of SNPTEST somewhat easier.

Example

Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are

SNP 1 : AA AA
SNP 2 : GG GT
SNP 3 : CC CT
SNP 4 : CT CT
SNP 5 : AG GG

The correct genotype file would be

SNP1 rs1 1000 A C 1 0 0 1 0 0
SNP2 rs2 2000 G T 1 0 0 0 1 0
SNP3 rs3 3000 C T 1 0 0 0 1 0
SNP4 rs4 4000 C T 0 1 0 0 1 0
SNP5 rs5 5000 A G 0 1 0 0 0 1

So, at SNP3 the two alleles are C and T so the set of 3 probabilities for each indvidual correspond to the genotypes CC, CT and TT respectively.

Note : columns 2 and 3 (that contain the RS ID and base-pair position of the SNPs are set arbitrarily in this example.

Sample File Format (top)

The sample file has three parts (a) a header line detailing the names of the columns in the file, (b) a line detailing the types of variables stored in each column, and (c) a line for each individual detailing the information for that individual. Here is an example of the start of a sample file for reference

ID_1 ID_2 missing cov_1 cov_2 cov_3 cov_4 phenotype_1
0 0 0 1 2 3 3 P
1 1 0.007 1 2 0.0019 -0.008 1.233
2 2 0.009 1 2 0.0022 -0.001 6.234
3 3 0.005 1 2 0.0025 0.0028 6.121
4 4 0.007 2 1 0.0017 -0.011 3.234
5 5 0.004 3 2 -0.012 0.0236 2.786

The header line

This line needs a minimum of three entries. The first three entries should always be ID_1, ID_2 and missing. They denote that the first three columns contain the first ID, second ID and missing data proportion of each individual. Additional entries on this line should be the names of covariates or phenotypes that are included in the file. In the above example, there are 4 covariates named cov_1, cov_2, cov_3, cov_4 and a phenotype named phenotype_1.
NOTE : All phenotypes should appear after the covariates in this file.

The second line (the variable type line)

The second line of the file details the type of variables included in each column. The first three entries of this line should be set to 0. Subsequent entries in this line for covariates and phenotypes should be specified by the following rules

1
Discrete covariate (coded using positive integers) that you want to use to carry out a Mantel-Haentzel test of association i.e. carry out a test for a common genetic effect across groups allowing for a different base-line risk n each group.
2
Discrete covariate (coded using positive integers) that you want to use to carry out a combined test of association across groups i.e carry out a separate test of association in each group and combine the results.
3
Continuous covariates
P
Phenotype NOTE : All phenotypes should appear after the covariates in this file.

Individual information

The remainder of the file should consist of a line for eah individual containing the information specified by the entries of the header line (see example above).
Individuals with missing values for covariates and phenotypes should be coded -9.

Use spaces to separate the entries of the sample file and not TABS.