shellfish: Parallel PCA and data processing for genome-wide SNP data

Table of Contents

Introduction

shellfish carries out a variety of tasks related to principal component analysis of genome-wide SNP data. Unlike other available software, PCA computations can be carried out in parallel (both on a computing cluster running the Sun Grid Engine, and also in the simple case of a machine with multiple processors). In addition to the PCA calculations, it automates the process of data subsetting and allele-matching, using plink and gtool for file format interconversion where necessary. The aim is that tasks that would otherwise require a complex series of shell commands and/or work in R, can be carried out with a single, straightforward, command.

Linear algebra computations make use of standard BLAS and LAPACK libraries, and data manipulations are performed using standard shell commands or one of several bundled command-line utilities written in C. shellfish is therefore very efficient and can be used on data sets involving over 10,000 individuals typed at hundreds of thousands of SNPs.

Please note that shellfish is beta software and is not currently under active development. While writing and maintaining software for the scientific community is important, it's not always clear that it's particularly beneficial to one's academic career.

Download

Setting things up

  1. shellfish lives on linux/unix/OS X machines; not Windows. It is written in python, so the machine must have python installed.
  2. Download and compile shellfish.
wget http://www.stats.ox.ac.uk/~davison/software/shellfish/shellfish.tgz
tar -xzvf shellfish.tgz
cd shellfish/src
make all
cd ..
  1. If you're going to work with plink format files (.ped, .bed, etc), you also need to have gtool and plink installed.

Data formats

shellfish can use the following input file formats

  • {.ped, .map} plink files
  • {.bed, .bim, .fam} binary plink files
  • {.gen, .sample} files uses by the chiamo/impute/gtool/etc suite of programs
  • {.gen.gz, .sample} gzipped files uses by the chiamo/impute/gtool/etc suite of programs
  • {.geno, .map} a simple, compact genotype format used by shellfish

Example usage

Make shellfish-format data

Shellfish automatically converts data to the necessary format, but this example illustrates the basic arguments, and shows how to form a dataset for analysis containing a subset of the SNPs in the original data file:

./shellfish --make-geno --file bigdata --file2 smalldata --out smalldata_output

Note the following:

  1. The requested action is supplied with the --make-geno option
  2. The main input is specified as --file bigdata. Whenever you supply an argument like --file basename, that implies that, for example {basename.gen and basename.sample} or {basename.geno, basename.map} exist.
  3. --file2 smalldata is used to specify the subset of SNPs which are required in the output data file. This option implies that a map file smalldata.map exists, but not necessarily any associated genotype data (e.g. smalldata.ped).

The corollary of this is that, when keeping lists of SNPs, keep them in plink .map format. This might be a lists of SNPs that are not in strong LD with each other, or a list of SNPs with PCA loadings. Any extra information like PCA loadings and allele frequencies can be stuck on as extra columns, after the mandatory .map format columns (chrom=1, rs=2, cM=3, bp=4, allele1=5, allele2=6).

Parallel principal component analysis

Sun Grid Engine

The following command computes the first 10 principal components on a cluster running the Sun Grid Engine, using a maximum of 200 processes. Jobs are submitted at priority level 4.

./shellfish.py --pca --numpcs 10 --sge --sge-level 4 --maxprocs 200 --file basename --out outputname

Of course, you might want to add nohup at the beginning, redirect the output and background the process if it's going to be running for a long time.

Normal multiprocessor machine

./shellfish.py --pca --numpcs 10 --maxprocs 6 --file basename --out outputname

Project individuals into pre-computed principal component space

Suppose you want to investigate the locations of a collection of genotyped individuals with respect to the 3 HapMap clusters. Rather than carrying out the principal component analysis from scratch, all that is needed are the "loadings" of your SNPs (or a large subset thereof) on the two principal components that describe structure in the HapMap. These loadings are available for download as follows:

After uncompressing (use gunzip) you'll see that those are tab-separated files that follow the conventions of plink .map files. Here's the first two lines of one of them:

1       rs3094315       0       792429  A       G       0.295238        -6.79218        1.628052
1       rs12562034      0       808311  A       G       0.75    -4.819747       -3.31086

This is a general shellfish feature: files containing information about SNPs always follow the .map format used by plink, for the first 4 columns at least. Columns are tab-separated, and there are at least 4 columns. The columns contain:

  1. Chromosome number
  2. rs ID
  3. cM location (can be anything, but the column must be present)
  4. physical location
  5. allele 1
  6. allele 2
  7. allele frequency used when computing the PCA
  8. PC1 loading
  9. PC2 loading

Briefly, suppose that

  1. you have genotype data for Illumina SNPs in binary plink format, contained in files called genotypes.bed, genotypes.bim and genotypes.fam
  2. you have downloaded the HapMap PCA SNP loadings file for Illumina provided above.
  3. Your genotypes.bim file identifes SNPs by rs ID, in the same way as the snpload-ill.map file does.

The shellfish command for projecting those individuals into the HapMap principal component space is:

./shellfish.py --project --numpcs 2 --file genotypes --file2 snpload-ill

Note that shellfish commands generally follow plink conventions:

  • In particular, you supply the file names genotypes and snpload-ill without the extensions (.bed,.map,.bim, etc).

If you know how to put executables into a directory listed in your $PATH variable then fine; but if you are unsure what that means, then for now you will be best off executing all commands from the shellfish directory that you downloaded and unzipped. Also, you should put plink and gtool in there, or else be in a position where you can just type plink and gtool and the system will know what you mean.