Tutorial: hapmap data and the LCT gene¶
To show how to use the program we provide an example of analysis on a small portion of HapMap phase III data around the LCT gene. A known causal mutation for the lactase persistent phenotype in the CEU population is located at position 136325116. To perform the analysis, you need to download files from the "Documents" tab above.
Files description¶
The data consits in 370 founder individuals from the HapMap Phase III datasets, of the CEU, TSI, CHB and JPT populations. Only a small portion of the HapMap data is provided: a 4Mb region (coodinates 134-138 Mb) on Human chromosome 2. Moreover, to reduce computational time, the data has been thinned so that only 25% (497 SNPs) of SNPs from the HapMap data are included.
These three files are needed to run hapflk:
- hapmap3-lct.ped : the ped file (plink format) with a particularity. First column of the file (FID) contains the population identifier of individuals (rather than a family ID).
- hapmap3-lct.map : the map file
- kinship.txt: the kinship matrix estimated from the Reynolds distances
For information we provide the following files:
- Reynolds.txt : the Reynolds distances estimated from a sample of ~ 70K SNPs on the whole genome
- reynolds2kinship.R : the R script to estimate kinship matrix from the reynolds distances. Yoruba population (YRI) is used as the outgroup.
- hapmap_tree.txt : the population tree estimated genome wide, in Newick format.
Running hapflk on the data¶
To run a simple analysis on the data, use the following command line:
hapflk --file hapmap3-lct --kinship kinship.txt -K 15 --nfit=1 --ncpu=2
It will take approximately 1 minute to run.
Explanation:
- --file is similar to plink options: it specifies input files are hapmap3-lct.ped and hapmap3-lct.map
- --kinship specifies the file containing the kinship matrix (if not specified, it will be computed taking all data into account). Because we focus on a specific region here, we need to specify it to use a whole genome estimate.
- -K 15 specifies the number of clusters to use for the Scheet and Stephens (2006) model. Note that not specifying a -K value will turn off the hapFLK calculations (only single marker statistic FLK will be computed).
- --nfit=1 specifies the number of EM runs to be performed. For the tutorial we only perform one fit of the EM algorithm. In the general case, it is a good idea to increase this number (to 10, 15 or so, depending on the dataset)
- --ncpu=2 specifies to use 2 CPUs for fitting the LD model. If you have more CPU available, you can safely increase this number :) If you have less (ie 1), do not use this option.
Output files¶
This command produces the following output files:
- hapflk.flk contains the single SNP FLK, corresponding theoretical p-value and estimated frequency in the ancestral population (pzero)
- hapflk.hapflk contains the haplotype based statistic hapFLK.
- hapflk.frq contains the estimated allele frequency of ref_all allele for each SNP in each population.
- hapflk.kfrq.fit_0.bz2 gives the cluster posterior probabilities for each population, locus and cluster. If multiple EM runs are used, one file is produced for each EM run. This file is used for representing local cluster frequencies in R.