Building local population trees to identify selected populations.

In Fariello et al. (2013), we built local population trees to help identifying selected populations.
We provide a pair of utilities that allow to construct these trees, they can be downloaded from the Documents section.

The approach is a two step procedure:

  1. Compute local Reynolds distances
  2. Build local population trees

1. Computing local Reynolds distances

We need to compute local reynolds distances from the allele, or haplotype cluster frequencies. This is done using the local_reynolds.py script. This should be run as python local_reynolds.py. You need version 2.7 of python, and the numpy package installed. The options are as follows:

usage: local_reynolds.py [-h] [-p PREFIX] [-l LEFT] [-r RIGHT] [-o OPREFIX]

optional arguments:
  -h, --help  show this help message and exit
  -p PREFIX   Work on files with prefix PREFIX
  -l LEFT     Leftmost coordinate to consider
  -r RIGHT    rightmost coordinate to consider
  -o OPREFIX  prefix for output files

Suppose we have performed the analysis in the Tutorial, using the following command:

hapflk --file hapmap3-lct --kinship kinship.txt -K15 --nfit=1 --ncpu=2 -p hapmap_tutorial

Among the FLK and hapFLK results, we get a file named hapmap_tutorial.frq, which contains the allele frequencies in the LCT region and hapmap_tutorial.kfrq.fit_0.bz2 which contains the haplotype cluster frequencies in the same regions. These are the two files that local_reynolds.py will use to estimate local Reynolds distances. Now run:

python local_reynolds.py -p hapmap_tutorial 

It will create two new files: hapflk_snp_reynolds.txt and hapflk_hap_reynolds.txt that contain the local Reynolds distance matrices for SNP and haplotype clusters respectively.

You may have noticed that the region included in the tutorial is slightly larger than the LCT signal, which actually spans roughly between 135.3 Mb and 136.7 Mb. To focus on this region, type:

python local_reynolds.py -p hapmap_tutorial -o hapmap_tutorial -l 135.3e6 -r 136.7e6

Note: If you have performed more than one fit of the LD model (which you should), the script will recognize all the kfrq files and average its computations over model fits.

2. Producing local population trees

This is done using the R script called local_trees.R. To use it you must edit it with a text editor and fill in the USER INPUT section, reproduced below:

######################## USER INPUT #########################
### Pre filled with files from the tutorial
## tree file : the whole genome tree as obtained
## from the hapflk analysis
tree_file='hapmap_tree.txt'
## wg_dist_file : the Global Reynolds distances between population,
## as obtained from the hapflk analysis
wg_dist_file='Reynolds.txt'
## Name of the outgroup in the previous file (if any)
outgroup='YRI'
## local reynolds distances for SNPs and Haplotype clusters
## To be computed using the local_reynolds.py script
loc_dist_file_snp='hapflk_snp_reynolds.txt'
loc_dist_file_hap='hapflk_hap_reynolds.txt'
## prefix for outputfiles
prefix='hapmap_tutorial'
##################### END USER INPUT #########################

Here, you need to indicate where the relevent files are located. You can use full path information in case the files are located in different folders. You can also use paths relative to where you are running the script from.

Then to produce local trees, just type:

R CMD BATCH local_trees.R

This will provide three plots in png format, in our case:

  • hapmap_tutorial_WholeGenomeTree.png
  • hapmap_tutorial_LocalSnpTree.png
  • hapmap_tutorial_LocalHapTree.png

The names are self explanatory. In this case, you should see that the branch leading to the CEU population is much longer than on the whole genome tree, consistent with selection along this branch. The local tree branches are colored according to the p-value for increased branch length along the tree. You can get the actual numbers (coefficients and associated p-values) from the hapmap_tutorial_snp_lmtab.csv and hapmap_tutorial_hap_lmtab.csv .

Note : In earlier versions of the hapflk software, the tree file (PREFIX_tree.txt) was not written. You can get this file back by adding the following line at the end of the PREFIX_kinship.R:

write.tree(F$tree,file='PREFIX_tree.txt') 

And run the script again:

R CMD BATCH PREFIX_tree.txt

where PREFIX is the prefix for output files you used (hapflk by default).