Description of the files available for download¶
Directory comp_stat/ contains python scripts related to the computation of summary statistics from simulated or observed genomic samples :
- simul_data.py simulates genomic samples with ms and computes AFS, zygotic LD and IBS statistics from these samples. Results are stored in sub-directory res/. Simulation settings (sample size, genome size, prior distributions ...) are as described in the manuscript.
- simul_data_ex1.py is a modified version of this script, where simulation settings have been adapted to the quick start example.
- simul_stat.py computes the same summary statistics, but uses genomic samples that have already been simulated and stored by a previous call of simul_data.py. This allows to compute summary statistics using a smaller sample size or different minor allele frequency (MAF) thresholds.
- stat_from_vcf.py computes the same summary statistics, for observed genomic samples in vcf format.
- stat_from_vcf_ex1.py is a modified version of this script used for the quick start example.
- stat_from_beagle.py computes the same summary statistics, for observed genomic samples in beagle output format.
- other files and sub-directories include internal routines that are not directly called by the user. They are only of interest for advanced users who would like to modify the program.
Directory comp_stat.1/ is exactly equivalent to directory comp_stat/, except that internal functions allowing to simulate genomic samples are based on msprime instead of ms.
Directory cattle_data/ contains two example data sets, taken from the cattle data analyzed in the manuscript. These data can be used to test the functions stat_from_vcf.py and stat_from_beagle.py described above:
- File Chr1.vcf.gz provides raw genotypes of the 1000 bull genomes project phase II, at vcf format, for a small region of chromosome 1.
- Files Chr1.phase.gz and Chr1.map.gz provide haplotyes inferred by Beagle for the 000 bull genomes project phase II, for a small region of chromosome 29.
- File indiv.ped provides the pedigree of all animals, and in particular their breed of origin.
- Files list_indiv_Angus.txt, list_indiv_Fleckvieh.txt, list_indiv_Holstein.txt and list_indiv_Jersey.txt indicate the subset of animals selected in each breed for computing summary statistics.
URLs to download all original BAM files produced by the 1000 bull genomes project phase II, and scripts to process them into vcf and beagle files, can be found in Daetwyler et al (2014).
Directory simu_stat/, simu_stat2/ and cattle_stat/ contain the main data sets that have been used in the manuscript :
- simu_n50_s100.params and simu_n50_s100_mac1_macld1.stat include the parameter values and summary statistics for approximately 250,000 simulated histories, as obtained from simul_data.py when no MAF threshold is used. simu_n50_s100_mac10_macld10.stat includes the same summary statistics, for the same histories, when using a MAF theshold of 20%. These simulated datasets are a subset of the ones we used in the manuscript for ABC estimation (which generally included 200,000 additional replicates).
- simu_n30_s100_mac6_macld6.stat includes the same summary statistics, for the same histories, using a MAF threshold of 20% and a sample size of 15 rather than 25 diploids. This was the data set used for estimating population size history in the Jersey breed.
- Angus_n50_mac10_macld10.stat, Holstein_n50_mac10_macld10.stat, Fleckvieh_n50_mac10_macld10.stat and Jersey_n30_mac6_macld6.stat provide the summary statistics coputed in the 4 cattle breeds with a MAF threshold of 20%, as obtained from stat_from_beagle.py. For the Holstein breed, we also provide the statistics when no MAF threshold was used, in Holstein_n50_mac1_macld1.stat.
Directory estim/ contains examples of R scripts that can be used for ABC estimation :
- abc_cv.R computes prediction errors estimated from 100 random histories, and plots these errors as we did in the manuscript. ABC estimations are based on the simulated data sets described above. As this is only an illustration, some parameters were modified compared to the analysis showed in the manuscript, in order to reduce computation time. In particular, only 100 (rather than 2,000) histories were used, and ridge rather than neural network regression was used for ABC estimation.
- abc_cv_ex1.py is a modified version of this script, where parameter values have been adapted to the quick start example.
- abc_holstein.R estimates the population size history in Holstein, based on the simulated and observed statistics described above.
- abc_ex1.py is a modified version of this script, where parameter values have been adapted to the quick start example.
- other files include internal functions that are not directly called by the user. They are only of interest for advanced users who would like to modify the program.