DistMiss

Authors

  • M. San Cristobal, INRA
  • C. Chevalet, INRA
  • G. Laval, UP Human Evolutionary Genetics, Institut Pasteur, Paris, France

Description

The program DistMiss computes a matrix of genetic distances between populations on the basis of their allele frequencies at L loci.

This kind of calculation is proposed in several population genetics software, in particular the PHYLIP software.
Our program allows some loci to be completely missing in some populations.
A maximum of 80 populations, 150 loci and 50 alleles per locus are allowed (can be changed in the source code).

Input of Distmiss

  • file containing allele frequencies in PHYLIP format (freqall)
  • file containing the number of genes typed in each population locus combination, with a minimal sample size to be accepted in the forthcoming calculations (sample size 10)
  • the number of bootstraps (0 if original data)
  • an indicator of the required distance method (see Details below)
  • file containing a (sub)list of populations for the current analysis (listpop)
  • file containing indicators for markers chosen for the current analysis (listmark)

Allele frequency file

The first line contains the number of loci and the number of populations.
The second line contains the number of alleles per locus.
For each population, the name of the population followed by the allele frequencies.
The names of populations is a character string of length 10.
A test is made on the sum of allele frequencies per population x locus combination.
Rounding errors are allowed (e.g. sum = 1 +/- 0.001). If the sum is equal to 0, then
the locus is considered as missing in the population, as in the following exemple for
the last 7 loci in the first population.

3 4
2 2 3 4
Pop1        0.3000 0.7000 0.5000 0.5000 0.2000 0.2000 0.6000 0.0000 0.0000 0.0000 0.0000
Pop2        0.4000 0.6000 0.1000 0.9000 0.3000 0.4000 0.3000 0.2500 0.2500 0.2500 0.2500
Pop3        0.1000 0.9000 0.2000 0.8000 0.3333 0.3333 0.3333 0.2000 0.3000 0.2000 0.3000

Sample size file

The first two lines are identical to the allele frequency file.
Then for each population, a first line contains the name of the population and the
number of individuals, a second line gives the number of haplotypes that were used to
calculate the allele frequencies.
In the exemple below, the first locus had 80 haplotypes among 2*50=100 possible.
The last locus has no genotypes in the first population.

3 4
2 2 3 4
Pop1 50
80 100 90 0
Pop2 30
60 60 60 60
Pop3 60
120 110 100 120

Indicator of distance method

1 = NEIMB = Nei minimum
2 = NEIMC = Nei minimum corrected for sample size
3 = NEISB = Nei standard
4 = NEISC = Nei standard corrected for sample size
5 = REYNB = Reynolds et al (1983)
6 = REYNC = Reynolds et al (1983) corrected for sample size (Laval et al 2000)
7 = MORTB = Morton
8 = MORTC = Morton corrected for sample size
9 = NEI87 = Nei (1987)

List of populations

The number of the chosen populations (order of the allele frequency file)
and their name. In the exemple, population 2 is absent from the current analysis.

1 Pop1 
3 Pop3

Marker file

It contains a first line with names of markers (optional) and a second line with 1 if
the marker is used in the current analysis and 0 otherwise.
In the exmple, marker Mark3 will not be taken into account in the calculations, even if
its information is still present in the allele frequency file.

Mark1 Mark2 Mark3 Mark4 
1     1     0   1          

Output

  • summary file (text format), named code_of_distance.res
  • file with the distance matrix (PHYLIP format), named code_of_distance.ngh

Download Source code

You can download the source code of the software: DistMiss.zip