Select SNP subset for Imputation:

Select SNP subset for Imputation (SS4I) is a tool dedicated to the selection of a sub-panel of SNP for the design of a low density genotyping chip. This program, written in Python, allow the selection of SNP subset according to a chosen linkage disequilibrium (r2) threshold on a chromosome scale.

DOI: https://doi.org/10.15454/4KVRWY

Input files:

Three files are required to run SS4I.
  • A « .ped » (PLINK format) genotype file.

The PED file is a white-space (space or tab) delimited file: the first six columns are mandatory:

Family ID (alphanumeric)
Individual ID (alphanumeric)
Paternal ID (alphanumeric)
Maternal ID (alphanumeric)
Sex (1=male; 2=female; other=unknown)
Phenotype (set to missing=-9)
Genotype

The combination of family and individual ID should uniquely identify a person.
Genotypes (column 7 onwards) should also be white-space delimited; they can be any character (e.g. 1,2,3,4 or A,C,G,T or anything else) except 0 which is, by default, the missing genotype character. All markers should be biallelic. All SNPs (whether haploid or not) must have two alleles specified. Either Both alleles should be missing (i.e. 0) or neither. No header row should be given.

ex:

Uto_124 Uto_124 0 0 0 -9 G G T T T T
Uto_13 Uto_13  0 0 0 -9 G G T T C T
Uto_243 Uto_243 0 0 0 -9 A G T T T T
Uto_244 Uto_244 0 0 0 -9 G G T T C T
Uto_273 Uto_273 0 0 0 -9 G G T T T T
  • A « .map » (PLINK format) SNP location information file.

By default, each line of the MAP file describes a single marker and must contain exactly 4 columns:

Chromosome
SNP identifier
Genetic distance (morgans)
Base-pair position (bp units)

ex:

25    AX-77283174    0    5208
24    AX-80751587    0    6369
28    AX-76388168    0    6695
28    AX-76388358    0    7122
25    AX-80909995    0    7629

SNP order must be the same in .ped and .map file.

  • A parameters file :

The 6 informations required by SS4I program are:

INPUT FILES
    - in_geno = path to the input genotype file
    - in_map = path to the input map file
ANALYSIS PARAMETERS
    - threshold = LD threshold for clustering SNP
    - chromosome = Chromosom to analyse
    - SNP_window = Maximum number of consecutive SNP to be consider for the analyse.

OUTPUT
    - path_out = Location for output files 

Output files:

Two files are generated by SS4I.

  • A log file:
    This text file contains all parameters used and describe all steps followed by the program.
  • A « Selected_SNP » file.
    This file contains all selected SNP. Each line correspond to a single selected SNP and report the following information :
SNP Name, Chromosome and Position of the selected SNP
MAF : Minor allele frequency of the selected SNP
Number of clustered SNP : Number of SNP included in the cluster
Minimum, average and standard deviation of the DL of all SNP pairs included in the cluster.
Minimum, maximum, average and standard deviation of the DL of selected SNP with all other SNP included in the cluster 
Minimum, maximum, average and standard deviation of the distance between all SNP pairs included in the cluster.
Minimum, maximum, average and standard deviation of the distance between the selected SNP and all other SNP included in the cluster.

SS4I operation :

SNP selection is carried out chromosome by chromosome.
If the analyzed chromosome carries more SNP than the number of SNP specified in « SNP_window » in the parameter file then SNP selection will be carried out in several steps.

Allele frequency and linkage disequilibrium (r2) are computed using PLINK software.

All SNP with no LD above the threshold are removed from the analysis and listed in the log file.
A hierachical clustering is realized based on linkage disequilibrium (SciPy library, average method) and a dendogram is created.
SNP clusters are selected by scanning the dendogram to get SNP clusters that contains SNP pairs with linkage disequilibrium above the threshold set in the parameters file.
For each cluster, the SNP for wich the average distance from other SNP is include in the average distance between all SNP in the cluster ± 2 sd and with the highest MAF value is selected.