The Glint Wiki

The glint source : glint-1.0.rc12.826_833.tgz

System requirements

Glint should work on any standard 64bit Linux environment with gcc (4.3.x preferred)
A reasonable amount of physical memory (12GB to start with) is recommended if working with large genomes.

Compiling instructions

Edit the makefile in top directory and modify CXX, CXXFLAGS, DEFINES to your needs
However, the makefile should work without any modification

make all
make tests All tests should be successfull
make install

The binary is now in the bin/ directory, you can mv it where you want (/usr/local/bin may be a good choice)

Usage

Glint is provided as a single executable with a set of commands

The first command: glint index performs the indexing of the database (a set of contigs,a chromosome or a set of chromosomes embedded in a single multi-fasta file).
The second command: glint align compares a sequence or a set of sequences to the previously indexed database.

The subdirectory examples contains an example that can be easily tested.

You can try the examples :

>cd examples
>../bin/glint index bank.fasta
>../bin/glint align bank.fasta bac_ends.fasta

glint commands

The available commands can be obtained by invoking --help (glint --help)

Usage: glint <command> [options]

Command:    align          align a genomic sequence against a whole genome
            index          index a bank of sequences (a genome) for later comparison
            self           align a genomic sequence against itself
            map            map reads against a genome
            mappe          map paired-end reads against a genome
            chain          chain glint (genomic) or blast (proteic) hsps
            convert        convert an alignment file between some common formats
            slice          cut a BAM file into several slices

For each command a set of parameters (general or command specific) can be obtained
by typing glint <command> --help

How the program works

Glint is an alignment program using a Blast-like seed and extend mechanism, followed by a dropoff local alignment heuristic to align a query sequence to a subject sequence of a bank. Glint is particularly well adapted for comparing sequences against whole chromosomes.

Instead of preprocessing the query sequence into a lookup table, like the Blast software, the glint algorithm is based on a preprocessing of the database. In this preprocessing step, the glintindex program builds an index made of k-mers in a particular fashion. In order to be able to perform alignments with different sensibility/specificity trade-off without reconstructing the database index, the latter is made of interleaving non consecutive k-mer or seeds. Non-consecutive means that the k letters forming the seed are not necessarily consecutive letters of the text. This idea of non-consecutive seeds follows the original work of Ma. et al (2002) in patternhunter.
The principle of interlinked non-consecutive seeds explains the particular format of what is called the mask structure. The default mask structure of the glintindex program is 130110211012110311. This indicates that the lookup table is made of all "non-consecutive" k-mers (k=14) of the database spanning 18 letters where only the non-zero value positions of the mask are relevant. If we define the weight of a seed as the number of non-zero position in the mask, the default mask enables to use 3 different seeds of weights 10, 12 and 14:

1)  100110011010110011 :  seed weight 10 (only positions with a non-zero value <= 1)
2) 200220222022220022 : seed weight 12 (only positions with a non-zero value <= 2)
3) 330330333033330333 : seed weight 14 (only positions with a non-zero value <= 3)

The glint program can be subsequently invoked using type 1 seeds (-w 1, sensitive and slow),
type 3 seeds (-w 3, faster at the expense of less sensibility) or type 2 seeds (-w 2, somewhere in between).

Programs overview:

The glint program is provided as a single executable glint with a set of commands :

align          align a genomic sequence against a whole genome
self           align a genomic sequence against itself
map            map reads against a genome
mappe          map paired-end reads against a genome
chain          chain glint (genomic) or blast (proteic) hsps
convert        convert an alignment file between some common formats
slice          cut a BAM file into several slices

type glint --help to get the list of commands

Bank indexing : the glint index command

Constructs the lookup table of the database and stores the corresponding information in 4 index files, all their prefixes are the sequence_file file name :
*.bi (the actual index a serialized tree correspondinf to the lookup table)
*.fa (a fasta file prepared for the glint alignment, ie. eventually masked according to the -C parameter)
*.bl (a file containing a black list the over-represented words as defined by the -M option)
*.he (a header file describing the content of the *.bi index file)

Sequence comparison : the glint align executable

The program works in the following manner :
  • the index file is uploaded in memory as a factor tree like structure of k-words (k=size of the mask).
    If the indexation step is avoided, the indexation is performed on the fly, and the indexes are not kept.
  • the query sequence is scanned in orderr to detect seeds using the factor tree like structure as an index.
    when a group of seed meets a given criterion, a dropoff alignment is performed in the neigborhood.

glint-1.0.rc12.826_827.tgz (2,345 Mo) Thomas Faraut, 12/04/2017 13:27

glint-1.0.rc12.826_833.tgz (2,345 Mo) Thomas Faraut, 22/05/2017 14:44