The Glint Wiki¶
The glint source : glint-1.0.rc12.826_833.tgz
System requirements¶
Glint should work on any standard 64bit Linux environment with gcc (4.3.x preferred)
A reasonable amount of physical memory (12GB to start with) is recommended if working with large genomes.
Compiling instructions¶
Edit the makefile in top directory and modify CXX, CXXFLAGS, DEFINES to your needs
However, the makefile should work without any modification
make all make tests All tests should be successfull make install
The binary is now in the bin/ directory, you can mv it where you want (/usr/local/bin may be a good choice)
Usage¶
Glint is provided as a single executable with a set of commands
The first command: glint index performs the indexing of the database (a set of contigs,a chromosome or a set of chromosomes embedded in a single multi-fasta file).
The second command: glint align compares a sequence or a set of sequences to the previously indexed database.
The subdirectory examples contains an example that can be easily tested.
You can try the examples :
>cd examples >../bin/glint index bank.fasta >../bin/glint align bank.fasta bac_ends.fasta
glint commands¶
The available commands can be obtained by invoking --help (glint --help)
Usage: glint <command> [options] Command: align align a genomic sequence against a whole genome index index a bank of sequences (a genome) for later comparison self align a genomic sequence against itself map map reads against a genome mappe map paired-end reads against a genome chain chain glint (genomic) or blast (proteic) hsps convert convert an alignment file between some common formats slice cut a BAM file into several slices
For each command a set of parameters (general or command specific) can be obtained
by typing glint <command> --help
How the program works¶
Glint is an alignment program using a Blast-like seed and extend mechanism, followed by a dropoff local alignment heuristic to align a query sequence to a subject sequence of a bank. Glint is particularly well adapted for comparing sequences against whole chromosomes.
Instead of preprocessing the query sequence into a lookup table, like the Blast software, the glint algorithm is based on a preprocessing of the database. In this preprocessing step, the glintindex program builds an index made of k-mers in a particular fashion. In order to be able to perform alignments with different sensibility/specificity trade-off without reconstructing the database index, the latter is made of interleaving non consecutive k-mer or seeds. Non-consecutive means that the k letters forming the seed are not necessarily consecutive letters of the text. This idea of non-consecutive seeds follows the original work of Ma. et al (2002) in patternhunter.
The principle of interlinked non-consecutive seeds explains the particular format of what is called the mask structure. The default mask structure of the glintindex program is 130110211012110311. This indicates that the lookup table is made of all "non-consecutive" k-mers (k=14) of the database spanning 18 letters where only the non-zero value positions of the mask are relevant. If we define the weight of a seed as the number of non-zero position in the mask, the default mask enables to use 3 different seeds of weights 10, 12 and 14:
1) 100110011010110011 : seed weight 10 (only positions with a non-zero value <= 1)
2) 200220222022220022 : seed weight 12 (only positions with a non-zero value <= 2)
3) 330330333033330333 : seed weight 14 (only positions with a non-zero value <= 3)
The glint program can be subsequently invoked using type 1 seeds (-w 1, sensitive and slow),
type 3 seeds (-w 3, faster at the expense of less sensibility) or type 2 seeds (-w 2, somewhere in between).
Programs overview:
The glint program is provided as a single executable glint with a set of commands :
align align a genomic sequence against a whole genome self align a genomic sequence against itself map map reads against a genome mappe map paired-end reads against a genome chain chain glint (genomic) or blast (proteic) hsps convert convert an alignment file between some common formats slice cut a BAM file into several slices
type glint --help to get the list of commands
Bank indexing : the glint index command¶
Constructs the lookup table of the database and stores the corresponding information in 4 index files, all their prefixes are the sequence_file file name :
*.bi (the actual index a serialized tree correspondinf to the lookup table)
*.fa (a fasta file prepared for the glint alignment, ie. eventually masked according to the -C parameter)
*.bl (a file containing a black list the over-represented words as defined by the -M option)
*.he (a header file describing the content of the *.bi index file)
Sequence comparison : the glint align executable¶
The program works in the following manner :- the index file is uploaded in memory as a factor tree like structure of k-words (k=size of the mask).
If the indexation step is avoided, the indexation is performed on the fly, and the indexes are not kept. - the query sequence is scanned in orderr to detect seeds using the factor tree like structure as an index.
when a group of seed meets a given criterion, a dropoff alignment is performed in the neigborhood.