Wiki

CR réunions

15/09/2022 : Natacha Nikolic - Sarah Maman
  • Besoin rapide (avant fin sept.) de comparer les échantillons entre eux afin de mettre en évidence les régions plus ou moins conservées. Les échantillons étant anciens, les dégradations sont variables.
  • Si des zones partagées entre séquences sont mises en évidence, cela permettra de définir plus finement une couverture de séquençage pour une demande de budget (WGS ou Capture) auprès de GetPlage.

1. Description du projet

1.1- Contacts

Natacha Nikolic
Stefanie Wagner <>

1.2- Contexte biologique du projet PaleoFish

1.3- Question

Stéfanie a fait tourner une partie des données (il manque encore quelques échantillons) qui comprend l’alignement sur Anguilla anguilla, Anguilla rostrata, Salmo salar, et Salmo trutta.
Comme Stéfanie n’a pas le temps de faire des analyses pour nous fournir le % des génomes partagés entre nos échantillons etc.
Je voulais voir avec toi si quelqu’un de votre équipe pourrait nous aider et nous fournir le % de couverture et % de génome partagé entre échantillons.
Informations sur les échantillons :
Les résultats de nos échantillons passés via Paleomix sont hébergés chez vous « Project/ECOBIOP_ADNa/Data_PaleoFish (comb = alignement sur génome complet donc nucléiare et mitochondrial) :
Parmi ces résultats nous avons les fichiers Coverage, dephs, BAM etc. :
Nous savons que 2 échantillons (L5-1 et L9-5) ne sont probablement pas du saumon ou de l’anguille, puisque avec Galaxy en faisant un BWA nous avions vu que le premier s’alignait avec Hucho hucho et le second avec l’homme.
Donc pour le moment, ce n’est pas la peine de regarder ces échantillons.

???? En attachées, les informations récupérées avec Galaxy au besoin (après nettoyage par fastp, info qualité par fastQC et alignement par BWA). Le % de couverture semblait variable mais nous n’avions pas l’info sur le % de génome partagé entre échantillon.

2. Pipeline

Répertoire de travail

/work/project/sigenae/sarah/PALEOFISH/Reference_Genome_Whole_genome/Poissons

Indexation des génomes de référence

Anguilla_anguilla

Vérification des FASTQ

Question 1 :
Certains FASTQ Anguilla anguilla sont corrompus et d'autres sont manquants:

Pour les foward:

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae $ for filename in /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/*.fastq.gz;do
>  gunzip -t $filename/forward.fastqsanger.gz ; [ $? == 0 ] || echo pb $filename/forward.fastqsanger.gz
> done

gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L10_10_R9_54_e_S122.fastq.gz/forward.fastqsanger.gz: unexpected end of file
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L10_10_R9_54_e_S122.fastq.gz/forward.fastqsanger.gz

gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L10_14_SCO_Aa_227_S126.fastq.gz/forward.fastqsanger.gz: unexpected end of file
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L10_14_SCO_Aa_227_S126.fastq.gz/forward.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L10_15_SCO_Aa_229_S127.fastq.gz/forward.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L10_15_SCO_Aa_229_S127.fastq.gz/forward.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_10_SCO_2112_S138.fastq.gz/forward.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_10_SCO_2112_S138.fastq.gz/forward.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_4_SCO_1451_S132.fastq.gz/forward.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_4_SCO_1451_S132.fastq.gz/forward.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_5_SCO_2_S133.fastq.gz/forward.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_5_SCO_2_S133.fastq.gz/forward.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_8_SCO_519_S136.fastq.gz/forward.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_8_SCO_519_S136.fastq.gz/forward.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_9_SCO_1492_S137.fastq.gz/forward.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_9_SCO_1492_S137.fastq.gz/forward.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L8_11_Aa_Sou266_248_1_S91.fastq.gz/forward.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L8_11_Aa_Sou266_248_1_S91.fastq.gz/forward.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L8_12_Aa_Sou266_794_1_S92.fastq.gz/forward.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L8_12_Aa_Sou266_794_1_S92.fastq.gz/forward.fastqsanger.gz

gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L8_2_Aa_Sou266_943_1_S82.fastq.gz/forward.fastqsanger.gz: unexpected end of file
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L8_2_Aa_Sou266_943_1_S82.fastq.gz/forward.fastqsanger.gz

Pour les reverse:

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae $ for filename in /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/*.fastq.gz;do
> gunzip -t $filename/reverse.fastqsanger.gz ; [ $? == 0 ] || echo pb $filename/reverse.fastqsanger.gz
> done

gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L10_10_R9_54_e_S122.fastq.gz/reverse.fastqsanger.gz: unexpected end of file
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L10_10_R9_54_e_S122.fastq.gz/reverse.fastqsanger.gz

gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L10_14_SCO_Aa_227_S126.fastq.gz/reverse.fastqsanger.gz: unexpected end of file
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L10_14_SCO_Aa_227_S126.fastq.gz/reverse.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L10_15_SCO_Aa_229_S127.fastq.gz/reverse.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L10_15_SCO_Aa_229_S127.fastq.gz/reverse.fastqsanger.gz

gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L10_8_R9_54_a_S120.fastq.gz/reverse.fastqsanger.gz: unexpected end of file
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L10_8_R9_54_a_S120.fastq.gz/reverse.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_10_SCO_2112_S138.fastq.gz/reverse.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_10_SCO_2112_S138.fastq.gz/reverse.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_4_SCO_1451_S132.fastq.gz/reverse.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_4_SCO_1451_S132.fastq.gz/reverse.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_5_SCO_2_S133.fastq.gz/reverse.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_5_SCO_2_S133.fastq.gz/reverse.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_8_SCO_519_S136.fastq.gz/reverse.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_8_SCO_519_S136.fastq.gz/reverse.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_9_SCO_1492_S137.fastq.gz/reverse.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L11_9_SCO_1492_S137.fastq.gz/reverse.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L8_11_Aa_Sou266_248_1_S91.fastq.gz/reverse.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L8_11_Aa_Sou266_248_1_S91.fastq.gz/reverse.fastqsanger.gz
gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L8_12_Aa_Sou266_794_1_S92.fastq.gz/reverse.fastqsanger.gz: No such file or directory
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L8_12_Aa_Sou266_794_1_S92.fastq.gz/reverse.fastqsanger.gz

gzip: /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L8_2_Aa_Sou266_943_1_S82.fastq.gz/reverse.fastqsanger.gz: unexpected end of file
pb /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L8_2_Aa_Sou266_943_1_S82.fastq.gz/reverse.fastqsanger.gz

Question 2 :
Les 69 FASTQ Salmonidae dans /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Salmonidae/, mais lequels sont Salmo trutta et Salmo salar ? J'aurai besoin de les distinguer pour travailler sur la bonne référence.
Pas de fichiers manquants, pas de fichiers foward ni reverse corrompu.

Après mis à jour des FASTQ par Natacha, atres points à vérifier: bien le même nombre de r1 et r2, et donnés dans le même ordre ?

sigenae@genologin1 /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/EG10.fastq $ for filename in /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/*.fastq.gz;do
> grep @ reverse.fastqsanger | wc -l
> done
740333
740333
740333
...
Idem pour forward.
  • Nettoyage du répertoire AUD10199-1.fastq pour ne garder que AUD10199-1.fastq/forward.fastqsanger AUD10199-1.fastq/reverse.fastqsanger:
    sigenae@genologin1 /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae $ ls AUD10199-1.fastq/*
    AUD10199-1.fastq/forward.fastqsanger  AUD10199-1.fastq/reverse.fastqsanger
    
    AUD10199-1.fastq/AUD10199-1.fastq:
    
    AUD10199-1.fastq/AUD10310-2.fastq:
    
    AUD10199-1.fastq/AUD10367-1.fastq:
    
    AUD10199-1.fastq/AUD10487-1.fastq:
    forward.fastqsanger  reverse.fastqsanger
    
    AUD10199-1.fastq/AUD10671-2.fastq:
    
    AUD10199-1.fastq/AUD10857-2.fastq:
    forward.fastqsanger
    
    AUD10199-1.fastq/AUD10996-1.fastq:
    forward.fastqsanger  reverse.fastqsanger
    
    AUD10199-1.fastq/AUD11096-1.fastq:
    
    AUD10199-1.fastq/AUD11719-1.fastq:
    
    AUD10199-1.fastq/AUD11719-2.fastq:
    
    AUD10199-1.fastq/AUD11747-9.fastq:
    
    AUD10199-1.fastq/AUD11782-2.fastq:
    
    AUD10199-1.fastq/AUD20457-2.fastq:
    
    AUD10199-1.fastq/CD2:
    fastp on collection 5_ Paired-end output/
    
    AUD10199-1.fastq/EG10.fastq:
    
  • Le répertoire CD2 était présent 2 fois:

ls /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/CD2/fastp\ on\ collection\ 5_\ Paired-end\ output/CD2.fastq.gz/
forward.fastqsanger.gz reverse.fastqsanger.gz
sigenae@genologin1 /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp $ ls /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/CD2.fastq.gz/
--> j'ai récupèré les fichiers forward et reverse de
CD2/fastp\ on\ collection\ 5_\ Paired-end\ output/CD2.fastq.gz/
pour les mettre dans:
CD2.fastq.gz

Objectif

Distinguons:
  • Taux de couverture : profondeur de séquençage = depth = nombre de lectures.
  • Taux de recouvrement : overlapping = breadth = chevauchement des reads pour s'aligner sur la référence, pourcentage en longeur du génome qui se trouve aligné avec des reads, nombre de bases sur la référence.

Taux de recouvrement sur la référence:
bedtools coverage -a reference.VCF (description des chromosomes) -b sample1.bed sample2.bed ... -hist

Taux de recouvrement d'un échantillon par rapport à un autre (s1 sur s2): voir les étapes ci-dessous

Indexation de la référence

bwa index Anguilla_anguilla_GCA_013347855.1_fAngAng1.pri_genomic.fasta

(base) smaman@node140 /work/project/sigenae/sarah/PALEOFISH/Reference_Genome_Whole_genome/Poissons $ bwa index  Anguilla_anguilla_GCA_013347855.1_fAngAng1.pri_genomic.fasta
[bwa_index] Pack FASTA... 9.22 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=1958058486, availableWord=149775832
[BWTIncConstructFromPacked] 10 iterations done. 99999990 characters processed.
...
[BWTIncConstructFromPacked] 260 iterations done. 1950183782 characters processed.
[bwt_gen] Finished constructing BWT in 264 iterations.
[bwa_index] 1152.95 seconds elapse.
[bwa_index] Update BWT... 6.12 sec
[bwa_index] Pack forward-only FASTA... 4.96 sec
[bwa_index] Construct SA from BWT and Occ... 408.09 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index Anguilla_anguilla_GCA_013347855.1_fAngAng1.pri_genomic.fasta
[main] Real time: 1582.803 sec; CPU: 1581.352 sec

Référence indexée:

(base) smaman@node140 /work/project/sigenae/sarah/PALEOFISH/Reference_Genome_Whole_genome/Poissons $ ls -ltrah
-rw-r--r-- 1 smaman  SIGENAE 934M 16 sept. 16:56 Anguilla_anguilla_GCA_013347855.1_fAngAng1.pri_genomic.fasta.bwt
-rw-r--r-- 1 smaman  SIGENAE 234M 16 sept. 16:56 Anguilla_anguilla_GCA_013347855.1_fAngAng1.pri_genomic.fasta.pac
-rw-r--r-- 1 smaman  SIGENAE 6,4K 16 sept. 16:56 Anguilla_anguilla_GCA_013347855.1_fAngAng1.pri_genomic.fasta.ann
-rw-r--r-- 1 smaman  SIGENAE  10K 16 sept. 16:56 Anguilla_anguilla_GCA_013347855.1_fAngAng1.pri_genomic.fasta.amb
drwxrwsr-x 3 sigenae SIGENAE 8,0K 16 sept. 17:03 .
-rw-r--r-- 1 smaman  SIGENAE 467M 16 sept. 17:03 Anguilla_anguilla_GCA_013347855.1_fAngAng1.pri_genomic.fasta.sa

Aligner sur la référence

s1.fastq.gz -> s1.bam

bwa aln sampe est plus précis mais mappe moins. En effet, les positions alternatives ne sont pas mappées. S'il y a trop de distances entre les reads et le génome de référence, les reads ne sont pas mappées, il n'y a pas d'alignement local.
Il est préférable d'utiliser bwa mem pour plus de mapping. L'alignement local permet les alignements secondaires donc forcera le recouvrement. Ensuite, il sera donc nécessaire de filtrer les reads non mappées et les alignements secondaires:

Préparation du répertoire RESULTATS/:

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla $ mkdir /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla/RESULTATS/

Préparation du fichier de lancement 1-mapp.sh (#bwa mem "index_prefix" "s1_R1.fq" "s1_R2.fq" > s1.sam):

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla $ for filename in /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/*.fastq.gz; do echo " module purge && module load bioinfo/bwa-0.7.17  &&  bwa mem $ref $filename/forward.fastqsanger.gz  $filename/reverse.fastqsanger.gz > /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla/RESULTATS/$(basename $filename).$i.sam &&  cd /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla/RESULTATS/ && ln -s $(basename $filename).$i.sam  $i.sam "; i=$(( $i + 1 )); done

Visualisation du fichier de lancement:

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla $ more 1-mapp.sh 
 module purge && module load bioinfo/bwa-0.7.17  &&  bwa mem /work/project/sigenae/sarah/PALEOFISH/Reference_Genome_Whole_genome/Poissons/Anguilla_anguilla_GCA_013347855.1_fAngAng1.pri_genomic.fasta /work/project/sigenae/sarah/PALEOFISH/C
lean_data_after_fastp/Anguillidae/CD2.fastq.gz/forward.fastqsanger.gz  /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/CD2.fastq.gz/reverse.fastqsanger.gz > /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre
_samples_Anguilla_anguilla/RESULTATS/CD2.fastq.gz.47.sam &&  cd /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla/RESULTATS/ && ln -s CD2.fastq.gz.47.sam  47.sam 
 module purge && module load bioinfo/bwa-0.7.17  &&  bwa mem /work/project/sigenae/sarah/PALEOFISH/Reference_Genome_Whole_genome/Poissons/Anguilla_anguilla_GCA_013347855.1_fAngAng1.pri_genomic.fasta /work/project/sigenae/sarah/PALEOFISH/C
lean_data_after_fastp/Anguillidae/L10_10_R9_54_e_S122.fastq.gz/forward.fastqsanger.gz  /work/project/sigenae/sarah/PALEOFISH/Clean_data_after_fastp/Anguillidae/L10_10_R9_54_e_S122.fastq.gz/reverse.fastqsanger.gz > /work/project/sigenae/sa
rah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla/RESULTATS/L10_10_R9_54_e_S122.fastq.gz.48.sam &&  cd /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla/RESULTATS/ && ln -s L10_10_R9_54_e_S122.fas
tq.gz.48.sam  48.sam 

Lancement du job:

cd  /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla/
sarray -c 4 --mem=10G 1-mapp.sh
Submitted batch job 37324668

et

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla $ sarray -c 4 --mem=10G  1-mapp-suite.sh
Submitted batch job 37376508

Filtre

samtools view -F 256 -F 4 sam
-F : exclude
256: flag des secondary alignements
4: flag des reads non mappées
https://broadinstitute.github.io/picard/explain-flags.html

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla $ more 2-filter-convert.sh 
#!/bin/bash
#SBATCH -J paleofish
#SBATCH -p unlimitq 
#SBATCH --mem=20G

module purge
module load bioinfo/samtools-1.16.1
module load bioinfo/bedtools-2.27.1

# Filter + Convertir le BAM en BED: 
cd  /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla/RESULTATS/
for ((i=1; i<=60; i++)); do
    samtools view -Sb -F 256 -F 4 $i.sam | samtools sort -@ 4 -m 1G  - | bamToBed -i - > $i.bed
done

Lancement du job:

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla $ sbatch 2-filter-convert.sh
Submitted batch job 37379842  (de 1 à 48)
et
(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla $ sbatch 2-filter-convert-suite.sh
Submitted batch job 37395053 (de 49 à 60)

Regrouper les intervals avec bedtools merge

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla $ more 3-merge-intervals.sh 
#!/bin/bash
#SBATCH -J paleofish
#SBATCH -p unlimitq 
#SBATCH --mem=20G

module purge
module load bioinfo/bedtools-2.27.1

# bedtools merge -i s1.bed > s1.merge.bed 
cd  /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla/RESULTATS/
for ((i=1; i<=60; i++)); do
    bedtools merge -i $i.bed > $i.merge.bed 
done

Lancement du job:

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla $ sbatch  3-merge-intervals.sh 
Submitted batch job 37402064

Récupérer le nombre de bases communes entre s1 et s2 avec bedtools coverage

#https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html
bedtools coverage -a s1.merge.bed -b s2.merge.bed

Voici l'exemple donné par les bedtools:


$ cat A.bed
chr1  0   100
chr1  100 200
chr2  0   100

$ cat B.bed
chr1  10  20
chr1  20  30
chr1  30  40
chr1  100 200

$ bedtools coverage -a A.bed -b B.bed
chr1  0   100  3  30  100 0.3000000
chr1  100 200  1  100 100 1.0000000
chr2  0   100  0  0   100 0.0000000

Colonne 1: Nom du chromosome
Colonne 2: 0 : position start dans A.bed (chr1  0   100)
Colonne 3: 100 : position stop dans A.bed (chr1  0   100)
Colonne 4: le profondeur = la position 0-100 chr1 de A est vu 3 fois dans B : chr1 10-20, chr1 20-30 et chr1 30-40, à ne pas prendre en compte car nos échantillons sont mergés.
Colonne 5: recouvrement = 30 nucleotides sont communs de 10 à 40 (40-10=30)
Colonne 6: longeur dans A : 100 bases 
Colonne 7: taux de recouvrement = 30/100 = col 5/ col 6 = 0.3

Traitement des données Anguilla anguilla:

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla $ more 4-bedtools-coverage.sh 
#!/bin/bash
#SBATCH -J paleofish
#SBATCH -p unlimitq 
#SBATCH --mem=15G

module purge
module load bioinfo/bedtools-2.27.1

cd /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla/RESULTATS/;
for ((i=1; i<=59; i++)); do
    for ((j=i; j<=60; j++)); do
        bedtools coverage -a $i.merge.bed -b $j.merge.bed  -hist > coverage_$i-$j.out
    done    
done

Lancement du job:

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla $ sbatch  4-bedtools-coverage.sh 
Submitted batch job 37402700

Calcul du taux de recouvrement

taux de recouvremement = nb tot bases (5ieme colonne)/ (long chromosomes +scaffolds)
avec

1/ Long chromosomes +scaffolds

#https://www.bioinformatics.nl/cgi-bin/emboss/help/infoseq
(infoseq -sequence /bank/ebi/ensembl/ensembl_sus_scrofa_genome/current/fasta/ensembl_sus_scrofa_genome -length -only -noheading 2>/dev/null | perl -lane '$n+=$F0;END{print $n;}'
ou)

srun --mem 10G --pty bash 
module load system/datamash-1.3
module load bioinfo/EMBOSS-6.6.0
infoseq -sequence /work/project/sigenae/sarah/PALEOFISH/Reference_Genome_Whole_genome/Poissons/Anguilla_anguilla_GCA_013347855.1_fAngAng1.pri_genomic.fasta -length -only -noheading 2>/dev/null| sed -e 's/\s//g'|datamash sum 1 

Résultat:
Long chromosomes +scaffolds = 979029243

2/ Nb tot bases (5ieme colonne)

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH $ search_module datamash
system/datamash-1.3
(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH $ module load system/datamash-1.3
(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH $ datamash sum 5  coverage_$i-$j.out

Enlever toutes les lignes "all":
(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla $ more 6-clean-coverage.sh

#!/bin/bash
#SBATCH -J paleofish
#SBATCH -p unlimitq 
#SBATCH --mem=15G

module purge
module load bioinfo/bedtools-2.27.1

cd /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla/RESULTATS/;
for ((i=1; i<=59; i++)); do
    for ((j=i; j<=60; j++)); do
        grep -v '^all' coverage_$i-$j.out > cleaned_coverage_$i-$j.out
    done    
done

Lancement du job:

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla $ sbatch  6-clean-coverage.sh 
Submitted batch job 37417120

Préparation du job:

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla $ more 5-calculate-coverage-by-sample.sh 
#!/bin/bash
#SBATCH -J paleofish
#SBATCH -p unlimitq 
#SBATCH --mem=15G

module purge
module load system/datamash-1.3

cd /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla/RESULTATS/;
for ((i=1; i<=59; i++)); do
    for ((j=i; j<=60; j++)); do
        datamash -s sum 5  < cleaned_coverage_$i-$j.out > cleaned_coverage_$i-$j.txt
    done    
done

Lancement du job:

(base) smaman@genologin1 /work/project/sigenae/sarah/PALEOFISH/recouvrement_entre_samples_Anguilla_anguilla $ sbatch 5-calculate-coverage-by-sample.sh 
Submitted batch job 37417446