Presentation is loading. Please wait.

Presentation is loading. Please wait.

MGmapper A tool to map MetaGenomics data

Similar presentations


Presentation on theme: "MGmapper A tool to map MetaGenomics data"— Presentation transcript:

1 MGmapper A tool to map MetaGenomics data
A tool to map fastq files against one or more reference sequence databases Make output that is “fairly” easy to understand Give an overview of tax, abundance, depth and coverage in relation to ref seq Estimate insert size Make assembled contig fasta files

2 https://cge.cbs.dtu.dk/services/MGmapper/

3 Command line versions module load mgmapper/2.2
sets path to: bwa, samtools, cutadapt, bedtools & perl/python scripts MGmapper_PE.pl –h or MGmapper_SE.pl –h MGmapper_classify.pl –h Classifier to read .annot files from MGmapper (in misc/ dir) Start the paired-end mapping now:

4 only the read pairs with the best sum of alignment scores.
Pre-processing of reads Adaptor removal/trimming Identify paired reads Reads don’t map to phiX genome Bwa mem mapping of all reads against reference databases and remove bad hits AS=28 AS=45 AS=90 AS=45 AS=90 Filter hits based on alignment criteria Alignment Score >=30 Best mode: Re-arrange database hits and keep only the read pairs with the best sum of alignment scores. full mode: keep all hits even if present in several databases. Bacteria pair1 forward AS=55 reverse AS=60 Bacteria pair2 forward AS=90 reverse AS=100 Human pair1 forward AS=60 reverse AS=60 Fungi pair1 forward AS=50 reverse AS=55 Bacteria pair2 forward AS=90 reverse AS=100 Human pair1 forward AS=60 reverse AS=60 Fungi pair1 forward AS=50 reverse AS=55 Abundance and read count statistics, fasta contigs Taxonomy annotation, post-processing (confidence) Final output

5 Properly paired reads Ref Seq1 5´ 5´
YES: InsertSize within upper and lower boundaries determined by bwa | InsertSize | NO: Paired but mapped to different ref sequence entries Ref Seq2 Ref Seq1

6 Inferred InsertSize All reads that are mapped to databases in ‘bestmode’ are used to make an inferred insert size distribution (5’ to 5’ distance)

7 Annotation of fastq reads via MGmapper
Local alignment of fastq reads against reference sequences via bwa Best alignment but not all aligments Taxonomy annotation Ref seq names have been assigned full taxonomy (.tax and .kch in )

8 MGmapper options command line version

9 databases.txt

10 MGmapper options command line version

11 MGmapper options command line version

12 Command xqsub –V –d workDir –l nodes=1:ppn=8,mem=40gb,walltime=2:00:00 –de MGmapper_PE.pl –c 8 –i /home/databases/metagenomics/fastq/China_2_500k_R1.fastq.gz –j /home/databases/metagenomics/fastq/China_2_500k_R2.fastq.gz –C 1,3,4,5,11 –F 6 –d outputDir -1 ‘touch yt.PE’ Abundance.databases.txt

13 Output files

14 Post-processing *.annot files

15 Output files

16 Numbers Strain abundance (paired-end)
Abundance (%) = 100*readCount/size*2 Strain abundance (Single-end) Abundance (%) = 100*readCount/size Abundancespecies (%)= S Abundancestrain Covered_positions Number of posistions in a ref seq that are observed at >= 1X Coverage=covered_positions/size Depth=nucleotides/size ReadCountUniq= reads where AS > XS, where AS is the alignment score and XS is second best hit Size = number of bp’s in reference sequence

17 Benchmark dataset: 11 bacterial species spiked in distilled water
15 classifications methods tested No abundance threshold => many false positives

18 Abundance thresholds FW in-vitro dataset, <233bp> (table 3, species)
Peabody et al. BMC Bioinformatics (2015) 16:363

19 Abundance thresholds FW in-silicio dataset <250bp> (table 4, species)

20 Reliability What criteria should be used to extract reliable results?
refSeq db Size nucleotides reads reads_uniq nm Name Bacteria

21 Exercise Run MGmapper_PE.pl against small air-toilet sample (China)
Learn to use MGmapper_classify.pl to extract reliable hits Collapse hits between different databases Collapse hits at different taxonomy levels Abundance criteria uniqReadCount fraction misMatch fraction

22 Abundance thresholds FW in-vitro dataset, <233bp> (table 3)
MGmapper_SE (uniq reads > 10) Peabody et al. BMC Bioinformatics (2015) 16:363

23 Abundance thresholds FW in-silicio dataset <250bp> (table 4)
MGmapper_PE (uniq reads > 10) * Nocardioides sp. JS614 included by MGmapper_PE benchmark, but not in table

24 Fastq files A_L01_R1.fq A_L02_R1.fq A_L01_R2.fq A_L02_R2.fq
Options: -I listF –J listR Options: –i A_L01_R1.fq –j A_L01_R2.fq

25 Trimming Raw Fastq Trim, adaptor removal Trimmed cutadapt
Cutadapt –f fastq –q 30 –m 30 –b “Illumina adaptors” infile > outfile Infile.R1 Outfile.R1 => Outfile.R2 Option: -S : skip cutadapt [off]

26 Reads In Common Only in Pair-end mode
Trimmed Fastq Identify pairs readsInCommon Trimmed & paired ReadsInCommon –i all.R1 –j all.R2 –a InCommon.R1 –b InCommon.R2 all.R1 => all.R2 InCommon.R1 InCommon.R2

27 Remove positive control reads
Fastq ReadsIn Common Bwa mem notPhix bam PhiX database bwa mem –t cores phiXdb fastqF fastqR | samtools view -f1 -f12 -Sb - | cat > notPhiX.bam -f1 (read paired) -f4 (read unmapped) -f8 (mate unmapped) PhiX: bacteriophage PhiX 174


Download ppt "MGmapper A tool to map MetaGenomics data"

Similar presentations


Ads by Google