MGmapper A tool to map MetaGenomics data

MGmapper A tool to map MetaGenomics data
A tool to map fastq files against one or more reference sequence databases Make output that is “fairly” easy to understand Give an overview of tax, abundance, depth and coverage in relation to ref seq Estimate insert size Make assembled contig fasta files

https://cge.cbs.dtu.dk/services/MGmapper/

Command line versions module load mgmapper/2.2
sets path to: bwa, samtools, cutadapt, bedtools & perl/python scripts MGmapper_PE.pl –h or MGmapper_SE.pl –h MGmapper_classify.pl –h Classifier to read .annot files from MGmapper (in misc/ dir) Start the paired-end mapping now:

only the read pairs with the best sum of alignment scores.
Pre-processing of reads Adaptor removal/trimming Identify paired reads Reads don’t map to phiX genome Bwa mem mapping of all reads against reference databases and remove bad hits AS=28 AS=45 AS=90 AS=45 AS=90 Filter hits based on alignment criteria Alignment Score >=30 Best mode: Re-arrange database hits and keep only the read pairs with the best sum of alignment scores. full mode: keep all hits even if present in several databases. Bacteria pair1 forward AS=55 reverse AS=60 Bacteria pair2 forward AS=90 reverse AS=100 Human pair1 forward AS=60 reverse AS=60 Fungi pair1 forward AS=50 reverse AS=55 Bacteria pair2 forward AS=90 reverse AS=100 Human pair1 forward AS=60 reverse AS=60 Fungi pair1 forward AS=50 reverse AS=55 Abundance and read count statistics, fasta contigs Taxonomy annotation, post-processing (confidence) Final output

Properly paired reads Ref Seq1 5´ 5´
YES: InsertSize within upper and lower boundaries determined by bwa | InsertSize | NO: Paired but mapped to different ref sequence entries Ref Seq2 Ref Seq1

Inferred InsertSize All reads that are mapped to databases in ‘bestmode’ are used to make an inferred insert size distribution (5’ to 5’ distance)

Annotation of fastq reads via MGmapper
Local alignment of fastq reads against reference sequences via bwa Best alignment but not all aligments Taxonomy annotation Ref seq names have been assigned full taxonomy (.tax and .kch in )

MGmapper options command line version

databases.txt

MGmapper options command line version

Command xqsub –V –d workDir –l nodes=1:ppn=8,mem=40gb,walltime=2:00:00 –de MGmapper_PE.pl –c 8 –i /home/databases/metagenomics/fastq/China_2_500k_R1.fastq.gz –j /home/databases/metagenomics/fastq/China_2_500k_R2.fastq.gz –C 1,3,4,5,11 –F 6 –d outputDir -1 ‘touch yt.PE’ Abundance.databases.txt

Output files

Post-processing *.annot files

Output files

Numbers Strain abundance (paired-end)
Abundance (%) = 100*readCount/size*2 Strain abundance (Single-end) Abundance (%) = 100*readCount/size Abundancespecies (%)= S Abundancestrain Covered_positions Number of posistions in a ref seq that are observed at >= 1X Coverage=covered_positions/size Depth=nucleotides/size ReadCountUniq= reads where AS > XS, where AS is the alignment score and XS is second best hit Size = number of bp’s in reference sequence

Benchmark dataset: 11 bacterial species spiked in distilled water
15 classifications methods tested No abundance threshold => many false positives

Abundance thresholds FW in-vitro dataset, <233bp> (table 3, species)
Peabody et al. BMC Bioinformatics (2015) 16:363

Abundance thresholds FW in-silicio dataset <250bp> (table 4, species)

Reliability What criteria should be used to extract reliable results?
refSeq db Size nucleotides reads reads_uniq nm Name Bacteria

Exercise Run MGmapper_PE.pl against small air-toilet sample (China)
Learn to use MGmapper_classify.pl to extract reliable hits Collapse hits between different databases Collapse hits at different taxonomy levels Abundance criteria uniqReadCount fraction misMatch fraction

Abundance thresholds FW in-vitro dataset, <233bp> (table 3)
MGmapper_SE (uniq reads > 10) Peabody et al. BMC Bioinformatics (2015) 16:363

Abundance thresholds FW in-silicio dataset <250bp> (table 4)
MGmapper_PE (uniq reads > 10) * Nocardioides sp. JS614 included by MGmapper_PE benchmark, but not in table

Fastq files A_L01_R1.fq A_L02_R1.fq A_L01_R2.fq A_L02_R2.fq
Options: -I listF –J listR Options: –i A_L01_R1.fq –j A_L01_R2.fq

Trimming Raw Fastq Trim, adaptor removal Trimmed cutadapt
Cutadapt –f fastq –q 30 –m 30 –b “Illumina adaptors” infile > outfile Infile.R1 Outfile.R1 => Outfile.R2 Option: -S : skip cutadapt [off]

Reads In Common Only in Pair-end mode
Trimmed Fastq Identify pairs readsInCommon Trimmed & paired ReadsInCommon –i all.R1 –j all.R2 –a InCommon.R1 –b InCommon.R2 all.R1 => all.R2 InCommon.R1 InCommon.R2

Remove positive control reads
Fastq ReadsIn Common Bwa mem notPhix bam PhiX database bwa mem –t cores phiXdb fastqF fastqR | samtools view -f1 -f12 -Sb - | cat > notPhiX.bam -f1 (read paired) -f4 (read unmapped) -f8 (mate unmapped) PhiX: bacteriophage PhiX 174

MGmapper A tool to map MetaGenomics data

Similar presentations

Presentation on theme: "MGmapper A tool to map MetaGenomics data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MGmapper A tool to map MetaGenomics data

Similar presentations

Presentation on theme: "MGmapper A tool to map MetaGenomics data"— Presentation transcript:

Similar presentations

About project

Feedback