DNAseq analysis Bioinformatics Analysis Team

Slides:



Advertisements
Similar presentations
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Advertisements

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Exploiting SNP polymorphism data Formation Bio-informatique, 9 au 13 février 2015.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Ruibin Xi Peking University School of Mathematical Sciences
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Introduction to Short Read Sequencing Analysis
RNAseq analysis Bioinformatics Analysis Team
Pathogen Informatics 21 st Nov 2014 Pathogen Sequencing Informatics Jacqui Keane Pathogen Informatics.
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.
Bioinformatics Analysis Team McGill University and Genome Quebec Innovation Center
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
NGS Analysis Using Galaxy
Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &
Whole Exome Sequencing for Variant Discovery and Prioritisation
Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2, C.-F. Lin 1,2, D. M. Childress 1,2, E. Klevak.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Introduction to Short Read Sequencing Analysis
MES Genome Informatics I - Lecture V. Short Read Alignment
File formats Wrapping your data in the right package Deanna M. Church
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
NGS data analysis CCM Seminar series Michael Liang:
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
Next Generation DNA Sequencing
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNAseq
Bioinformatics trainings, Vietnam Hanoi, November, 2015
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,
Personalized genomics
Calling Somatic Mutations using VarScan
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
Tools for Targeted Sequencing and NGS analysis O. Harismendy, PhD BIOM262 – W2016.
SNP and Genomic analysis SNP/genomic signature Clinical sampling Personalized chemotherapy Personalized Targeted therapy Personalized RNA therapy Personalized.
Canadian Bioinformatics Workshops
Short Read Sequencing Analysis Workshop
Cancer Genomics Core Lab
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
VCF format: variants c.f. S. Brown NYU
Introduction to RAD Acropora millepora.
EMC Galaxy Course November 24-25, 2014
ChIP-Seq Data Processing and QC
Maximize read usage through mapping strategies
BF528 - Genomic Variation and SNP Analysis
BF528 - Whole Genome Sequencing and Genomic Variation
Canadian Bioinformatics Workshops
RNA-Seq Data Analysis UND Genomics Core.
The Variant Call Format
Presentation transcript:

DNAseq analysis Bioinformatics Analysis Team McGill University and Genome Quebec Innovation Center bioinformatics.service@mail.mcgill.ca

Module #: Title of Module 2

What is DNAseq ? DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

Why dnaSeq ? Whole genome sequencing: Whole exome sequencing: Whole genome SNV detection Structural variant Capture the regulatory region information Cancer analysis De novo genome assembly Whole exome sequencing: Cheaper Capture the coding region information Rare diseases analysis

What the DNAseq problem is about ? Strings of 100 to ≈1kb letters Puzzle of 3,000,000,000 letters Usually have 120,000,000,000 letters you need to fit Many pieces don’t fit : sequencing error/SNP/Structural variant Many pieces fit in many places: Low complexity region/microsatellite/repeat

DNAseq overview

DNAseq: Input Data

Input Data: FASTQ End 1 End 2 Sample1_R2.fastq.gz Sample1_R1.fastq.gz Each sample will generate between 5Gb (100x WES) to 300Gb (100x WGS) of data

Q = -10 log_10 (p) Where Q is the quality and p is the probability of the base being incorrect.

QC of raw sequences

QC of raw sequences low qualtity bases can bias subsequent anlaysis (i.e, SNP and SV calling, …)

QC of raw sequences Positional Base-Content

QC of raw sequences

QC of raw sequences Species composition (via BLAST)

DNA-Seq: Trimming and aligning

Read Filtering Clip Illumina adapters: Trim trailing quality < 30 Filter for read length ≥ 32 bp usadellab.org

Assembly vs. Mapping mapping all vs reference Reference reads contig1 all vs all

RNA-seq: Assembly vs Mapping Reference-based mapping Ref. Genome DNA-seq reads contig1 contig2 De novo assembly

Read Mapping Mapping problem is challenging: Need to map millions of short reads to a genome Genome = text with billons of letters Many mapping locations possible NOT exact matching: sequencing errors and biological variants (substitutions, insertions, deletions, splicing) Clever use of the Burrows-Wheeler Transform increases speed and reduces memory footprint Used mapper: BWA Other mappers: Bowtie, STAR, GEM, etc.

SAM/BAM Used to store alignments SAM = text, BAM = binary Sample1.bam between 10Gb ot 500Gb each bam Sample2.bam SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAG Used to store alignments SAM = text, BAM = binary SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGA #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@> … Read name Flag Reference Position CIGAR Mate Position Bases Base Qualities SAM: Sequence Alignment/Map format

Sort, View, Index, Statistics, Etc. The BAM/SAM format samtools.sourceforge.net picard.sourceforge.net Sort, View, Index, Statistics, Etc. $ samtools flagstat C1.bam 110247820 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 110247820 + 0 mapped (100.00%:nan%) 110247820 + 0 paired in sequencing 55137592 + 0 read1 55110228 + 0 read2 93772158 + 0 properly paired (85.06%:nan%) 106460688 + 0 with itself and mate mapped 3787132 + 0 singletons (3.44%:nan%) 1962254 + 0 with mate mapped to a different chr 738766 + 0 with mate mapped to a different chr (mapQ>=5) $

DNA-Seq: metrics

Included metrics Metrics are collected form the output of Trimmomatic, Samtools and Picard softwares

DNA-Seq: Alignment refinement

Local indel realignment Primary alignment with BWA bio-bwa.sourceforge.net Local re-alignment around indels with GATK Possible mate inconsistency are fixed using Fixmate image: broadinstitute.org/gatk

Duplicates and recalibration Mark duplicates with Picard Base Quality Score Recalibration GATK Example Bias in the qualities reported depending of the nucleotide context image: broadinstitute.org/gatk

DNA-Seq: SNV calling

Single Nucleotide Variant calling Aim: differentiate real SNPs from sequencing errors An accurate SNP discovery is closely linked with a good base quality and a sufficient depth of coverage sequencing errors SNP

SNP and genotype calling workflow Variants from multiple samples are called simultaneously using the mpileUp method from samtools and quality filtered using bcftools Bayesian apporachs MLE apporachs Nielsen et al June 2011

The variant format : vcf Variant Call Format Column FORMAT defines “:” separated values GT = Genotype DP = depth … vcftools.sourceforge.net/specs.html

VCF visualization in IGV broadinstitute.org/igv/viewing_vcf_files

DNA-Seq: SNV annotation and metrics

Variant annotation Hypo- or hyper-mappabilty flag dbSNP [SnpSift] Mark SNV in low confidence regions dbSNP [SnpSift] Mark already known variant Variant effects [SnpEff] predict the effects of variants on genes (such as amino acid changes) dbNSFP [SnpSift] Functional annotations of the change Cosmic[SnpSift] Known somatic mutations

SNV statistics Statistics are generated from the SNPeff stats outputs Example of one of the SNv metrics graph

DNA-Seq: Generate report

Home-made Rscript Generate report Files generated: Noozle-based html report which describe the entire analysis and provide QC, summary statistics as well as the entire set of results Files generated: index.html, links to detailed statistics and plots For examples of report generated while using our pipeline please visit our website