Presentation is loading. Please wait.

Presentation is loading. Please wait.

From Reads to Results Exome-seq analysis at CCBR

Similar presentations


Presentation on theme: "From Reads to Results Exome-seq analysis at CCBR"— Presentation transcript:

1 From Reads to Results Exome-seq analysis at CCBR
Justin Lack March 8, 2015

2 Workflow for Data Analysis
Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation

3 Workflow for Data Analysis
FASTQ format QC analysis Read trimming Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation

4 FASTQ Data Format FASTQ format Sequence ID Sequence Quality score
Phred-scaled quality value – i.e., Q10 mean 1/10 error rate, Q20 means 1/100, etc. Sequence ID Sequence Quality score

5 Read Quality Assessment
Read quality analysis Crucial to ensure high quality data Can reveal issues in library preparation and sequence generation

6 Read Quality Assessment
Read trimming Trims reads for both adapter contamination and low quality Absolutely essential for variant detection

7 Read Quality Assessment
FastQC and trimming (Trimmomatic)

8 Workflow for Data Analysis
Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation Map reads to reference genome Alignment QC

9 Read Mapping Challenge:
compare billions of short sequence reads against human genome (3Gb)

10 Different Alignment Algorithms
BWA – 2009 BWA-SW – 2010 BWA-MEM – 2013 Bowtie – 2009 Bowtie2 – 2012 Gem – 2012 Cushaw2 – 2014 Novoalign Li, arXiv: (2013)

11 Different Alignment Algorithms
BWA – 2009 BWA-SW – 2010 BWA-MEM – 2013 Bowtie – 2009 Bowtie2 – 2012 Gem – 2012 Cushaw2 – 2014 Novoalign Li, arXiv: (2013)

12 SAM/BAM Format SAM (Sequence Alignment/Map) format
Single unified format for storing read alignments to a reference genome BAM (Binary Alignment/Map) format Binary equivalent of SAM Advantages Supports indexing Compact size

13 BAM File Format Header Data

14 Alignment QC Crucial for examining and summarizing quality of alignment at exome targets GATK Depth of Coverage

15 Alignment QC Crucial for examining and summarizing quality of alignment at exome targets Qualimap

16 BAM Visualization - IGV
Mismatches Integrative Genomics Behavior Reference

17 Workflow for Data Analysis
Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation - BAM/SAM Alignment improvement

18 BAM Improvement Short-read mappers designed to balance accuracy and speed Algorithm can result in errors, especially at challenging indels Tools designed to target specific systematic errors Remove duplicates Local realignment Base quality recalibration

19 Library Duplicates All next generation sequencing platforms are NOT single molecule sequencing PCR amplification step in library preparation Can result in duplicate DNA fragments in the final library prep. PCR-free protocols do exist – require large volumes of input DNA Can result in false SNP calls Duplicates manifest themselves as high read depth support

20 Duplicates and False SNP Calls

21 Remove Duplicates Identify read-pairs where the outer ends map to the same position on the genome and remove all but 1 copy Samtools: samtools rmdup or samtools rmdupse Picard/GATK: MarkDuplicates

22 Local Realignment - indels
The trouble with mapping approaches

23 Local Realignment - indels
The trouble with mapping approaches

24 Local Realignment - indels
The trouble with mapping approaches

25 Local Realignment - indels

26 Local realignment in GATK
Uses information from known SNPs/indels (dbSNP, 1000 Genomes) Uses information from other reads Smith-Waterman exhaustive alignment on select reads Similar to GATK Haplotype Caller

27 Quality scores issued by sequencers are inaccurate and biased
Quality  scores  are  critical  for  all  downstream  analysis Systematic  biases  are  a  major  contributor  to  bad calls

28 Base Quality Recalibration
Sequence context refers to base composition skews

29 Base Quality Recalibration in GATK
Align subsample of reads from a lane to human reference Exclude all known dbSNP sites Assume all other mismatches are sequencing errors Compute a new calibration table based on mismatch rates per position on the read

30 Base Quality Recalibration

31 Workflow for Data Analysis
Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation - Germline variant detection - Somatic variant detection - VCF files

32 Germline Variant Detection
Mutations are hidden in the noise!

33 Germline Variant Detection
Mutations are hidden in the noise! Utilize GATK Haplotype Caller

34 Germline Variant Detection
Mutations are hidden in the noise! Utilize GATK Haplotype Caller Genotype jointly to maximize information

35 Germline Variant Detection
Mutations are hidden in the noise! Utilize GATK Haplotype Caller Genotype jointly to maximize information

36 Somatic Variant Detection
Genes and chromosomes can mutate in either somatic or germline tissue Mutation Detection

37 An Example of Germline Variants
Robinson et al. 2011

38 An Example of Somatic Variants
Normal Tumor

39 Somatic Variant Detection
But somatic variant detection can be EXTREMELY difficult Allelic fractions do not scale to ploidy

40 Somatic Variant Detection
But somatic variant detection can be EXTREMELY difficult Multiple additional sources of errors Low depth and/or tumor contaminated normal Noise vs Event

41 MuTect2 Somatic caller that attempts to account for and model all of these sources of errors

42 Variant Call Format (VCF)
VCF is a standardized format for storing DNA polymorphism data SNPs, insertions, deletions and structural variants With rich annotations Indexed for fast data retrieval of variants from a range of positions Store variant information across many samples Record meta-data about the site dbSNP accession, filter status, validation status, Very flexible format

43 Example VCF

44 Workflow for Data Analysis
Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation - Genome Annotation Databases - AVIA…

45 Annotation and Functional Prediction

46 dbSNP dbSNP is a free public archive for genetic variation within and across different species developed by NCBI Sherry, Genome Res. 1999

47 1000 Genomes Project 15 million SNPs
1 million short insertions/deletions   20,000 structural variants The 1000 Genomes Project Consortium, Nature 2010 (

48 COSMIC COSMIC is the most comprehensive resource for exploring impact of somatic mutations in human cancer Forbes, Nucleic Acids Research 2015

49 COSMIC

50 Lots lots more in AVIA!

51 Thank you! Any Questions?


Download ppt "From Reads to Results Exome-seq analysis at CCBR"

Similar presentations


Ads by Google