Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

Slides:



Advertisements
Similar presentations
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Advertisements

GPU and machine learning solutions for comparative genomics Usman Roshan Department of Computer Science New Jersey Institute of Technology.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
DNAseq analysis Bioinformatics Analysis Team
Ruibin Xi Peking University School of Mathematical Sciences
High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Next Generation Sequencing, Assembly, and Alignment Methods
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Ultrafast and memory-efficient alignment of short reads to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center for Bioinformatics.
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.
I inherited What??? You and Your Genes: The Explosive New World of Genetics David Finegold, M.D.
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2015 Xiaole Shirley Liu Please Fill Out Student Sign In.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
NGS Analysis Using Galaxy
Dr Katie Snape Specialist Registrar in Genetics St Georges Hospital
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Biotechnology SB2.f – Examine the use of DNA technology in forensics, medicine and agriculture.
BLAST What it does and what it means Steven Slater Adapted from pt.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Introduction to Short Read Sequencing Analysis
MES Genome Informatics I - Lecture V. Short Read Alignment
Genetics-multistep tumorigenesis genomic integrity & cancer Sections from Weinberg’s ‘the biology of Cancer’ Cancer genetics and genomics Selected.
Probe Design Using Exact Repeat Count August 8th, 2007 Aaron Arvey.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.
CS177 Lecture 10 SNPs and Human Genetic Variation
Introduction to Modeling and Algorithms in Life Sciences Ananth Grama Purdue University
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
Lecture-3 EXOME SEQUENCING Huseyin Tombuloglu, Phd GBE423 Genomics & Proteomics.
Short Read Mapping On Post Genomics Datasets
INTERPRETING GENETIC MUTATIONAL DATA FOR CLINICAL ONCOLOGY Ben Ho Park, M.D., Ph.D. Associate Professor of Oncology Johns Hopkins University May 2014.
Analysis of Next Generation Sequence Data BIOST /06/2015.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Canadian Bioinformatics Workshops
The Haplotype Blocks Problems Wu Ling-Yun
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2016 Xiaole Shirley Liu.
From Reads to Results Exome-seq analysis at CCBR
Tumor Genome Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST512.
High Throughput Sequencing
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
Interpreting exomes and genomes: a beginner’s guide
Burrows-Wheeler Transformation Review
Lesson: Sequence processing
Cancer Genomics Core Lab
Disease risk prediction
VCF format: variants c.f. S. Brown NYU
Validation of a Next-Generation Sequencing Pipeline for the Molecular Diagnosis of Multiple Inherited Cancer Predisposing Syndromes  Paula Paulo, Pedro.
2nd (Next) Generation Sequencing
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
Lecture 14 Algorithm Analysis
Maximize read usage through mapping strategies
BF528 - Genomic Variation and SNP Analysis
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Canadian Bioinformatics Workshops
Presentation transcript:

Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Whole Genome Sequencing Usually need 30-50X coverage (~ 3 lanes of 100bp PE HiSeq2000 sequencing) 2

Exome Sequencing

Exome Sequencing Solution Hybrid Selection: Probes in solution can capture all exons (exome) for high throughput sequencing 1-2% of whole genome seq Easily multiplex 20 samples in one lane 4

Comparative Sequencing Somatic mutation detection between normal / cancer pairs WGS or WES More mutation yield and better causal gene identification than Mendelian disorders 5 Meyerson et al, Nat Rev Genet 2010

Hallmark of Mendelian Disease Gene Discovery 6 Gilissen, Genome Biol 2011

Hallmark of Mendelian Disease Gene Discovery 7 Gilissen, Genome Biol 2011

Mutation Targets vs Disorder Frequency Rarer disorders are focused on fewer mutated genes 8 Gilissen, Genome Biol 2011

Whole Genome or Exome Seq? Enabling technologies: NGS machines, open-source algorithms, capture reagents, lowering cost, big sample collections Exomes more cost effective: Sequence patient DNA and filter common SNPs; compare parents child trios; compare paired normal cancer Challenges: –Still can’t interpret many Mendelian disorders –Rare variants need large samples sizes –Exome might miss region (e.g. novel non-coding genes) –Unsuccessful at using exome-seq to interpret clinical data 9 Shendure, Genome Biol 2011

Read Mapping Mapping hundreds of millions of reads back to the reference genome is CPU and RAM intensive, and slow Read quality decreases with length (small single nucleotide mismatches or indels) Very few mapper deals with indel, and often allow ~2 mismatches within first 30bp (4 ^ 28 could still uniquely identify most 30bp sequences in a 3GB genome) Mapping output: SAM (BAM) or BED 10

Spaced seed alignment Tags and tag-sized pieces of reference are cut into small “ seeds. ” Pairs of spaced seeds are stored in an index. Look up spaced seeds for each tag. For each “ hit, ” confirm the remaining positions. Report results to the user.

Burrows-Wheeler Store entire reference genome. Align tag base by base from the end. When tag is traversed, all active locations are reported. If no match is found, then back up and try a substitution. Trapnell & Salzberg, Nat Biotech 2009

Burrows-Wheeler Transform Reversible permutation used originally in compression Once BWT(T) is built, all else shown here is discarded –Matrix will be shown for illustration only Burrows Wheeler Matrix Last column BWT(T)T Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994 Slides from Ben Langmead

Burrows-Wheeler Transform Property that makes BWT(T) reversible is “LF Mapping” –i th occurrence of a character in Last column is same text occurrence as the i th occurrence in First column T BWT(T) Burrows Wheeler Matrix Rank: 2 Slides from Ben Langmead

Burrows-Wheeler Transform To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) –Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping Final T Slides from Ben Langmead

Exact Matching with FM Index To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) –Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc Slides from Ben Langmead

Exact Matching with FM Index In progressive rounds, top & bot delimit the range of rows beginning with progressively longer suffixes of Q Slides from Ben Langmead

Exact Matching with FM Index If range becomes empty (top = bot) the query suffix (and therefore the query) does not occur in the text Slides from Ben Langmead

Backtracking Consider an attempt to find Q = “agc” in T = “acaacg”: Instead of giving up, try to “backtrack” to a previous position and try a different base (much slower) For 50bp reads, need to have ~25bp perfect match “gc” does not occur in the text “g”“g” “c”“c” Slides from Ben Langmead

Seq Files Raw FASTQ –Sequence ID, sequence –Quality ID, quality score Mapped SAM –Map: 0 OK, 4 unmapped, 16 mapped reverse strand –XA (mapper-specific) –MD: mismatch info –NM: number of mismatch Mapped BED –Chr, start, end, strand GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT +HWI-EAS305:1:1:1:991#0/1 AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT +HWI-EAS305:1:1:1:201#0/1 PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB HWUSI- EAS366_0112:6:1:1298:18828#0/1 16 chr M * 0 0 TACAATATGTCTTT ATTTGAGATATGGATTTTAGGCCG Y\]bc^dab\[_U U`^`LbTUT\ccLbbYaY`cWLYW^ XA:i:1 MD:Z:3C30T 3 NM:i:2 HWUSI- EAS366_0112:6:1:1257:18819#0/1 4 * 0 0 * * 0 0 AGACCACATGAAGCTCAAGAA GAAGGAAGACAAAAGTG ece^dddT\cT^c`a`ccdK\c ^^__]Yb\_cKS^_W\ XM:i:1 HWUSI- EAS366_0112:6:1:1315:19529#0/1 16 chr M * 0 0 GCACTCAAGGGT ACAGGAAAAGGGTCAGAAGTGTGGCC ^c_Yc\Lc b`bbYdTa\dd\`dda`cdd\Y\ddd^cT` XA:i:0 MD:Z:38 NM:i:0 chr chr

Data Analysis Heuristic filtering to identify novel genes for Mendelian disorders 21 Stitziel et al, Genome Biol 2011

Genomic Structural Variation 22 Baker et al, Nat Meth 2012

Structural Variation Detection BreakDancer Chen et al, Nat Meth 2009 Only look at anomalous read pairs

Structural Variation Detection Crest (Wang et al, Nat Meth 2011) –Use soft-clipped reads, kind of like bidir-blast 24

Copy Number Variation Detection Change in read coverage 25

Representation: VCF Format 26

Summary Whole genome and whole exome sequencing –Solution hybrid selection –Specific locus for rare diseases Bioinformatics issues: –Read mapping –SNP, indel detection –Heuristic filtering –Structural variation detection 27