Jin Zhang, Jiayin Wang and Yufeng Wu

Slides:



Advertisements
Similar presentations
Marius Nicolae Computer Science and Engineering Department
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
RNAseq.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Inference of Genealogies for Recombinant SNP Sequences in Populations Yufeng Wu Computer Science and Engineering Department University of Connecticut
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment II Dynamic Programming
Genome sequencing and assembling
Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
The iPlant Collaborative
Multiple Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW:
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher Yonsei Biomedical Science Institute Yonsei University College.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Spliced Transcripts Alignment & Reconstruction
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
De novo assembly validation
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
Canadian Bioinformatics Workshops
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
Lesson: Sequence processing
MGmapper A tool to map MetaGenomics data
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Gene expression from RNA-Seq
Genome sequence assembly
Homology Search Tools Kun-Mao Chao (趙坤茂)
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
How to Solve NP-hard Problems in Linear Time
Sequence comparison: Local alignment
SVs and CNVs They are often confused…
Homology Search Tools Kun-Mao Chao (趙坤茂)
Global, local, repeated and overlaping
2nd (Next) Generation Sequencing
Discovery tools for human genetic variations
CSC2431 February 3rd 2010 Alecia Fowler
Sequence Alignment Kun-Mao Chao (趙坤茂)
Dr Tan Tin Wee Director Bioinformatics Centre
CSE 589 Applied Algorithms Spring 1999
Protein structure prediction.
Sequence Alignment Kun-Mao Chao (趙坤茂)
Multiple Sequence Alignment
Canadian Bioinformatics Workshops
Homology Search Tools Kun-Mao Chao (趙坤茂)
Multiple Sequence Alignment
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Jin Zhang, Jiayin Wang and Yufeng Wu An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data Jin Zhang, Jiayin Wang and Yufeng Wu Department of Computer Science and Engineering University of Connecticut 8:33 PM RECOMB-seq 2012

Structural Variation (SV) Alternative deletion insertion Reference Reference Alternative Mean insert size + 3 σ SV calling using HTS sequencing data Method Pair or Single Coverage Exact breakpoints Assembly Higher Read depth No Read pair Pair only Split read Reference Alternative Deletion Reference Alternative Deletion Exact breakpoint Mills et al. (Nature, 2011) “…,which facilitated analysing their origin and functional impact.“ Lam et al. (Nature Biotechnology 2010) classification and annotation Problem Finding SVs with Exact breakpoints using Low-coverage Paired-end reads 8:33 PM RECOMB-seq 2012

Split-read mapping (e.g. Deletion) Reads mapping tools: Not map it Or Soft-clipping Focal region Reference Maximum event size Alternative Deletion Because of sequence and repeats, longer Maximum Event size (e.g. 1Mbps) may cause false positives Different way of splits may cause even more false positives Shorter maximum event size may reduce false positives but also may fail to find some larger deletions Method Algorithm Max Deletion Size Cutoff Insert Size Focal Region Pindel: Ye et al. (Bioinformatics 2009) Pattern growth Yes SVseq1: Zhang et al. (Bioinformatics 2011) BWT SVseq2 (For this work) (Recomb-seq 2012) Dynamic Programming 8:33 PM RECOMB-seq 2012 8:33 PM RECOMB-seq 2012 3

SVseq2: a pattern for deletion calling: Finding focal region with the help of a spanning pair li: library mean σ: standard deviation l: read length li + 3σ Known breakpoint Alternative unknown breakpoint li+ 3σ -2l They are the same breakpoint on Alternative Spanning pair E.g. li+ 3σ -2l = 400 + 3*50 -200 = 350 bps Note Maximum Event Size can be 1Mbps Reference li+ 3σ -2l li+ 3σ -2l Alternative (not known) Deletion (a) within length li+ 3σ -2l from ,find (b) Find by using , coz they are a pair (c) Find by mapping the soft-clipped portion within length li+ 3σ -2l of Using focal region: (1)Search in much smaller space (2)Reduce the way of splits (3)Able to find large deletions 8:33 PM RECOMB-seq 2012

SVseq2: another pattern in deletion calling Reference Alternative Deletion Anchor li+ 3σ -2l The pair itself is also a spanning pair Dynamic alignment algorithm (semi-global) Similarity : 1 for matches and −1 for mismatches. Penalty: 3 for gaps inside the sequence, 0 outside. GTTCTAAGCCAGTGGTTCTACCAACTTGAGTATGCATCAGAATCACTTGGA - - - - - - - - - -AGTGGTTCT- CCAACTTGAGAATGCATCA - - - - - - - - - - - - 8:33 PM RECOMB-seq 2012

SVseq2: Type III pattern for Insertion calling: Read 1 Read 2 Alternative Overlap Region 1 Reference Portion 3 Portion 2 Portion 1 Portion 4 Mapping score: Penalties same as the deletion case Calling: Score / length of overlap < Threshold SVseq2 currently not reconstruct inserted sequences still use cutoff 8:33 PM RECOMB-seq 2012

Results Simulation on deletions Simulate on chromosome 15 (100, 338, 915 bps); Introduce deletions with exact break from 1000 genomes project release: union.2010 06.deletions.genotypes.vcf.gz (number of them are 132) 45 individuals Simulate reads with wgsim (https://github.com/lh3/wgsim) (error rate 0.02) Pair-ends reads with length 100, outer distance 500 Mapped by BWA Cutoff: SVseq2: cutoff 3; SVseq1 cutoff 3; Pindel 0.2.4d cutoff 3 8:33 PM RECOMB-seq 2012

Real data Individual data Pooled data 20101123 Illumina datasets of 18 individuals on chromosome 20 (9 CEU, 9YRI) Mapped by BWA on NCBI37 Benchmark: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/ Contains called SVs using BreakDancerMax1.1, CNVnator, GenomeStrip, EMBL/Delly and Pindel ( with data of 1094 individuals) SVseq2 Cutoff 3(no cutoff for type I) and 4; SVseq1 and Pindel 0.2.4d cutoff 3. Individual data Pooled data ** F: Findings SE: supported by Exact breakpoint SO: supported by Overlap 8:33 PM RECOMB-seq 2012

Running time Acknowledgement NA19312, One Thread Supported by NSF grant IIS-0953563 8:33 PM RECOMB-seq 2012