Next Generation Sequencing, Assembly, and Alignment Methods

Slides:



Advertisements
Similar presentations
In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.
Advertisements

Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Fast and accurate short read alignment with Burrows–Wheeler transform
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
OUTLINE Suffix trees Suffix arrays Suffix trees Indexing techniques are used to locate highest – scoring alignments. One method of indexing uses the.
Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Goodrich, Tamassia String Processing1 Pattern Matching.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (excerpts) Advanced Implementation of Tables CS102 Sections 51 and 52 Marc Smith and.
A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
CS 6293 Advanced Topics: Current Bioinformatics
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
MES Genome Informatics I - Lecture V. Short Read Alignment
Whole genome comparison Kelley Crouse And Greg Matuszek.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Large Scale Assembly of DNA Strings using Suffix Trees David Rivshin Parallel 2 4/11/2001.
GENE SEQUENCING. INTRODUCTION CELL The cells contain the nucleus. The chromosomes are present within the nucleus.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
Lecture 1 CS5661 Topics Basis of Bioinformatics Goals of Bioinformatics Bioinformatics Jargon 101.
MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Next-generation sequencing technology
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
CSC 421: Algorithm Design & Analysis
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
Next-generation sequencing technology
13 Text Processing Hongfei Yan June 1, 2016.
Strings: Tries, Suffix Trees
Introduction to Genome Assembly
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
CS 598AGB Genome Assembly Tandy Warnow.
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
Advanced Implementation of Tables
Tries 2/27/2019 5:37 PM Tries Tries.
BIOINFORMATICS Fast Alignment
Introduction to Sequencing
Strings: Tries, Suffix Trees
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Presentation transcript:

Next Generation Sequencing, Assembly, and Alignment Methods Andy Nagar

Agenda Background Next Generation Sequencing Sequence Assembly Sequence Alignment Traditional Alignment Algorithms Next Generation Alignment Algorithms Conclusion Andy Nagar

Background Earlier sequencing methods were based on Sanger sequencing, which goes back to the 1970s. Sequencing was slow, bases were read one at a time. Separation is done by electrophoresis. Readout by fluorescent tags. Andy Nagar Source:[Wikipedia]

Background To complete second generation genome projects such as the Human Genome Project, need for faster and high-throughput sequencing. Next-Generation Sequencing technologies based on various implementations of cyclic array sequencing. Cyclic Array Sequencing is based on the idea of sequencing of an array of DNA features by continuous process of enzymatic separation and imaging-based data collection. Andy Nagar

Growth in Sequencing Growth of Next - Gen Sequencing – doubles every month Andy Nagar Source:[6]

Next Generation Sequencing Workflow : DNA is fragmented Adaptors ligated to fragments Several possible protocols yield array of PCR colonies. Enyzmatic extension with fluorescently tagged nucleotides. Cyclic readout by imaging the array. Andy Nagar Source:[10]

Next Generation Sequencing Reads are done in parallel to speed up the sequencing. Andy Nagar Source:[11]

NGS - Products Products based on cyclic array sequencing include: Roche’s 454 Illumina’s Genome Analyzer ABI’s SOLiD HeliScope They allow the sequencing of millions of short sequences (reads) simultaneously, and can sequence entire human genome in a few days [Magi et al 2010]. Andy Nagar

NGS - Products Andy Nagar Source:[13]

Comparison of existing methods Andy Nagar Source:[4]

Whole Genome Shotgun Sequences (WGS) DNA is broken up randomly into numerous small segments. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs then use the overlapping ends of different reads to assemble them into a continuous sequence. Andy Nagar

Sequencing Andy Nagar Source:[9]

How to ensure enough coverage Andy Nagar Source:[9]

Whole Genome Shotgun Sequences (WGS) Andy Nagar Source: http://www.nature.com/scitable/topicpage/complex-genomes-shotgun-sequencing-609

Assembly - Reconstructing the Genome 2 possible methods of assembly: Consensus Overlap Assembly: The overlap consensus assembly method uses the overlap between sequence reads to create a link between them. The contig is eventually formed by reading along the links as far as possible. Problematic for short reads: - Overlaps must be calculated over a large proportion of the read - Huge number of reads increases the number of links, so contig path is difficult to compute. Andy Nagar

Assembly - Reconstructing the Genome 2 possible methods of assembly: de Bruijn Graph Approach: All k-mers are computed and the reads are represented as a path through the k-mers. - A de Bruijn graph is a graph in which the nodes are sets of symbols (i.e. nucleotides) and the edges represent overlaps between the symbols. This is a convenient way to represent data, such as overlapping sequence reads - de Bruijn graphs handle redundancy better and can assemble sequences more efficiently. Andy Nagar

Assembly - Reconstructing the Genome Andy Nagar Source:[13]

Assembly - Reconstructing the Genome Andy Nagar Source:[12]

Assembly –de Bruijn Graph Reads are parsed into 4-mers Matches are found and de Bruijn Graph is created There can be more than one path in the graph. => Practical problems of assembly. Andy Nagar Source:[12]

What can we do about repeats? Two main approaches: Cluster the reads Link the reads Andy Nagar Source:[9]

What can we do about repeats? Two main approaches: Cluster the reads Link the reads Andy Nagar Source:[9]

What can we do about repeats? Two main approaches: Cluster the reads Link the reads Andy Nagar Source:[9]

Traditional Sequence Alignment 2 types of traditional Sequence Alignment Algorithms: 1. Hash-table based eg: BLAST (and its variants)=> keep track of each k-mer in a hash table with sequence being the key [14][15]. SSAHA => builds a position sensitive hash-table [17]. Advantage: Fast search, allows gapped searches. Drawback: Large memory requirement to store the hash table. Andy Nagar

Traditional Sequence Alignment 2. Tree-based search eg: Suffix and Prefix tries Advantage: Fast search, can easily search for sub-strings or patterns. Drawback: Inserting new sequences required re-building the tree. Andy Nagar

Traditional Sequence Alignment – Suffix Tree Suffix tree for the string BANANA. Each substring is terminated with special character $. The six paths from the root to a leaf (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$. The numbers in the leaves give the start position of the corresponding suffix. Suffix links drawn dashed. Represents “NA” Represents “ANA” NA is suffix of ANA so suffix link Andy Nagar Source:[19]

Next Generation Sequence Alignment With high throughput sequencing, millions of reads are obtained in a single run. “Read-mapping” problem: How do the reads fit in the reference genome. Find hits where these reads occur in the genome. Report position(s) and frequency of hits. A short read may map to many chromosomes in a genome. Andy Nagar

Next Generation Sequence Alignment Andy Nagar Source:[25]

Next Generation Sequence Alignment Burrows-Wheeler Transform can be used to find matches of a query string inside a reference string. Steps: 1. Create a suffix array in which each element is a cyclic permutation of the original string terminated by end character “$”. Example: String “googol”. Original String: googol$ 1st circular permutation=> oogol$g 2nd circular permutation => ogol$go … till $ moves to front of the string last circular permutation => $googol Andy Nagar Source:[27]

Next Generation Sequence Alignment Steps: 2. Sort the elements of the suffix array in a lexicographic order. $ is lexicographically the smallest element S(i) represents the index in suffix array i represents index in BW Array Note: All occurrences of any substring occur next to each other in the BW Array. Such range is called the Suffix Array Interval (SA Interval). For example “go” occurs as prefix in positions 1 and 2. SA Interval of “go” = [1,2] BW Array Andy Nagar Source:[27]

Next Generation Sequence Alignment BW Array Steps: SA Interval of “go” = [1,2] Value of S(i) give the corresponding positions in original string. Here the S(i) values and 3 and 0. X = googol$ This algorithm has many extensions for finding inexact and gapped matches. More details in reference [27] Andy Nagar Source:[27]

Conclusion Next Generation Sequencing is transforming the fields of genetics, molecular biology and bioinformatics. Enormous amounts of data produced by sequencing projects. Computing and data analysis are lagging behind. Need for more efficient data analysis and storage methods. Use of data mining to find useful information fast and without need to store the entire data. Andy Nagar

Conclusion More efficient assembly and alignment techniques needed. Need for “metagenomic” analysis – find out which organisms or species are present in a biological or environmental sample. Andy Nagar

References Andy Nagar

References Andy Nagar

References Andy Nagar