Alignment of Long Sequences

Slides:

Advertisements

Similar presentations

CS 336 March 19, 2012 Tandy Warnow.

Advertisements

Indexing DNA Sequences Using q-Grams

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),

4. Lecture WS 2003/04Bioinformatics III1 Whole Genome Alignment (WGA) When the genomic DNA sequences of closely related organisms become available one.

Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.

15-853Page : Algorithms in the Real World Suffix Trees.

OUTLINE Suffix trees Suffix arrays Suffix trees Indexing techniques are used to locate highest – scoring alignments. One method of indexing uses the.

296.3: Algorithms in the Real World

© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.

Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.

Tries Standard Tries Compressed Tries Suffix Tries.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,

Phylogenetic Tree Construction and Related Problems Bioinformatics.

Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago.

Sequence comparison: Local alignment

Sequencing a genome and Basic Sequence Alignment

Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.

1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen

Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.

Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.

JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.

Sequencing a genome and Basic Sequence Alignment

Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)

Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Foundation of Computing Systems

1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.

Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.

Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Hidden Markov Models BMI/CS 576

Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.

15-853:Algorithms in the Real World

Genome alignment Usman Roshan.

Sequence comparison: Local alignment

The short-read alignment in distributed memory environment

Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.

Learning Sequence Motif Models Using Expectation Maximization (EM)

Inferring Models of cis-Regulatory Modules using Information Theory

Strings: Tries, Suffix Trees

Interpolated Markov Models for Gene Finding

RNA Secondary Structure Prediction

Eukaryotic Gene Finding

Linking Genetic Variation to Important Phenotypes

String Data Structures and Algorithms

Stochastic Context Free Grammars for RNA Structure Modeling

CSE 589 Applied Algorithms Spring 1999

String Data Structures and Algorithms

3. Brute Force Selection sort Brute-Force string matching

Tries 2/27/2019 5:37 PM Tries Tries.

3. Brute Force Selection sort Brute-Force string matching

Strings: Tries, Suffix Trees

Important Problem Types and Fundamental Data Structures

Analysis of Algorithms CS 477/677

Fragment Assembly 7/30/2019.

3. Brute Force Selection sort Brute-Force string matching

Presentation transcript:

Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2018 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Mark Craven, Colin Dewey, and Anthony Gitter

Goals for Lecture Key concepts how large-scale alignment differs from the simple case the canonical three step approach of large-scale aligners using suffix trees to find maximal unique matching subsequences (MUMs) If time permits using tries and threaded tries to find alignment seeds constrained dynamic programming to align between/around anchors using sparse dynamic programming (DP) to find a chain of local alignments

Pairwise Large-Scale Alignment: Task Definition Given a pair of large-scale sequences (e.g. chromosomes) a method for scoring the alignment (e.g. substitution matrices, insertion/deletion parameters) Do construct global alignment: identify all matching positions between the two sequences

Large Scale Alignment Example Mouse Chr6 vs. Human Chr12 Figure from: Delcher et al., Nucleic Acids Research 27, 1999

Why the Problem is Challenging Sequences too big to make O(n2) dynamic-programming methods practical Long sequences are less likely to be colinear because of rearrangements initially we’ll assume colinearity we’ll consider rearrangements in next lecture (or never)

General Strategy find a good chain of anchors Figure from: Brudno et al. Genome Research, 2003 find a good chain of anchors fill in remainder with standard but constrained alignment method perform pattern matching to find seeds for global alignment

The MUMmer System Delcher et al., Nucleic Acids Research, 1999 Given: genomes A and B find all maximal unique matching subsequences (MUMs) extract the longest possible set of matches that occur in the same order in both genomes close the gaps

Step 1: Finding Seeds in MUMmer Maximal unique match: occurs exactly once in both genomes A and B not contained in any longer MUM mismatches Key insight: a significantly long MUM is certain to be part of the global alignment

Suffix Trees Substring problem: given text S of length m preprocess S in O(m) time such that, given query string Q of length n, find occurrence (if any) of Q in S in O(n) time Suffix trees solve this problem and others

Suffix Tree Definition A suffix tree T for a string S of length m is a tree with the following properties: rooted and directed m leaves, labeled 1 to m each edge labeled by a substring of S concatenation of edge labels on path from root to leaf i is suffix i of S (we will denote this by Si...m) each internal non-root node has at least two children edges out of a node must begin with different characters key property

Suffixes S = “banana$” suffixes of S $ (special character) a$ na$ ana$

Suffix Tree Example S = “banana$” Add ‘$’ to end so that suffix tree exists (no suffix is a prefix of another suffix) a b n n a a a $ n a $ n n n a $ a a $ $ 7 $ $ 2 4 6 1 3 5

Solving the Substring Problem Assume we have suffix tree T and query string Q FindMatch(Q, T): follow (unique) path down from root of T according to characters in Q if all of Q is found to be a prefix of such a path return label of some leaf below this path else, return no match found

Solving the Substring Problem Q = nan $ 1 b a n 2 3 4 5 6 7 Q = anab STOP return no match found a b n n a a a $ n a $ n n n a $ a a $ $ 7 $ $ 2 4 6 1 3 5 return 3

MUMs and Generalized Suffix Trees Build one suffix tree for both genomes A and B Label each leaf node with genome it represents each internal node represents a repeated sequence Genome A: ccacg# Genome B: cct$ t$ acg# c g# A, 3 B, 3 A, 5 acg# t$ c g# A, 2 A, 4 B, 2 acg# t$ A, 1 B, 1 each leaf represents a suffix and its position in sequence

MUMs and Suffix Trees Unique match: internal node with 2 children, leaf nodes from different genomes But these matches are not necessarily maximal Genome A: ccacg# Genome B: cct$ t$ acg# c g# A, 3 B, 3 A, 5 acg# t$ c g# A, 2 A, 4 B, 2 acg# t$ represents unique match A, 1 B, 1

MUMs and Suffix Trees To identify maximal matches, can compare suffixes following unique match nodes Genome A: acat# Genome B: acaa$ a ca t# a$ A, 2 A, 3 A, 4 A, 1 B, 4 $ B, 3 B, 2 B, 1 the suffixes following these two match nodes are the same; the left one represents a longer match (aca)

Using Suffix Trees to Find MUMs O(n) time to construct suffix tree for both sequences (of lengths ≤ n) O(n) time to find MUMs - one scan of the tree (which is O(n) in size) O(n) possible MUMs in contrast to O(n2) possible exact matches Main parameter of approach: length of shortest MUM that should be identified (20 – 50 bases)

Step 2: Chaining in MUMmer Sort MUMs according to position in genome A Solve variation of Longest Increasing Subsequence (LIS) problem to find sequences in ascending order in both genomes Figure from: Delcher et al., Nucleic Acids Research 27, 1999

Finding Longest Subsequence Unlike ordinary LIS problems, MUMmer takes into account lengths of sequences represented by MUMs overlaps Requires time where k is number of MUMs

Recall: Three Main Steps of Large-Scale Alignment Brudno et al. Genome Research, 2003 General Pattern matching to find seeds for global alignment Find a good chain of anchors Fill in with standard but constrained alignment MUMmer Suffix trees to obtain MUMs LIS to find colinear MUMs Smith-Waterman and recursive MUMmer for gap filling

Types of Gaps in a MUMmer Alignment Figure from: Delcher et al., Nucleic Acids Research 27, 1999

Step 3: Close the Gaps SNPs: between MUMs: trivial to detect otherwise: handle like repeats Insertions simple insertions: trivial to detect transpositions (subsequences that were deleted from one location and inserted elsewhere): look for out-of-sequence MUMs

Step 3: Close the Gaps Polymorphic regions short ones: align them with dynamic programming method long ones: call MUMmer recursively with reduced minimum MUM length Repeats detected by overlapping MUMs Figure from: Delcher et al. Nucleic Acids Research 27, 1999

MUMmer Performance FASTA on 1000 base pair segments MUMmer Figure from: Delcher et al. Nucleic Acids Research 27, 1999

MUMmer Performance Mycoplasma test case Suffix tree: 6.5s LIS: 0.02s DEC Alpha 4100 Mycoplasma test case Suffix tree: 6.5s LIS: 0.02s Smith-Waterman: 116s FASTA baseline: many hours Centre for Computing History

Longevity of MUMmer Antimicrobial Resistance Identification By Assembly (ARIBA) Identify antimicrobial resistance genes from Illumina reads Figure from: Hunt et al. bioRxiv 2017

Longevity of MUMmer Whole genome alignment still an active area of research Jain et al. 2018 (Mashmap2): “we were able to map an error-corrected whole-genome NA12878 human assembly to the hg38 human reference genome in about one minute total execution time and < 4 GB memory using 8 CPU threads” Uses MUMmer as ground truth in evaluation

Limitations of MUMmer MUMs are perfect matches, typically ≥ 20-50 base pairs Evolutionarily distant may not have sufficient MUMs to anchor global alignment How can we tolerate minor variation in the seeds?

LAGAN: Three Main Steps Brudno et al. Genome Research, 2003 General Pattern matching to find seeds for global alignment Find a good chain of anchors Fill in with standard but constrained alignment LAGAN Threaded tries to obtain seeds Sparse dynamic programming for chaining Dynamic programming for gap filling

Step 1: Finding Seeds in LAGAN Degenerate k-mers: matching k-long sequences with a small number of mismatches allowed By default, LAGAN uses 10-mers and allows 1 mismatch cacg cgcgctacat acct acta cgcggtacat cgta

Finding Seeds in LAGAN Example: a trie to represent all 3-mers of the sequence gaaccgacct a c g 3, 7 2 4 5 8 1 6 t One sequence is used to build the trie The other sequence (the query) is “walked” through to find matching k-mers

Allowing Degenerate Matches Suppose we’re allowing 1 base to mismatch in looking for matches to the 3-mer acc; need to explore green nodes a c g a c c g a c c g t a a c 2 3, 7 4 8 5 1 6

LAGAN Uses Threaded Tries In a threaded trie, each leaf for word w1...wk has a back pointer to the node for w2...wk a c g a c c g a c c g t a a c 2 3, 7 4 8 5 1 6

Traversing a Threaded Trie Consider traversing the trie to find 3-mer matches for the query sequence: accgt a c g 3, 7 2 4 5 8 1 6 t Usually requires following only two pointers to match against the next k-mer, instead of traversing tree from root for each

Comparing MUMmer and LAGAN Baboon Chimpanzee Mouse Rat Cow Pig Cat Dog Chicken Zebrafish Fugu Overall Exons 232 176 230 224 174 182 68 48 150 1914 MUMmer (% human exons covered by ≥ 90% alignment) 100 8 9 40 44 47 37 41 LAGAN (% human exons covered by ≥ 90% alignment) 99 88 77 98

Comparing MUMmer and LAGAN Brudno et al. Genome Research, 2003 Pattern matching to find seeds for global alignment Find a good chain of anchors Fill in with standard but constrained alignment Suffix trees to obtain MUMs Longest Increasing Subsequence Smith-Waterman, recursive MUMmer MUMmer k-mer trie to obtain seeds Spare dynamic programming Dynamic programming LAGAN

Multiple Whole Genome Alignment: Task Definition Given A set of n > 2 genomes (or other large-scale sequences) Do Identify all corresponding positions between all genomes, allowing for substitutions, insertions/deletions, and rearrangements

Progressive Alignment Given a guide tree relating n genomes Construct multiple alignment by performing n-1 pairwise alignments

Progressive Alignment: MLAGAN Example human chimpanzee mouse rat align pairs of sequences align multi-sequences (alignments) chicken align multi-sequence with sequence

Progressive Alignment: MLAGAN Example Suppose we’re aligning the multi-sequence X/Y with Z anchors from X-Z and Y-Z become anchors for X/Y-Z overlapping anchors are reweighted LIS algorithm is used to chain anchors Figure from: Brudno et al. Genome Research, 2003

Genome Rearrangements ancestor ancestor x y a b c d e a b c d e extant species a d c b e extant species d e a b c x y inversion translocation Can occur within a chromosome or across chromosomes Can have combinations of these events

Mercator: Rough Orthology Map k-partite graph with edge weights vertices = anchors, edges = sequence similarity

Refining the Map: Finding Breakpoints Breakpoints: the positions at which genomic rearrangements disrupt colinearity of segments Mercator finds breakpoints by using inference in an undirected graphical model

The Breakpoint Graph 1 2 3 4 5 6 7 8 9 10 11 12 11 10 6 5 4 7 12 9 3 8 2 1 some prefix of region 2 and some prefix of region 11 should be aligned

Comparing MLAGAN and Mercator Requires phylogenetic tree Greedy solution with local refinement Mercator Define probabilistic model to solve globally Inference is intractable, resort to approximations