Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Slides:

Advertisements

Similar presentations

DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.

Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.

Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.

Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.

Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.

Chapter 7 Dynamic Programming.

Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.

Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.

. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.

Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.

Space Efficient Alignment Algorithms and Affine Gap Penalties

Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez June 24, 2005.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:

Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.

Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.

Introduction to Bioinformatics Algorithms Block Alignment and the Four-Russians Speedup Presenter: Yung-Hsing Peng Date:

Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.

Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.

Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.

Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)

7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.

6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.

Introduction to Bioinformatics Algorithms Sequence Alignment.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.

1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.

Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez.

1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

Sequence comparison: Local alignment

Dynamic Programming – Part 2 Introduction to Algorithms Dynamic Programming – Part 2 CSE 680 Prof. Roger Crawfis.

Developing Pairwise Sequence Alignment Algorithms

Sequence Alignment.

CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.

Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.

Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.

CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina

Introduction to Bioinformatics Algorithms Sequence Alignment.

An Introduction to Bioinformatics 2. Comparing biological sequences: sequence alignment.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Chapter 3 Computational Molecular Biology Michael Smith

Greedy Methods and Backtracking Dr. Marina Gavrilova Computer Science University of Calgary Canada.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Applied Bioinformatics Week 3. Theory I Similarity Dot plot.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.

Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.

Sequence Comparison I519 Introduction to Bioinformatics, Fall 2012.

Finding Regular Simple Paths Sept. 2013Yangjun Chen ACS Finding Regular Simple Paths in Graph Databases Basic definitions Regular paths Regular simple.

Divide & Conquer Algorithms

Sequence comparison: Local alignment

JinJu Lee & Beatrice Seifert CSE 5311 Fall 2005 Week 10 (Nov 1 & 3)

Sequence Alignment Using Dynamic Programming

Pairwise sequence Alignment.

BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment

Bioinformatics Algorithms and Data Structures

Analysis and design of algorithm

CSE 5290: Algorithms for Bioinformatics Fall 2009

Presentation transcript:

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel Pevzner (prepared by Iman Famili)

Outline New computational ideas for sequence comparison: Divide-and-conquerDivide-and-conquer technique Recursive programsRecursive programs HashHash tables

Edit Graphs Finds similarities between two sequences. Every alignment in this method corresponds to the longest path problem from a source to a sink. The alignment is done by constructing an “edit graph”. There are 3 types of edges in the edit graph horizontal (H), diagonal (D), and vertical (V) corresponding to insertion (I), match/mismatch (M), and deletion (D), respectively. Every edge of the edit graph (i.e. every movement) has a weight corresponding to the penalty or premium for that action. The best path is the path with the maximum length. Edit Graph TGCATA A T C T G A T deletions: mismatches: insertions: matches: source sink

Computational Complexity of Dynamic Programming Sequence alignment is limited by: Time:Time: –Four operations are needed at each vertex. –The required time is proportional to the number of edges in the edit graph (i.e. O(nm), where n and m are sequence lengths). Space:Space: –The required memory is proportional to the number of vertices in the edit graph, O(nm).

Computational Complexity of Dynamic Programming –To compute the score of alignment, we can reduce the calculations to 2 columns at every computing instance. This can be done since scoring for each box in dynamic programming (DP) matrix is done based only on the three previously calculated boxes. Therefore only a linear memory is required for construction of the DP matrix. –To calculate the alignment (backtracking through the matrix), however, a quadratic memory is needed (n 2 ) since all the scores are needed to find the best alignment. only 2 columns are needed to determine the score of each box (forward calculation) all columns are needed for calculating the best alignment (backtracking)

Space-Efficient Sequence Alignment To solve the space complexity of sequence alignment: Find the middle vertex between a source and a sink by computing the score of the path s *,m/2 from (0,0) to (i,m/2) and s reverse *,m/2 from (i,m/2) to (n,m) (i.e. find the longest path between the source and the middle vertex and middle vertex and the sink). Repeat this process iteratively middle m/2m (0,0) (n,m) n i m/2m (0,0) (n,m) n middle m/2m (0,0) n middle (n,m) m (0,0) (n,m) n m (0,0) n(n,m) m (0,0) n(n,m) Source Sink

Space-Efficient Sequence Alignment The computing time is equal to the area of the rectangles. The total time to find the middle vertices is therefore: area+area/2+area/4+…  2*area The space complexity is of order n, O(n). Pseudocode for this algorithm is: Path (source, sink) If source and sink are in consecutive columns output the longest path from the source to the sink Else middle  middle vertex between source and sink Path (source, middle) Path (middle, sink)

String Matching: naïve approach Let’s say we want to compare a sequence of length l =10 against a database of length, for example, n =10 9 and we want to find the exact sequence l =10 in n. We can: 1.Move l along n one base at a time and find similar sequences (this takes a long time): l =10 n =10 9 So, essentially moving diagonally along the database alignments:

Sting Matching: hashing 2.Create a hash table of all possible combinations of l - length strings that exist in n Hash Table and search your l -length string against the hash table.

Approximate String Matching Now if instead of l =10 we have l =1000, we can apply the same method by dividing l into overlapping strings of 10 base-long and cross the resultant alignments, as shown below: String matching in this fashion may be done using filtration/verification algorithms that will be described next.

Filtration/Verification Method Let’s say we want to find a string in a database with up to 2 mismatches, or in general, find a string t 1 … t n (text) in a database q 1 … q p (query) with up to k mismatches. The query matching problem is to find all m -substrings of the query and the text that match with at most k mismatches. Filtration/verification algorithms are used to perform this task. Filtration/verification algorithms involve a two-stage process. walk in both directions while mismatches are < k First, a set of positions are reselected in the text that are potentially similar to the query. Second, each potential position is verified if mismatches are less than k and rejected if more than k mismatches are found.

Filtration/Verification Method Filtration algorithm is done in 2-steps: 1.Potential match detection: Find all matches of t -tuples in both query and the text for l = m / k +1 (it’s sparse alignment happens rarely) 2.Potential match verification: Verify each potential match by extending it to the left and to the right until either (i) the first k +1 mismathces are found or (ii) the beginning or end of the query or the text is found This is the idea behind BLAST and FASTA.