Combinational Pattern Matching Shashank Kadaveru.

Slides:



Advertisements
Similar presentations
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Space-for-Time Tradeoffs
BLAST Sequence alignment, E-value & Extreme value distribution.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Goodrich, Tamassia String Processing1 Pattern Matching.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Heuristic alignment algorithms and cost matrices
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
Sequence alignment, E-value & Extreme value distribution
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
13 Text Processing Hongfei Yan June 1, 2016.
Local alignment and BLAST
CSE 5290: Algorithms for Bioinformatics Fall 2011
CSE 5290: Algorithms for Bioinformatics Fall 2009
Fast Sequence Alignments
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Sequences 5/17/ :43 AM Pattern Matching.
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Combinational Pattern Matching Shashank Kadaveru

Introduction In Motif Finding problem, no particular pattern is given to search for. We infer it from the sample. In Combinational Pattern Matching, we look for exact or appropriate occurrences of given patterns in a long text.

Pattern Matching Might seem simpler than Motif finding because we know what we are looking for. But due to the large size of genomes, in practice it is difficult. Here, we develop number of ways to make pattern matching in a long string.

Repeat Finding Genetic diseases are associated with deletions, duplications and rearrangements of long chromosomal regions. For example, DiGeorge syndrome which results in impaired immune system and heart defects, is associated with a 3MB deletion on human chromosome 22. The deletion removes important genes which leads to a disease.

Repeat Finding Repeats in DNA hold many evolutionary secrets. A striking and still unexplained phenomenon is the large number of repeats in many genomes. For example, repeats account for about 50% of the human genome.

Duplicate Removal Problem Find all unique entries in a list of integers. For example, the list (8, 1, 5, 1, 0, 4, 5, 10, 1) has the elements 1 and 5 repeated multiple times, so the resulting list would be (8, 1, 5, 0, 4, 10)

Duplicate Removal Algorithm DUPLICATEREMOVAL(a, n) 1 m <- largest element of a 2 for i <- 0 to m 3 bi <- 0 4 for i <- 0 to n 5 b ai <- 1 6 for i <- 0 to m 7 if bi = 1 8 output i 9 return

Duplicate Removal Algorithm This approach requires O(n log n) time to sort the list of size n. Difficult to use when the entries are arbitrarily large. Creating a list with so many values is not efficient and may not even be possible.

Duplicate Removal Algorithm If we can map any number of entries to some number between 1 and 1000 then we can use binning strategy efficiently. The function used for mapping is called Hash Function. The second list is called Hash Table.

Hash Function Should be easy to calculate and integer valued. A simple hash function takes h(x) to integer remainder of x/|b|. We want different integers in the array to map to different integers in b. But this isn’t possible when the size of array a is larger than b.

Hash Function Example: ◦ If h(x) = x/1000 ◦ 7, 1007, 2007 collide and fall into the same bin. This is called a collision. To deal with collisions, we use Chaining.

Chaining Elements collapsing to the same bin are organized into a linked list. Also, elements falling into the same bins reveal duplicate elements and hence solves the duplicate removal problem.

Chaining This concept can apply to many other problems. Ex: to find exact repeats of l-mer in a genome. If we have to find exact repeats for large values like l = 40, then we need not construct 4 40 entries. We can design a suitable hashing function and use a table of much smaller size.

Exact Pattern Matching If we have a pattern p = p 1 p 2.. p n and a string t = t 1 t 2.. t m Pattern matching problem is used to find any or all occurrences of pattern p in the text t. Brute force algorithm is used for solving Pattern Matching problem.

Exact Pattern Matching PATTERNMATCHING(p, t) 1 n <- length of pattern p 2 m <- length of text t 3 for i <- 1 to m − n if ti = p 5 output i

Exact Pattern Matching Worst case running time is O (mn) Such a worst-case scenario can be seen by searching for p = AAAAT in t =AAAA...AAAAA. Peter Weiner invented a data structure called Suffix Tree that solves Pattern Matching problem in linear time O(m) for any text or pattern. To understand this, we need to learn about Keyword Trees.

Keyword Trees Multiple Pattern Matching Problem: ◦ Given a set of patterns and a text, find all occurrences of any of the patterns in the text. ◦ For example, if t = ATGGTCGGT and p 1 = GGT, p 2 = GGG, p 3 = ATG, and p4 = CG, then the result of an algorithm that solves the Multiple Pattern Matching problem would be positions 1, 3, 6, and 7.

Keyword Trees Multiple Pattern Matching problem with k patterns can be reduced to k individual Pattern Matching problems and solved in O(knm) time. However, there exists an even faster way to solve this problem in O(n + m) time where n is the total length of patterns p 1, p 2,..., p k

Keyword Trees Keyword Trees The key- word tree for the set of patterns apple, apropos, banana, bandana, and orange is shown in figure.

Keyword Trees The keyword tree has at most N vertices where N is the total length of all patterns, but may have fewer. One can construct the keyword tree in O(N) time by progressively extending the keyword tree.

Suffix Trees Suffix trees allow one to preprocess a text in such a way that for any pattern of length n, one can answer whether or not it occurs in the text, using only O(n) time. It doesn’t depend on the length of the text.

Suffix Trees Patterns: proud, perfect, muggle, ugly, rivet t = “mr and mrs dursley of number 4 privet drive were proud to say that they were perfectly normal thank you very much”

Suffix Trees

A suffix tree for a text of length m can be constructed in O(m) time, which leads immediately to a linear O(n + m) algorithm for the Pattern Matching problem Construct the suffix tree for t, in O(m) time, and then check whether p occurs in the tree, which requires O(n) time.

Suffix Trees The suffix tree for a text t = t1...tm is a rooted labeled tree with m leaves (numbered from 1 to m) satisfying the following conditions ◦ Each edge is labeled with a substring of the text ◦ Each internal vertex (except possibly the root) has atleast 2 children, ◦ Any two edges out of the same vertex start with a different letter, ◦ Every suffix of text t is spelled out on a path from the root to some leaf.

Suffix Trees

SUFFIXTREEPATTERNMATCHING(p, t) ◦ Build the suffix tree for text t ◦ Thread pattern p through the suffix tree. ◦ if threading is complete  output positions of every p-matching leaf in the tree ◦ else  output “pattern does not appear anywhere in the text”

Heuristic Similarity Search Algorithms The suffix tree algorithm above, while fast, can only find exact, rather than approximate, occurrences of a gene in a database. When we are trying to find an approximate match to a gene of length 10 3 in a database of size 10 10, the programming algorithms (like the Smith- Waterman local alignment algorithm) may be too slow.

Approximate Pattern Matching Finds all approximate occurrences of a pattern in a text. The naive brute force algorithm for approximate pattern matching runs in O(nm) time. The following algorithm will output all locations in t where p occurs with no more than k mismatches.

Approximate Pattern Matching APPROXIMATEPATTERNMATCHING(p, t, k) n ← length of pattern p m ← length of text t for i ← 1tom − n+1 ◦ dist ← 0 ◦ for j ← 1to n  if ti+j − 1 != pj dist ← dist + 1 if dist ≤ k ◦ output i

Approximate Pattern Matching In 1985 Gadi Landau and Udi Vishkin found an algorithm for approximate string matching with O(km) worst-case running time. Although this algorithm yields the best known worst-case performance, it is not necessarily the best in practice.

Heuristic Similarity Search Algorithms Many heuristics for fast database search in molecular biology use the same filtration idea. In 1973 Donald Knuth suggested a method for pattern matching with one mismatch based on the observation that strings differing by a single mismatch must match exactly in either the first or second half.

Heuristic Similarity Search Algorithms In 1985 the idea of filtration in computational molecular biology was used by David Lipman and Bill Pearson, in their FASTA algorithm. It was further developed in BLAST, now the dominant database search tool in molecular biology.

Heuristic Similarity Search Algorithms Biologists frequently depict similarities between two sequences in the form of dot matrices. A dot matrix is simply a matrix with each entry either 0 or 1, where a 1 at position (i, j) indicates some similarity between the ith position of the first sequence and the jth position of the second sequence.

Heuristic Similarity Search Algorithms

BLAST: Comparing a Sequence against a Database Using shared l-mers for finding similarities, as FASTA does, has some disadvantages. For example, two proteins can have different amino acid sequences but still be biologically similar. A common construct in many heuristic similarity search algorithms, including BLAST, is that of a scoring matrix similar to the scoring matrices introduced in chapter 6. These scoring matrices reveal similarities between proteins even if they do not share a single l-mer.

BLAST: Comparing a Sequence against a Database BLAST, uses scoring matrices to improve the efficiency of filtration and to introduce more accurate rules for locating potential matches. Another powerful feature of BLAST is the use of Altschul-Dembo-Karlin statistics for estimating the statistical significance of found matches.

BLAST: Comparing a Sequence against a Database A segment pair is just a pair of l-mers, one from each sequence. The maximal segment pair is a segment pair with the best score over all segment pairs in the two sequences. A segment pair is locally maximal if its score cannot be improved either by extending or by shortening both segments.

BLAST: Comparing a Sequence against a Database BLAST attempts to find all locally maximal segment pairs in the query and database sequences with scores above some set threshold. A fast algorithm for finding such l-mers is the key ingredient of BLAST.

BLAST: Comparing a Sequence against a Database An important observation is that if the threshold is high enough, then the set of all l- mers that have scores above a threshold is not too large. In this case the database can be searched for exact occurrences of the strings from this set using Multiple Pattern Matching Algorithm. After the potential matches are located, BLAST attempts to extend them to see whether the resulting score is above the threshold.

BLAST: Comparing a Sequence against a Database In recent years BLAST has been further improved by allowing insertions and deletions and combining matches on the same and close diagonals. BLAST reports matches to sequences that either have one segment score above the threshold or that have several closely located segment pairs that are statistically significant when combined.

BLAST: Comparing a Sequence against a Database According to the Altschul-Dembo-Karlin statistics, the num- ber of matches with scores above θ is approximately Poisson-distributed, with mean where m and n are the lengths of compared sequences, and K is a constant. Parameter λ is a positive root of the equation

BLAST: Comparing a Sequence against a Database where px and py are frequencies of amino acids x and y from the twenty- letter alphabet A and δ is the scoring matrix. The probability that there is a match of score greater that θ between two “random” sequences of length n and m is computed as 1 − e E( θ ).

Any Questions? Thank You