Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Asynchronous Pattern Matching - Metrics Amihood Amir CPM 2006.
Designing Algorithms Csci 107 Lecture 4. Outline Last time Computing 1+2+…+n Adding 2 n-digit numbers Today: More algorithms Sequential search Variations.
Mathematics as a Second Language Mathematics as a Second Language Mathematics as a Second Language Developed by Herb I. Gross and Richard A. Medeiros ©
296.3: Algorithms in the Real World
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Function Matching Amihood Amir Yonatan Aumann Moshe Lewenstein Ely Porat Bar Ilan University.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Recursion.
Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
 Last lesson  Arrays for implementing collection classes  Performance analysis (review)  Today  Performance analysis  Logarithm.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Orgad Keller Modified by Ariel Rosenfeld Less Than Matching.
String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)
Asynchronous Pattern Matching - Address Level Errors Amihood Amir Bar Ilan University 2010.
S C A L E D Pattern Matching Amihood Amir Ayelet Butman Bar-Ilan University Moshe Lewenstein and Johns Hopkins University Bar-Ilan University.
Analysis of Algorithms COMP171 Fall Analysis of Algorithms / Slide 2 Introduction * What is Algorithm? n a clearly specified set of simple instructions.
Chapter 2: Algorithm Discovery and Design
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
Algebra Problems… Solutions Algebra Problems… Solutions © 2007 Herbert I. Gross Set 5 By Herbert I. Gross and Richard A. Medeiros next.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Definitions from the Text, sections 1.1 – 1.4
Minimal Spanning Trees What is a minimal spanning tree (MST) and how to find one.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Analysis of Algorithms
B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)
Variables Tutorial 3c variable A variable is any symbol that can be replaced with a number to solve a math problem. An open sentence has at least one.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
CSC 211 Data Structures Lecture 13
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Data Structure and Algorithms. Algorithms: efficiency and complexity Recursion Reading Algorithms.
Counting Discrete Mathematics. Basic Counting Principles Counting problems are of the following kind: “How many different 8-letter passwords are there?”
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
1 CSC 421: Algorithm Design & Analysis Spring 2014 Complexity & lower bounds  brute force  decision trees  adversary arguments  problem reduction.
Searching Topics Sequential Search Binary Search.
INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.
Pattern Matching With Don’t Cares Clifford & Clifford’s Algorithm Orgad Keller.
Week 1 Real Numbers and Their Properties (Section 1.6, 1.7, 1.8)
Amihood Amir, Gary Benson, Avivit Levy, Ely Porat, Uzi Vishne
COSC160: Data Structures Linked Lists
13 Text Processing Hongfei Yan June 1, 2016.
Unit-2 Divide and Conquer
Algorithm Discovery and Design
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
In Pattern Matching Convolutions: O(n log m) using FFT b0 b1 b2
2-Dimensional Pattern Matching
Data Structures Sorted Arrays
CS203 Lecture 15.
Error Correction Coding
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir

Given: A glass ball. An n storied building. Find: The floor k such that the ball breaks when dropped from it, but does not break if dropped from floor k-1.

STRATEGY 1: Only one ball given to experiment with. O(n) experiments necessary. Sequential search. n n

STRATEGY 2: As many balls as necessary are given to experiment with. O(log n) experiments necessary. Binary search.

STRATEGY 3: Only two balls given to experiment with. O( ) experiments necessary. Bounded divide-and-conquer Experiments 1 st ball Experiments 2 nd ball

Meaning of two Balls “Bounded” Divide-and-Conquer. In reality: Different paradigms. 1. Works on large groups. 2. Works within a group.

In Pattern Matching 1.Works on large groups: Convolutions: O(n log m) using FFT b 0 b 1 b 2

Problem: O(n log m) only in algebraically closed fields, e.g. C. Solution: Reduce problem to (Boolean/integer/real) multiplication. This reduction costs! Example: Hamming distance. Counting mismatches is equivalent to Counting matches A B A B C A B B B A

Example: Count all “hits” of 1 in pattern and 1 in text

For Define: 1 if a=b 0 o/w Example:

For Do: + + Result: The number of times a in pattern matches a in text + the number of times b in pattern matches b in text + the number of times c in pattern matches c in text.

So for alphabet with a symbols (a fixed) the time is: O(n a log m) = O(n log m) Problem: Infinite alphabets.

Without loss of generality: |  | = m + 1 Since every element of T not in P is replaced by some symbol x not in P. ABCDEFGH same number of errors as ABXXXXGH ABBBBBGH ABBBB BGH Example:

Divide and Conquer Idea (Wrong) Split  to  1 U  2 of size m/2 each. Construct T 1, P 1 and T 2, P 2. Where for S = { T, P } and e = { 1, 2 }: if o/w

The Algorithm 1.Find num1 = number of matches of P 1 in T 1 2.Find num2 = number of matches of P 2 in T 2 3. matches num1 + num2 Time: O(n) every iteration for changing alphabet.

Time: T(m)=2T(m/2) + n. Closed Form: T(m) = 2 i T(m/2 i ) + (2 i i-2 + … ) n = (2 log m + … ) n = O(m 2 n) THIS IS BAD !!!

Needed: Faster way to compute matches of x to itself. Such a method exists if x appears in the pattern a very small number of times. Assume: x appears in pattern c times. For every occurrence of x in text, update just the appropriate counters of the c occurrences of x in the pattern. Text: Pattern: Time: O(nc).

Problem: In general it could be that x occurs in the pattern O(m) times, then total time becomes O(nm). BAD again. Tradeoff: If x appears in the pattern more than c times, count matches by FFT, in time O(n log m), per x. For all x’s that appear in the pattern less than c times, count matches (simultaneously) in time O(nc).

How many elements appear at least c times? For these elements, time: O((m/c) n log m). For all other elements, time: O(nc). The optimal case is when they equal, i.e. Total Time:(A-87,K-87)

In our Tower Metaphor: >c A Separate convolution for each group of floors (repetitions of a number x). >c <c Every element within the group is taken care of individually. However, all groups are “scanned” together.

Weighted Sequences Alignment of “similar” sequences – one of the challenges of string matching. Assume: from a set of sequences over alphabet  a set of “probabilities” is constructed as follows: Text: i a 1 a 2 a k Where is the probability that symbol a j occurs in text location i..

This text of probabilities is called a weighted sequence. Our problem: Given: Weighted sequence T, pattern P=s 1,…,s m, and probability . Find: All text locations i such that P occurs there with probability > , i.e.. Example: Pattern ACDB occurs at location 2 of the text with probability

Iliopoulos et. al., in a number of recent papers answer the following questions about weighted sequences: 1.Do exact matching. 2.Construct weighted suffix tree for indexing. Exact Matching: 1.Convert probabilities to logarithms. Now we use sums rather than products. 2. Consider every text row separately. Let T a be the text row of a, for some Then the log probability of the pattern at every location is given by the formula:

Example: P = ABABCAB x T A Gives the sum of the log probabilities of A x T B Gives the sum of the log probabilities of B x T C Gives the sum of the log probabilities of C. Add them all up and get the result. Time: O(n log m).

Weighted Hamming Distance (A, Iliopoulos, Kapah – 06) Compute the smallest number of mismatches for every location. Mismatches are not symmetric. If errors are assumed to be in the text: How many text elements need to be changed (so that they will have probability 1 matching the corresponding pattern symbol) to produce a match at each location? Example: Text,pattern ACDB,  =1/3. There exists a match at location 2 with 1 mismatch

If errors assumed to be in the pattern: How many pattern symbols need to be replaced in order to have a match at a given location? Example: For Text,pattern ACDB, and  =1/3 no match exists in location 2 even with 4 mismatches since every element already has highest probability. So changing the pattern letter D to A,B, or C will leave the same probability.

We solve both types of mismatch weighted sequences problems, (as well as a few flavors of edit distance). Here we show the simpler of the two mismatch definitions – changes to text. We solve a more general problem: Input: Text where N. Pattern. Natural number e. Find: For every text location i, the smallest number of text locations that, when changed to 0, bring the convolution result to be no greater than e. (We change the negatives to positives and dropped the requirement that the numbers be log probabilities of weighted sequences.)

We use the Tower Metaphor. Assumption: n<2m+1. Observation: For every text location we need to sort all O(m) text elements, and find out what is the precise point where the sum of all elements becomes less than e.

Text elements sorted (biggest at bottom, smallest on top) First find the block where the sum is still. Then Find where that change occurs within the block. Need to known: For each text location, how many text elements from each block.

How many text elements in each block? One convolution per block. Let T j be such that Do convolution: for every block j and save in each text location. Time:

Let T j be For every block j do: We now know for each text location what is the sum of block values less than e and how many such values exist. All we need to do is find exact number within the seam block. For every text location: For every element in seam block, from top to bottom: If element matches 1 in pattern, multiply. Until number exceeds e.

Example: Implementation: Keep index for every element in every block, subtract from it the index of text location i and check if it hits a 1. Total Time for correction:

As always, taking block sizes rather than Will make the Total Time: