1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

1 A B C
Simplifications of Context-Free Grammars
Adders Used to perform addition, subtraction, multiplication, and division (sometimes) Half-adder adds rightmost (least significant) bit Full-adder.
Angstrom Care 培苗社 Quadratic Equation II
AP STUDY SESSION 2.
1
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Speaker: C. C. Lin Adviser: R. C. T. Lee
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
UNITED NATIONS Shipment Details Report – January 2006.
David Burdett May 11, 2004 Package Binding for WS CDL.
We need a common denominator to add these fractions.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
Custom Services and Training Provider Details Chapter 4.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Break Time Remaining 10:00.
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
Bright Futures Guidelines Priorities and Screening Tables
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Association Rule Mining
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
How to convert a left linear grammar to a right linear grammar
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Subtraction: Adding UP
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
Types of selection structures
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Converting a Fraction to %
Clock will move after 1 minute
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
Presentation transcript:

1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February 2004, Pages Amihood Amir, Moshe Lewenstein and Ely Porat

2 String matching with k mismatches Input: A text T with length n, a pattern P with length m and a mismatching threshold k Output: Each sub-string S of T where HD(S,P)

3 The basic idea of following algorithms The authors discuss the number of distinct symbols in the pattern and design algorithms to solve the problems efficiently in different cases. Example: P = ACAABD The number of distinct symbols of P is 4.

4 Three cases of the number of distinct symbols in pattern The paper discusses the following three cases; k is the maximal number of mismatches allowed. 1.There are at least 2k distinct symbols. 2.There are less than distinct symbols. 3.The number of distinct symbols is between and 2k.

5 Case 1: At least 2k distinct symbols There are two stages in the algorithm. 1. Marking Identify potential starts of the pattern and do a crude pruning of the potential candidates. 2. Verification Verify which of the potential candidates is indeed a pattern which occurs. In this case, the algorithm takes linear time to solve string matching with k mismatches problem.

6 The basic idea of this paper is as follow 1)Let A={a 1 a 2… a 2k } be a set of distinct alphabets appearing in P. 2)Let P be the shortest prefix of P containing A. 3)Let the length of P be C. 4)Let S be a substring of T of length C. 5)Suppose among the 2k distinct alphabets in A which also appear in S, there are d matches between Pand S, as shown below 6)Then, obviously, among 2k locations in P,there are 2k-d mismatches. 7)If, then, we may ignore S totally. C d matches S P

7 But, how can we determine d ? We may use a position table

8 Marking stage of Case1 Let{a 1…., a 2k }be 2k different alphabet symbols appearing in the pattern and let i j be the smallest index in the pattern where a j appears,j=1….,2k. Create a position table S 1 … S 2k to represent distinct symbols in pattern P and pos 0 … pos 2k are their first appearance locations on P. Example symbolsACBD pos0134 S0S0 S1S1 S2S2 S3S3 pos 0 pos 1 pos 2 pos 3 T = ACBBDACTADIKQDABD…. = T 0 … T n-1 P = ACABDAE k =

9 We need scan the text T for each t i,, if we can find a j,, such that t i =s j, add 1 to location i - pos j of an array X. If i – pos j is less than 0, we ignore it. X is an array with size n and all elements of X are 0 initially. 4310pos DBCA symbols S0S0 S1S1 S2S2 S3S3 pos 0 pos 1 pos 2 pos 3 S 0 … S 3 represent 2k distinct symbols in pattern P and pos 0 … pos 3 are their first appearance locations on P. T = ACBBDACTADIKQDABD…. = T 0 … T n-1 P = ACABDAE k = X = ….

10 After the scanning is completed, we obtain the following array : X= For every X(a)=b, we know that there are b matches 2k distinct character between T(a, a+c-1) and P(0, c-1). There are at least 2k- b mismatches.Since b k. We may ignore T(a,a+c-1) in our case, since X= We need to examine only T(0,4) and T(5,9).We ignore all other substrings

11 Lemma 1 For Case 1, let n denote the length of text and k be maximal number of mismatches allowed. There are at most n/k candidate locations. Proof : The total number of addition to the X array is at most n because the algorithm tests T(i), i=1,2….n. Let the number of locations whose numbers are larger than k be a Then

12 Through Lemma 1, we know that at most n/k candidate locations remain. But not all candidate locations are starting points of matches with k maximal number of mismatches. P = ACABDAE T = ACBBDACTADIKQDABD…. X = …. There are four other mismatches, so the candidate location is not a starting point of match with k maximal number of mismatches. Take T(5) as an example:

13 We must verify which candidate locations are starting points of matches with k maximal number of mismatches.

14 Verification stage of Case1 The authors use the Kangaroo Method to verify whether a location has k maximal number of mismatches in O(k). T = ABCCABDADBDETADBAADFDAAEERDXTDADCT… P = ETBDBCCDFDC We shall not elaborate on this method because it was presented before

15 Time complexity of Case 1 We take O(n) time in marking stage, where n is the length of the text. According to Lemma 1, we have at most n/k candidate locations. Using Kangaroo method, we take O(k) time to verify a remained candidate location. Thus, we take O(n) time for the verification stage.

16 Case 2: Less than distinct symbols We can use the Boolean Convolution method to solve the problem for this case.

17 Thus it is obvious that Hamming distance can be found by convolution Let A=abac and B=acdc For this case HD(A,B)=2 Convolution a b a c c d c a matches HD(A,B)=2

18 Using Fast Fourier Transforms (FFT), Boolean Convolution can be done in O(nlogm). Our alphabet size is We take times to solve the problem for Case 2.

19 Case 3: The number of distinct symbols is between and 2k Definition: frequent symbol: A symbol appears in the pattern at least times. k = 2,, P = baccdbdd d is a frequent symbol. Example

20 Two Sub-cases of Case 3 Case3-1 There are at least frequent symbols in the pattern. Case3-2 There are less than frequent symbols in pattern.

21 Case 3-1:at least frequent symbols There are two stages in the algorithm for this case. (1)Marking stage Identify potential starts of the pattern and do a crude pruning of the potential candidates. (2)Verification stage Verify which of the potential candidate is indeed a pattern which occurs. Verification stage will be done by Kangaroo Method.

22 Example Let P = ABCAABBDBAA and k = 4 There are 4 ( 4 is between and 2k) distinct symbols in P and A, B are frequent symbols. There are 2 (= )frequent symbols. Marking stage of Case 3-1 We pick arbitrarily frequent symbols and convert this problem to mismatch problem with dont care. T = ABCABDCABBCFADDABC

23 Mismatch problem with dont care Input: A text T with length n and a pattern P with length m. where g are the characters in the pattern which are not dont care symbols. and the rest are Φ(dont care). Output: The numbers of mismatches between pattern and each sub-string of T with length m. Only mismatches of the g pattern characters are counted. The number of mismatches ABΦAABBΦBAΦ ABCABDCABBCFADDABD T = P = 4 ABΦAABBΦBAΦ

24 Mismatch problem with dont care Input: A text T with length n and a pattern P with length m. where g are the characters in the pattern which are not dont care symbols. and the rest are Φ(dont care). Output: The numbers of mismatches between pattern and each sub-string of T with length m. Only mismatches of the g pattern characters are counted. The number of mismatches ABΦAABBΦBAΦ ABCABDCABBCFADDABD T = P = 47 ABΦAABBΦBAΦ

25 Mismatch problem with dont care Input: A text T with length n and a pattern P with length m. where g are the characters in the pattern which are not dont care symbols. and the rest are Φ(dont care). Output: The numbers of mismatches between pattern and each sub-string of T with length m. Only mismatches of the g pattern characters are counted. The number of mismatches ABΦAABBΦBAΦ ABCABDCABBCFADDABD T = P = 477 ABΦAABBΦBAΦ

26 Mismatch problem with dont care Input: A text T with length n and a pattern P with length m. where g are the characters in the pattern which are not dont care symbols. and the rest are Φ(dont care). Output: The numbers of mismatches between pattern and each sub-string of T with length m. Only mismatches of the g pattern characters are counted. The number of mismatches ABΦAABBΦBAΦ ABCABDCABBCFADDABD T = P = 2477 ABΦAABBΦBAΦ

27 Mismatch problem with dont care can be solved in (Amir et, 1997), where n is the length of text T, m is the length of pattern P and g are the characters in the pattern which are not dont care symbols.

28 All locations with at most k mismatches of frequent symbols are our candidate locations where matches with k maximal number of mismatches start. The number of mismatches ABCABDCABBCFADDABD T = P = ABΦAABBΦBAΦ k = 4 Example

29 Lemma 2 for Case 3-1 Let {a 1,….,a }be frequent symbols. Then there exist in the text at most locations where there is a pattern occurrence with no more than k errors Proof The total number of mark is at most n because the algorithm tests T(i), i=1,2….n. Let the number of locations which have marks larger than k be a Then

30 We convert marking stage to mismatch problem with dont care and take to solve mismatch problem with dont care problem. According to lemma 2 for Case3-1, there are candidate locations and we take O(k) time to verify one candidate location. Verification stage for Case3-1 takes time.

31 Case 3-2:less than frequent symbols First, we can check the number of mismatches by using convert all frequent symbols to Φ (dont care symbol). Let P = ABCAABGDBAA and k = 5 There are 5 ( 5 is between and 2k) distinct symbols in P and A are frequent symbols. There are 1 (< )frequent symbols. T = ABCABDCABBCFADDABC Example

32 Two cases are discussed after we convert all frequent symbols to Φ There are less than 2k remaining symbols There are at least 2k remaining symbols.

33 Case3-2-1 There are less than 2k remaining symbols There are less than 2k remaining symbols and the rest are dont care symbols. Finding mismatches of remaining symbols can be solved as a mismatch problem with dont care and takes time. T = ABCABDCABBCFADDABCΦBCΦΦBGDBΦΦ P = mismatches of remaining = symbols 3 ΦBCΦΦBGDBΦΦ

34 Case3-2-1 There are less than 2k remaining symbols There are less than 2k remaining symbols and the rest are dont care symbols. Finding mismatches of remaining symbols can be solved as a mismatch problem with dont care and takes time. T = ABCABDCABBCFADDABCΦBCΦΦBGDBΦΦ P = mismatches of remaining = symbols 3 ΦBCΦΦBGDBΦΦ 5

35 Case3-2-1 There are less than 2k remaining symbols There are less than 2k remaining symbols and the rest are dont care symbols. Finding mismatches of remaining symbols can be solved as a mismatch problem with dont care and takes time. T = ABCABDCABBCFADDABCΦBCΦΦBGDBΦΦ P = mismatches of remaining = symbols 3 ΦBCΦΦBGDBΦΦ 56

36 Case3-2-1 There are less than 2k remaining symbols There are less than 2k remaining symbols and the rest are dont care symbols. Finding mismatches of remaining symbols can be solved as a mismatch problem with dont care and takes time. T = ABCABDCABBCFADDABCΦBCΦΦBGDBΦΦ P = mismatches of remaining = symbols 3 ΦBCΦΦBGDBΦΦ 564

37 Case3-2-1 There are less than 2k remaining symbols There are less than 2k remaining symbols and the rest are dont care symbols. Finding mismatches of remaining symbols can be solved as a mismatch problem with dont care and takes time. T = ABCABDCABBCFADDABCΦBCΦΦBGDBΦΦ P = mismatches of remaining = symbols 3 ΦBCΦΦBGDBΦΦ 5644

38 All locations which have less than k mismatches of all frequent symbols and remaining symbols are matches which we want.

39 Conclusion: The problem for Case can be solved in time

40 Case3-2-2 There are at least 2k remaining symbols There are two stages in algorithm for this case. (1)Marking stage Identify potential starts of the pattern and do a crude pruning of the potential candidates. (2)Verification stage Verify which of the potential candidates is indeed a pattern which occurred. Verification stage will be done by Kangaroo Method.

41 Marking stage of Case We pick arbitrarily 2k remaining symbols and convert all symbols to Φ(dont care symbols) except 2k remaining symbols which we picked. Marking stage of Case3-2-2 can be solved as mismatch problem with dont care in time.

42 Conclusion: The problem for Case can be solved in time

43 Thank you