Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)

Similar presentations


Presentation on theme: "Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)"— Presentation transcript:

1 Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq http://www-igm.univ-mlv.fr/~lecroq/string/index.html

2 String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching: 1 pattern ---> The algorithm depends on |p| and |  | k patterns ---> The algorithm depends on k, |p| and |  | The text ----> Data structure for the text (suffix tree,...) The patterns ---> Data structures for the patterns Dynamic programming Sequence alignment (pairwise and multiple) Extensions Regular Expressions Probabilistic search: Sequence assembly: hash algorithm Hidden Markov Models

3 Regular expression A regular expression ℛ is a string on the set of simbols Σ U { ε, |, ·, *, (, ) } which is recursively defined as: ε (empty character) is a regular expression A character of Σ is a regular expression ( ℛ ) is a regular expression ℛ 1 · ℛ 2 is a regular expression ℛ * is a regular expression ℛ 1 | ℛ 2 is a regular expression

4 Regular lenguage The lenguage defined by a regular expression ℛ is the set of strings generated by ℛ. The problem of searching for a regular expression in the text T is to find all the factors in T that belong to the lenguage.

5 Methods Regular expression NFA Strings found DFA Search with deterministic finit automata Search with bit-parallel Thompson automata Parse tree

6 Methods Regular expression NFA Strings found Search with bit-parallel Thompson automata Parse tree DFA Search with deterministic finit automata

7 Search with a deterministic finit automata Given the regular expression bb*(b|b*a) the NFA is As it’s not possible to spell the text out the NFA, the NFA is transformed into a DFA … And the search process… What is the cost? b 1 0 b b a 3 2 b b 1 0 b a a 3 12

8 Search example with DFA Given the regular expression bb*(b|b*a) and the NFA: The search on the text:b b b a a b a a b b … b b 1 0 b a a 3 12 …

9 Methods Regular expression NFA Strings found DFA Search with deterministic finit automata Parse tree Search with bit-parallel Thompson automata

10 Parse tree Is a tree such that: - internal nodes are labeled by operators - leaves are labeled by characters of Σ and ε ( ℛ ) ℛ 1 · ℛ 2 ℛ * ℛ * ℛ 1 | ℛ 2 ℛ. ℛ 1 ℛ 2 | ℛ *

11 Parse tree: example Given the regular expression bb*(b|b*a) the parse tree is: a b* b. | b. * b

12 NFA (Thompson automaton) From the regular expression or from the parse tree we define the automaton: For a character a of Σ: a. ℛ 1 ℛ 2 ℛ * | ε ε ε ε ε ε ε

13 Thompsom automaton construction b a b* b | b. * b bb*(b|b*a) b a b b.

14 NFA: ε-closure (states ε-equivalents) a b b b b 1 0 2 3 4 5 67 89 1011 12 ε 1 3 4 5 7 9 11 bb*(b|b*a) 1, 2, 4, 5, 6, 8, 10 5, 6, 8 9, 12 6, 7, 8 4, 5, 6, 8, 10 2, 3, 4, 5, 6, 8,10 11, 12

15 Bit-parallel Thompsom algorithm bb*(b|b*a) ε 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 a b b b b 10 2 3 4 5 67 89 10 11 12 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab The bit-vector D mark the active states: at the begining D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 At every step we shift to the right followed by an “and” operator with the mask of the last read character… D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …and the ε-closure extension of active states. -> 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 The masks are

16 Bit-parallel Thompsom algorithm ε 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 bb*(b|b*a) a b b b b 10 2 3 4 5 67 89 10 11 12 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 0 1 0 0 1 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Text: ababbbaab -> 0 1 0 0 0 0 0 0 0 0 0 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0

17 Bit-parallel Thompsom algorithm ε 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 bb*(b|b*a) a b b b b 10 2 3 4 5 67 89 10 11 12 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Text: ababbbaab -> 0 1 0 0 0 0 0 0 0 0 0 0 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 0 1 1 1 0 1 0 1 0 0

18 Bit-parallel Thompsom algorithm E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 bb*(b|b*a) a b b b b 10 2 3 4 5 67 89 10 11 12 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 1 1 1 0 1 1 1 0 1 0 1 0 0 Text: ababbbaab -> 0 1 1 1 0 1 1 1 0 1 0 1 0

19 Bit-parallel Thompsom algorithm E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 bb*(b|b*a) a b b b b 10 2 3 4 5 67 89 10 11 12 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 1 1 1 0 1 1 1 0 1 0 1 0 0 Text: ababbbaab -> 0 1 1 1 0 1 1 1 0 1 0 1 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

20 Bit-parallel Thompsom algorithm E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 bb*(b|b*a) a b b b b 10 2 3 4 5 67 89 10 11 12 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 1 1 1 0 1 1 1 0 1 0 1 0 0 Text: ababbbaab -> 0 1 1 1 0 1 1 1 0 1 0 1 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 3 1 0 0 0 0 0 0 0 0 1 0 0 1 D 0 1 2 3 4 5 6 7 8 9 10 11 12 3 1 0 0 0 0 0 0 0 0 1 0 0 1

21 Bit-parallel Thompsom algorithm E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 bb*(b|b*a) a b b b b 10 2 3 4 5 67 89 10 11 12 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 1 1 1 0 1 1 1 0 1 0 1 0 0 Text: ababbbaab -> 0 1 1 1 0 1 1 1 0 1 0 1 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 3 1 0 0 0 0 0 0 0 0 1 0 0 1 -> 0 1 0 0 0 0 0 0 0 0 1 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 3 1 0 0 0 0 0 0 0 0 1 0 0 1

22 Bit-parallel Thompsom algorithm E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 bb*(b|b*a) a b b b b 10 2 3 4 5 67 89 10 11 12 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 1 1 1 0 1 1 1 0 1 0 1 0 0 Text: ababbbaab -> 0 1 1 1 0 1 1 1 0 1 0 1 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 3 1 0 0 0 0 0 0 0 0 1 0 0 1 -> 0 1 0 0 0 0 0 0 0 0 1 0 0 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 3 1 0 0 0 0 0 0 0 0 1 0 0 1

23 Bit-parallel Thompsom algorithm E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 bb*(b|b*a) a b b b b 10 2 3 4 5 67 89 10 11 12 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 1 1 1 0 1 1 1 0 1 0 1 0 0 Text: ababbbaab -> 0 1 1 1 0 1 1 1 0 1 0 1 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 3 1 0 0 0 0 0 0 0 0 1 0 0 1 -> 0 1 0 0 0 0 0 0 0 0 1 0 0 (b) 0 1 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 4 1 1 1 0 1 1 1 0 1 0 1 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 3 1 0 0 0 0 0 0 0 0 1 0 0 1

24 Bit-parallel Thompsom algorithm E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 bb*(b|b*a) a b b b b 10 2 3 4 5 67 89 10 11 12 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 4 1 1 1 0 1 1 1 0 1 0 1 0 0 Text: ababbbaab -> 0 1 1 1 0 1 1 1 0 1 0 1 0

25 Bit-parallel Thompsom algorithm E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 bb*(b|b*a) a b b b b 10 2 3 4 5 67 89 10 11 12 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 4 1 1 1 0 1 1 1 0 1 0 1 0 0 Text: ababbbaab -> 0 1 1 1 0 1 1 1 0 1 0 1 0 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0

26 Bit-parallel Thompsom algorithm E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 bb*(b|b*a) a b b b b 10 2 3 4 5 67 89 10 11 12 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 4 1 1 1 0 1 1 1 0 1 0 1 0 0 Text: ababbbaab -> 0 1 1 1 0 1 1 1 0 1 0 1 0 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 5 1 1 1 1 1 1 1 1 1 0 1 1 1 D 0 1 2 3 4 5 6 7 8 9 10 11 12 5 1 1 1 1 1 1 1 1 1 0 1 1 1

27 Bit-parallel Thompsom algorithm E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 bb*(b|b*a) a b b b b 10 2 3 4 5 67 89 10 11 12 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 4 1 1 1 0 1 1 1 0 1 0 1 0 0 Text: ababbbaab -> 0 1 1 1 0 1 1 1 0 1 0 1 0 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 5 1 1 1 1 1 1 1 1 1 0 1 1 1 -> 0 1 1 1 1 1 1 1 1 1 0 1 1 D 0 1 2 3 4 5 6 7 8 9 10 11 12 5 1 1 1 1 1 1 1 1 1 0 1 1 1

28 Bit-parallel Thompsom algorithm E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 bb*(b|b*a) a b b b b 10 2 3 4 5 67 89 10 11 12 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 4 1 1 1 0 1 1 1 0 1 0 1 0 0 Text: ababbbaab -> 0 1 1 1 0 1 1 1 0 1 0 1 0 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 5 1 1 1 1 1 1 1 1 1 0 1 1 1 -> 0 1 1 1 1 1 1 1 1 1 0 1 1 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 5 1 1 1 1 1 1 1 1 1 0 1 1 1

29 Bit-parallel Thompsom algorithm E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 bb*(b|b*a) a b b b b 10 2 3 4 5 67 89 10 11 12 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 4 1 1 1 0 1 1 1 0 1 0 1 0 0 Text: ababbbaab -> 0 1 1 1 0 1 1 1 0 1 0 1 0 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 1 0 1 0 5 1 1 1 1 1 1 1 1 1 1 1 1 1 -> 0 1 1 1 1 1 1 1 1 1 1 1 1 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 6 1 1 1 0 1 1 1 1 1 0 1 1 1 D 0 1 2 3 4 5 6 7 8 9 10 11 12 5 1 1 1 1 1 1 1 1 1 1 1 1 1


Download ppt "Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)"

Similar presentations


Ads by Google