Hokkaido University 1 Lecture on Information knowledge network2010/11/10 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

Slides:



Advertisements
Similar presentations
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Advertisements

1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p Speaker: L. C. Chen Advisor:
Chapter 4 An Introduction to Finite Automata
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Automata Theory Part 1: Introduction & NFA November 2002.
Turing Machines Part 1:. 2 Church-Turing Thesis Part 1 An effective procedure is defined as: a procedure which can be broken down into simple.
Lexical Analysis Dragon Book: chapter 3.
CS 3240: Languages and Computation
CSCI 3130: Formal Languages and Automata Theory Tutorial 5
PARIXIT PRASAD December 4, 2013 Parixit Prasad | CSA - IISC 1 Deciding Presburger Arithmetic Using Automata Department of Computer Science and Automaton.
Hokkaido University Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network,
Computing functions with Turing machines
Chapter 11: Models of Computation
CS2303-THEORY OF COMPUTATION
Jing-Shin Chang1 Regular Expression: Syntax for Specifying String Patterns Basic Alphabet empty-string: any symbol a in input symbol set Basic Operators.
CS2303-THEORY OF COMPUTATION
Non-Deterministic Finite Automata
Introduction to Computability Theory
Finite-state Recognizers
北海道大学 Hokkaido University 1 Lecture on Information knowledge network2010/12/23 Lecture on Information Knowledge Network "Information retrieval and pattern.
Chapter 6 Languages: finite state machines
Finite-State Machines with No Output Ying Lu
Finite State Machines Finite state machines with output
FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY
Parametrized Matching Amir, Farach, Muthukrishnan Orgad Keller.
Deterministic Finite Automata (DFA)
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
YES-NO machines Finite State Automata as language recognizers.
1 Languages. 2 A language is a set of strings String: A sequence of letters Examples: “cat”, “dog”, “house”, … Defined over an alphabet: Languages.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
1 CSCI-2400 Models of Computation. 2 Computation CPU memory.
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.
1 Languages and Finite Automata or how to talk to machines...
Topics Automata Theory Grammars and Languages Complexities
Great Theoretical Ideas in Computer Science.
CMPS 3223 Theory of Computation
Lecture 23: Finite State Machines with no Outputs Acceptors & Recognizers.
AUTOMATA THEORY Reference Introduction to Automata Theory Languages and Computation Hopcraft, Ullman and Motwani.
1 Unit 1: Automata Theory and Formal Languages Readings 1, 2.2, 2.3.
CSC312 Automata Theory Lecture # 2 Languages.
1 Chapter 1 Introduction to the Theory of Computation.
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
Great Theoretical Ideas in Computer Science.
1 Languages. 2 A language is a set of strings String: A sequence of letters Examples: “cat”, “dog”, “house”, … Defined over an alphabet:
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu.
A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,
Deterministic Finite Automata COMPSCI 102 Lecture 2.
Finite State Machines 1.Finite state machines with output 2.Finite state machines with no output 3.DFA 4.NDFA.
Machines That Can’t Count CS Lecture 15 b b a b a a a b a b.
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2005 Lecture 10Sept Carnegie Mellon University b b a b a a a b a b One.
Lecture Notes 
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2006 Lecture 22 November 9, 2006Carnegie Mellon University b b a b a a a b a b.
Great Theoretical Ideas in Computer Science for Some.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
Great Theoretical Ideas In Computer Science Steven RudichCS Spring 2005 Lecture 9Feb Carnegie Mellon University b b a b a a a b a b One Minute.
Chapter 1 INTRODUCTION TO THE THEORY OF COMPUTATION.
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
Fall 2004COMP 3351 Finite Automata. Fall 2004COMP 3352 Finite Automaton Input String Output String Finite Automaton.
Advanced Algorithms Analysis and Design
Languages.
Lecture 1 Theory of Automata
FORMAL LANGUAGES AND AUTOMATA THEORY
Jaya Krishna, M.Tech, Assistant Professor
Principles of Computing – UFCFA3-30-1
Great Theoretical Ideas in Computer Science
One Minute To Learn Programming: Finite Automata
Chapter 1 Introduction to the Theory of Computation
CSC312 Automata Theory Lecture # 2 Languages.
Languages Fall 2018.
Presentation transcript:

Hokkaido University 1 Lecture on Information knowledge network2010/11/10 Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA

The 1st Preliminary: terms and definitions What is the pattern matching problem? Basic terms of text algorithms About finite automaton Difference with text retrieval using index data structure 2010/11/10 Lecture on Information knowledge network 2

Hokkaido University 3 Lecture on Information knowledge network 2010/11/10 What is the pattern matching problem? Text T: Pattern P: compress Problem of finding occurrences of pattern P in text T Well-known algorithms: –KMP method (Knuth&Morris&Pratt 1974) –BM method (Boyer&Moore 1977) –Karp-Rabin method (Karp&Rabin 1987) We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extremely extends that for LZW compressed text presented by Amir, Benson and Farach [Amir94].

Hokkaido University 4 Lecture on Information knowledge network 2010/11/10 Text T: tekumakumayakontekumakumayakon Pattern P: kumakuma Existence problem and all-occurrences problem Existence problem Yes! All-occurrences problem 3 18 Although it is enough for us to solve the existence problem when we retrieve a document from a set of documents, solving the pattern matching problem often means solving the all-occurrences problem. Text T: tekumakumayakontekumakumayakon Pattern P: kumakuma

Hokkaido University 6 Lecture on Information knowledge network 2010/11/10 Basic terms (computational complexity) To clarify the quality of an algorithm, we have to judge the computational complexity –How much time and memory space do we need for the calculation to the input data of length n? Big-O notation –Definition: Let f and g be functions from an integer to an integer. For some constant C and N, if f(n) N, then we write f(n) = O(g(n)). (f is said to be order of g) –If f and g are the same order, that is, f(n)=O(g(n)) and g(n)=O(f(n)), then we write f=Θ(g). –Let T(n) be a function of the calculation time for the input of length n, and assume that T(n) = O(g(n)). This means that "It asymptotically takes only the time proportional to g(n) at most." Example: O(n) O(n log n) indicates the upper bound of asymptotic complexity (There is Ω notation, which indicates the lower bound)

Hokkaido University 7 Lecture on Information knowledge network 2010/11/10 Definition of terms –: a finite set of non-empty characters. (The set is called alphabet) Example: ={a,b,c,…}, ={0,1}, ={0x00, 0x01, 0x02,…, 0xFF} –x *: a string, a word, or a text |x|: length of string x. Example: |aba|=3. ε: the string whose length is equal to 0 is called the empty string. That is, |ε|=0. –x[i]: the character of i-th position of string x. –x[i.. j]: the consecutive sequence of characters from i to j of string x We assume x[i.. j]=ε for i>j for convenience. x[1..i] is especially called a prefix of x. x[i..|x|] is especially called a suffix of x. –For strings x and y, we say x is a subsequence of y if the string obtained by removing 0 or more characters from y is identified with x. Example: X=abba is a subsequence of y=aaababaab. –For a 1 a 2 …a k *, we denote the reversed string a k a k-1 …a 1 by x R. Basic terms (alphabet and string) Character a is also called a letter or a symbol It is called factor, substring, or subword of x. Note that the difference with a factor!

Hokkaido University 8 Lecture on Information knowledge network 2010/11/10 Example of prefix, factor, and suffix exercise 1: Enumerate all the prefix of w=ROYCE. exercise 2: Enumerate all the factor of w=ABABC. factor c co coc coco a oa coa ocoa o oc oco prefixsuffix w = cocoa ε cocoa

Hokkaido University 9 Lecture on Information knowledge network 2010/11/10 Finite automaton What is a finite automaton? –Automatic machine (automata) where the output is determined by an input and its internal states. –The number of states is finite. –There are several variations according to the definition of its state transitions and output, and the use of the auxiliary spaces. For example, non-deterministic automaton, sequential machine, pushdown automaton, and etc. –It has the ability to define (computer) languages; it is often used for lexical analysis of strings. It is one of the most fundamental concepts that appears in all computer science fields! It is used also for software design (state transition diagram of UML)! Of course, it deeply relates to the pattern matching problem! References:"Automaton and computability," S. Arikawa and S. Miyano, BAIFU-KAN (written in Japanese) "Formal language and automaton," Written by E. Moriya, SAIENSU-SHA (written in Japanese)

Hokkaido University 10 Lecture on Information knowledge network 2010/11/10 Definition of deterministic finite automaton (Deterministic) Finite automaton M = (K, Σ, δ, q0, F) –K = {q0, q1, …, qn}: a set of states, –Σ = {a, b, …, c}: a set of characters (alphabet), –q 0 : initial state, –δ: transition function K×Σ K –F: a set of accept states (which is a subset of K) About the transition function –We extend it as follows: δ(q, e) = q (q K) δ(q, ax) =δ(δ(q, a), x) (q K, a Σ, x Σ*) –For string w, δ(q 0, w) indicates the state when w is input. –We say that M accepts w if and only if δ(q 0, w) = p F. –The set L(M)={w|δ(q0, w) F} of all strings that M accepts is called the language accepted by finite automaton M. a1a1 a2a2 aiai q q input Head having internal states

Hokkaido University 11 Lecture on Information knowledge network 2010/11/10 State transition diagramState transition Example of deterministic finite automaton δ(q 0, aba) = δ(δ(q 0, a), ba) = δ(q 1, ba) = δ(d(q 1, b), a) = δ(q 0, a) = q 1 F L(M) = {a(ba) n | n 0} a b ba a,b q0q0 q1q1 q2q2 q0q0 q0q0 Head: Input: aba

Hokkaido University 12 Lecture on Information knowledge network 2010/11/10 Definition of nondeterministic finite automaton Nondeterministic finite automaton M=(K, Σ, δ, Q 0, F) –K = {q 0, q 1, …, q n }: a set of states, –Σ= {a, b, …, c}: a set of characters (alphabet), –Q 0 K: a set of initial states, –δ: transition function K×Σ 2K –F: a set of accept states (which is a subset of K) About the transition function: –We extend the domain from K×Σ to K×Σ * as follows: δ(q, e) = {q} δ(q, ax) = p δ(q, a) δ(p, x) (q K, a Σ, x Σ * ) –Moreover, we extend it to 2K×Σ *. δ(S, x) = q S δ(q, x) –For x Σ *, we say that M accepts x when δ(Q0, x)F Φ. In other words, destination of each transition isnt unique! There exist several current states!

Hokkaido University 13 Lecture on Information knowledge network 2010/11/10 For abb, for example, Then, abb L(M) since q 2 F. Example of nondeterministic automaton State diagram of nondeterministic finite automaton b q2q2 q1q1 q0q0 b a,b δ(q 0, abb) = δ(q 0, bb) = δ(q 0,b) δ(q 1,b) = {q 0 } {q 1 } {q 2 } = {q 0, q 1, q 2 }.

Hokkaido University 14 Lecture on Information knowledge network 2010/11/10 Sequential machine atcgaatccg... Finite Automaton Yes No or atcgaatccg... Sequential machine What is a sequential machine? A kind of translator!

Hokkaido University 15 Lecture on Information knowledge network 2010/11/10 Conceptual diagram of sequential machine q q Head: a1a1 a2a2 aiai Input: b1b1 b2b2 bibi Output:

Hokkaido University 16 Lecture on Information knowledge network 2010/11/10 Example of sequential machine q0q0 q3q3 q5q5 q1q1 q2q2 q4q4 0/01/1 0/0 1/0 1/1 λ(q 0, 011) =λ(q 0, 0)λ(δ(q 0,0), 11) = 0λ(q 4, 11) = 0λ(q 4, 1)λ(δ(q 4,1), 1) = 01λ(q 5, 1) = 010

Hokkaido University Text retrieval by using index data structure Merit Very fast High scalability Weak point To construct the index is needed Little flexibility of updating Space for the index is necessary Text retrieval by doing pattern matching Merit No extra data structure Flexible about data updating Weak point Slow Low scalability For small-scale group of documents (example: grep of UNIX) For large-scale DB (example: Namazu, sufary, mg, Google) 17 Lecture on Information knowledge network 2010/11/10 Difference with text retrieval using index data structure O(n) O(m log n) There exists a large-scale full-text search DB system based on pattern matching!

Hokkaido University 19 Lecture on Information knowledge network 2010/11/10 Summary (the 1st) What is the pattern matching problem? –Problem of finding the occurrences of pattern P included in text T –There are the existence problem and the all-occurrence problem. Basic terms of text algorithm –Notation of computational complexity: Big-O notation –Alphabet, string, prefix, factor, and suffix Finite automaton –Deterministic finite automaton: it can define computer languages. –Nondeterministic finite automaton: There are some existing states and destinations of each transition. –Sequential machine: it outputs for each input character. Difference with text retrieval using index data structure –Although the text retrieval by pattern matching is usually considered that it is slower than the retrieval using index data structure, the former has some good aspects!

Hokkaido University 18 Lecture on Information knowledge network 2010/11/10 Fujitsu Interstage Shunsaku Data Manager Feature: –Regarding a text file of XML form as DB, it achieves high-speed data access by using fast serial search (pattern matching). NO index! The composition of data format can be flexibly changed. –The core part is based on the search engine "SIGMA, which is developed by the research group of Prof. Setsuo Arikawa (currently he is a vice-president of Kyushu University) in Introduction cases: –National Institute of Genetics, Research Organization of Information and Systems, DDBJ center ARSA (All-round Retrieval of Sequence and Annotation) system of DDBJ(Japanese DNA data bank), which is One of three major international DNA data banks. –Fujitsu in-house Production management system, and new electronic telephone book system Theoretical base is Aho-Corasick algorithm