Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jing-Shin Chang1 Regular Expression: Syntax for Specifying String Patterns Basic Alphabet empty-string: any symbol a in input symbol set Basic Operators.

Similar presentations

Presentation on theme: "Jing-Shin Chang1 Regular Expression: Syntax for Specifying String Patterns Basic Alphabet empty-string: any symbol a in input symbol set Basic Operators."— Presentation transcript:

1 Jing-Shin Chang1 Regular Expression: Syntax for Specifying String Patterns Basic Alphabet empty-string: any symbol a in input symbol set Basic Operators disjunction (OR, union): s | t concatenation (AND): s t closure (repetition): s* Extended operators: ?, +, [a-z], {m,n}, escape, meta-symbols, registers Chomsky Hierarchy: regular set (R.E.) context-free context-sensitive recursively enumerable (Tuning Machine)

2 Jing-Shin Chang2 Regular Expression: Syntax for Specifying String Patterns Applications: wildcard characters (shell commands, filename expansion) string pattern matching (grep, awk) search engine (keyword matching, fuzzy match) string pattern editing/processing (sed, vi, tr)

3 Jing-Shin Chang3 Recognition of Regular Expression Finite (State) Automata Definition: a set of states: S a set of input symbols: (the input symbol alphabet) a transition (move) function: (s,a) = s initial (start) state: s0 a set of final (accepting) states: F Implementation: state transition table Deterministic (DFA) single transition for all states on all input symbols Non-deterministic (NFA) more than one transitions for at least one state with some input symbol

4 Jing-Shin Chang4 Recognition of Regular Expression Simulating Deterministic Finite Automata (DFA) initialization: current_state = s0; input_symbol = 1st symbol while (current_state not in final_states && input_symbol != EOF) next_state = (current_state, input_symbol) input_symbol = next_input_symbol Simulating non-Deterministic Finite Automata (NFA) Backtrack/Backup: remember next alternative configuration (current input & next alternative state) when alternative choices are possible Parallelism: trace every possible alternatives in parallel Look-ahead: look more input symbols to make it deterministic

5 Jing-Shin Chang5 Constructing Automata from R.E. R.E. => NFA (Thompsons construction) => DFA => State Minimization R.E. decomposition into basic alphabets & operators construct FA for basic alphabets merging FAs by operator R.E. => DFA: state_transition position transition in pattern annotate RE symbols with position labels get syntax tree of the annotated pattern compute {nullable, fistpos, lastpos} compute follow(i) s0 = firstpos(root) construct transition function according to follow(i)

6 Jing-Shin Chang6 R.E. and Pattern Matching Naïve Pattern Matching: Specify the pattern with a regular expression R.E. for each keyword Construct a FA for each such R.E., and conduct left-to-right matching: DFA = State_Transition_Table = Construct_DFA(R.E.) while (input_pointer != EOF) stop_state = recognize(input_pointer, DFA) if fail (stop_state not in final_states) : move input pointer by one character if not match if success (stop_state in final_states) : output matching status & skip over matched pattern upon successful match Why Is It Slow? match multiple keywords multiple times for each keyword, move input pointer backward to the character next to the last begin of matching & reset to initial state on failure, even though some repeated pattern might appear in recently matched partial string probability of failure is significantly larger than probability of success match in most applications (success or match only a few times) will therefore start the next matching session by setting the input pointer one character behind the starting position of the previous match most of the time

7 Jing-Shin Chang7 R.E. and Pattern Matching RE vs. Pattern Matching R.E. FA for recognizing one of a set of keywords/patterns in input string say yes if input string is in Lang(R.E.) (the regular language for the expression) Pattern Matching (PM): recognizing the occurrence of any keyword/pattern specified in a regular expression within a text document specify pattern/keywords with a RE output all occurrences, in addition to saying yes/no

8 Jing-Shin Chang8 R.E. and Pattern Matching Formal Method for Pattern Matching (PM) Constructing a FA for (single/multi-keyword) PM is equivalent to constructing a FA that recognizes the regular expression: PM = (.* | RE)*, and outputting a keyword upon visiting a final state of the original FA for recognizing RE RE = K1 | K2 | K3 | … | Kn (the regular expression for all specified keywords). : any character not in K1 ~ Kn.*: unspecified patterns (or unknown keywords) Constructing FA1 for recognizing RE = K1 | K2 | … | Kn equivalent to merging prefixes of the keywords to avoid redundant forward matching => TRIE lexicon tree = a DFA for RE Constructing FA2 for recognizing PM = (.*|RE)* extending FA1 by (a) including unknown keywords and (2) introducing epsilon-moves from the original final states to original initial states on matching failure, redundant backward matching can be avoided if a sub-string preceding current input pointer is the prefix of another keyword failure function: the state (in TRIE) to backoff on failure (!= init. state if the above mentioned sub-string exists and is non-null) epsilon-moves & failure function make FA2 a NFA, whose DFA counterpart can be simulated by backtracking

9 Jing-Shin Chang9 R.E. and Fast Methods for Pattern Matching Fast Single Keyword Matching [KMP - Knuth, Morris & Pratt 1977] Reference: [Aho et. al 1986, Ex ] keyword => state_transition_table reduce repeated matching suggested by keyword pattern failure function: where to backoff on failure Fast Multiple Keyword Matching [AC, Cherry 1982] Reference: [Aho, Ex ] keywords => TRIE (state_transition_table) reduce repeated matching suggested by TRIE of the keywords TRIE failure function Boyer & Moore [1977] Harrison [1971]: Hashing Method

Download ppt "Jing-Shin Chang1 Regular Expression: Syntax for Specifying String Patterns Basic Alphabet empty-string: any symbol a in input symbol set Basic Operators."

Similar presentations

Ads by Google