Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSC NLP - Regex, Finite State Automata

Similar presentations


Presentation on theme: "CSC NLP - Regex, Finite State Automata"— Presentation transcript:

1 CSC 9010- NLP - Regex, Finite State Automata
CSC 9010 Natural Language Processing Lecture 2: Regular Expressions, Finite State Automata Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted from Jim Martin’s course: 12/3/2018 CSC NLP - Regex, Finite State Automata

2 Regular Expressions and Text Searching
Everybody does it Emacs, vi, perl, grep, etc.. 12/3/2018 CSC NLP - Regex, Finite State Automata

3 CSC 9010- NLP - Regex, Finite State Automata
Example Find me all instances of the word “the” in a text. /the/ /[tT]he/ /\b[tT]he\b/ 12/3/2018 CSC NLP - Regex, Finite State Automata

4 CSC 9010- NLP - Regex, Finite State Automata
Two kinds of Errors Matching strings that we should not have matched (there, then, other) False positives Not matching things that we should have matched (The) False negatives 12/3/2018 CSC NLP - Regex, Finite State Automata

5 Two Antagonistic Goals
Accuracy (minimize false positives) Coverage (minimize false negatives). 12/3/2018 CSC NLP - Regex, Finite State Automata

6 CSC 9010- NLP - Regex, Finite State Automata
Idealized machines for processing regular expressions Example: /baa+!/ 12/3/2018 CSC NLP - Regex, Finite State Automata

7 CSC 9010- NLP - Regex, Finite State Automata
Idealized machines for processing regular expressions Example: /baa+!/ 5 states 5 transitions alphabet? initial state accept state 12/3/2018 CSC NLP - Regex, Finite State Automata

8 CSC 9010- NLP - Regex, Finite State Automata
More examples: 12/3/2018 CSC NLP - Regex, Finite State Automata

9 Another FSA for the same language:
12/3/2018 CSC NLP - Regex, Finite State Automata

10 Formally Specifying a FSA
The set of states: Q A finite alphabet: Σ A start state A set of accept/final states A transition function that maps QxΣ to Q discuss alphabets = not too narrow! do example: STATE TRANSITION TABLE input State b a ! 12/3/2018 CSC NLP - Regex, Finite State Automata

11 CSC 9010- NLP - Regex, Finite State Automata
Dollars and Cents 12/3/2018 CSC NLP - Regex, Finite State Automata

12 CSC 9010- NLP - Regex, Finite State Automata
Recognition Recognition is the process of determining if a string should be accepted by a machine Or… it’s the process of determining if as string is in the language we’re defining with the machine Or… it’s the process of determining if a regular expression matches a string 12/3/2018 CSC NLP - Regex, Finite State Automata

13 Turing’s way of Visualizing Recognition
12/3/2018 CSC NLP - Regex, Finite State Automata

14 CSC 9010- NLP - Regex, Finite State Automata
Recognition Begin in the start state Examine current input Consult the table Go to a new state and update the tape pointer. When you run out of tape: if in accepting state, accept input else reject input 12/3/2018 CSC NLP - Regex, Finite State Automata

15 CSC 9010- NLP - Regex, Finite State Automata
D-Recognize 12/3/2018 CSC NLP - Regex, Finite State Automata

16 CSC 9010- NLP - Regex, Finite State Automata
Key Points Deterministic means that at each point in processing there is always one unique thing to do (no choices). D-recognize is a simple table-driven interpreter The algorithm is universal for all unambiguous languages. To change the machine, you change the table. 12/3/2018 CSC NLP - Regex, Finite State Automata

17 CSC 9010- NLP - Regex, Finite State Automata
Key Points Crudely therefore… matching strings with regular expressions is a matter of translating the expression into a machine (table) and passing the table to an interpreter 12/3/2018 CSC NLP - Regex, Finite State Automata

18 CSC 9010- NLP - Regex, Finite State Automata
Recognition as Search You can view this algorithm as a degenerate kind of state-space search. States are pairings of tape positions and state numbers. Operators are compiled into the table Goal state is a pairing with the end of tape position and a final accept state Its degenerate because? 12/3/2018 CSC NLP - Regex, Finite State Automata

19 Generative Formalisms
Formal Languages are sets of strings composed of symbols from a finite set of symbols. Finite-state automata define formal languages (without having to enumerate all the strings in the language) The term Generative is based on the view that you can run the machine as a generator to get strings from the language. 12/3/2018 CSC NLP - Regex, Finite State Automata

20 Generative Formalisms
FSAs can be viewed from two perspectives: Acceptors that can tell you if a string is in the language Generators to produce all and only the strings in the language 12/3/2018 CSC NLP - Regex, Finite State Automata

21 CSC 9010- NLP - Regex, Finite State Automata
Review Regular expressions are just a compact textual representation of FSAs Recognition is the process of determining if a string/input is in the language defined by some machine. Recognition is straightforward with deterministic machines. 12/3/2018 CSC NLP - Regex, Finite State Automata

22 CSC 9010- NLP - Regex, Finite State Automata
Three Views Three equivalent formal ways to look at what we’re up to (not including tables) Regular Expressions Mention machine (Turing) production systems (Post) Regular sets (Kleene) Finite State Automata Regular Languages 12/3/2018 CSC NLP - Regex, Finite State Automata

23 Defining Languages with Productions
S → b a a A A → a A A → ! S → NP VP NP → PrNoun NP → Det Noun Det → a | the Noun → cat | dog| book PrNoun → samantha |elmer | fido VP → IVerb | TVerb NP IVerb → ran |slept | ate TVerb → hit | kissed | ate Regular language Regular? 12/3/2018 CSC NLP - Regex, Finite State Automata

24 CSC 9010- NLP - Regex, Finite State Automata
Non-Determinism Compare: 12/3/2018 CSC NLP - Regex, Finite State Automata

25 CSC 9010- NLP - Regex, Finite State Automata
Non-Determinism cont. Epsilon transitions: Note: these transitions do not examine or advance the tape during recognition ε 12/3/2018 CSC NLP - Regex, Finite State Automata

26 Are Non-deterministic FSA more powerful?
Non-deterministic machines can be converted to deterministic ones with a fairly simple construction One way to do recognition with a non-deterministic machine is to turn it into a deterministic one. 12/3/2018 CSC NLP - Regex, Finite State Automata

27 Non-Deterministic Recognition
In a ND FSA there exists at least one path through the machine for a string that is in the language defined by the machine. But not all paths directed through the machine for an accept string lead to an accept state. No paths through the machine lead to an accept state for a string not in the language. 12/3/2018 CSC NLP - Regex, Finite State Automata

28 Non-Deterministic Recognition
So success in a non-deterministic recognition occurs when a path is found through the machine that ends in an accept. Failure occurs when none of the possible paths lead to an accept state. 12/3/2018 CSC NLP - Regex, Finite State Automata

29 CSC 9010- NLP - Regex, Finite State Automata
Example b a a a ! \ q0 q1 q2 q2 q3 q4 12/3/2018 CSC NLP - Regex, Finite State Automata

30 CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata

31 CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata

32 CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata

33 CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata

34 CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata

35 CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata

36 CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata

37 CSC 9010- NLP - Regex, Finite State Automata
Example 12/3/2018 CSC NLP - Regex, Finite State Automata

38 CSC 9010- NLP - Regex, Finite State Automata
Key Points States in the search space are pairings of tape positions and states in the machine. By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input. 12/3/2018 CSC NLP - Regex, Finite State Automata

39 CSC 9010- NLP - Regex, Finite State Automata
ND-Recognize Code 12/3/2018 CSC NLP - Regex, Finite State Automata

40 CSC 9010- NLP - Regex, Finite State Automata
Infinite Search If you’re not careful such searches can go into an infinite loop. How? 12/3/2018 CSC NLP - Regex, Finite State Automata

41 CSC 9010- NLP - Regex, Finite State Automata
Why Bother? Non-determinism doesn’t get us more formal power and it causes headaches so why bother? More natural solutions Machines based on construction are too big 12/3/2018 CSC NLP - Regex, Finite State Automata

42 Compositional Machines
Formal languages are just sets of strings Therefore, we can talk about various set operations (intersection, union, concatenation) This turns out to be a useful exercise 12/3/2018 CSC NLP - Regex, Finite State Automata

43 CSC 9010- NLP - Regex, Finite State Automata
Union Accept a string in either of two languages 12/3/2018 CSC NLP - Regex, Finite State Automata

44 CSC 9010- NLP - Regex, Finite State Automata
Concatenation Accept a string consisting of a string from language L1 followed by a string from language L2. 12/3/2018 CSC NLP - Regex, Finite State Automata

45 CSC 9010- NLP - Regex, Finite State Automata
Negation Construct a machine M2 to accept all strings not accepted by machine M1 and reject all the strings accepted by M1 Invert all the accept and not accept states in M1 Does that work for non-deterministic machines? 12/3/2018 CSC NLP - Regex, Finite State Automata

46 CSC 9010- NLP - Regex, Finite State Automata
Intersection Accept a string that is in both of two specified languages An indirect construction… A^B = ~(~A or ~B) 12/3/2018 CSC NLP - Regex, Finite State Automata


Download ppt "CSC NLP - Regex, Finite State Automata"

Similar presentations


Ads by Google