Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

Nondeterministic Finite Automata CS 130: Theory of Computation HMU textbook, Chapter 2 (Sec 2.3 & 2.5)
Regular Expressions and DFAs COP 3402 (Summer 2014)
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
1 Regular Expressions and Automata September Lecture #2-2.
Introduction to Computability Theory
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture2: Non Deterministic Finite Automata Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Courtesy Costas Busch - RPI1 Non Deterministic Automata.
1 Introduction to Computability Theory Lecture2: Non Deterministic Finite Automata (cont.) Prof. Amos Israeli.
Computational Language Finite State Machines and Regular Expressions.
CS5371 Theory of Computation Lecture 6: Automata Theory IV (Regular Expression = NFA = DFA)
Fall 2006Costas Busch - RPI1 Non-Deterministic Finite Automata.
PZ02B Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ02B - Regular grammars Programming Language Design.
Finite Automata Chapter 5. Formal Language Definitions Why need formal definitions of language –Define a precise, unambiguous and uniform interpretation.
CSC 361Finite Automata1. CSC 361Finite Automata2 Formal Specification of Languages Generators Grammars Context-free Regular Regular Expressions Recognizers.
Introduction to Finite Automata Adapted from the slides of Stanford CS154.
Costas Busch - LSU1 Non-Deterministic Finite Automata.
Grammars, Languages and Finite-state automata Languages are described by grammars We need an algorithm that takes as input grammar sentence And gives a.
FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output
::ICS 804:: Theory of Computation - Ibrahim Otieno SCI/ICT Building Rm. G15.
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
REGULAR LANGUAGES.
Theory of Languages and Automata
PZ02B Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ02B - Regular grammars Programming Language Design.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
CHAPTER 1 Regular Languages
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 11 Midterm Exam 2 -Context-Free Languages Mälardalen University 2005.
CMSC 330: Organization of Programming Languages Theory of Regular Expressions Finite Automata.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
Donghyun (David) Kim Department of Mathematics and Physics North Carolina Central University 1 Chapter 1 Regular Languages Some slides are in courtesy.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 12 Mälardalen University 2007.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Deterministic Finite Automata Nondeterministic Finite Automata.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
1 Chapter 2 Finite Automata (part a) Hokkaido, Japan.
Theory of Computation Automata Theory Dr. Ayman Srour.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Lexical analysis Finite Automata
Non Deterministic Automata
Pushdown Automata.
Pushdown Automata.
Chapter 2 FINITE AUTOMATA.
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Non-Deterministic Finite Automata
CSC NLP - Regex, Finite State Automata
CSE322 Definition and description of finite Automata
Non Deterministic Automata
4b Lexical analysis Finite Automata
4b Lexical analysis Finite Automata
CPSC 503 Computational Linguistics
Chapter 1 Regular Language
NFAs and Transition Graphs
Lecture 5 Scanning.
Presentation transcript:

Regular Expressions and Automata Chapter 2

Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing and information extraction tasks As things have progressed, the RE languages used in various tools and languages (grep, Emacs, Python, Ruby, Java, …) are very similar 7/3/2015 Speech and Language Processing - Jurafsky and Martin 2

Regular Expressions We’ll look at a few examples [in lecture], make a note about types of errors, and then move toward automata 7/3/2015 Speech and Language Processing - Jurafsky and Martin 3

7/3/2015 Speech and Language Processing - Jurafsky and Martin 4 Example Find all the instances of the word “the” in a text.  /the/  /[tT]he/  /\b[tT]he\b/

7/3/2015 Speech and Language Processing - Jurafsky and Martin 5 Errors We fixed two kinds of errors  Matching strings that we should not have matched  False positives (Type I)  Not matching things that we should have matched  False negatives (Type II)

7/3/2015 Speech and Language Processing - Jurafsky and Martin 6 Errors We’ll see the same story for many tasks, all semester. Reducing the error rate for an application often involves two antagonistic efforts:  Increasing precision, (minimizing false positives)  Increasing coverage, or recall, (minimizing false negatives)

Formal Languages and Models Language: a (possibly infinite) set of strings made up of symbols from a finite alphabet Model of a language: can recognize and generate all and only the strings from the language  Serves as a definition of the formal language 7/3/2015 Speech and Language Processing - Jurafsky and Martin 7

Chomsky Hierarchy Regular language  Model: regular expressions, finite state automata Context free language Context sensitive language Unrestricted language  Model: Turning Machine 7/3/2015 Speech and Language Processing - Jurafsky and Martin 8

Regular Expressions and Languages A regular expression pattern can be mapped to a set of strings A regular expression pattern defines a language (in the formal sense) – the class of this type of languages is called a regular language 7/3/2015 Speech and Language Processing - Jurafsky and Martin 9

7/3/2015 Speech and Language Processing - Jurafsky and Martin 10 Finite State Automata FSAs and their probabilistic relatives are at the core of much of what we’ll be doing all semester. They also capture significant aspects of what linguists say we need for morphology and parts of syntax. They are formally equivalent to regular expressions

Formal Definition of a Finite Automaton 1.Finite set of states, typically Q. 2.Alphabet of input symbols, typically  3.One state is the start/initial state, typically q 0 // q 0  Q 4.Zero or more final/accepting states; the set is typically F.// F  Q 5.A transition function, typically δ. This function Takes a state and input symbol as arguments. Returns a state. One “rule” would be written δ(q, a) = p, where q and p are states, and a is an input symbol. Intuitively: if the FA is in state q, and input a is received, then the FA goes to state p (note: q = p OK). 6.A FA is represented as the five-tuple: A = (Q, , δ,q 0, F). Here, F is a set of accepting states.

A Simple Example Language: “Sheepish” Any string that starts with the letter b, followed by two or more a’s, and ending in ! {“baa!”,”baaa!”,”baaaa!”,”baaaaa!”,…} Regular expression for this? 7/3/2015 Speech and Language Processing - Jurafsky and Martin 12

7/3/2015 Speech and Language Processing - Jurafsky and Martin 13 One Possible Sheepish FSA Formal definition of this FSA?

FSA as a Recognizer Does a string belong to its language? 1.Place the input string on a tape, point at start 2.Initialize current state to q 0 3.Iteratively check the next letter on tape. 1.From the current state, if an outgoing arc label matches new letter, move to new state 2.If stuck, REJECT 4.If reach the end of the tape and in a final state, then ACCEPT; else, REJECT 7/3/2015 Speech and Language Processing - Jurafsky and Martin 14

7/3/2015 Speech and Language Processing - Jurafsky and Martin 15 Recognition Traditionally, (Turing’s notion) this process is depicted with a tape.

FSA as Generator FSA can also produce strings in the language it represents 1.Start from q 0 2.Pick an out-going arc to a new state (for now, assume picking randomly) and print the symbol on the arc 3.Follow the arc to the new state 4.Repeat from step 2 until reaching a final state 7/3/2015 Speech and Language Processing - Jurafsky and Martin 16

FSAs and Regular Expressions These are formally equivalent. Both of these classes of models recognize/generate exactly the class of regular languages Interesting proofs: constructive! Given any regular expression, create an equivalent FSA; given any FSA, create an equivalent regular expression 7/3/2015 Speech and Language Processing - Jurafsky and Martin 17

Note on Practical Regular Expression Utilities NOTE: additional features added to regular expression processing can make them more powerful; think of memory 7/3/2015 Speech and Language Processing - Jurafsky and Martin 18

7/3/2015 Speech and Language Processing - Jurafsky and Martin 19 About Alphabets Don’t take term alphabet word too narrowly; it just means we need a finite set of symbols in the input. These symbols can and will stand for bigger objects that can have internal structure.

7/3/2015 Speech and Language Processing - Jurafsky and Martin 20 Often there is more than one FSA for a given language E.g., here is another FSA for “Sheepish”

7/3/2015 Speech and Language Processing - Jurafsky and Martin 21 Yet Another View The guts of FSAs can ultimately be represented as tables ba!e , If you’re in state 1 and you’re looking at an a, go to state 2

7/3/2015 Speech and Language Processing - Jurafsky and Martin 22 Deterministic versus Non- Deterministic FSAs Deterministic means that at each point in processing there is always one unique thing to do (no choices). Non-deterministic means there are choices Go back and look at previous DFA How do deterministic and non- deterministic FSAs compare?

7/3/2015 Speech and Language Processing - Jurafsky and Martin 23 Non-Deterministic FSAs May include  Epsilon transitions  Key point: these transitions do not examine or advance the tape during recognition

7/3/2015 Speech and Language Processing - Jurafsky and Martin 24 ND Recognition Two basic approaches 1.Either take a ND machine and convert it to a D machine and then do recognition with that. 2.Or explicitly manage the process of recognition as a state-space search (leaving the machine as is).

7/3/2015 Speech and Language Processing - Jurafsky and Martin 25 Non-Deterministic Recognition: Search In a ND FSA there exists at least one path through the machine for a string that is in the language defined by the machine. But not all paths through the machine for an accept string lead to an accept state. If a string is not in the language, there are no paths through the machine that lead to an accept state

7/3/2015 Speech and Language Processing - Jurafsky and Martin 26 Non-Deterministic Recognition So success in non-deterministic recognition occurs when a path is found through the machine that ends in an accept. Failure occurs when all of the possible paths for a given string lead to failure.

7/3/2015 Speech and Language Processing - Jurafsky and Martin 27 Example

7/3/2015 Speech and Language Processing - Jurafsky and Martin 28 Why Non-Determinism? Non-determinism doesn’t get us more formal power and it causes headaches so why bother?  More natural (understandable) solutions

7/3/2015 Speech and Language Processing - Jurafsky and Martin 29 Compositional Machines Formal languages are just sets of strings Therefore, we can talk about various set operations (intersection, union, concatenation) We’ll just do a couple

7/3/2015 Speech and Language Processing - Jurafsky and Martin 30 Union

7/3/2015 Speech and Language Processing - Jurafsky and Martin 31 Concatenation