1 Regular Expressions and Automata CPE 641 Natural Language Processing from Kathy McCoy’s slides, CISC 882 Introduction to NLP

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

1 Regular Expressions and Automata September Lecture #2.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
1 Regular Expressions and Automata September Lecture #2-2.
Finite-State Automata Shallow Processing Techniques for NLP Ling570 October 5, 2011.
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
CS 4705 Lecture 2 Regular Expressions and Automata in Language Analysis.
Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Fall 2008.
Computational Language Finite State Machines and Regular Expressions.
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
Finite Automata Chapter 5. Formal Language Definitions Why need formal definitions of language –Define a precise, unambiguous and uniform interpretation.
Topics Automata Theory Grammars and Languages Complexities
CSC 361Finite Automata1. CSC 361Finite Automata2 Formal Specification of Languages Generators Grammars Context-free Regular Regular Expressions Recognizers.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Grammars, Languages and Finite-state automata Languages are described by grammars We need an algorithm that takes as input grammar sentence And gives a.
CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
1 Unit 1: Automata Theory and Formal Languages Readings 1, 2.2, 2.3.
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture 1 Computation and Languages CS311 Fall 2012.
1 Computability Five lectures. Slides available from my web page There is some formality, but it is gentle,
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
Copyright © Curt Hill Finite State Automata Again This Time No Output.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
1 LING 6932 Spring 2007 LING 6932 Topics in Computational Linguistics Hana Filip Lecture 2: Regular Expressions, Finite State Automata.
1 Regular Expressions and Automata August Lecture #2.
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
CS 4705 Lecture 2 Regular Expressions and Automata.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Finite State Machines 1.Finite state machines with output 2.Finite state machines with no output 3.DFA 4.NDFA.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
using Deterministic Finite Automata & Nondeterministic Finite Automata
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
Conversions Regular Expression to FA FA to Regular Expression.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Deterministic Finite Automata Nondeterministic Finite Automata.
Theory of Computation Automata Theory Dr. Ayman Srour.
Theory of Computation Automata Theory Dr. Ayman Srour.
Department of Software & Media Technology
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Lexical analysis Finite Automata
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Chapter 2 FINITE AUTOMATA.
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
CSC NLP - Regex, Finite State Automata
Finite Automata.
4b Lexical analysis Finite Automata
Regular Expressions and Automata in Language Analysis
4b Lexical analysis Finite Automata
CPSC 503 Computational Linguistics
Automata theory and formal languages COS 3112 – AUTOMATA THEORY PRELIM PERIOD WEEK 1 AND 2.
Natural Language Processing (NLP)
Presentation transcript:

1 Regular Expressions and Automata CPE 641 Natural Language Processing from Kathy McCoy’s slides, CISC 882 Introduction to NLP

2 Natural Language Understanding Associate each input (acoustic signal/character string) with a meaning representation. Carried out by a series of components: –Each component acts as a translator from one representation to another –In general, each component adds successively ‘richer’ information to the output

3 Phonological / morphological analyser SYNTACTIC COMPONENT SEMANTIC INTERPRETER CONTEXTUAL REASONER Sequence of words Syntactic structure (parse tree) Logical form Meaning Representation Spoken input For speech understanding Phonological & morphological rules Grammatical Knowledge Semantic rules, Lexical semantics Pragmatic & World Knowledge Indicating relns (e.g., mod) between words Thematic Roles Selectional restrictions Basic Process of NLU “He loves Mary.” Mary He loves  x loves(x, Mary) loves(John, Mary)

4 Representations and Algorithms for NLP Representations: formal models used to capture linguistic knowledge Algorithms manipulate representations to analyze or generate linguistic phenomena Simplest often produce best performance but….the 80/20 Rule and “low-hanging fruit”

5 NLP Representations State Machines –FSAs, FSTs, HMMs, ATNs, RTNs Rule Systems –CFGs, Unification Grammars, Probabilistic CFGs Logic-based Formalisms –1 st Order Predicate Calculus, Temporal and other Higher Order Logics Models of Uncertainty –Bayesian Probability Theory

6 NLP Algorithms Most are parsers or transducers: accept or reject input, and construct new structure from input –State space search Pair a partial structure with a part of the inputpartial structure Spaces too big and ‘best’ is hard to define –Dynamic programming Avoid recomputing structures that are common to multiple solutions

7 Today Review some of the simple representations and ask ourselves how we might use them to do interesting and useful things –Regular Expressions –Finite State Automata How much can you get out of a simple tool?

8 Regular Expressions Can be viewed as a way to specify: –Search patterns over text string –Design of a particular kind of machine, called a Finite State Automaton (FSA) These are really equivalent

9 Uses of Regular Expressions in NLP As grep, perl: Simple but powerful tools for large corpus analysis and ‘shallow’ processing –What word is most likely to begin a sentence? –What word is most likely to begin a question? –In your own , are you more or less polite than the people you correspond with? With other unix tools, allow us to –Obtain word frequency and co-occurrence statistics –Build simple interactive applications (e.g. Eliza) Regular expressions define regular languages or sets

10 Regular Expressions Regular Expression: Formula in algebraic notation for specifying a set of strings String: Any sequence of alphanumeric characters –Letters, numbers, spaces, tabs, punctuation marks Regular Expression Search –Pattern: specifying the set of strings we want to search for –Corpus: the texts we want to search through

11 Some Examples

12 REDescriptionUses? /a*/Zero or more a’sOptional doubled modifiers (words) /a+/One or more a’sNon-optional... /a?/Zero or one a’sOptional... /cat|dog/‘cat’ or ‘dog’Words modifying pets /^cat\.$/A line that contains only cat. ^anchors beginning, $ anchors end of line. ?? /\bun\B/Beginnings of longer strings Words prefixed by ‘un’

13 happier and happier, fuzzier and fuzzier/ (.+)ier and \1ier / Morphological variants of ‘puppy’/pupp(y|ies)/ E.G.RE

14 Optionality and Repetition /[Ww]oodchucks?/ matches woodchucks, Woodchucks, woodchuck, Woodchuck /colou?r/ matches color or colour /he{3}/ matches heee /(he){3}/ matches hehehe /(he){3,} matches a sequence of at least 3 he’s

15 Operator Precedence Hierarchy 1. Parentheses () 2. Counters* + ? {} 3. Sequence of Anchorsthe ^my end$ 4. Disjunction| Examples /moo+/ /try|ies/ /and|or/

16 A Simple Exercise Write a regular expression to find all instances of the determiner “the”: /the/ /[tT]he/ /\b[tT]he\b/ /(^|[^a-zA-Z][tT]he[^a-zA-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

17 A Simple Exercise Write a regular expression to find all instances of the determiner “the”: /the/ /[tT]he/ /\b[tT]he\b/ /(^|[^a-zA-Z][tT]he[^a-zA-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with southern factions.

18 A Simple Exercise Write a regular expression to find all instances of the determiner “the”: /the/ /[tT]he/ /\b[tT]he\b/ /(^|[^a-zA-Z][tT]he[^a-zA-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

19 A Simple Exercise Write a regular expression to find all instances of the determiner “the”: /the/ /[tT]he/ /\b[tT]he\b/ /(^|[^a-zA-Z][tT]he[^a-zA-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

20 A Simple Exercise Write a regular expression to find all instances of the determiner “the”: /the/ /[tT]he/ /\b[tT]he\b/ /(^|[^a-zA-Z][tT]he[^a-zA-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

21 A Simple Exercise Write a regular expression to find all instances of the determiner “the”: /the/ /[tT]he/ /\b[tT]he\b/ /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

22 The Two Kinds of Errors The process we just went through was based on fixing errors in the regular expression –Errors where some of the instances were missed (judged to not be instances when they should have been) – False negatives –Errors where the instances were included (when they should not have been) – False positives

23 Finite State Automata Regular Expressions (REs) can be viewed as a way to describe machines called Finite State Automata (FSA, also known as automata, finite automata). FSAs and their close variants are a theoretical foundation of much of the field of NLP.

24 Finite State Automata FSAs recognize the regular languages represented by regular expressions –SheepTalk: /baa+!/ q0 q4 q1q2q3 ba a a! Directed graph with labeled nodes and arc transitions Five states: q0 the start state, q4 the final state, 5 transitions

25 Formally FSA is a 5-tuple consisting of –Q: set of states {q0,q1,q2,q3,q4} –  : an alphabet of symbols {a,b,!} –q0: a start state –F: a set of final states in Q {q4} –  (q,i): a transition function mapping Q x  to Q q0 q4 q1q2q3 ba a a!

26 State Transition Table for SheepTalk SheepTalk State Input ba! 01ØØ 1Ø2Ø 2Ø3Ø 3Ø34 4ØØØ

27 Recognition Recognition (or acceptance) is the process of determining whether or not a given input should be accepted by a given machine. In terms of REs, it’s the process of determining whether or not a given input matches a particular regular expression. Traditionally, recognition is viewed as processing an input written on a tape consisting of cells containing elements from the alphabet.

28 FSA recognizes (accepts) strings of a regular language –baa! –baaa! –… Tape metaphor: a rejected input b!aba q0q0

29 baaaa! q0q0 State Input ba! 01ØØ 1Ø2Ø 2Ø3Ø 3Ø34 4ØØØ q3q3 q3q3 q4q4 q1q1 q2q2 q3q3

30 Key Points About D-Recognize D-Regognize (D = Deterministic) means that the code always knows what to do at each point in the process Recognition code is universal for all FSAs. To change to a new formal language: –change the alphabet –change the transition table Searching for a string using a RE involves compiling the RE into a table and passing the table to the interpreter

31 Determinism and Non-Determinism Deterministic: There is at most one transition that can be taken given a current state and input symbol. Non-deterministic: There is a choice of several transitions that can be taken given a current state and input symbol. (The machine doesn’t specify how to make the choice.)

32 Non-Deterministic FSAs for SheepTalk q0 q4 q1q2q3 ba a a! q0 q4 q1q2q3 baa! 

33 FSAs as Grammars for Natural Language q2q4q5q0q3q1q6 therev mr dr hon patl.robinson ms mrs 

34 Problems of Non-Determinism ‘Natural’….but at any choice point, we may follow the wrong arc Potential solutions: –Save backup states at each choice point –Look-ahead in the input before making choice –Pursue alternatives in parallel –Determinize our NFSAs (and then minimize) FSAs can be useful tools for recognizing – and generating – subsets of natural language –But they cannot represent all NL phenomena (Center Embedding: The mouse the cat... chased died.)

35 Summing Up Regular expressions and FSAs can represent subsets of natural language as well as regular languages –Both representations may be impossible for humans to understand for any real subset of a language –But they are very easy to use for smaller subsets For fun: –Think of ways you might characterize features of your using only regular expressions

36