Natural Language Processing Lecture 3—9/3/2013 Jim Martin.

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong. Administrivia Homework 3 graded.
Nondeterministic Finite Automata CS 130: Theory of Computation HMU textbook, Chapter 2 (Sec 2.3 & 2.5)
Regular Expressions and DFAs COP 3402 (Summer 2014)
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Morphology September 4, 2012 Lecture #3. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units.
1 Midterm I review Reading: Chapters Test Details In class, Wednesday, Feb. 25, :10pm-4pm Comprehensive Closed book, closed notes.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
1 Regular Expressions and Automata September Lecture #2-2.
Introduction to Computability Theory
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
1 Introduction to Computability Theory Lecture2: Non Deterministic Finite Automata Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture7: PushDown Automata (Part 1) Prof. Amos Israeli.
Introduction to Computability Theory
CS5371 Theory of Computation
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
1 Introduction to Computability Theory Lecture2: Non Deterministic Finite Automata (cont.) Prof. Amos Israeli.
Lecture 3 Goals: Formal definition of NFA, acceptance of a string by an NFA, computation tree associated with a string. Algorithm to convert an NFA to.
Lecture 3 Goals: Formal definition of NFA, acceptance of a string by an NFA, computation tree associated with a string. Algorithm to convert an NFA to.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Introduction to English Morphology Finite State Transducers
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
Lexical Analysis Constructing a Scanner from Regular Expressions.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
CMSC 330: Organization of Programming Languages Finite Automata NFAs  DFAs.
Natural Language Processing Chapter 2 : Morphology.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
using Deterministic Finite Automata & Nondeterministic Finite Automata
CS 154 Formal Languages and Computability February 9 Class Meeting Department of Computer Science San Jose State University Spring 2016 Instructor: Ron.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Design and Analysis of Algorithms.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Lecture 7 Summary Survey of English morphology
Speech and Language Processing
Natural Language Processing
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
CSC NLP - Regex, Finite State Automata
4b Lexical analysis Finite Automata
4b Lexical analysis Finite Automata
Chapter 1 Regular Language
CPSC 503 Computational Linguistics
CSCI 5832 Natural Language Processing
Presentation transcript:

Natural Language Processing Lecture 3—9/3/2013 Jim Martin

4/13/2015 Speech and Language Processing - Jurafsky and Martin 2 Today Review/finish up FSA material English morphology Morphological processing and FSAs

4/13/2015 Speech and Language Processing - Jurafsky and Martin 3 ND Recognition Two basic approaches (used in all major implementations of regular expressions, see Friedl 2006) 1.Either take a ND machine and convert it to a D machine and then do recognition with that. 2.Or explicitly manage the process of recognition as a state-space search (leaving the machine as is).

4/13/2015 Speech and Language Processing - Jurafsky and Martin 4 Non-Deterministic Recognition: Search In a ND FSA there exists at least one path through the machine for any string that is in the language defined by the machine. But not all paths directed through the machine for an accept string necessarily lead to an accept state. No paths through the machine lead to an accept state for a string not in the language.

4/13/2015 Speech and Language Processing - Jurafsky and Martin 5 Non-Deterministic Recognition So success in non-deterministic recognition occurs when a path is found through the machine that ends in an accept state. Failure occurs when all of the possible paths for a given string lead to failure.

4/13/2015 Speech and Language Processing - Jurafsky and Martin 6 Example ba a a !\ q0q0 q1q1 q2q2 q2q2 q3q3 q4q4

4/13/2015 Speech and Language Processing - Jurafsky and Martin 7 Example

4/13/2015 Speech and Language Processing - Jurafsky and Martin 8 Example

4/13/2015 Speech and Language Processing - Jurafsky and Martin 9 Example

4/13/2015 Speech and Language Processing - Jurafsky and Martin 10 Example

4/13/2015 Speech and Language Processing - Jurafsky and Martin 11 Example

4/13/2015 Speech and Language Processing - Jurafsky and Martin 12 Example

4/13/2015 Speech and Language Processing - Jurafsky and Martin 13 Example

4/13/2015 Speech and Language Processing - Jurafsky and Martin 14 Example

4/13/2015 Speech and Language Processing - Jurafsky and Martin 15 ND-Recognize

4/13/2015 Speech and Language Processing - Jurafsky and Martin 16 Example CurrentAgenda {(q0, 1)} (q0,1){} {(q1,2)} (q1,2){} {(q2,3)} (q2,3){} {(q3,4), (q2,4)} (q3,4){(q2,4)} (q2,4){} {(q3,5), (q2,5)} (q3,5){(q2,5)} {(q4,6), (q2,5)} (q4,6)

4/13/2015 Speech and Language Processing - Jurafsky and Martin 17 Why Bother with ND? Non-determinism doesn’t get us more formal power and it causes headaches so why bother?  More natural (understandable) solutions  Not always obvious to users whether or not the regex that they’ve produced is non- deterministic or not  Better to not make them worry about it

Converting NFAs to DFAs The Subset Construction is the means by which we can convert an NFA to a DFA automatically. The intuition is to think about being in multiple states at the same time. Let’s go back to our earlier example where we’re in state q 2 looking at an “a” 4/13/2015 Speech and Language Processing - Jurafsky and Martin 18

Subset Construction So the trick is to simulate going to both q2 and q3 at the same time One way to do this is to imagine a new state of a new machine that represents the state of being in states q2 and q3 at the same time  Let’s call that new state {q2,q3}  That’s just the name of a new state but it helps us remember where it came from  That’s a subset of the original set of states The construction does this for all possible subsets of the original states (the powerset).  And then we fill in the transition table for that set 4/13/2015 Speech and Language Processing - Jurafsky and Martin 19

Subset Construction Given an NFA with the usual parts: Q, Σ, transition function δ, start state q 0, and designated accept states We’ll construct a new DFA that accepts the same language where  States of the new machine are the powerset of states Q: call it Q D  Set of all subsets of Q  Start state is {q 0 }  Alphabet is the same: Σ  Accept states are the states in Q D that contain any accept state from Q 4/13/2015 Speech and Language Processing - Jurafsky and Martin 20

Subset Construction What about the transition function?  For every new state we’ll create a transition on a symbol α from the alphabet to a new state as follows  δ D ({q 1,…,q k }, α) = is the union over all i = 1,…,k of δ N (q i, α) for all αin the alphabet 4/13/2015 Speech and Language Processing - Jurafsky and Martin 21

Baaa! How does that work out for our example?  Alphabet is still “a”, “b” and “!”  Start state is {q0}  Rest of the states are: {q1}, {q2},... {q4}, {q1,q2}, {q1,q3}... {q0,q1,q2,q3,q4,q5}  All subsets of states in Q What’s the transition table going to look like? 4/13/2015 Speech and Language Processing - Jurafsky and Martin 22

Lazy Method 4/13/2015 Speech and Language Processing - Jurafsky and Martin 23 ba! q0q1 q2 q2,q3 q3q4 ba! {q0}

Baaa! 4/13/2015 Speech and Language Processing - Jurafsky and Martin 24 ba! q0q1 q2 q2,q3 q3q4 ba! {q0}{q1}

Baaa! 4/13/2015 Speech and Language Processing - Jurafsky and Martin 25 ba! q0q1 q2 q2,q3 q3q4 ba! {q0}{q1}

Baaa! 4/13/2015 Speech and Language Processing - Jurafsky and Martin 26 ba! q0q1 q2 q2,q3 q3q4 ba! {q0}{q1} {q2}

Baaa! 4/13/2015 Speech and Language Processing - Jurafsky and Martin 27 ba! q0q1 q2 q2,q3 q3q4 ba! {q0}{q1} {q2} {q2,q3}

Baaa! 4/13/2015 Speech and Language Processing - Jurafsky and Martin 28 ba! q0q1 q2 q2,q3 q3q4 ba! {q0}{q1} {q2} {q2,q3}

Baaa! 4/13/2015 Speech and Language Processing - Jurafsky and Martin 29 ba! q0q1 q2 q2,q3 q3q4 ba! {q0}{q1} {q2} {q2,q3} {q4}

Couple of Notes We didn’t come close to needing 2 Q new states. Most of those were unreachable. So in theory there is the potential for an explosion in the number of states. In practice, it may be more manageable. Draw the new deterministic machine from the table on the previous slide... It should look familiar. 4/13/2015 Speech and Language Processing - Jurafsky and Martin 30

4/13/2015 Speech and Language Processing - Jurafsky and Martin 31 Compositional Machines Recall that formal languages are just sets of strings Therefore, we can talk about set operations (intersection, union, concatenation, negation) on languages This turns out to be a very useful  It allows us to decompose problems into smaller problems, solve those problems with specific languages, and then compose those solutions to solve the big problems.

Example Create a regex to match all the ways that people write down phone numbers. For just the U.S. that needs to cover  (303)     (01) You could write a big hairy regex to capture all that, or you could write individual regex’s for each type and then OR them together into a new regex/machine. How does that work? 4/13/2015 Speech and Language Processing - Jurafsky and Martin 32

4/13/2015 Speech and Language Processing - Jurafsky and Martin 33 Union (Or)

4/13/2015 Speech and Language Processing - Jurafsky and Martin 34 Negation Construct a machine M2 to accept all strings not accepted by machine M1 and reject all the strings accepted by M1  Invert all the accept and not accept states in M1 Does that work for non-deterministic machines?

4/13/2015 Speech and Language Processing - Jurafsky and Martin 35 Intersection (AND) Accept a string that is in both of two specified languages An indirect construction…  A^B = ~(~A or ~B)

Break If you were on the waitlist last week you should be in now. The class is now at the room cap. So if you’re on the waitlist now you’ll have to wait for someone to drop First homework will be posted soon...  Given a newspaper article, segment it into sentences and words (tokens) and tell me how many there are of each.  I will post development examples of the kind of texts I’m talking about 4/13/2015 Speech and Language Processing - Jurafsky and Martin 36

Changing Gears We’re switching to talking about why this stuff is relevant to NLP In particular, we’ll be talking about words and morphology 4/13/2015 Speech and Language Processing - Jurafsky and Martin 37

4/13/2015 Speech and Language Processing - Jurafsky and Martin 38 Words Finite-state methods are particularly useful in dealing with large lexicons  That is, big bunches of words  Often infinite sized bunches Many devices, some with limited memory resources, need access to large lists of words And they need to perform fairly sophisticated tasks with those lists So we’ll first talk about some facts about words and then come back to computational methods

4/13/2015 Speech and Language Processing - Jurafsky and Martin 39 English Morphology Morphology is the study of the ways that words are built up from smaller units called morphemes  The minimal meaning-bearing units in a language We can usefully divide morphemes into two classes  Stems: The core meaning-bearing units  Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions

4/13/2015 Speech and Language Processing - Jurafsky and Martin 40 English Morphology We can further divide morphology up into two broad classes  Inflectional  Derivational

4/13/2015 Speech and Language Processing - Jurafsky and Martin 41 Word Classes By word class, we have in mind familiar notions like noun and verb  Also referred to as parts of speech and lexical categories We’ll go into the gory details in Chapter 5 Right now we’re concerned with word classes because the way that stems and affixes combine is based to a large degree on the word class of the stem

4/13/2015 Speech and Language Processing - Jurafsky and Martin 42 Inflectional Morphology Inflectional morphology concerns the combination of stems and affixes where the resulting word....  Has the same word class as the original  And serves a grammatical/semantic purpose that is  Different from the original  But is nevertheless transparently related to the original “walk” + “s” = “walks”

4/13/2015 Speech and Language Processing - Jurafsky and Martin 43 Inflection in English Nouns are simple  Markers for plural and possessive Verbs are only slightly more complex  Markers appropriate to the tense of the verb That’s pretty much it  Other languages can be quite a bit more complex  An implication of this is that hacks (approaches) that work in English will not work for many other languages

4/13/2015 Speech and Language Processing - Jurafsky and Martin 44 Regulars and Irregulars Things are complicated by the fact that some words misbehave (refuse to follow the rules)  Mouse/mice, goose/geese, ox/oxen  Go/went, fly/flew, catch/caught The terms regular and irregular are used to refer to words that follow the rules and those that don’t

4/13/2015 Speech and Language Processing - Jurafsky and Martin 45 Regular and Irregular Verbs Regulars…  Walk, walks, walking, walked, walked Irregulars  Eat, eats, eating, ate, eaten  Catch, catches, catching, caught, caught  Cut, cuts, cutting, cut, cut

4/13/2015 Speech and Language Processing - Jurafsky and Martin 46 Inflectional Morphology So inflectional morphology in English is fairly straightforward But is somewhat complicated by the fact that are irregularities

4/13/2015 Speech and Language Processing - Jurafsky and Martin 47 Derivational Morphology Derivational morphology is the messy stuff that no one ever taught you In English it is characterized by  Quasi-systematicity  Irregular meaning change  Changes of word class

4/13/2015 Speech and Language Processing - Jurafsky and Martin 48 Derivational Examples Verbs and Adjectives to Nouns -ationcomputerizecomputerization -eeappointappointee -erkillkiller -nessfuzzyfuzziness

4/13/2015 Speech and Language Processing - Jurafsky and Martin 49 Derivational Examples Nouns and Verbs to Adjectives -alcomputationcomputational -ableembraceembraceable -lessclueclueless

4/13/2015 Speech and Language Processing - Jurafsky and Martin 50 Example: Compute Many paths are possible… Start with compute  Computer -> computerize -> computerization  Computer -> computerize -> computerizable But not all paths/operations are equally good (allowable?)  Clue  Clue  clueless  Clue  ?clueful  Clue  *clueable

4/13/2015 Speech and Language Processing - Jurafsky and Martin 51 Morphology and FSAs We would like to use the machinery provided by FSAs to capture these facts about morphology  Accept strings that are in the language  Reject strings that are not  And do so in a way that doesn’t require us to in effect list all the forms of all the words in the language  Even in English this is inefficient  And in other languages it is impossible

4/13/2015 Speech and Language Processing - Jurafsky and Martin 52 Start Simple Regular singular nouns are ok as is  They are in the language Regular plural nouns have an -s on the end  So they’re also in the language Irregulars are ok as is

4/13/2015 Speech and Language Processing - Jurafsky and Martin 53 Simple Rules

4/13/2015 Speech and Language Processing - Jurafsky and Martin 54 Now Plug in the Words Spelled Out Replace the class names like “reg-noun” with FSAs that recognize all the words in that class.