CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted.

Slides:



Advertisements
Similar presentations
Finite-state automata and Morphology
Advertisements

Jing-Shin Chang1 Morphology & Finite-State Transducers Morphology: the study of constituents of words Word = {a set of morphemes, combined in language-dependent.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Morphology September 4, 2012 Lecture #3. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
CMSC 723: Intro to Computational Linguistics February 11, 2003 Lecture 3: Finite-State Morphology Prof. Bonnie J. Dorr and Dr. Nizar Habash TAs: Nitin.
5/16/ ICS 482 Natural Language Processing Words & Transducers-Morphology - 1 Muhammed Al-Mulhem March 1, 2009.
Brief introduction to morphology
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
CMSC 723 / LING 645: Intro to Computational Linguistics September 15, 2004: Dorr More about FSA’s, Finite State Morphology (J&M 3) Prof. Bonnie J. Dorr.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
CSCI 5832 Natural Language Processing Lecture 5 Jim Martin.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
SIMS 290-2: Applied Natural Language Processing
Morphological analysis
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Finite-State Morphology CMSC 723: Computational Linguistics I ― Session #3 Jimmy Lin The iSchool University of Maryland Wednesday, September 16, 2009.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Some Basic Concepts: Morphology.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Introduction to English Morphology Finite State Transducers
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology COMP3310 Natural Language Processing Eric Atwell,
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
Morphology: Words and their Parts CS 4705 Slides adapted from Jurafsky, Martin Hirschberg and Dorr.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005.
Finite State Transducers
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
Natural Language Processing Chapter 2 : Morphology.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
November 2003Computational Morphology VI1 CSA4050 Advanced Topics in NLP Non-Concatenative Morphology – Reduplication – Interdigitation.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Lecture 7 Summary Survey of English morphology
Speech and Language Processing
Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.
Morphology: Parsing Words
Morphology: Words and their Parts
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Ambiguity At last, a computer that understands you like your mother.
Morphological Parsing
CSCI 5832 Natural Language Processing
Basic Text Processing: Morphology Word Stemming
Presentation transcript:

CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted from: Marti Hearst (some following Dorr and Habash) http://www.sims.berkeley.edu/courses/is290-2/f04/ and Jim Martin: http://www.cs.colorado.edu/~martin/csci5832.html CSC 9010- NLP - 3: Morphology, Finite State Transducers

Today Elementary Morphology Computational morphology Finite State Transducers Lexicon-only schemes Rule-only schemes Lab: Introduction to NLTK CSC 9010- NLP - 3: Morphology, Finite State Transducers

Morphology Morphology: Morphemes: Contrasts: A useful resource: The study of the way words are built up from smaller meaning units. Morphemes: The smallest meaningful unit in the grammar of a language. Contrasts: Derivational vs. Inflectional Regular vs. Irregular Concatinative vs. Templatic (root-and-pattern) A useful resource: Glossary of linguistic terms by Eugene Loos http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.htm CSC 9010- NLP - 3: Morphology, Finite State Transducers

Examples (English) “unladylike” “technique” “dogs” 3 morphemes, 4 syllables un- ‘not’ lady ‘(well behaved) female adult human’ -like ‘having the characteristics of’ Can’t break any of these down further without distorting the meaning of the units “technique” 1 morpheme, 2 syllables “dogs” 2 morphemes, 1 syllable -s, a plural marker on nouns CSC 9010- NLP - 3: Morphology, Finite State Transducers

Morpheme Definitions Root Stem Affix Clitic The portion of the word that: is common to a set of derived or inflected forms, if any, when all affixes are removed is not further analyzable into meaningful elements carries the principal portion of meaning of the words Stem The root or roots of a word, together with any derivational affixes, to which inflectional affixes are added. Affix A bound morpheme that is joined before, after, or within a root or stem. Clitic a morpheme that functions syntactically like a word, but does not appear as an independent phonological word Spanish: un beso, las aguas English: Hal’s (genetive marker) Proto-European: Kwe  -que (Latin), te (Greek), and –ca (Sanskrit) CSC 9010- NLP - 3: Morphology, Finite State Transducers

Inflectional vs. Derivational Word Classes Parts of speech: noun, verb, adjectives, etc. Word class dictates how a word combines with morphemes to form new words Inflection: Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast. Doesn’t change the word class Usually produces a predictable, non-idiosyncratic change of meaning. Derivation: The formation of a new word or inflectable stem from another word or stem. CSC 9010- NLP - 3: Morphology, Finite State Transducers

Inflectional Morphology Adds: tense, number, person, mood, aspect Word class doesn’t change Word serves new grammatical role Examples come is inflected for person and number: The pizza guy comes at noon. las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s las manzanas rojas (‘the red apples’) CSC 9010- NLP - 3: Morphology, Finite State Transducers

Derivational Morphology Nominalization (formation of nouns from other parts of speech, primarily verbs in English): computerization appointee killer fuzziness Formation of adjectives (primarily from nouns) computational clueless Embraceable Diffulcult cases: building  from which sense of “build”? CSC 9010- NLP - 3: Morphology, Finite State Transducers

Concatinative Morphology Morpheme+Morpheme+Morpheme+… Stems: also called lemma, base form, root, lexeme hope+ing  hoping hop  hopping Affixes Prefixes: Antidisestablishmentarianism Suffixes: Antidisestablishmentarianism Infixes: hingi (borrow) – humingi (borrower) in Tagalog Circumfixes: sagen (say) – gesagt (said) in German Agglutinative Languages uygarlaştıramadıklarımızdanmışsınızcasına (Turkish) uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına Behaving as if you are among those whom we could not cause to become civilized Say (has) said CSC 9010- NLP - 3: Morphology, Finite State Transducers

Templatic Morphology Roots and Patterns Example: Hebrew verbs Root: Consists of 3 consonants CCC Carries basic meaning Template: Gives the ordering of consonants and vowels Specifies semantic information about the verb Active, passive, middle voice Example: lmd (to learn or study) CaCaC -> lamad (he studied) CiCeC -> limed (he taught) CuCaC -> lumad (he was taught) Psycholinguistic reality format  فرمت farmat CSC 9010- NLP - 3: Morphology, Finite State Transducers

Nouns and Verbs (in English) Nouns have simple inflectional morphology cat cat+s, cat+’s Verbs have more complex morphology CSC 9010- NLP - 3: Morphology, Finite State Transducers

Nouns and Verbs (in English) Have simple inflectional morphology Cat/Cats Mouse/Mice, Ox, Oxen, Goose, Geese Verbs More complex morphology Walk/Walked Go/Went, Fly/Flew CSC 9010- NLP - 3: Morphology, Finite State Transducers

Regular (English) Verbs Morphological Form Classes Regularly Inflected Verbs Stem walk merge try map -s form walks merges tries maps -ing form walking merging trying mapping Past form or –ed participle walked merged tried mapped CSC 9010- NLP - 3: Morphology, Finite State Transducers

Irregularly Inflected Verbs Irregular (English) Verbs Morphological Form Classes Irregularly Inflected Verbs Stem eat catch cut -s form eats catches cuts -ing form eating catching cutting Past form ate caught -ed participle eaten CSC 9010- NLP - 3: Morphology, Finite State Transducers

“To love” in Spanish CSC 9010- NLP - 3: Morphology, Finite State Transducers

Syntax and Morphology Phrase-level agreement Subject-Verb John studies hard (STUDY+3SG) Noun-Adjective Las vacas hermosas Sub-word phrasal structures שבספרינו ש+ב+ספר+ים+נו That+in+book+PL+Poss:1PL Which are in our books CSC 9010- NLP - 3: Morphology, Finite State Transducers

Phonology and Morphology Script Limitations Spoken English has 14 vowels heed hid hayed head had hoed hood who’d hide how’d taught Tut toy enough English Alphabet has 5 Use vowel combinatios: far fair fare Consonantal doubling (hopping vs. hoping) CSC 9010- NLP - 3: Morphology, Finite State Transducers

Computational Morphology Approaches Lexicon only Rules only Lexicon and Rules Finite-state Automata Finite-state Transducers Systems WordNet’s morphy PCKimmo Named after Kimmo Koskenniemi, much work done by Lauri Karttunen, Ron Kaplan, and Martin Kay Accurate but complex http://www.sil.org/pckimmo/ Two-level morphology Commercial version available from InXight Corp. Background Chapter 3 of Jurafsky and Martin A short history of Two-Level Morphology http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/ CSC 9010- NLP - 3: Morphology, Finite State Transducers

Computational Morphology WORD STEM (+FEATURES)* cats cat +N +PL cat cat +N +SG cities city +N +PL geese goose +N +PL ducks (duck +N +PL) or (duck +V +3SG) merging merge +V +PRES-PART caught (catch +V +PAST-PART) or (catch +V +PAST) CSC 9010- NLP - 3: Morphology, Finite State Transducers

FSAs and the Lexicon First we’ll capture the morphotactics The rules governing the ordering of affixes in a language. Then we’ll add in the actual words CSC 9010- NLP - 3: Morphology, Finite State Transducers

Simple Rules CSC 9010- NLP - 3: Morphology, Finite State Transducers

Adding the Words CSC 9010- NLP - 3: Morphology, Finite State Transducers

Derivational Rules CSC 9010- NLP - 3: Morphology, Finite State Transducers

Parsing/Generation vs. Recognition Recognition is usually not quite what we need. Usually if we find some string in the language we need to find the structure in it (parsing) Or we have some structure and we want to produce a surface form (production/generation) Example From “cats” to “cat +N +PL” and back Morphological analysis Morphological analysis – either An important stand-alone component of an application (spelling correction, information retrieval) a link in a chain of processing CSC 9010- NLP - 3: Morphology, Finite State Transducers

Finite State Transducers The simple story Add another tape Add extra symbols to the transitions On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around. CSC 9010- NLP - 3: Morphology, Finite State Transducers

FSTs CSC 9010- NLP - 3: Morphology, Finite State Transducers

Transitions c:c a:a t:t +N:ε +PL:s c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s CSC 9010- NLP - 3: Morphology, Finite State Transducers

Ambiguity Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. Didn’t matter which path was actually traversed In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result CSC 9010- NLP - 3: Morphology, Finite State Transducers

Ambiguity What’s the right parse for Unionizable Union-ize-able Un-ion-ize-able Each represents a valid path through the derivational morphology machine. CSC 9010- NLP - 3: Morphology, Finite State Transducers

Ambiguity There are a number of ways to deal with this problem Simply take the first output found Find all the possible outputs (all paths) and return them all (without choosing) Bias the search so that only one or a few likely paths are explored CSC 9010- NLP - 3: Morphology, Finite State Transducers

The Gory Details Multi-tape machines Of course, its not as easy as “cat +N +PL” <-> “cats” As we saw earlier there are geese, mice and oxen But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes Cats vs Dogs Multi-tape machines CSC 9010- NLP - 3: Morphology, Finite State Transducers

Multi-Level Tape Machines We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape CSC 9010- NLP - 3: Morphology, Finite State Transducers

Lexical to Intermediate Level CSC 9010- NLP - 3: Morphology, Finite State Transducers

Intermediate to Surface The add an “e” rule as in fox^s# <-> foxes A key feature of this machine is that it doesn’t do anything to inputs to which it doesn’t apply. Meaning that they are written out unchanged to the output tape. CSC 9010- NLP - 3: Morphology, Finite State Transducers

Foxes CSC 9010- NLP - 3: Morphology, Finite State Transducers

Foxes CSC 9010- NLP - 3: Morphology, Finite State Transducers

FST Review FSTs allow us to take an input and deliver a structure based on it Or… take a structure and create a surface form Or take a structure and create another structure In many applications its convenient to decompose the problem into a set of cascaded transducers where The output of one feeds into the input of the next. We’ll see this scheme again for deeper semantic processing. CSC 9010- NLP - 3: Morphology, Finite State Transducers

Overall Plan CSC 9010- NLP - 3: Morphology, Finite State Transducers This leaves out the fact that many such transducerss are needed and the easiest way to create the morphological analyzer is to create the transducers separately and then compose them to obtain the grand result, as described in the text. CSC 9010- NLP - 3: Morphology, Finite State Transducers

Lexicon-only Morphology The lexicon lists all surface level and lexical level pairs No rules … Analysis/Generation is easy Very large for English What about Arabic or Turkish or Chinese? acclaim acclaim $N$ acclaim acclaim $V+0$ acclaimed acclaim $V+ed$ acclaimed acclaim $V+en$ acclaiming acclaim $V+ing$ acclaims acclaim $N+s$ acclaims acclaim $V+s$ acclamation acclamation $N$ acclamations acclamation $N+s$ acclimate acclimate $V+0$ acclimated acclimate $V+ed$ acclimated acclimate $V+en$ acclimates acclimate $V+s$ acclimating acclimate $V+ing$ CSC 9010- NLP - 3: Morphology, Finite State Transducers

Stemming vs Morphology Sometimes you just need to know the stem of a word and you don’t care about the structure. In fact you may not even care if you get the right stem, as long as you get a consistent string. This is stemming… it most often shows up in IR applications CSC 9010- NLP - 3: Morphology, Finite State Transducers

Stemming in IR Run a stemmer on the documents to be indexed Run a stemmer on users queries Match This is basically a form of hashing Example: Computerization ization -> -ize computerize ize -> ε computer CSC 9010- NLP - 3: Morphology, Finite State Transducers

Porter Stemmer Step 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible CSC 9010- NLP - 3: Morphology, Finite State Transducers

Porter No lexicon needed Basically a set of staged sets of rewrite rules that strip suffixes Handles both inflectional and derivational suffixes Doesn’t guarantee that the resulting stem is really a stem (see first bullet) Lack of guarantee doesn’t matter for IR CSC 9010- NLP - 3: Morphology, Finite State Transducers

Porter Stemmer Errors of Omission Errors of Commission European Europe analysis analyzes matrices matrix noise noisy explain explanation Errors of Commission organization organ doing doe generalization generic numerical numerous university universe CSC 9010- NLP - 3: Morphology, Finite State Transducers

Dr Papalarsky or Dr Matuzka Soundex You work as the Villanova telephone operator. Someone calls looking for: Dr Papalarsky or Dr Matuzka ???????? What do you type as your query string? CSC 9010- NLP - 3: Morphology, Finite State Transducers

Soundex Keep the first letter Drop non-initial occurrences of vowels, h, w and y Replace the remaining letters with numbers according to group (e.g.. b, f, p, and v -> 1 Replace strings of identical numbers with a single number (333 -> 3) Drop any numbers beyond a third one CSC 9010- NLP - 3: Morphology, Finite State Transducers

Soundex Effect is to map (hash) all similar sounding transcriptions to the same code. Structure your directory so that it can be accessed by code as well as by correct spelling Used for census records, phone directories, author searches in libraries etc. CSC 9010- NLP - 3: Morphology, Finite State Transducers