Words: Surface Variation and Automata CMSC 35100 Natural Language Processing April 3, 2003.

Slides:



Advertisements
Similar presentations
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Advertisements

Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.
BİL711 Natural Language Processing1 Morphology Morphology is the study of the way words are built from smaller meaningful units called morphemes. We can.
Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
1 Regular Expressions and Automata September Lecture #2-2.
Finite-State Automata Shallow Processing Techniques for NLP Ling570 October 5, 2011.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr.
Morphological analysis
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Finite-State Morphology CMSC 723: Computational Linguistics I ― Session #3 Jimmy Lin The iSchool University of Maryland Wednesday, September 16, 2009.
CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Introduction to English Morphology Finite State Transducers
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model.
October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.
Finite State Transducers
Finite State Transducers for Morphological Parsing
Introduction to CL & NLP CMSC April 1, 2003.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
CSA3050: Natural Language Algorithms Finite State Devices.
The Simplest NL Applications: Text Searching and Pattern Matching Read J & M Chapter 2.
Natural Language Processing Chapter 2 : Morphology.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
Introduction to Deep Processing Techniques for NLP Deep Processing Techniques for NLP Ling 571 January 4, 2016 Gina-Anne Levow.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Speech and Language Processing
Morphology: Parsing Words
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Morphological Parsing
Presentation transcript:

Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003

Roadmap ● The NLP Pipeline ● Words: Surface variation and automata – Motivation: ● Morphological and pronunciation variation – Mechanisms: ● Patterns: Regular expressions ● Finite State Automata and Regular Languages – Non-determinism, Transduction, and Weighting – FSTs and Morphological/Phonological Rules

Real Language Understanding ● Requires more than just pattern matching ● But what?, ● 2001: ● Dave: Open the pod bay doors, HAL. ● HAL: I'm sorry, Dave. I'm afraid I can't do that.

Language Processing Pipeline Phonetic/Phonological Analysis Morphological analysis OCR/Tokenization Syntactic analysis Semantic Interpretation Discourse Processing speechtext

Phonetics and Phonology ● Convert an acoustic sequence to word sequence ● Need to know: – Phonemes: Sound inventory for a language – Vocabulary: Word inventory – pronunciations – Pronunciation variation: ● Colloquial, fast, slow, accented, context

Morphology & Syntax ● Morphology: Recognize and produce variations in word forms – (E.g.) Inflectional morphology: ● e.g. Singular vs plural; verb person/tense – Door + sg: door – Door + plural: doors – Be + 1 st person, sg, present: am ● Syntax: Order and group words together in sentence – Open the pod bay doors – Vs – Pod the open doors bay

Semantics ● Understand word meanings and combine meanings in larger units ● Lexical semantics: – Bay: partially enclosed body of water; storage area ● Compositional sematics: – “pod bay doors”: ● Doors allowing access to bay where pods are kept

Discourse & Pragmatics ● Interpret utterances in context – Resolve references: ● “I'm afraid I can't do that” – “that” = “open the pod bay doors” – Speech act interpretation: ● “Open the pod bay doors” – Command

Surface Variation: Morphology ● Searching for documents about – “Televised sports” ● Many possible surface forms: – Televised, televise, television,.. – Sports, sport, sporting ● Convert to some common base form – Match all variations – Compact representation of language

Surface Variation: Morphology ● Inflectional morphology: – Verb: past, present; Noun: singular, plural – e.g. Televise: inf; televise +past -> televised – Sport+sg: sport; sport+pl: sports ● Derivational morphology: – v->n: televise -> television ● Lexicon:Root form + morphological features ● Surface: Apply rules for combination ● Identify patterns of transformation, roots, affixes..

Surface Variation: Pronunciation ● Regular English plural: +s ● English plural pronunciation: – cat+s -> cats where s=s, but – dog+s -> dogs where s=z, and – base+s -> bases where s=iz ● Phonological rules govern morpheme combination – +s = s, unless [voiced]+s = z, [sibilant]+s= iz ● Common lexical representation – Mechanism to convert appropriate surface form

Representing Patterns ● Regular Expressions – Strings of 'letters' from an alphabet Sigma – Combined by concatenation, union, disjunction, and Kleene * ● Examples: a, aa, aabb, abab, baaa!, baaaaaa! – Concatenation: ab – Disjunction: a[abcd]: -> aa, ab, ac, ad ● With precedence: gupp(y|ies) -> guppy, guppies – Kleene : (0 or more): baa*! -> ba!, baa!, baaaaa! ● Could implement ELIZA with RE + substitution

Expressions, Languages & Automata ● Regular expressions specify sets of strings (languages) that can be implemented with a finite-state automaton. Regular Expressions Regular Languages Finite-State Automata

Finite-State Automata ● Formally, – Q: a finite set of N states: q0, q1,...,qN ● Designated start state: q0; final states: F – Sigma: alphabet of symbols – Delta(q,i): Transition matrix specifies in state q, on input i, the next state(s) ● Accepts a string if in final state at end of string – O.W. Rejects

Finite-State Automata ● Regular Expression: baaa*! – e.g. Baaaa! ● Closed under concatention, union, disjunction, and Kleene * Q0Q1Q2Q3Q4 A B A A !

Non-determinism & Search ● Non-determinism: – Same state, same input -> multiple next states – E.g.: Delta(q2,a)-> q2, q3 ● To recognize a string, follow state sequence – Question: which one? – Answer: Either! ● Provide mechanism to backup to choice point – Save on stack: LIFO: Depth-first search – Save in queue: FIFO: Breadth-first search ● NFSA equivalent to FSA – Requires up to 2^n states, though

From Recognition to Transformation ● FSAs accept or reject strings as elements of a regular language: recognition ● Would like to extend: – Parsing: Take input and produce structure for it – Generation: Take structure and produce output form – E.g. Morphological parsing: words -> morphemes ● Contrast to stemming – E.g. TTS: spelling/representation -> pronunciation

Morphology ● Study of minimal meaning units of language – Morphemes ● Stems: main units; Affixes: additional units ● E.g. Cats: stem=cat; affix=s (plural) – Inflectional vs Derivational: ● Inflection: add morpheme, same part of speech ● E.g. Plural -s of noun; -ed: past tense of verb ● Derivation: add morpheme, change part of speech ● E.g. verb+ation -> noun; realize -> realization ● Huge language variation: ● English: relatively little: concatenative ● Arabic: richer, templatic kCtCb + -s: kutub ● Turkish: long affix strings, “agglutinative”

Morphology Issues ● Question 1: Which affixes go with which stems? – Tied to POS (e.g. Possessive with noun; tenses: verb) – Regular vs irregular cases ● Regular: majority, productive – new words inherit ● Irregular: small (closed) class – often very common words ● Question 2: How does the spelling change with the affix? – E.g. Run + ing -> running; fury+s -> furies

Associating Stems and Affixes ● Lexicon – Simple idea: list of words in a language – Too simple! ● Potentially HUGE: e.g. Agglutinative languages – Better: ● List of stems, affixes, and representation of morphotactics ● Split stems into equivalence classes w.r.t. morphology – E.g. Regular nouns (reg-noun) vs irregular-sg-noun... ● FSA could accept legal words of language – Inputs: words-classes, affixes

Automaton for English Nouns q0q1q2 noun-regplural -s noun-irreg-sg noun-irreg-pl

Two-level Morphology ● Morphological parsing: – Two levels: (Koskenniemi 1983) ● Lexical level: concatenation of morphemes in word ● Surface level: spelling of word surface form – Build rules mapping between surface and lexical ● Mechanism: Finite-state transducer (FST) – Model: two tape automaton – Recognize/Generate pairs of strings

FSA -> FST ● Main change: Alphabet – Complex alphabet of pairs: input x output symbols – e.g. i:o ● Where i is in input alphabet, o in output alphabet ● Entails change to state transition function – Delta(q, i:o): now reads from complex alphabet ● Closed under union, inversion, and composition – Inversion allows parser-as-generator – Composition allows series operation

Simple FST for Plural Nouns reg-noun-stem irreg-noun-sg-form irreg-noun-pl-form +N:e +SG:# +PL:^s# +SG:# +PL:#

Rules and Spelling Change ● Example: E insertion in plurals – After x, z, s...: fox + -s -> foxes ● View as two-step process – Lexical -> Intermediate (create morphemes) – Intermediate -> Surface (fix spelling) ● Rules: (a la Chomsky & Halle 1968) – Epsilon -> e/{x,z,s}^__s# ● Rewrite epsilon (empty) as e when it occurs between x,s,or z at end of one morpheme and next morpheme is -s ● ^: morpheme boundary; #: word boundary

E-insertion FST q5q3q4 q0q1q2 ^:e, other # z,s,x other s ^:e #,other # ^:e z,x e:e s

Implementing Parsing/Generation ● Two-layer cascade of transducers (series) – Lexical -> Intermediate; Intermediate -> Surface ● I->S: all the different spelling rules in parallel ● Bidirectional, but – Parsing more complex ● Ambiguous! – E.g. Is fox noun or verb?

Shallow Morphological Analysis ● Motivation: Information Retrieval – Just enable matching – without full analysis ● Stemming: – Affix removal ● Often without lexicon ● Just return stems – not structure – Classic example: Porter stemmer ● Rule-based cascade of repeated suffix removal – Pattern-based ● Produces: non-words, errors,...

Automatic Acquisition of Morphology ● “Statistical Stemming” (Cabezas, Levow, Oard) – Identify high frequency short affix strings for removal – Fairly effective for Germanic, Romance languages ● Light Stemming (Arabic) – Frequency-based identification of templates & affixes ● Minimum description length approach – (Brent and Cartwright1996, DeMarcken 1996, Goldsmith 2000 – Minimize cost of model + cost of lexicon | model