Finite-State Transducers: Applications in Natural Language Processing Heli Uibo Institute of Computer Science University of Tartu

Slides:



Advertisements
Similar presentations
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Advertisements

A Short History of Two-Level Morphology Lauri Karttunen, Xerox PARC Kenneth R. Beesley, XRCE.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Intro to NLP - J. Eisner1 Finite-State Methods.
Statistical NLP: Lecture 3
Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.
PSY 369: Psycholinguistics Some basic linguistic theory part2.
October 2006Advanced Topics in NLP1 Finite State Machinery Xerox Tools.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
Stemming, tagging and chunking Text analysis short of parsing.
Parsing I Miriam Butt May 2005 Jurafsky and Martin, Chapters 10, 13.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Morphological analysis
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Normal forms for Context-Free Grammars
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Models of Generative Grammar Smriti Singh. Generative Grammar  A Generative Grammar is a set of formal rules that can generate an infinite set of sentences.
Tips and Tricks … with INTEX/NOOJ Tamás Váradi Institute for Linguistics Research Hungarian Academy of Sciences Max Silberztein University.
Introduction to English Morphology Finite State Transducers
1 A Chart Parser for Analyzing Modern Standard Arabic Sentence Eman Othman Computer Science Dept., Institute of Statistical Studies and Research (ISSR),
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi and introduced in Tree-adjoining grammars are somewhat similar to context-free.
October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Finite State Transducers for Morphological Parsing
Finite State Machinery - I Fundamentals Recognisers and Transducers.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 4.
Linguistic Essentials
An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic By: Mohammed A. Attia Abbas Al-Julaih Natural Language Processing ICS.
CSA3050: Natural Language Algorithms Finite State Devices.
November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst.
Natural Language Processing Chapter 2 : Morphology.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
SYNTAX.
FST Morphology Miriam Butt October 2003 Based on Beesley and Karttunen 2003.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
November 2003Computational Morphology III1 CSA405: Advanced Topics in NLP Xerox Notation.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
November 2003Computational Morphology VI1 CSA4050 Advanced Topics in NLP Non-Concatenative Morphology – Reduplication – Interdigitation.
NATURAL LANGUAGE PROCESSING
Composing Music with Grammars. grammar the whole system and structure of a language or of languages in general, usually taken as consisting of syntax.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
10/31/00 1 Introduction to Cognitive Science Linguistics Component Topic: Formal Grammars: Generating and Parsing Lecturer: Dr Bodomo.
Descriptive Grammar – 2S, 2016 Mrs. Belén Berríos
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Statistical NLP: Lecture 3
Composition is Our Friend
Theory of Computation Theory of computation is mainly concerned with the study of how problems can be solved using algorithms.  Therefore, we can infer.
Formal Language Theory
Prepare to partition your brain to learn a whole new formalism.
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Token generation - stemming
Lecture 7: Introduction to Parsing (Syntax Analysis)
Writing Lexical Transducers Using xfst
Finite-State Methods in Natural-Language Processing: Basic Mathematics
CS4705 Natural Language Processing
languages & relations regular expressions finite-state networks
Linguistic Essentials
Morphological Parsing
Presentation transcript:

Finite-State Transducers: Applications in Natural Language Processing Heli Uibo Institute of Computer Science University of Tartu

Outline  FSA and FST: operations, properties  Natural languages vs. Chomsky’s hierarchy  FST-s: application areas in NLP  Finite-state computational morphology  Author’s contribution: Estonian finite-state morphology  Different morphology-based applications  Conclusion

FSA-s and FST-s

Operations on FSTs  concatenation  union  iteration (Kleene’s star and plus)  *complementation  composition  reverse, inverse  *subtraction  *intersection  containment  substitution  cross-product  projection

Algorithmic properties of FSTs  epsilon-free  deterministic  minimized

Natural languages vs. Chomsky’s hierarchy  “English is not a finite state language.” (Chomsky “Syntactic structures” 1957)  Chomsky’s hierarchy: Finite- state Context- free Context- sensitive Turing machine

Natural languages vs. Chomsky’s hierarchy  The Chomsky’s claim was about syntax (sentence structure).  Proved by (theoretically unbounded) recursive processes in syntax: embedded subclauses I saw a dog, who chased a cat, who ate a rat, who … adding of free adjuncts S  NP (AdvP)* VP (AdvP)*

Natural languages vs. Chomsky’s hierarchy  Attempts to use more powerful formalisms Syntax: phrase structure grammars (PSG) and unification grammars (HPSG, LFG) Morphology: context-sensitive rewrite rules (not- reversible)

Natural languages vs. Chomsky’s hierarchy  Generative phonology by Chomsky&Halle (1968) used context-sensitive rewrite rules, applied in the certain order to convert the abstract phonological representation to the surface representation (wordform) through the intermediate representations.  General form of rules: x  y / z _ w, where x, y, z, w – arbitrary complex feature structures

Natural languages vs. Chomsky’s hierarchy  BUT: Writing large scale, practically usable context-sensitive grammars even for well-studied languages such as English turned out to be a very hard task.  Finite-state devices have been "rediscovered" and widely used in language technology during last two decades.

Natural languages vs. Chomsky’s hierarchy  Finite-state methods have been especially successful for describing morphology.  The usability of FSA-s and FST-s in computational morphology relies on the following results:  D. Johnson, 1972: Phonological rewrite rules are not context-sensitive in nature, but they can be represent as FST-s.  Schützenberger, 1961: If we apply two FST-s sequentially, there exist a single FST, which is the composition of the two FST-s.

Natural languages vs. Chomsky’s hierarchy  Generalization to n FST-s: we manage without intermediate representations – deep representation is converted to surface representation by a single FST!  1980 – the result was rediscovered by R. Kaplan and M. Kay (Xerox PARC)

Natural languages vs. Chomsky’s hierarchy Deep representationSurface representation ”one big rule” = FST Rule 1 Rule 2 Rule n ………..

Applications of FSA-s and FST-s in NLP  Lexicon (word list) as FSA – compression of data!  Bilingual dictionary as lexical transducer  Morphological transducer (may be combined with rule-transducer(s), e.g. Koskenniemi’s two-level rules or Karttunen’s replace rules – composition of transducers). Each path from the initial state to a final state represents a mapping between a surface form and its lemma (lexical form).

Finite-state computational morphology Morphological readings Wordforms Morphological analyzer/generator

Morfological analysis by lexical transducer Morphological analysis = lookup The paths in the lexical transducers are traversed, until one finds a path, where the concatenation of the lower labels of the arcs is equal to the given wordform. The output is the concatenation of the upper labels of the same path (lemma + grammatical information). If no path succeeds (transducer rejects the wordform), then the wordform does not belong to the language, described by the lexical transducer.

Morfological synthesis by lexical transducer Morphological synthesis = lookdown The paths in the lexical transducers are traversed, until one finds a path, where the concatenation of the upper labels of the arcs is equal to the given lemma + grammatical information. The output is the concatenation of the lower labels of the same path (a wordform). If no path succeeds (transducer rejects the given lemma + grammatical information), then either the lexicon does not contain the lemma or the grammatical information is not correct.

Finite-state computational morphology In morphology, one usually has to model two principally different processes: 1. Morphotactics (how to combine wordforms from morphemes) - prefixation and suffixation, compounding = concatenation - reduplication, infixation, interdigitation – non- concatenative processes

Finite-state computational morphology 2. Phonological/orthographical alternations - assimilation (hind : hinna) - insertion (jooksma : jooksev) - deletion (number : numbri) - gemination (tuba : tuppa) All the listed morphological phenomena can be described by regular expressions.

Estonian finite-state morphology In Estonian language different grammatical wordforms are built using stem flexion tuba - singular nominative (room) toa - singular genitive (of the room) suffixes (e.g. plural features and case endings) tubadest - plural elative (from the rooms)

Estonian finite-state morphology productive derivation, using suffixes kiire (quick)  kiiresti (quickly) compounding, using concatenation piiri + valve + väe + osa = piirivalveväeosa border(Gen) + guarding(Gen) + force(Gen) + part = a troup of border guards

Estonian finite-state morphology  Two-level model by K. Koskenniemi  LexiconFST.o. RuleFST  Three types of two-level rules:, (formally regular expressions)  e.g. two-level rule a:b => L _ R is equivalent to regular expression [ ~[ [ [ ?* L ] a:b ?* ] | [ ?* a:b ~[ R ?* ] ] ]  Linguists are used to rules of type a  b || L _ R

Estonian finite-state morphology  Phenomena handled by lexicons: noun declination verb conjugation comparison of adjectives derivation compounding stem end alternations ne-se, 0-da, 0-me etc. choice of stem end vowel a, e, i, u Appropriate suffixes are added to a stem according to its inflection type

Estonian finite-state morphology  Handled by rules: stem flexion kägu : käo, hüpata : hüppan phonotactics lumi : lumd*  lund morphophonological distribution seis + da  seista orthography kirj*  kiri, kristall + ne  kristalne

Estonian finite-state morphology Problem with derivation from verbs with weakening stems: every stem occurs twice at the upper side of the lexicon  vaste of space! LEXICON Verb lõika:lõiKa V2; ……….. LEXICON Verb-Deriv lõiga VD0; ……….. LEXICON VD0 tud+A:tud #; tu+S:tu S1; nud+A:nud #; nu+S:nu S1;

Estonian finite-state morphology  My own scientific contribution : Solution to the problem of weak-grade verb derivatives: also primary form, belonging to the level of morphological information, has lexical (or deep) representation. That is, two-levelness has been extended to the upper side of the lexical transducer (only for verbs). LEXICON Verb lõiKa:lõiKa V2; …………. No stem doubling for productively derived forms!

Estonian finite-state morphology Result: The morphological transducer for Estonian is composed as follows: ((LexiconFST) -1 ∘ RulesFST 1 ) -1 ∘ RulesFST, where RulesFST 1 ⊂ RulesFST (subset of the whole rule set, containing grade alternation rules only) Operations used: composition, inversion

Estonian finite-state morphology  The experimental two-level morphology for Estonian has been implemented using the XEROX finite-state tools lexc and twolc.  45 two-level rules  The root lexicons include  2000 word roots.  Over 200 small lexicons describe the stem end alternations, conjugation, declination, derivation and compounding.

Estonian finite-state morphology To-do list:  avoid overgeneration of compound words solution: compose the transducer with other transducers which constrain the generation process  guess the analysis of unknown words (words not in the lexicon) solution: use regexp in the lexicon which stand for any root, e.g. [Alpha*]

Language technological applications: requirements  Different approaches of building the morphological transducer may be suitable for different language technological applications. Speller – is the given wordform correct? ( = accepted by the morphological transducer) Important to avoid overgeneration! Improved information retrieval – find all the documents where the given keyword occurs in arbitrary form and sort the documents by relevance Weighted FST-s may be useful; morphological disambiguation also recommended; overgeneration not so big problem.

Full NLP with FST-s? Morph- FST Syntax- FST Semantics- FST Description of a natural language = one big transducer analysis generation Speech- Text FST