Writing Lexical Transducers Using xfst

Slides:



Advertisements
Similar presentations
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Advertisements

Beesley 2000 Introduction to the xfst Interface Review Introduction to Morphology Relations and Transducers Introduction to xfst.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Beesley 2001 The lexc Language Prepare to partition your brain to learn a whole new formalism.
Beesley 2001 Finite-State Technology and Linguistic Applications March 2001 Xerox Research Centre Europe Grenoble Laboratory 6, chemin de Maupertuis.
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
Intro to NLP - J. Eisner1 Finite-State Methods.
Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.
October 2006Advanced Topics in NLP1 Finite State Machinery Xerox Tools.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Morphology See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics, Oxford (2004): OUP D Jurafsky &
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Chapter 3: Formal Translation Models
FST Morphology Miriam Butt October 2002 Based on Beesley and Karttunen 2002.
Languages and Machines Unit two: Regular languages and Finite State Automata.
Introduction to English Morphology Finite State Transducers
May 2007CLINT/LIN xfst 1 Introduction to the xfst Interface Review Introduction to Morphology Relations and Transducers Introduction to xfst.
Lee CSCE 314 TAMU 1 CSCE 314 Programming Languages Syntactic Analysis Dr. Hyunyoung Lee.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Topic #3: Lexical Analysis
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
Syntax and Backus Naur Form
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Lexical Analysis Hira Waseem Lecture
Monday Afternoon Review Introduction to Natural-Language Morphology Relations and Transducers Introduction to xfst.
Finite State Transducers for Morphological Parsing
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
Human Language Technology Finite State Transducers.
Chapter 3 Describing Syntax and Semantics
An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic By: Mohammed A. Attia Abbas Al-Julaih Natural Language Processing ICS.
The Simplest NL Applications: Text Searching and Pattern Matching Read J & M Chapter 2.
Python Primer 1: Types and Operators © 2013 Goodrich, Tamassia, Goldwasser1Python Primer.
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
A Programming Languages Syntax Analysis (1)
November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst.
Natural Language Processing Chapter 2 : Morphology.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
FST Morphology Miriam Butt October 2003 Based on Beesley and Karttunen 2003.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
November 2003Computational Morphology III1 CSA405: Advanced Topics in NLP Xerox Notation.
November 2003Computational Morphology VI1 CSA4050 Advanced Topics in NLP Non-Concatenative Morphology – Reduplication – Interdigitation.
The lexc Language Prepare to partition your brain to learn a whole new formalism.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
10/31/00 1 Introduction to Cognitive Science Linguistics Component Topic: Formal Grammars: Generating and Parsing Lecturer: Dr Bodomo.
Testing with the Finite-State Calculus Thursday AM Kenneth R. Beesley Xerox Research Centre Europe.
Chapter 1 INTRODUCTION TO THE THEORY OF COMPUTATION.
BİL711 Natural Language Processing
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Chapter 3 Lexical Analysis.
Context-Free Grammars: an overview
Composition is Our Friend
CO4301 – Advanced Games Development Week 2 Introduction to Parsing
Prepare to partition your brain to learn a whole new formalism.
CSCI 5832 Natural Language Processing
Token generation - stemming
CHAPTER 2 Context-Free Languages
Lecture 4: Lexical Analysis & Chomsky Hierarchy
Writing Lexical Transducers Using xfst
Building Finite-State Machines
languages & relations regular expressions finite-state networks
Teori Bahasa dan Automata Lecture 9: Contex-Free Grammars
Lexical Elements & Operators
Morphological Parsing
Presentation transcript:

Writing Lexical Transducers Using xfst Overview of Transduction Review of xfst Rules Creating Two-Level Lexicons Putting it All Together

Theory-Neutral Morphological Analysis Analyses Black-Box Morphological Analyzer Words

Finite-State Transducers (FSTs) An FST encodes a Regular Relation, i.e. a relation between two regular languages. FSTs can be used for morphological analysis, if The set of surface words (strings) to be analyzed is a regular language, and The “analyses” are also defined to be a regular language, i.e. just another set of strings Analysis String Language FST Surface String Language

What Do the Two Languages Look Like? In commercial natural-language processing The surface language (e.g. French words written in the standard French orthography) is usually a given. Periodic official spelling reforms may require fixes to your analyzer. You may have to worry about national variations. In contrast, the analysis-language strings must be designed by the linguist. In the most common Xerox convention, each analysis string consists of the traditional dictionary-citation baseform followed by multicharacter-symbol “tags”. cantar+Verb+PInd+1P+Sg canto+Noun+Masc+Sg alto+Adj+Fem+Pl

Non-Commercial (Lesser-Studied) Languages 1. All normal human beings speak a natural language, but there is nothing necessary or natural about reading and writing. 2. An orthography is a set of symbols, and conventions for using them, for “making language visible”. 3. Orthographies are technologies, like agriculture or metalworking. 4. Most languages have never been written, i.e. there is no standard orthography; or linguists and governments may have proposed several competing orthographies. 5. When working with lesser-studied languages, you may have to choose (or devise) a surface orthography for use in your morphological analyzer.

Two Main Tasks to Morphology Morphotactics Describe the structure/grammar of words Classic finite-state operations required Concatenation of one morpheme to the next Union of morphemes within classes Some languages require other finite-state operations Arabic stems require intersection Malay requires special algorithms for reduplication Phonological/Orthographical Alternation Union and concatenation by themselves tend to build abstract morphophonemic strings Use finite-state rules to map from underlying (or “lexical”) morphophonemic strings to surface strings

Describing Morphotactics Using Regular Expressions Some very simple morphotactics can be described using just union, concatenation and perhaps optionality. Simple Esperanto Verbs Opt. Prefix Req. Root Opt. Aspect Req. Verb Ending ne don ad as mal dir is pens os ir us ... u i

Esperanto Verb Morphotactics xfst[]: read regex ( n e | m a l ) [ d o n | d i r | p e n s | i r] ( a d ) [ a s | i s | o s | u s | u | i ] ; Each morpheme class is a unioned list of morphemes. Optional classes are surrounded with parentheses. Then morpheme classes are concatenated together, in the right order.

Esperanto Verb Morphotactics, Version 2 (xfst script) xfst[]: define Prefix n e | m a l ; xfst[]: define Root d o n | d i r | p e n s | i r ; xfst[]: define Aspect a d ; xfst[]: define VSuff a s | i s | o s | u s | u | i ; xfst[]: read regex (Prefix) Root (Aspect) VSuff ;

Morphophonological/Orthographical Alternations If simple concatenation doesn’t produce valid words, then we need to handle alternations. In today’s exercises, we will use Replace Rules, e.g. if Spanish pluralization is done by concatenating [ %+ s] to a noun, we will need to fix cases like the following: pez+s .o. z %+ -> c e || _ s .#. pez+s FST peces

The Simplest Xerox Replace Rules Schema: upper -> lower || left _ right where upper, lower, left and right are regular expressions denoting regular languages (not relations!) Remember to use regular-expression syntax. Replace Rules are regular expressions! The overall Replace Rule denotes a relation. E.g. s -> z || [ a | e | i | o | u ] _ [ a | e | i | o | u ] A context can be left empty, which is equivalent to a context of ?* E.g. s -> z || _ m p -> m || m _

The Simplest Replace Rules II Referring to the beginning or the end of a word: z -> s || _ .#. e -> i || _ (s) .#. e -> i || .#. p _ r A rule may be unconditioned, with no context at all c h -> %$ s s -> s Do not write “ss” or “ch” in regular expressions unless you want them to be treated as single symbols. Remember to “unspecialize” special symbols when you want a literal dollar sign, etc.

Rule Abbreviations Instead of two rules: e -> i || _ (s) .#. o -> u || _ (s) .#. You can write: e -> i , o -> u || _ (s) .#. a comma separates the “left-hand sides” of the rule e -> i || .#. p _ r You can write: e -> i || _ (s) .#. , .#. p _ r a comma separates the “right-hand sides” of the rule

Simple Replace-Rule Semantics upper -> lower || leftcontext _ rightcontext The overall rule denotes a finite-state relation (not an algorithm) The upper-side language of a -> relation is the universal language (?*) By default, all symbols on the upper side are mapped to the same symbol on the lower side But IF a string on the upper side contains a designated “upper” string, in the designated context, then it is mapped to a string (or strings) on the lower side where the matched substring is replaced by the designated “lower” string. The context must “match” on the upper side string A right-arrow -> rule has a downward orientation.

Understanding Replace Rules xfst> read regex a -> b ; xfst> apply down a xfst> apply down aaa xfst> apply down dog xfst> apply up b xfst> apply up bbb xfst> apply up dog xfst> read regex a:b ;

Review of Notations for Transducers The cross-product operator: [ u p p e r .x. l o w e r ] In general, for any two regular expressions A and B denoting languages: A .x. B For convenience, we can also write a:b equivalent to [ a .x. b ] %+Tag:{ing} [ %+Tag .x. i n g ] {upper}:{lower} [ u p p e r .x. l o w e r ]

Esperanto Verb Morphotactics, Version 3; A Lexicon with Two Levels xfst[]: define Prefix Neg%+:{ne} | Op%+:{mal} ; xfst[]: define Root d o n | d i r | p e n s | i r ; xfst[]: define Aspect %+Cont:{ad} ; xfst[]: define VSuff %+Pres:{as} | %+Past:{is} | %+Fut:{os} | %+Cond:{us} | %+Subj:u | %+Inf:i ; xfst[]: read regex (Prefix) Root (Aspect) VSuff ;

Esperanto Verb Transducer Pres+ a Past+ Fut+ Neg+ o i Cond+ n o s n e u d i Cont+ Op+ r a d m l Subj+ i p u a s Inf+ e n i Apply up: malpensadus

The Usual Strategy: Define a dictionary and alternation rules Upper: Op+don+Cont+Past Dictionary Transducer Lower: maldonadis .o. Final FST As necessary, apply alternation rules via composition Alternation Rules

The Bambona Language Review the Xerox regular-expression syntax. Review the difference between regular expression file contains a single regular expression, ends with a semicolon and newline xfst[]: read regex < myfile.regex script file contains a list of commands to xfst (including perhaps “define” and “read regex” commands) xfst[]: source myfile.script Read the description carefully (not just the final test data). Describe the morphotactics using union and concatenation. Handle the variations using replace rules.