Beesley 2001 The lexc Language Prepare to partition your brain to learn a whole new formalism.

Slides:



Advertisements
Similar presentations
Chapter 11 Introduction to Programming in C
Advertisements

Finite State Automata. A very simple and intuitive formalism suitable for certain tasks A bit like a flow chart, but can be used for both recognition.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Beesley 2000 Introduction to the xfst Interface Review Introduction to Morphology Relations and Transducers Introduction to xfst.
 2005 Pearson Education, Inc. All rights reserved Introduction.
 C++ programming facilitates a disciplined approach to program design. ◦ If you learn the correct way, you will be spared a lot of work and frustration.
Introduction to C Programming
Writing Lexical Transducers Using xfst
October 2006Advanced Topics in NLP1 Finite State Machinery Xerox Tools.
Programming Logic and Design Fourth Edition, Introductory
Structure of a C program
C Programming Language 4 Developed in 1972 by Dennis Ritchie at AT&T Bell Laboratories 4 Used to rewrite the UNIX operating system 4 Widely used on UNIX.
Honors Compilers The Course Project Feb 28th 2002.
Chapter3: Language Translation issues
 2007 Pearson Education, Inc. All rights reserved Introduction to C Programming.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
Guide To UNIX Using Linux Third Edition
About the Presentations The presentations cover the objectives found in the opening of each chapter. All chapter objectives are listed in the beginning.
JavaScript, Third Edition
Introduction to C Programming
FST Morphology Miriam Butt October 2002 Based on Beesley and Karttunen 2002.
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
Introduction to English Morphology Finite State Transducers
CHAPTER 1: INTORDUCTION TO C LANGUAGE
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
1 Chapter One A First Program Using C#. 2 Objectives Learn about programming tasks Learn object-oriented programming concepts Learn about the C# programming.
A First Program Using C#
May 2007CLINT/LIN xfst 1 Introduction to the xfst Interface Review Introduction to Morphology Relations and Transducers Introduction to xfst.
Fortran 1- Basics Chapters 1-2 in your Fortran book.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Chapter 1: A First Program Using C#. Programming Computer program – A set of instructions that tells a computer what to do – Also called software Software.
Computer Science 101 Introduction to Programming.
Chapter Three The UNIX Editors. 2 Lesson A The vi Editor.
Introduction to Java Applications Part II. In this chapter you will learn:  Different data types( Primitive data types).  How to declare variables?
Lexical Analysis Hira Waseem Lecture
Monday Afternoon Review Introduction to Natural-Language Morphology Relations and Transducers Introduction to xfst.
CPS120: Introduction to Computer Science
Finite State Transducers for Morphological Parsing
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Introduction to programming in the Java programming language.
An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic By: Mohammed A. Attia Abbas Al-Julaih Natural Language Processing ICS.
Chapter Three The UNIX Editors.
November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst.
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
FST Morphology Miriam Butt October 2003 Based on Beesley and Karttunen 2003.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
Introduction to Java Applications Part II. In this chapter you will learn:  Different data types( Primitive data types).  How to declare variables?
November 2003Computational Morphology III1 CSA405: Advanced Topics in NLP Xerox Notation.
Announcements Assignment 1 due Wednesday at 11:59PM Quiz 1 on Thursday 1.
Internet & World Wide Web How to Program, 5/e © by Pearson Education, Inc. All Rights Reserved.
November 2003Computational Morphology VI1 CSA4050 Advanced Topics in NLP Non-Concatenative Morphology – Reduplication – Interdigitation.
Overview of Compilation Prepared by Manuel E. Bermúdez, Ph.D. Associate Professor University of Florida Programming Language Principles Lecture 2.
L071 Introduction to C Topics Compilation Using the gcc Compiler The Anatomy of a C Program Reading Sections
The lexc Language Prepare to partition your brain to learn a whole new formalism.
Linux Administration Working with the BASH Shell.
Testing with the Finite-State Calculus Thursday AM Kenneth R. Beesley Xerox Research Centre Europe.
Definition of the Programming Language CPRL
A Simple Syntax-Directed Translator
Composition is Our Friend
INTRODUCTION TO UNIX: The Shell Command Interface
Prepare to partition your brain to learn a whole new formalism.
Writing Lexical Transducers Using xfst
Creating your first C program
Advanced Filtering and Flag Diacritics
Chapter 2: Introduction to C++.
Faculty of Computer Science and Information System
Presentation transcript:

Beesley 2001 The lexc Language Prepare to partition your brain to learn a whole new formalism.

Beesley 2001 The lexc language “lexc” stands for “LEXicon Compiler” lexc is a high-level, declarative programming language lexc is different from regular expressions and from xfst the syntax is different the assumptions are different the special characters are different the interfaces are different BUT, the lexc compiler produces STANDARD Xerox networks these networks are fully compatible with networks from xfst you can sometimes choose to use lexc or xfst for building a network This is all fertile ground for confusion!

Beesley 2001 Why a Separate lexc Language? Lexc is intended for use by lexicographers. Regular expressions in xfst are often hard to read, especially big ones Typing spaces between all the letters, e.g. e l e p h a n t, to be concatenated in xfst is a nuisance, especially if you need to type 40,000 words You can also write {elephant} in xfst regular expressions, but that’s a nuisance too Lexc is more efficient for compiling large natural-language lexicons (it optimizes the union operation) Lexc has better error messages But remember: lexc is just another formalism for defining finite-state languages and relations you can (and will) use lexc and xfst together in building significant applications

Beesley 2001 The lexc Source File: Multichar_Symbols The lexc compiler and the xfst regular-expression compiler have completely opposite assumptions about multicharacter symbols: In xfst Regular Expressions, the default is to treat a string of symbols written together, e.g. %+Noun or cat, as a single symbol. Concatenation of separate symbols is indicated by manually separating symbols with white space, e.g. [ c a t ], or by using the curly-brace notation, e.g. {cat}. In lexc, in contrast, the default is to treat strings, e.g. cat, as a concatenation of three symbols. Any multicharacter symbols must be explicitly declared at the top of the source file.

Beesley 2001 Multichar_Symbols declaration Multichar_Symbols +Noun +Verb +Adj +Adv +Sg +Pl +1P +2P +3P ^FEAT1 ^FEAT2 The Multichar_Symbols statement is formally optional and is placed at the top of your lexc source file. You can declare as many multicharacter symbols as you find necessary or useful. The compiler uses this declaration to separate the strings of your lexc program into symbols. You are strongly encouraged to include a non-alphabetic character like the plus sign or the circumflex to help the multicharacter symbol stand out visually.

Beesley 2001 The Body of your lexc Program The body of a lexc program is composed of LEXICONs. There should be one LEXICON named Root. It corresponds to the Start State in the resulting Network. If you don’t define a LEXICON Root, lexc will try to use the first LEXICON in the file as the Start State. LEXICON Root dogN ; catN ; birdN ;

Beesley 2001 Entries in a LEXICON Each defined LEXICON must have at least one entry. An entry consists of two parts and is terminated with a semicolon data continuation-class ; The data part has to fit one of four formats: stringe.g. dog upper:lowere.g. swim:swam e.g. empty e.g.

Beesley 2001 upper:lower Entries The upper:lower entries are the simplest way to specify portions of the network where the upper-side and lower-side differ. They are especially useful for irregularies/suppletions. Multichar_Symbols +Verb +Past +Noun +Sg +Pl LEXICON Root swim+Verb+Past:swam # ; go+Verb+Past:went # ; child+Noun+Pl:children # ; ox+Noun+Pl:oxen # ;

Beesley 2001 upper:lower Entries In upper:lower entries, you can overtly indicate where the epsilons should go. Multichar_Symbols +Verb +Past +Noun +Sg +Pl +Nom LEXICON Root poder+Verb:pod0rFutCond ; Danger: the lexc upper:lower notation is not quite the same as the regular-expression colon notation.

Beesley 2001 Regular Expressions in lexc Any data written as a regular expression must be surrounded with angle brackets, e.g. CC ; Inside angle brackets, you revert to all the assumptions suitable for xfst regular expressions, including the treatment of multicharacter symbols vs. concatenation of symbols. This is fertile ground for confusion and errors.

Beesley 2001 Continuation Classes The Continuation Class is just the name of a defined LEXICON or #, indicating end-of-word (a final state). Multichar_Symbols +Noun +Sg +Pl LEXICON Root dogN ; catN ; LEXICON N +Noun+Sg:0# ; +Noun+Pl:s# ;

Beesley 2001 Thinking About lexc LEXICONS A LEXICON should hold a coherent class of morphemes The entries in a lexc LEXICON are unioned together by the compiler; the order of the entries in a LEXICON is not significant. Think of LEXICONs as potential “targets” Entries “point at” a LEXICON via the ContinuationClass But each entry in a LEXICON could itself point to a different ContinuationClass During development, you may have to subdivide lexicons Avoid having copies of the same material (if possible) You may change an entry in one place and forget to change the copy

Beesley 2001 Formally Speaking Lexc syntax is a kind of right-recursive phrase-structure grammar. Phrase-structure grammars can in general describe languages beyond finite-state power, including languages with balanced parentheses. But with the right-recursive limitation, a phrase-structure grammar can define only finite-state languages. Lexc can describe only finite-state languages. Lexc descriptions compile into finite-state networks.

Beesley 2001 Lexc Idiom: Optional Morphemes via By-Pass LEXICON Vroot kantV ; dirV ; don V ; pens V ; LEXICON V AdLex ; Vend ; LEXICON AdLex adVend ; LEXICON Vend as# ; is# ; os# ; us # ; u # ; i # ;

Beesley 2001 Lexc Idiom: Optional Morphemes via “Escape” Entries LEXICON Vroot kantAdLex ; dirAdLex ; don AdLex ; pens AdLex ; LEXICON AdLex adVend ; Vend ; ! escape LEXICON Vend as# ; is# ; os# ; us # ; u # ; i # ;

Beesley 2001 Lexc Idiom: Loops LEXICON Nroot LEXICON Plur katN ;jCase ; ! Opt. plural ending hundN ;Case ; elefantN ; LEXICON NLEXICON Case egN ; ! loopn# ; ! Opt. case ending etN ; ! loop# ; inN ; ! loop Nend ; LEXICON Nend oPlur ;

Beesley 2001 Stem compounding (loops) in lexc LEXICON NrootLEXICON Plur katN ;jCase ; ! Opt. plural ending hundN ;Case ; elefantN ; LEXICON NLEXICON Case Nroot ;n# ; Nend ;# ; LEXICON Nend oPlur ;

Beesley 2001 Special Characters in lexc Overall, there are far fewer special characters in lexc than in regular expressions. In lexc, the following are special: SpecialLiteralized :used in upper:lower notation %: ;terminates an entry %; <begins a regular expression %< >ends a regular expression %> 0denotes the empty string (epsilon) %0 ! introduces a comment line %! #continuation-class for end-of-word %# % literalizing prefix %

Beesley 2001 Lexc source files Lexc sources files are ascii, typically edited with xemacs Lexc programs for natural language can get very large Typically 8000 to entries for verbs Tens of thousands of entries for nouns and proper nouns

Beesley 2001 The lexc interface Invoke the lexc interface by simply entering ‘lexc’ at the UNIX prompt. unixprompt% lexc You communicate with the interface using lexc commands. Type ‘?’ to see all the possible commands. Invoke ‘help commandname’ to see some terse online documentation. Enter ‘quit’ to leave lexc and return to the operating system. lexc: quit

Beesley 2001 The Three lexc Registers To understand lexc commands, you must understand that they refer to and operate on networks held in three registers, visualized as SOURCE RULES RESULT Typically used to store a lexicon FST. Typically used to store a rule FST or FSTs. Typically used to store the result of composing the rule FST(s) under the source FST.

Beesley 2001 Basic lexc commands SOURCE RULES RESULT compile-source filename compiles the lexc source code in filename and stores the resulting network in SOURCE read-rules filename reads the binary file filename and stores the network(s) in RULES. The binary file may be from xfst or twolc. compose-result composes the RULE FST(s) under the SOURCE FST and stores the resulting FST in RESULT

Beesley 2001 Some Other lexc Commands read-source load a pre-compiled binary network into the SOURCE register save-source filenamestore network in SOURCE to binary file save-result filenamestore network in RESULT to binary file lookup word(equivalent to the xfst ‘apply up’) lookdown word(equivalent to the xfst ‘apply down’) result-to-sourcemove the network from the RESULT register to SOURCE Enter ‘?’ to see all the lexc interface commands.

Beesley 2001 Using lexc and xfst together Write a lexc source file (e.g. mysrc.lex) using xemacs or a similar editor Write suitable alternation rules (in xfst or even twolc). Compile them and save the network(s) to file, e.g. to myrul.fst Then from the lexc interface: lexc: compile-source mysrc.lex lexc: read-rules myrul.fst lexc: compose-result lexc: save-result mylang.fst

Beesley 2001 Using lexc and xfst together Lexc lexicons build “words” (strings) using union and concatenation Entries within a LEXICON are unioned (the order of entries is not significant) The LEXICON Root corresponds to the start state The special # continuation class corresponds to final states Other continuation classes translate into concatenation By Xerox convention, upper-side strings consist of a baseform and “tags” By convention, a surface (or more surfacy) form appears on the lower-side The surfacy forms generated by lexc may still be rather abstract, hyper-regular, or “morphophonemic”. They may sometimes contain multicharacter symbols. Replace Rules (perhaps a whole cascade of them) map from the surfacy strings produced by lexc to real surface strings; rules are applied using composition. Composition can also be used to “filter” out various kinds of overgeneration.

Beesley 2001 A Typical Finite-State System Filters (xfst) Core Lexicon (lexc) Orthographical or Phonological Alternation Rules (xfst).o.

Beesley 2001 A System may be a Union of Subsystems Nouns (lexc)Verbs (lexc)Adjs (lexc)Numbers (lexc) Noun Rules Verb Rules.o. define final NounFST | VerbFST | AdjFST | NumberFST ; Then, in xfst:

Beesley 2001 Review: Outputs and Inputs Unix Pipe: cat wordlist.in | sort | uniq -c | sort -rnb > myfile.out The output of one routine is the input to the next NOT reversible Cascade of Replace Rules: read regex [ N -> m || _ p ].o. [ p -> m || m _ ] ; Reversible/bidirectional relation apply down: the output of the first rule is the input to the second the lower side of the top rule is the upper side of the bottom rule

Beesley 2001 Review: Up and Down In xfst (regular expressions)In lexc a:bswim+Verb+Past:swam %+Pl:supper:lower [ a.x. b ] a -> b a <- b

Beesley 2001 Review: Xerox Conventions Upper (lexical) language:baseform+Tag+Tag+Tag Lower (surface) language:orthographical-string The surface language is usually determined for you by the standard orthography. The lexical side language, and all intermediate languages, have to be defined by the linguist writing the grammar. (mapping via rules)

Beesley 2001 Review: Up and Down with Composition baseform+Tag+Tag+Tag surfacy-form A rule that refers to tags on its lower side.o. A rule that refers to a surfacy form on its upper side.o. An FST defined using lexc

Beesley 2001 Review: Think in terms of Languages and Relations Lexical Language Core Lexicon FST Surfacy Language Rule1 Intermediate Language Rule2 Intermediate Language Rule n Final Surface Language

Beesley 2001 Other Important Topics in The Book Composition is Our Friend Modify a common “core” network to handle –Multiple orthographies –Multiple dialects –Multiple registers Testing with the Finite-State Calculus Bulk testing against corpora Regression testing/comparison Testing against wordlists Testing the well-formedness of the upper-side strings

Beesley 2001 Advanced Features “Flag Diacritic” features and feature unification Simplify lexc descriptions Help keep transducers small The compile-replace Algorithm Useful for non-concatenative morphology –Reduplication –Semitic Interdigitation