Presentation on theme: "Grammar Development Platform Miriam Butt October 2002."— Presentation transcript:
Grammar Development Platform Miriam Butt October 2002
Grammar Development What is a Grammar Development Platform good for? English: Anna sees the man. English c-str and f-str MT German f-str German: Anna sieht den Mann. Information Retrieval/Extraction Machine Translation (MT) Parser Generator XLE
A Sample Development Platform Platforms: Unix (Solaris), Linux, MacOsX Software (Shareware): Emacs, Tcl/Tk XLE (Xerox Linguistic Environment) Main Developer: John Maxwell (PARC)
A Sample Development Platform Performance: Worst-case exponential, polynomial in practice (makes broad-coverage grammars feasible) Parser: Bottom-Up, Left-to-Right XLE (Xerox Linguistic Environment) Linguistic Theory: LFG (Lexical-Functional Grammar) orginally developed by Ronald M. Kaplan (PARC) and Joan Bresnan (Stanford)
Palo Alto Research Center (PARC), English Grammar IMS, University of Stuttgart German Grammar Fuji Xerox Japanese Grammar University of Bergen Norwegian: Bokmal and Nynorsk UMIST Urdu Grammar XRCE Grenoble French Grammar The ParGram Project
ParGram Possible Applications: Machine Translation (French, English) Tree Banking (English, German) Smart Text Annotation (German) Robust Parsing (English, German, French) Information Extraction (English) Teaching Tools (Urdu)
Grammar Components Each Grammar Contains: Phrase Structure Rules (S NP VP) Lexicon (verb stems and functional elements) Finite-State Morphological Analyzer No Semantics
Phrase Structure Rules Formulation as used today goes back to Chomsky 1957. Sample Set for English: S NP VP VP V NP NP D (ADJ) N Why these kinds of rules? Natural Language is recursive and potentially infinite. Constituency, X-bar Theory
Phrase Structure Rules The syntax of natural languages is context-free. Colorless green ideas sleep furiously. However, we must also deal with context-sensitive information. The monkey sleeps. The monkey sleep.The monkeys sleeps.
Features and Unifications Context-Sensitivity can be achieved in many ways. XLE and LFG (like many other theories/platforms) uses phrase-structure annotation via attribute-value pairs. S NP VP ( SUBJ) = ( SUBJ NUM) = ( NUM) XLE Features are checked via Unificaition.
The Ambiguity Problem PP-Attachment The girl saw the monkey with the telescope. XLE Categorial Ambiguity Flying planes can be dangerous. Time flies like an arrow.
Lexicons Category Information (Terminal Node in Tree) Context Sensitive Featural Information Subcategorization Information Semantics (sometimes) Typically Contain: XLE
Ambiguity in Large Grammars Ambiguity: a serious problem even in simple sentences PP-attachment (English) Subject/Object Ambiguities (German) Within XLE various techniques have been invented to cut down on the explosion of parses. Optimality Marking Packed Representations XLE
Morphologies and Tokenizers Beyond the Word: Writing and adding in Morphological Analysis and Tokenization XLE
Parallel Analyses English:Yassin was seen. German:Yassin wurde gesehen. Urdu:yassin dek h a gaya Languages Differ on the Surface (c-structure) ParGram Goal: The same underlying f-structures for all languages (modulo lexical semantics). XLE
The “Parallel” in ParGram Analyses at the level of f-structure are held as parallel as possible across languages (crosslinguistic invariance). Theoretical Advantage: This models the idea of UG. Applicational Advantage: machine translation is made easier. Analyses at the level of c-structure are allowed to differ much more (variance across languages).
FST Morphological Analyzers Kaplan and Butt (2002): this LFG morphology-syntax interface is natural: calana ‘to drive’ (M.Sg) drive+Verb+Inf+M+S g Sequence Relation surface form [VFORM inf] f-structure (m-structure) Lexical Relation [NUM sg] [GEND masc] Satisfaction Relation Seq L Sat PRED ‘drive ’ VFORM inf GEND masc NUM sg