Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.

Slides:



Advertisements
Similar presentations
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Advertisements

Linguistic Theory Lecture 11 Explanation.
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Approaches to Parsing.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Chunk Parsing CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
C. Varela; Adapted w/permission from S. Haridi and P. Van Roy1 Declarative Computation Model Defining practical programming languages Carlos Varela RPI.
Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור עשר Chart Parsing (cont) Features.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Features and Unification
The Use of Corpora for Automatic Evaluation of Grammar Inference Systems Andrew Roberts & Eric Atwell Corpus Linguistics ’03 – 29 th March Computer Vision.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
Building the Valency Lexicon of Arabic Verbs Viktor Bielický Otakar Smrž LREC 2008, Marrakech, Morocco.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
Compiler1 Chapter V: Compiler Overview: r To study the design and operation of compiler for high-level programming languages. r Contents m Basic compiler.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
IKTA-27/2000 Development of a Part-of-Speech (POS) Tagging Method for Hungarian Using Machine Learning Algorithms Project duration: July June.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Lexical Analysis Hira Waseem Lecture
Unsupervised learning of Natural languages Eitan Volsky Yasmine Meroz.
May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Parsing with Context Free Grammars.
7. Parsing in functional unification grammar Han gi-deuc.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Tokenization & POS-Tagging
Topic #1: Introduction EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
Artificial Intelligence: Natural Language
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
csa3050: Parsing Algorithms 11 CSA350: NLP Algorithms Parsing Algorithms 1 Top Down Bottom-Up Left Corner.
Supertagging CMSC Natural Language Processing January 31, 2006.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NATURAL LANGUAGE PROCESSING
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Certifying and Synthesizing Membership Equational Proofs Patrick Lincoln (SRI) joint work with Steven Eker (SRI), Jose Meseguer (Urbana) and Grigore Rosu.
10/31/00 1 Introduction to Cognitive Science Linguistics Component Topic: Formal Grammars: Generating and Parsing Lecturer: Dr Bodomo.
Natural Language Processing Vasile Rus
Introduction to Parsing
Natural Language Processing Vasile Rus
Syntax Analysis Chapter 4.
CS 388: Natural Language Processing: Syntactic Parsing
Chunk Parsing CS1573: AI Application Development, Spring 2003
COMPILER CONSTRUCTION
Presentation transcript:

Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL MFF, Charles University, Prague Conclusion Designed and implemented a new scripting language AX for performing selection of sentences based on linguistically motivated criteria. Prepared an AX script (15 filters and 21 rules) to demonstrate its utility. The script selects sentences suitable for extraction of Czech verb frames. The extraction of frames itself has not been performed yet, but the utility of the selection can be illustrated by improvement of the Czech Collins parser’s accuracy (measured by the number of verb occurrences with correctly assigned daughters). When combining the linguistically motivated selection with a filter selecting only short sentences, the accuracy of 73.2 % can be achieved. Approximately 10 % of input sentences pass both filters. Observed verbs with all the daughters recognized correctly In all the sentences In the selected sentences only Sentences of any length55 %65 % Sentences with up to 10 words68 %73 % An AX script is an arbitrary sequence of filters and rulesets. Input for the script is a sentence represented as a sequence of feature structures corresponding one to one to the input word forms. Word forms with ambiguous morphological information are still represented as single feature structures. For example, the Czech word form má can serve either as a personal pronoun or as a finite verb. A single feature structure can hold both of the variants: [cat-pron, lemma-“můj“, case-nom, gend-fem, num-sg |cat-verb, lemma-”mít“, tense-pres, person-third, gend-masc, num-sg] Filters are used to strike out sentences not suitable for further analysis or for extracting the lexico-syntactic information: Early filters decide according to simple criteria, such as too many punctuation marks. Later filters make use of results of partial syntactic analysis and reject sentences after a more sophisticated decision, such as discovering noun phrases ordered in a manner where syntactic ambiguity is very common and would spoil the observed verb frame. Filters are expressed as regular expression of feature structures. Rulesets are used to perform partial syntactic analysis of the sentence. Rulesets may produce more possible “readings” of the sentence. Some of the readings may be rejected by following filters. (Formally, a reading is a sequence of feature structures.) A filter rejects sentences with strange symbols. Sentence 1 Sentence 2 A ruleset combines aux.+main verb. This might be ambiguous, several readings can be generated. Reading 1 rejected. Reading 2 accepted. Reading 1 rejected. Sentence 3 Reading 2 rejected. Reading 3 rejected. AX Overall Scheme Sentence 1 was rejected by the first filter. Sentence 2 was accepted, one reading passed the last filter and the final sequence of feature structures will be printed out Sentence 3 was rejected because none of the readings passed the second filter. A Dependency Treebank Lexicon of Syntactic Behavior A Big Corpus (with morphological information only) ? ? ? ? ?? ? ? Subcorpus of Nice Examples Automatic Extraction of Lexico-Syntactic Data AX Treebanks contain the required syntactic infor- mation but cover too few lexemes in too few situa- tions. Syntactic Analysis Needs Lexicons For instance, the Prague Dependency Treebank (1.5 million tokens in 98,263 sentences) covers only 5,400 Czech verbs out of an estimated total of 40,000. Only 500 verbs occur in the PDT more than 50 times. Therefore the PDT is not sufficient as a source of valencies of Czech verbs. Corpora without syntactic annotation contain enough examples but many of the sentences available are too complex to extract the syntactic information auto- matically. Proposed solution: First “pick nice examples”, then extract the lexico- syntactic information tra- ditionally. AX (automatic extraction) is a new scripting language designed to make the following tasks easy: Dealing with (ambiguous) morphological information. Partial parsing and grammatically consistent simplification of sentences (if needed to check for more complex phenomena). Selection of sentences to keep, based on linguistic criteria (both morphological and syntactic ones). Print-out of the simplified version of selected sentences. If the script was prepared carefully, this can already be the desired lexico-syntactic information. Arbitrary texts augmented with morphological information (not necessarily disambiguated). AX rule “combine main and aux. verb parts”: combined \gap [cat-trace] --> aux {gap: ![cat-verb]*} main | main {gap: ![cat-verb]*} aux :: # unification requirements follow aux = [cat-verb, lemma-”být”], main = [cat-verb_participle], aux.person = main.person, aux.number = main.number, combined = [cat-complex_verb], combined.lemma = main.lemma, combined.person = main.person, combined.number = main.number end Fill (restrict by unifying) the output variable combined with features relevant for further analysis. Ensure a grammatical agreement between the words assigned to variables aux and main. Restrict, which words can be assigned to input variables aux and main. Find a subsequence in the input reading that matches the given regular expression. Assign words to variables aux and main and mark the region between them with the label gap. Replace the matching region with the content of variable combined, the region labelled gap and an extra feature structure trace marking the former location of the second part of the complex verb (if useful in further analysis). A filter rejects sentences with two main verbs. Sample AX Script and AX Rule A Sample AX Rule: Full text, acknowledgement and the list of references in the proceedings of ESSLLI Student Session, Vienna, 2003.