Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Feature Structures and Unification.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Albert Gatt Corpora and Statistical Methods Lecture 11.
Chapter 4 Syntax.
Introduction to Syntax Owen Rambow September 30.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.
The SALSA experience: semantic role annotation Katrin Erk University of Texas at Austin.
Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,
Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance July 27 EMNLP 2011 Shay B. Cohen Dipanjan Das Noah A. Smith Carnegie Mellon University.
Introduction to treebanks Session 1: 7/08/
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Albert Gatt LIN3022 Natural Language Processing Lecture 8.
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.
Introduction to Syntax Owen Rambow October
Recovering empty categories. Penn Treebank The Penn Treebank Project annotates naturally occurring text for linguistic structure. It produces skeletal.
Tasks Talk: ULA08 Workshop March 18, 2007 A Talk about Tasks Unified Linguistic Annotation Workshop Adam Meyers New York University March 18, 2008.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
1/13 Parsing III Probabilistic Parsing and Conclusions.
1/17 Probabilistic Parsing … and some other approaches.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
Parsing the NEGRA corpus Greg Donaker June 14, 2006.
Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
ELN – Natural Language Processing Giuseppe Attardi
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations – Send me your choices if you haven’t already.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Spring /22/071 Beyond PCFGs Chris Brew Ohio State University.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
Inductive Dependency Parsing Joakim Nivre
SYNTAX Lecture -1 SMRITI SINGH.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
An ICALL writing support system tunable to varying levels of learner initiative Karin Harbusch 1 & Gerard Kempen 2,3 1 University of Koblenz-Landau, Koblenz,
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Albert Gatt Corpora and Statistical Methods Lecture 11.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.
Rules, Movement, Ambiguity
CSA2050 Introduction to Computational Linguistics Parsing I.
1 Context Free Grammars October Syntactic Grammaticality Doesn’t depend on Having heard the sentence before The sentence being true –I saw a unicorn.
CPSC 503 Computational Linguistics
Iceland 5/30-6/1/07 1 Parsing with Morphological Information for Treebank Construction Seth Kulick University of Pennsylvania.
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
Supertagging CMSC Natural Language Processing January 31, 2006.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.
Exploiting Reducibility in Unsupervised Dependency Parsing David Mareček and Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
Semantic annotation of a dialog corpus Silvie Cinková Institute of Formal and Applied Linguistics Charles University in Prague, Czech Republic COMPANIONS.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
An Introduction to the Government and Binding Theory
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept
Topics in Linguistics ENG 331
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 26
CS224N Section 3: Corpora, etc.
Presentation transcript:

Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 3: Treebanks

Overview 1. syntactic annotation and treebanks lab work: TIGERSearch lab work: TIGERSearch 2. lexical semantics lab work: WordNet lab work: WordNet 3. Projects

Treebanks A linguistically annotated corpus that includes some grammatical analysis beyond word- level syntactic annotation (part-of-speech) “treebank” vs. “annotated corpus” “treebank” vs. “annotated corpus”  the first has to be manually annotated or post-edited two syntactic frameworks: two syntactic frameworks:  constituent structure  dependency structure

Constituent structure American structuralism, e.g. Zelig Harris (1951) “Bracketing”: sentences consist of hierarchically embedded subparts → constituents   strings of words that belong together   constituency tests: substitution, movement, stand- alone test, … Part-whole relations   e.g. a NP consists of a determiner, adjective and noun [ NP [ DET [ ADJ ] [ N ]]

Dependency structure First comprehensive theory: Lucien Tesniere (1959) First comprehensive theory: Lucien Tesniere (1959) Sentence consists of hierarchically structured asymmetric binary relations between word forms → dependency relations (connections) Sentence consists of hierarchically structured asymmetric binary relations between word forms → dependency relations (connections)  governor, dependent(s)  closely related to functional analysis Relations Relations  e.g. determiner and adjective are subordinated to the noun

Dependencies in SDT

Hybrid models Combine constituent and functional (dependency) information e.g. function added as additional sub-label to daughter category: [S [NP-SB …]] in Penn Treebank II e.g. function added as additional sub-label to daughter category: [S [NP-SB …]] in Penn Treebank II

Treebanks and linguistic theory Constituent structure, e.g. Constituent structure, e.g.  Penn Treebank I (AE) Dependency structure, e.g. Dependency structure, e.g.  Prague Dependency Treebank / analytical level (Czech) Constituent / Dependency Hybrid approaches, e.g. Constituent / Dependency Hybrid approaches, e.g.  Penn Treebank II, SUSANNE (AE)  NEGRA/TIGER, TüBa (German) Theory specific annotation, e.g. Theory specific annotation, e.g.  Prague Dependency Treebank / tectogrammatical level - Functional Generative Grammar  CCG-bank - Combinatory Categorial Grammar

Penn Treebank English treebank built at the University of Pennsylvania, distributed by LDC English treebank built at the University of Pennsylvania, distributed by LDC Phase 1 ( ) Phase 1 ( )  skeletal parse  2.6. mill words PoS tagged from Wall Street Journal, also other components, e.g. Brown Corpus Phase II ( ) Phase II ( )  enriching part of the material with grammatical functions and semantic relations  null-elements, coreference Phase III ( ) Phase III ( )  additional material: corpus of telephone conversations annotated for disfluencies

Penn Treebank: PoS annotation uses modified BROWN tagset uses modified BROWN tagset allows multiple tags on word when annotator is unsure (avoid arbitrary decisions) allows multiple tags on word when annotator is unsure (avoid arbitrary decisions) 36 PoS tags, 12 other tags (punctuation, currency symbols) 36 PoS tags, 12 other tags (punctuation, currency symbols) etc.

Penn Treebank: syntactic annotation

Penn Treebank: Skeletal parsing

Penn Treebank: Functional tagset Text categories Text categories  -HNL headlines and datelines  -LST list markers  -TTL titles Grammatical functions Grammatical functions  -NOM non NPs that function as NPs  -ADV clausal and NP adverbials  -SBJ surface subject  … Semantic roles Semantic roles  -DIR direction and trajectory  -LOC location  -MCR manner  … Pseudo-attachment Pseudo-attachment  *EXP* expletive  *RNR* right node raising  …

TIGER Treebank “LinguisTIc Interpretation of a GERman Corpus” sentences sentences follow-up of NEGRA corpus ( sentences) follow-up of NEGRA corpus ( sentences) German newspaper texts (Frankfurter Rundshau) German newspaper texts (Frankfurter Rundshau) free licence free licence hybrid annotation hybrid annotation crossing branches for discontinuous constituents crossing branches for discontinuous constituents

TIGER treebank example: discontinuous constituents

Creating treebanks Manual annotation Manual annotation  TrEd, CLaRK, Word freak Automatic annotation with human post- editing Automatic annotation with human post- editing  Collins’ Parser, Stanford Parser,… very labour intensive! very labour intensive!

Exploiting treebanks: Parser training

Exploiting treebanks: Charniak 1996 inducing a treebank-based PCFG inducing a treebank-based PCFG preliminary version of Penn Treebank preliminary version of Penn Treebank training corpus: ~30,000 words training corpus: ~30,000 words test corpus: ~30,000 words test corpus: ~30,000 words

CoNLL-X shared task on multilingual dependency parsing 2006, , open task: common format of treebanks, all systems must compete on all languages open task: common format of treebanks, all systems must compete on all languages 13 treebanks: Arabic, Chinese, Czech, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish, Turkish, Bulgarian 13 treebanks: Arabic, Chinese, Czech, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish, Turkish, Bulgarian 20 systems 20 systems Best average labelled attachment score ~80% Best average labelled attachment score ~80%