Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
The Language Model in Bulgarian Treebank (BulTreeBank) Petya Osenova (Sofia) , Prague.
Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,
Example Database English-German Dictionary
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Language-specific Issues Czech Jan Hajič Institute of Formal and Applied Linguistics.
1/36 TectoMT Zdeněk Žabokrtský Institute of Formal and Applied Linguistics MFF UK Software framework for developing MT systems (and other NLP applications)
Building the Valency Lexicon of Arabic Verbs Viktor Bielický Otakar Smrž LREC 2008, Marrakech, Morocco.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
PDT Grammatemes and Coreference in the PDT 2.0 Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax 1 The PDT Morphology and Surface Syntax.
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Systematic Parameterized Description of Pro-forms in the Prague Dependency Treebank 2.0 Magda Ševčíková Zdeněk Žabokrtský Institute of Formal and Applied.
Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Prague Arabic Dependency Treebank MALACH Workshop in Prague August 28, 2003 Introduction & Related Projects Otakar Smrž et al.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
Proper Nouns in Czech Corpora Magda Ševčíková Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics.
PDT Grammatemes in the PDT 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
Supertagging CMSC Natural Language Processing January 31, 2006.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Leonid Iomdin Institute for Information Transmission Problems, Russian Academy of Sciences
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Tectogrammatics 1 PDT: Tectogrammatical Representation Jan Hajič Institute.
TSD, Brno, Institute of Formal and Applied Linguistics, 1 Czech Verbs of Communication and the Extraction of.
Semantic annotation of a dialog corpus Silvie Cinková Institute of Formal and Applied Linguistics Charles University in Prague, Czech Republic COMPANIONS.
Prague Czech-English Dependency Treebank 2.0 ufal.mff.cuni.cz/pcedt2.0 Silvie Cinková, Marie Mikulová, Jan Štěpánek & professors, annotators and programmers.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Netgraph – a Tool for Searching in the Prague Dependency Treebank 2.0 Defence of the Doctoral Thesis, Prague, September 3 rd, 2008 Author: Mgr. Jiří Mírovský.
Language Identification and Part-of-Speech Tagging
Grammar Grammar analysis.
Beginning Syntax Linda Thomas
CSC 594 Topics in AI – Natural Language Processing
Statistical NLP: Lecture 3
Web News Sentence Searching Using Linguistic Graph Similarity
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Natural Language Processing (NLP)
Prague Arabic Dependency Treebank
A Statistical Model for Parsing Czech
Universal Dependencies
LING/C SC 581: Advanced Computational Linguistics
Phil Durrant Debra Myhill Mark Brenchley
How to publish in a format that enhances literature-based discovery?
Towards Semantics Generation
Statistical n-gram David ling.
Text Mining & Natural Language Processing
Natural Language Processing (NLP)
CS224N Section 3: Corpora, etc.
Natural Language Processing (NLP)
Presentation transcript:

Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague zabokrtsky@ufal.mff.cuni.cz http://ufal.mff.cuni.cz/pdt2.0

Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks http://ufal.mff.cuni.cz/pdt2.0

Introduction treebank Prague Dependency Treebank syntactically annotated corpus (“bank” of syntactic trees) Prague Dependency Treebank collection of linguistically annotated Czech texts (2MW), software tools and documentation morphological and surface- and deep-syntactic dependency-oriented sentence analyses http://ufal.mff.cuni.cz/pdt2.0

About Czech western group of Slavic languages rich inflectional morphology (relatively) free word order language Latin alphabet extended with accents (příliš žluťoučký kůň) spoken in the Czech republic 10+ million speakers http://ufal.mff.cuni.cz/pdt2.0

Historical background and development of PDT 1920’s – Prague Linguistic Circle founded 1930-50’s – influential dependency-oriented works of Lucien Tesniere and Vladimír Šmilauer mid 1960’s – Petr Sgall’s Functional Generative Description 1992 – Penn Treebank 1994 – Czech National Corpus 1995 – PDT started 1998 – PDT 0.5 pre-release 2001 – PDT 1.0 released by LDC 2006 – PDT 2.0 to be released by LDC http://ufal.mff.cuni.cz/pdt2.0

Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks http://ufal.mff.cuni.cz/pdt2.0

Layered annotation scheme tectogrammatical layer surface-syntactic dependency tree analytical layer morphological layer morphological lemma and tag associated with each token word layer original text, segmented on word boundaries http://ufal.mff.cuni.cz/pdt2.0 He would have gone intoforest.

M-layer sentence represented as a sequence of tokens each token lemmatized and tagged (attributes lemma and tag) 15-character long positional morphological tag 1. (main) POS 2. detailed POS 3. gender 4. number 5. case ... http://ufal.mff.cuni.cz/pdt2.0

A-layer (1) - nodes and edges sentence represented as a rooted ordered tree with labeled nodes and edges edges labeled with analytical functions: dependency relations (Sb, Obj, Adv, Atr) non-dep. relations (Coord) auxiliary (functional) nodes (AuxP for prepositions, AuxC for subordinating conjunctions...) special treatment of coordination constructions http://ufal.mff.cuni.cz/pdt2.0

A-layer (2) - coordination intricate interplay between dependency and coordination relations PDT solution: both conjuncts (members of coordination) and shared modifiers attached below the coordination conjunction (but distinguished from each other by a special attribute is_member) direct parent vs. effective parent: M M http://ufal.mff.cuni.cz/pdt2.0

T-layer (1) - nodes t-nodes node attributes complex typed feature structures nodes represent autosemantic words functional words do not have nodes of their own artificially added nodes (e.g. for pro-drops) node attributes tectogrammatical lemma dependency relation – functor and subfunctor grammateme attributes (representing morphological meanings) attributes for topic-focus articulation attributes for coreference relations http://ufal.mff.cuni.cz/pdt2.0

T-layer (2) - dependency relations according to FGD, two types of functors actants (arguments) ACT – actor PAT – patient ADDR – addressee EFF – effect ORIG - origin free modifiers (adjuncts) various types of temporal modifiers - TWHEN, TTIL, TSIN... spatial and directional modifiers – LOC, DIR1, DIR2, DIR3 MEANS, BENeficiary, CAUSe, REGard, EXTent, MATerial, CONDition... additional functors for representing non-dependency relations coordinations – CONJ, DISJ, ADVS ... appositions – APPS parenthetical constructions - PAR expressions in foreign language - FPHR http://ufal.mff.cuni.cz/pdt2.0

T-layer (3) - valency all occurrences of all verbs in t-trees interlinked with the valency lexicon PDT-VALLEX individual valency frames roughly corresponds to individual senses of the given verb valency frame ~ a sequence of frame slots, for each of which its functor, obligatority and its possible surface realizations are specified http://ufal.mff.cuni.cz/pdt2.0

T-layer (3) - coreference two types of coreference according to FGD grammatical (verbs of control, relative clauses, reflexive pronouns...) textual (personal pronouns, incl. elided ones) coreference in PDT binary relation between t-nodes depicted as a “non-tree” arc (arrow) http://ufal.mff.cuni.cz/pdt2.0

T-layer (4) - grammatemes t-node attributes representing morphological meanings motivation number for nouns, tense for verbs, degree for adjectives, deontic/verb/sentence modality ... http://ufal.mff.cuni.cz/pdt2.0

T-layer (5) - node typing presence/absence of a given attribute?  the need for node typing two-level hierarchy of t-layer node types used in PDT 2.0: http://ufal.mff.cuni.cz/pdt2.0

Interlinking the layers any unit at any layer has a PDT unique ID neighboring layers connected by top-down pointers http://ufal.mff.cuni.cz/pdt2.0

Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks http://ufal.mff.cuni.cz/pdt2.0

Sources of text texts provided by the Czech National Corpus 7000 articles (or article fragments) from Czech newspapers and journals: Lidové noviny (daily newspapers) Mladá fronta Dnes (daily newspapers) Českomoravský profit (business weekly) Vesmír (scientific journal) http://ufal.mff.cuni.cz/pdt2.0

Amount of annotated data m-layer data 1.96 MW in 116 kS a-layer data (75 % of m-layer) 1.5 MW in 88 kS t-layer data (59 % of a-layer) 0.8 MW in 49 kS http://ufal.mff.cuni.cz/pdt2.0

Division into files 1 XML file per document and annotation layer http://ufal.mff.cuni.cz/pdt2.0

Train/test data train : devtest : evaltest = 8 : 1 : 1 http://ufal.mff.cuni.cz/pdt2.0

Full vs. sample data sample data 500 sentences a freely available subset of the full data converted also to HTML (can be viewed in any WWW browser, no tree editor needed) the whole PDT 2.0 except for the full data (but including sample data, all tools, docs, and sample data) is available on the web the full data will be available only to the licensed users who obtain the CD from the Linguistic Data Consortium http://ufal.mff.cuni.cz/pdt2.0

Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks http://ufal.mff.cuni.cz/pdt2.0

Tree editor TrEd general customizable tree editor implemented in Perl the main editing and browsing tool in the PDT project http://ufal.mff.cuni.cz/pdt2.0

Batch processing of the data btred – batch processing version of tred ntred – networked (parallelized) version of btred $ btred -TNe 'print "$this->{t_lemma}\n" if $this->parent==$root and grep{$_->{functor}=~/^DIR/} $this->children()‘ data/sample/*.t.gz -q http://ufal.mff.cuni.cz/pdt2.0

Netgraph client-server application for on-line PDT search implemented in Java http://ufal.mff.cuni.cz/pdt2.0

Tools for post-annotation consistency checking hundreds of btred scripts of various types: technical tests e.g. each sentence contains at least one token all identifiers are unique, all referred identifiers exist... m-layer tests locative (6th case) cannot occur without a preposition improbable word forms (e.g. imperatives haš, tel) a-layer tests not more than one subject in a clause attributes (afun Atr) should not appear directly below verbs t-layer tests surface forms of verb arguments match the specifications in the valency lexicon relative pronouns in relative clauses should be in agreement with their antecedent (in the sense of grammatical coreference) http://ufal.mff.cuni.cz/pdt2.0

Tools for automatic annotation chain of tools for automatic text processing (from a raw text to a-layer trees): 1. sentence segmentation and tokenization 2. morphological analysis 3. morphological disambiguation 4. dependency parsing (adapted Collins) 5. analytical function assignment http://ufal.mff.cuni.cz/pdt2.0

Tools for format conversions conversion not only between PDT data formats, but also from other treebanks’ formats constituency trees from Negra in TrEd: http://ufal.mff.cuni.cz/pdt2.0

Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks http://ufal.mff.cuni.cz/pdt2.0

PDT 2.0 Documentation PDT Guide Annotation guidelines Publications overview of all parts of PDT 2.0 mirrors the directory structure of the PDT 2.0 CD-ROM Annotation guidelines m-layer (~100 pages) a-layer (~ 250 pages) t-layer (~ 800 pages) Publications conference and journal papers, technical reports, theses ... Technical documentation (software tools and data formats) http://ufal.mff.cuni.cz/pdt2.0

Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks http://ufal.mff.cuni.cz/pdt2.0

Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks http://ufal.mff.cuni.cz/pdt2.0

Want to experiment with... tagging ? dependency parsing ? semantic-role labeling ? frame semantics ? word-sense disambiguation ? anaphora resolution ? information structure ? ... Use PDT 2.0, it’s all there !!! http://ufal.mff.cuni.cz/pdt2.0

Annotation scheme not limited to Czech T-layer in English T-layer in German A-layer in German A-layer in Arabic A-layer in Slovene A-layer in Romanian http://ufal.mff.cuni.cz/pdt2.0

Those involved (some of) http://ufal.mff.cuni.cz/pdt2.0

Thank you! BTW, anyone interested in beta-testing? http://ufal.mff.cuni.cz/pdt2.0