Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept

Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept
Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

Outline of the talk Introduction Layers of annotation Data
Software tools Documentation Tour through the CD-ROM Final remarks

Introduction treebank Prague Dependency Treebank
syntactically annotated corpus (“bank” of syntactic trees) Prague Dependency Treebank collection of linguistically annotated Czech texts (2MW), software tools and documentation morphological and surface- and deep-syntactic dependency-oriented sentence analyses

About Czech western group of Slavic languages
rich inflectional morphology (relatively) free word order language Latin alphabet extended with accents (příliš žluťoučký kůň) spoken in the Czech republic 10+ million speakers

Historical background and development of PDT
1920’s – Prague Linguistic Circle founded ’s – influential dependency-oriented works of Lucien Tesniere and Vladimír Šmilauer mid 1960’s – Petr Sgall’s Functional Generative Description 1992 – Penn Treebank 1994 – Czech National Corpus 1995 – PDT started 1998 – PDT 0.5 pre-release 2001 – PDT 1.0 released by LDC 2006 – PDT 2.0 to be released by LDC

Layered annotation scheme
tectogrammatical layer surface-syntactic dependency tree analytical layer morphological layer morphological lemma and tag associated with each token word layer original text, segmented on word boundaries He would have gone intoforest.

M-layer sentence represented as a sequence of tokens
each token lemmatized and tagged (attributes lemma and tag) 15-character long positional morphological tag 1. (main) POS 2. detailed POS 3. gender 4. number 5. case ...

A-layer (1) - nodes and edges
sentence represented as a rooted ordered tree with labeled nodes and edges edges labeled with analytical functions: dependency relations (Sb, Obj, Adv, Atr) non-dep. relations (Coord) auxiliary (functional) nodes (AuxP for prepositions, AuxC for subordinating conjunctions...) special treatment of coordination constructions

A-layer (2) - coordination
intricate interplay between dependency and coordination relations PDT solution: both conjuncts (members of coordination) and shared modifiers attached below the coordination conjunction (but distinguished from each other by a special attribute is_member) direct parent vs. effective parent: M M

T-layer (1) - nodes t-nodes node attributes
complex typed feature structures nodes represent autosemantic words functional words do not have nodes of their own artificially added nodes (e.g. for pro-drops) node attributes tectogrammatical lemma dependency relation – functor and subfunctor grammateme attributes (representing morphological meanings) attributes for topic-focus articulation attributes for coreference relations

T-layer (2) - dependency relations
according to FGD, two types of functors actants (arguments) ACT – actor PAT – patient ADDR – addressee EFF – effect ORIG - origin free modifiers (adjuncts) various types of temporal modifiers - TWHEN, TTIL, TSIN... spatial and directional modifiers – LOC, DIR1, DIR2, DIR3 MEANS, BENeficiary, CAUSe, REGard, EXTent, MATerial, CONDition... additional functors for representing non-dependency relations coordinations – CONJ, DISJ, ADVS ... appositions – APPS parenthetical constructions - PAR expressions in foreign language - FPHR

T-layer (3) valency all occurrences of all verbs in t-trees interlinked with the valency lexicon PDT-VALLEX individual valency frames roughly corresponds to individual senses of the given verb valency frame ~ a sequence of frame slots, for each of which its functor, obligatority and its possible surface realizations are specified

T-layer (3) - coreference
two types of coreference according to FGD grammatical (verbs of control, relative clauses, reflexive pronouns...) textual (personal pronouns, incl. elided ones) coreference in PDT binary relation between t-nodes depicted as a “non-tree” arc (arrow)

T-layer (4) - grammatemes
t-node attributes representing morphological meanings motivation number for nouns, tense for verbs, degree for adjectives, deontic/verb/sentence modality ...

T-layer (5) - node typing
presence/absence of a given attribute?  the need for node typing two-level hierarchy of t-layer node types used in PDT 2.0:

Interlinking the layers
any unit at any layer has a PDT unique ID neighboring layers connected by top-down pointers

Sources of text texts provided by the Czech National Corpus
7000 articles (or article fragments) from Czech newspapers and journals: Lidové noviny (daily newspapers) Mladá fronta Dnes (daily newspapers) Českomoravský profit (business weekly) Vesmír (scientific journal)

Amount of annotated data
m-layer data 1.96 MW in 116 kS a-layer data (75 % of m-layer) 1.5 MW in 88 kS t-layer data (59 % of a-layer) 0.8 MW in 49 kS

Division into files 1 XML file per document and annotation layer

Train/test data train : devtest : evaltest = 8 : 1 : 1

Full vs. sample data sample data
500 sentences a freely available subset of the full data converted also to HTML (can be viewed in any WWW browser, no tree editor needed) the whole PDT 2.0 except for the full data (but including sample data, all tools, docs, and sample data) is available on the web the full data will be available only to the licensed users who obtain the CD from the Linguistic Data Consortium

Tree editor TrEd general customizable tree editor implemented in Perl
the main editing and browsing tool in the PDT project

Batch processing of the data
btred – batch processing version of tred ntred – networked (parallelized) version of btred $ btred -TNe 'print "$this->{t_lemma}\n" if $this->parent==$root and grep{$_->{functor}=~/^DIR/} $this->children()‘ data/sample/*.t.gz -q

Netgraph client-server application for on-line PDT search
implemented in Java

Tools for post-annotation consistency checking
hundreds of btred scripts of various types: technical tests e.g. each sentence contains at least one token all identifiers are unique, all referred identifiers exist... m-layer tests locative (6th case) cannot occur without a preposition improbable word forms (e.g. imperatives haš, tel) a-layer tests not more than one subject in a clause attributes (afun Atr) should not appear directly below verbs t-layer tests surface forms of verb arguments match the specifications in the valency lexicon relative pronouns in relative clauses should be in agreement with their antecedent (in the sense of grammatical coreference)

Tools for automatic annotation
chain of tools for automatic text processing (from a raw text to a-layer trees): 1. sentence segmentation and tokenization 2. morphological analysis 3. morphological disambiguation 4. dependency parsing (adapted Collins) 5. analytical function assignment

Tools for format conversions
conversion not only between PDT data formats, but also from other treebanks’ formats constituency trees from Negra in TrEd:

PDT 2.0 Documentation PDT Guide Annotation guidelines Publications
overview of all parts of PDT 2.0 mirrors the directory structure of the PDT 2.0 CD-ROM Annotation guidelines m-layer (~100 pages) a-layer (~ 250 pages) t-layer (~ 800 pages) Publications conference and journal papers, technical reports, theses ... Technical documentation (software tools and data formats)

Want to experiment with...
tagging ? dependency parsing ? semantic-role labeling ? frame semantics ? word-sense disambiguation ? anaphora resolution ? information structure ? ... Use PDT 2.0, it’s all there !!!

Annotation scheme not limited to Czech
T-layer in English T-layer in German A-layer in German A-layer in Arabic A-layer in Slovene A-layer in Romanian

Those involved (some of)

Thank you! BTW, anyone interested in beta-testing?

Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept

Similar presentations

Presentation on theme: "Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept

Similar presentations

Presentation on theme: "Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept"— Presentation transcript:

Similar presentations

About project

Feedback