Presentation is loading. Please wait.

Presentation is loading. Please wait.

The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam

Similar presentations


Presentation on theme: "The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam"— Presentation transcript:

1 The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam Workshop Processing Pipelines Darmstadt 2008/07/10

2 Overview Motivation Multi-level annotation for discourse structure research Multi-level annotation for information structure research The ANNIS linguistic information system multi-level querying and visualization Example pipelines Corpus annotation and exploitation PAULA for text summarization The PAULA format Current state, future plans

3 Multi-level annotation (1): Discourse structure(s) Thesis: Coherence of a text is not adequately characterized by the discourse structure (a single tree or graph) but by the interplay of different levels of description, each reflecting a separate dimension of textuality. (In Textlinguistik, this idea is not new (e.g., Motsch 96) but the programme has not been carried through yet.)

4 Impfpflicht gegen Kinderkrankheiten? [1] Kein Kind weiß heute noch, was Pocken sind. [2] So ein Glück. [3] Als die Pockenimpfung 1854 eingeführt wurde, [4] glaubten manche Menschen, [5] dass sich ihr Kopf in einen Kuhkopf verwandelt, [6] wenn sie sich impfen lassen. [7] Denn der Impfstoff wurde damals aus der Haut von Rindern hergestellt. [8] Heute ist diese furchtbare Krankheit ausgerottet. [9] Dank einer entschlossenen, weltweiten Impfkampagne. [10] Aber es gibt noch: Masern, Kinderlähmung, Diphtherie, Mumps, Röteln, Hepatitis B, Tuberkulose, Keuchhusten. [11] Daran sterben, vor allem in den Entwicklungsländern, jährlich immer noch Millionen Kinder. [12] In Deutschland werden diese Krankheiten von vielen Eltern offenbar nicht ernst genommen. [13] Weil sie sie gar nicht mehr kennen! [14] Denn mit Impfstoffen wurde erreicht, [15] dass diese Infektionen nur noch sporadisch auftreten. [16] Doch wer aus eigenem Erleben weiß, [17] wie schrecklich Kinder leiden, [18] wenn sie nur Masern oder Keuchhusten haben, [19] sollte ihnen dies ersparen. [20] Und auch die gesundheitlichen Folgewirkungen. [21] Nur wer impfen lässt, hilft mit, dass Impfungen eines Tages überflüssig werden. [22] Stattdessen wird über Nebenwirkungen von Impfstoffen schwadroniert, [23] die höchst selten auftreten und die man erst Recht nur aus Büchern kennt. [24] Dann gibt es noch das schöne Argument: Das ist mein Kind, das darf der Staat nicht pieken. [25] Gegen solche Eltern hilft auch keine Impfung.

5 Mandatory vaccination against childrens diseases? [1] Today, children dont know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cows head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattles skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently dont take these diseases seriously. [13] Because they dont know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with just measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.

6 Mandatory vaccination against childrens diseases? [1] Today, children dont know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cows head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattles skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently dont take these diseases seriously. [13] Because they dont know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with just measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.

7 Referential Structure

8 Mandatory vaccination against childrens diseases? [1] Today, children dont know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cows head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattles skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently dont take these diseases seriously. [13] Because they dont know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with just measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.

9 Thematic Structure

10 Mandatory vaccination against childrens diseases? [1] Today, children dont know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cows head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattles skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently dont take these diseases seriously. [13] Because they dont know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with just measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.

11 Conjunctive Relations temporal –simultaneous, succession consequential –manner, consequence, condition, purpose, concession comparative –similarity, contrast, reformulation additive –addition, alternation Relations can be directed but not weighted - there is no nuclearity (Martin 1992)

12 Conjunctive Relations (Martin 1992)

13 Intentional structure Illocutions (inspired by Schmitt 00, Searle 76) –Reportivum: writer describes a state of affairs –Identifikativum: writer characterizes own state of mind, health, etc. –Estimativum: writer presents proposition as probably true –Evaluativum: writer presents a personal opinion –Appellativum: writer orders or suggests an action Support Relations (subset of RST) –Ease-understanding (Background) –Encourage-acting (Motivation) –Ease-acting (Enablement) –Encourage-believing (Evidence) –Encourage appreciating (Antithesis, Concession) Compare types of argument (e.g., Eggs 00): –deontic –epistemic –ethic/aesthetic

14 Mandatory vaccination against childrens diseases? [1] Today, children dont know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cows head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattles skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently dont take these diseases seriously. [13] Because they dont know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with just measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.

15 Argument structure (inspired by Freeman 1993)

16 Text understanding: Relating levels of analysis

17 Text understanding: Relations to sentence syntax Impfpflicht gegen Kinderkrankheiten? [1] Kein Kind weiß heute noch, was Pocken sind. [2] So ein Glück. [3] Als die Pockenimpfung 1854 eingeführt wurde, [4] glaubten manche Menschen, [5] dass sich ihr Kopf in einen Kuhkopf verwandelt, [6] wenn sie sich impfen lassen. [7] Denn der Impfstoff wurde damals aus der Haut von Rindern hergestellt. [8] Heute ist diese furchtbare Krankheit ausgerottet. [9] Dank einer entschlossenen, weltweiten Impfkampagne. [10] Aber es gibt noch: Masern, Kinderlähmung, Diphtherie, Mumps, Röteln, Hepatitis B, Tuberkulose, Keuchhusten. [11] Daran sterben, vor allem in den Entwicklungsländern, jährlich immer noch Millionen Kinder. [12] In Deutschland werden diese Krankheiten von vielen Eltern offenbar nicht ernst genommen. [13] Weil sie sie gar nicht mehr kennen! [14] Denn mit

18 Multi-level annotation: syntax tree Annotate, Synpathy NK NP NK Die einstigeFußball-Weltmacht ARTADJANN

19 Multi-level annotation: coreference MMAX

20 Multi-level annotation: text tree RST Tool

21 Multi-level annotation: layers Exmaralda

22

23

24 Multi-level annotation (2): Information structure - SFB632 B1 (Gur/Kwa languages) B2 (Tchadic languages) B4 (Diachronic Germanic / Latin translation) B6 (Spoken Kiezdeutsch) C1 (Newspaper German) - see below C6 (Hindi) D1 (Newspaper German) - see above D2 (Questionnaire - 13 different languages)

25 Multi-level annotation (2): Information structure - SFB632 B1 (Gur/Kwa languages) - Shoe/Toolbox, Exm. B2 (Tchadic languages) - Shoe/Toolbox, Exm. B4 (Diachronic Germanic / Latin) - Exmaralda B6 (Spoken Kiezdeutsch) - Exmaralda C1 (Newspaper German) - Synpathy, MMax C6 (Hindi) - XML D1 (Newspaper German) - Syn, MMax, RST, Exm D2 (Questionnaire) - Exmaralda

26 Multi-level annotation (2): Information structure - SFB632 B4 (Diachronic Germanic / Latin) - Exmaralda 1800 sentences: info structure, syntax, coherence relations C1 (Newspaper German) - Synpathy, MMax Large text collection with only selected sentences being annotated - see below D1 (Newspaper German) - Syn, MMax, RST, Exm 200 texts/2500 sentences, in part with coherence relations, coreference, syntax, info structure D2 (Questionnaire) - Exmaralda GB of audio data / 50K transcribed tokens, in part with phrase structure, info structure

27 SFB632: From annotation tool to database Database reads PAULA Conversion scripts map from tool output to PAULA –Add metadata to documents –Fix some inconsistent tokenization Challenges –Enforce common tokenization across layers (and thus across tools) –Enforce syntactically correct annotation (Exmaralda) Manual work –Check for typos and other errors (wrong type of annotation layer, etc.) –Repair some inconsistent tokenization

28 ANNIS Database ANNIS V1: Data resides in main memory –In use since 2005 ANNIS V2: System with relational DB backend (PostgreSQL) –To be launched this summer

29 ANNIS query language Issue queries across annotation layers –...to combine different realms of information givenness=giv & syncat=pp & rhetrel=contrast –...to check for conflicting annotations within the same realm ann1::givenness=new & ann2::givenness=giv & #1 _=_ #2 –...to check for completeness of annotations aboutness=ref & !givenness=* & #1 _=_ #2

30 ANNIS V1 Text view and annotation layers

31 ANNIS V2 Search for multiple constitutents in the Vorfeld

32 ANNIS V2 Hit list

33 ANNIS V2 Tree view

34 ANNIS V2 Coreference view

35 Availability ANNIS database V1 ANNIS database V2: later this year PAULA documentation Conversion scripts –AnnotTools to PAULA Exmaralda, MMAX2, TigerXML, RSTTool, URML, Palinka, generic inline XML Ontology & Tools* –extensions for ontology-based corpus querying –HTML-export for ontologies * Developed in the DFG project Nachhaltigkeit linguistischer Daten (SFB 441, SFB 538, SFB 632)

36 Example pipelines Motivation Multi-level annotation for discourse structure research Multi-level annotation for information structure research The ANNIS linguistic information system multi-level querying and visualization Example pipelines Corpus annotation and exploitation PAULA for text summarization The PAULA format Current state, future plans

37 Corpus annotation pipeline information structure and word order in German* –What contextual conditions are licensing pre-field occupation of non-subject constituents ? annotations –grammatical annotation syntax, morphology –pragmatic annotation anaphora, bridging, information status efficient, goal-specific annotation –partial annotation selected examples + immediate context –semiautomatic annotation * Chiarcos, C., J. Ritz, M. Stede (2008), Investigating non-canonical constructions in context: efficient corpus annotation and retrieval. to be presented at KONVENS 2008, Berlin, October, 2008

38 Corpus annotation pipeline sample selection collect a number of texts mark target sentences automated pre-processing tokenization parsing annotate anaphora and verify syntax use standard annotation tools for both tasks synchronization anaphora grammar integration conversion to PAULA manual annotation

39 Corpus annotation pipeline sample selection collect a number of texts mark target sentences convert to plain text with markup tokenization use standard tokenizer mark sentence boundaries preserve markup parsing BitPar (German version of TracePar)* POS, morph, TIGER-style syntax in case of failure, use TreeTagger/Chunker** POS, NP/PP-chunks * Helmut Schmid Efficient parsing of highly ambiguous context-free grammars with bit vectors. In Proceedings of COLING, Geneva, Switzerland. ** Helmut Schmid, Improvements in Part-of-Speech Tagging with an application to German. in S. Armstrong et al. (ed.), Natual Language Processing using very large corpora, Kluwer, Dordrecht. conversion to TIGER XML conversion from bracket format automated pre-processing * **

40 pre-processing Corpus annotation pipeline synchronization identify relevant context sentences produce TIGER XML anaphoric annotation MMAX* converted from TIGER XML preserve TIGER ids as annotations * ** grammatical annotation Synpathy** correct selected sentences synchronization verify MMAX references to TIGER XML integration manual annotation Christoph Müller, Michael Strube (2006): Multi-Level Annotation of Linguistic Data with MMAX2. In: Sabine Braun, Kurt Kohn, Joybrato Mukherjee (Eds.): Corpus Technology and Language Pedagogy. New Resources, New Tools, New Methods. Frankfurt: Peter Lang, pp (English Corpus Linguistics, Vol.3 )

41 integration Corpus annotation pipeline MMAX format TIGER ids as annotation values anaphora grammar loss-less conversion to PAULA isomorphic to source format annotation TIGER XML PAULA PAULA merged PAULA project merging references to the same token file integrated PAULA project integration replace TIGER ids from with pointing relations to elements

42 Corpus exploitation pipeline What is the relation between different levels of description ? –information status vs. morphosyntax –discourse structure vs. anaphora Qualitative analysis –Query the corpus for corresponding annotations and analyse these examples manually. cf. ANNIS slides Quantitative analysis –Assess statistic correlations between different annotations.

43 Corpus exploitation pipeline Quantitative analysis corpus of PAULA projects TIGER XML Exmaralda RST Tool MMAX POS, morph, syntax information structure discourse structure coreference conversion to PAULA integration of multiple annotations of the same set of documents conversion to ARFF WEKA WEKA* workbench for statistic analyses statistic, neuronal, symbolic classifiers * extraction of feature vectors so far, no generic ARFF exporter has been developed. ANNIS 2.0 will be augmented with a number of example converters

44 Corpus exploitation pipeline Quantitative analysis with WEKA Preprocessing selecting relevant features from an ARFF feature list

45 Corpus exploitation pipeline Quantitative analysis with WEKA example analysis (decision tree) information status and referring expressions in German (Potsdam Commentary Corpus)

46 NLP pipeline Summarization project* –high-quality summarization –syntax, coreference, text structure, causal markers –PAULA as exchange format between different NLP modules –output of different modules is to be combined these may also run in parallel –specific requirements for the exchange format *Stede, M., H. Bieler, S. Dipper, and A. Suriyawongkul (2006). SUMMaR: Combining Linguistics and Statistics for Text Summarization. In Proceedings of the 17th European Conference on Artificial Intelligence (ECAI-06)

47 Layout Structure and Metadata Extraction Text Structure Extraction Tokenization and Sentence Boundary Detection Syntactical Analysis (Connexor) Structure Weight Calculation Discourse Marker Annotation Term Weight Calculation Treetagger Topic Segmentation Number and Time Annotation Coreference Analysis (Rosana) Preprocessing Modules Flexible Modules Summarization, architecture flexible modules can be arranged in any order in the pipeline or be processed non- sequentially PAULA as common interchange format Merging Summary Calculation Graphical Representation Final Modules

48 Preprocessing Modules Flexible Modules (selection) Final Modules Layout Structure and Metadata Extraction Text Structure Extraction Tokenization and Sentence Boundary Detection Syntactic Analysis (Connexor) Term Weight Calculation Coreference Analysis (Rosana) Merging Summary Calculation Graphical Representation Topic Segmentation Robust Morphosyntactic Analysis (TreeTagger) Summarization pipeline

49 Summarization pipeline A fragment Layout Structure and Metadata Extraction Text Structure Extraction Tokenization and Sentence Boundary Detection Term Weight Calculation ??? Merging Summary Calculation Graphical Representation Preprocessing Modules Flexible Modules Final Modules Topic Segmentation Robust Morphosyntactic Analysis (TreeTagger) PAULA * Rosana requires Connexor as input format, hence, the mapping to PAULA is skipped at this point * to be processed by other components in the summarization pipeline coming from a preprocessing module Syntactic Analysis (Connexor) Coreference Analysis (Rosana) Transforming Rosana output to PAULA PAULA components in the pipeline are wrapped to become consumers and generators of PAULA Transforming relevant PAULA annotations to Connexor input format

50 Summarization pipeline A fragment Layout Structure and Metadata Extraction Text Structure Extraction Tokenization and Sentence Boundary Detection Term Weight Calculation ??? Merging Summary Calculation Graphical Representation Preprocessing Modules Flexible Modules Final Modules Topic Segmentation Robust Morphosyntactic Analysis (TreeTagger) PAULA Syntactic Analysis (Connexor) Coreference Analysis (Rosana) Transforming Rosana output to PAULA PAULA Transforming relevant PAULA annotations to Connexor input format Merging multiple annotation layers in one PAULA project one single PAULA project comprising annotations from different modules

51 Requirements for an interchange format for summarization advantages –scalability –modularization requirements –supporting merge and split operations for annotations of the same document –clear conceptual separation of annotations

52 PAULA Motivation Multi-level annotation for discourse structure research Multi-level annotation for information structure research The ANNIS linguistic information system multi-level querying and visualization Example pipelines Corpus annotation and exploitation PAULA for text summarization The PAULA format Current state, future plans

53 PAULA format desiderata I PAULA Potsdamer Austauschformat Linguistischer Annotationen designed with the following premises –very general, annotation-specific format supporting multi-layer annotations for information structural (and other) phenomena conflicting hierarchies (RST vs. syntax) pointing references (e.g., anaphora)

54 PAULA format desiderata II PAULA Potsdamer Austauschformat Linguistischer Annotationen designed with the following premises –high coverage loss-less representation of information from a multitude of input formats and tools –TIGER XML, Exmaralda, MMAX, RSTTool –Connexor, Rosana, Brill Tagger

55 PAULA format desiderata III PAULA Potsdamer Austauschformat Linguistischer Annotationen designed with the following premises –merging and splitting operations self-contained annotation layers extraction/addition of new annotation layers with minimal effects to other annotation layers –XML

56 PAULA format An interlingua for tools Radical standoff –each annotation layer stored in a separate file systematic application of xlinks –for non-tree fragments crossing branches for discontinuous constituents, anaphoric annotation Make as few structural commitments as possible a wide variety of data formats can be represented as opposed to earlier, task-specific formats design inspired by early drafts for LAF –conceptually related to GrAF (Ide & Suderman (2007))

57 PAULA format Basic elements (markable) span of text which is subject to annotation, e.g. a token (structure) node in a hierarchical (tree or tree-like) structure (relation) relation between struct or mark elements (feature) annotation attached to a mark, struct, or rel element

58 PAULA format Basic elements of syntax annotation Annotate, TIGERSearch, Synpathy NK NP NK Die einstigeFußball-Weltmacht ARTADJANN

59 PAULA format Basic elements of syntax annotation PAULA representation of structure elements (struct, mark) mark elements (token) struct elements rel elements NK NP NK Die einstigeFußball-Weltmacht ARTADJANN (type tok) Die einstigeFußball-Weltmacht primary data

60 PAULA format Basic elements of syntax annotation PAULA representation of annotation elements (feat) NK NP NK Die einstigeFußball-Weltmacht ARTADJANN (type tok) Die einstigeFußball-Weltmacht mark elements (token) struct elements rel elements primary data cat=NP func=NK POS=ARTPOS=ADJA POS=NN

61 PAULA format Physical representation NK NP NK Die einstigeFußball-Weltmacht ARTADJANN (type tok) Die einstigeFußball-Weltmacht mark elements (token) struct elements rel elements primary data cat=NP func=NK POS=ARTPOS=ADJA POS=NN text.xml tok.xml syntax.xml Every type of structure (primary data, mark, struct) represented in an individual file struct and rel together encode hierarchical structures

62 PAULA format Physical representation NK NP NK Die einstigeFußball-Weltmacht ARTADJANN (type tok) Die einstigeFußball-Weltmacht mark elements (token) struct elements rel elements primary data cat=NP func=NK POS=ARTPOS=ADJA POS=NN text.xml tok.xml syntax.xml Dominance relations represented by XML hierarchy between struct and rel inline XML fragment

63 PAULA format Physical representation NK NP NK Die einstigeFußball-Weltmacht ARTADJANN (type tok) Die einstigeFußball-Weltmacht mark elements (token) struct elements rel elements primary data cat=NP func=NK POS=ARTPOS=ADJA POS=NN text.xml tok.xml syntax.xml Dominance relations represented by XML hierarchy between struct and rel and xlinks/xpointer between rel and dominated struct/mark inline XML fragment xlink/ xpointer

64 PAULA format Physical representation NK NP NK Die einstigeFußball-Weltmacht ARTADJANN (type tok) Die einstigeFußball-Weltmacht mark elements (token) struct elements rel elements primary data cat=NP func=NK POS=ARTPOS=ADJA POS=NN text.xml tok.xml syntax.xml Every type of structure (primary data, mark, struct) represented in an individual file marks refer to token sequences

65 PAULA format Physical representation NK NP NK Die einstigeFußball-Weltmacht ARTADJANN (type tok) Die einstigeFußball-Weltmacht mark elements (token) struct elements rel elements primary data cat=NP func=NK POS=ARTPOS=ADJA POS=NN text.xml tok.xml syntax.xml Every type of structure (primary data, mark, struct) represented in an individual file marks of type tok refer to spans of primary data

66 PAULA format Physical representation NK NP NK Die einstigeFußball-Weltmacht ARTADJANN (type tok) Die einstigeFußball-Weltmacht mark elements (token) struct elements rel elements primary data cat=NP func=NK POS=ARTPOS=ADJA POS=NN text.xml tok.xml syntax.xml Every type of structure (primary data, mark, struct) represented in an individual file marks of type tok refer to spans of primary data xlink/ xpointer

67 PAULA format Physical representation NK NP NK Die einstigeFußball-Weltmacht ARTADJANN (type tok) Die einstigeFußball-Weltmacht mark elements (token) struct elements rel elements primary data cat=NP func=NK POS=ARTPOS=ADJA POS=NN text.xml tok.xml syntax.xml For every annotation layer, every type of feat is also represented in a separate file cat_func.xml pos.xml

68 PAULA format Physical representation NK NP NK Die einstigeFußball-Weltmacht ARTADJANN (type tok) Die einstigeFußball-Weltmacht mark elements (token) struct elements rel elements primary data cat=NP func=NK POS=ARTPOS=ADJA POS=NN text.xml tok.xml syntax.xml Feats are attached to mark/struct elements by means of xlink/xpointer expressions cat_func.xml pos.xml

69 PAULA format Achievements Generic format –capable to represent hierarchical structures struct elements correspond to nodes struct/rel elements correspond to dominance relations –capable to represent flat, layer-based annotations* mark elements correspond to spans of texts without hierarchical structure –capable to represent pointing relations* rel elements without a dominating struct element represent non-dominance relations –capable to represent any annotation assigned to these feat elements may point to any struct, mark, rel element * not shown here

70 PAULA format Achievements Hierarchies are modelled by means of xlinks –may represent any kind of dominance relation using the same mechanism, including discontinuous segments and crossing edges Represents every annotation layer on its own –structures from different annotation layers do not interfere with each other e.g. conflicting hierarchies –addition or removal of another annotation layer does not affect the representation of the remaining layers

71 PAULA format Achievements Addition of annotation layers and merging annotation projects is easy –if two annotation projects exist for one piece of primary data:* redirect all references to the token layer to the common token layer register the new annotation layer Removal of annotation layers is trivial –if an annotation layer is to be removed remove the registration of the annotation layer in the current annotation project * Merging of two annotation projects requires identical tokenization, more in a minute.

72 PAULA format Some minor disadvantages Overhead –for a project with n annotation layers with different annotations, at least 2n+2 files are created Only partially human readable –information distributed across multiple files

73 PAULA format More serious problems Hard to process using script languages –validity of xlink-references must be verified Maintenance –there is a number of (quite elaborate) converters from and to PAULA –any extension of the original format requires all these converters to be updated Merging annotation projects with different tokenization –Regularly, correction of tokenization is required, e.g., in the output of tools that are insensitive to tokenization (RSTTool) or re-tokenize (Connexor)

74 PAULA Recent developments Currently, the PAULA JAVA API is under development, including –an implementation of the PAULA Object Model –a parser for PAULA downward-compatible –serialization facilities downward-compatible –routines for standard operations aligning divergent tokenizations

75 PAULA Forthcoming Intended extensions of PAULA concern –sub-token annotations morphemes, tones, etc. –parallel corpora multiple streams of primary data –integration of media files

76 What weve shown need for MLA for annotation and processing of pragmatic (and other linguistic) phenomena ANNIS, a tool for the querying and visualization of MLA example pipelines involving MLA –typical problems synchronization adding/removal operations expressivity of existing formats

77 What weve shown typical problems when processing MLA –synchronization –adding/removal operations –expressivity of existing formats premises for the development of PAULA –generic format specifically designed for linguistic annotations –`radical standoff

78 What weve shown Problems of radical standoff formats –excessive use of xlinks –hard to read –validation vacilities.... and a solution to these –PAULA API

79 Thank you

80 ... and thanks to the team: Anke Lüdeling (HUB), Ulf Leser (HUB) Heike Bieler (UP), Michael Götze (UP), Julia Ritz (UP), Amir Zeldes (HUB), Uwe Küssner (ext), {Stefanie Dipper, Tillmann Wegst} Karsten Hütter (HUB), Christian Lemke (UP), Viktor Rosenfeld (HUB), Florian Zipser (UP)


Download ppt "The PAULA framework: Automatic and Manual Annotation of Linguistic Data Christian Chiarcos and Manfred Stede Universität Potsdam"

Similar presentations


Ads by Google