Presentation is loading. Please wait.

Presentation is loading. Please wait.

Christian Chiarcos and Manfred Stede Universität Potsdam

Similar presentations


Presentation on theme: "Christian Chiarcos and Manfred Stede Universität Potsdam "— Presentation transcript:

1 The PAULA framework: Automatic and Manual Annotation of Linguistic Data
Christian Chiarcos and Manfred Stede Universität Potsdam Workshop „Processing Pipelines“ Darmstadt 2008/07/10

2 Overview Motivation The ANNIS linguistic information system
Multi-level annotation for discourse structure research Multi-level annotation for information structure research The ANNIS linguistic information system multi-level querying and visualization Example pipelines Corpus annotation and exploitation PAULA for text summarization The PAULA format Current state, future plans

3 Multi-level annotation (1): Discourse structure(s)
Thesis: Coherence of a text is not adequately characterized by „the“ discourse structure (a single tree or graph) but by the interplay of different levels of description, each reflecting a separate dimension of textuality. (In Textlinguistik, this idea is not new (e.g., Motsch 96) but the programme has not been carried through yet.)

4 Impfpflicht gegen Kinderkrankheiten?
[1] Kein Kind weiß heute noch, was Pocken sind. [2] So ein Glück. [3] Als die Pockenimpfung 1854 eingeführt wurde, [4] glaubten manche Menschen, [5] dass sich ihr Kopf in einen Kuhkopf verwandelt, [6] wenn sie sich impfen lassen. [7] Denn der Impfstoff wurde damals aus der Haut von Rindern hergestellt. [8] Heute ist diese furchtbare Krankheit ausgerottet. [9] Dank einer entschlossenen, weltweiten Impfkampagne. [10] Aber es gibt noch: Masern, Kinderlähmung, Diphtherie, Mumps, Röteln, Hepatitis B, Tuberkulose, Keuchhusten. [11] Daran sterben, vor allem in den Entwicklungsländern, jährlich immer noch Millionen Kinder. [12] In Deutschland werden diese Krankheiten von vielen Eltern offenbar nicht ernst genommen. [13] Weil sie sie gar nicht mehr kennen! [14] Denn mit Impfstoffen wurde erreicht, [15] dass diese Infektionen nur noch sporadisch auftreten. [16] Doch wer aus eigenem Erleben weiß, [17] wie schrecklich Kinder leiden, [18] wenn sie ‚nur‘ Masern oder Keuchhusten haben, [19] sollte ihnen dies ersparen. [20] Und auch die gesundheitlichen Folgewirkungen. [21] Nur wer impfen lässt, hilft mit, dass Impfungen eines Tages überflüssig werden. [22] Stattdessen wird über Nebenwirkungen von Impfstoffen schwadroniert, [23] die höchst selten auftreten und die man erst Recht nur aus Büchern kennt. [24] Dann gibt es noch das schöne Argument: Das ist mein Kind, das darf der Staat nicht pieken. [25] Gegen solche Eltern hilft auch keine Impfung.

5 Mandatory vaccination against children‘s diseases?
[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.

6 Mandatory vaccination against children‘s diseases?
[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.

7 Referential Structure

8 Mandatory vaccination against children‘s diseases?
[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.

9 Thematic Structure

10 Mandatory vaccination against children‘s diseases?
[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.

11 Conjunctive Relations
temporal simultaneous, succession consequential manner, consequence, condition, purpose, concession comparative similarity, contrast, reformulation additive addition, alternation Relations can be directed but not weighted - there is no nuclearity (Martin 1992)

12 Conjunctive Relations
(Martin 1992)

13 Intentional structure
Illocutions (inspired by Schmitt 00, Searle 76) Reportivum: writer describes a state of affairs Identifikativum: writer characterizes own state of mind, health, etc. Estimativum: writer presents proposition as probably true Evaluativum: writer presents a personal opinion Appellativum: writer orders or suggests an action Support Relations (subset of RST) Ease-understanding (Background) Encourage-acting (Motivation) Ease-acting (Enablement) Encourage-believing (Evidence) Encourage appreciating (Antithesis, Concession) Compare „types of argument“ (e.g., Eggs 00): deontic epistemic ethic/aesthetic

14 Mandatory vaccination against children‘s diseases?
[1] Today, children don‘t know anymore what pox are. [2] What a joy. [3] When pox vaccination was introduced in 1854, [4] quite a few people believed [5] that their head would turn into a cow‘s head [6] if they got themselves vaccinated. [7] For the vaccine was made from cattle‘s skin at the times. [8] Nowadays this dreadful disease is exterminated. [9] Thanks to a determined, world-wide vaccination campaign. [10] But there still are other diseases: Measles, polio, diphteria, mumps, rubella, hepatitis B, tuberculosis, pertussis. [11] Millions of children die of these, especially in less developed countries. [12] In Germany, many parents apparently don‘t take these diseases seriously. [13] Because they don‘t know them anymore! [14] For it has been achieved with vaccines [15] that these infections hit only rarely today. [16] But those who have experienced [17] how terribly children suffer [18] when they come down with ‚just‘ measles or pertussis, [19] should spare them the agony. [20] As well as the long-term consequences. [21] Only those who have their children vaccinated will contribute to vaccines‘ becoming superfluous some day. [22] Instead, people rant about side effects [23] that occur very rarely and are known merely from books. [24] Then there is the great argument: This is my child, the governement must not prick her. [25] No vaccine can help against such parents.

15 Argument structure (inspired by Freeman 1993)

16 Text understanding: Relating levels of analysis

17 Text understanding: Relations to sentence syntax
Impfpflicht gegen Kinderkrankheiten? [1] Kein Kind weiß heute noch, was Pocken sind. [2] So ein Glück. [3] Als die Pockenimpfung 1854 eingeführt wurde, [4] glaubten manche Menschen, [5] dass sich ihr Kopf in einen Kuhkopf verwandelt, [6] wenn sie sich impfen lassen. [7] Denn der Impfstoff wurde damals aus der Haut von Rindern hergestellt. [8] Heute ist diese furchtbare Krankheit ausgerottet. [9] Dank einer entschlossenen, weltweiten Impfkampagne. [10] Aber es gibt noch: Masern, Kinderlähmung, Diphtherie, Mumps, Röteln, Hepatitis B, Tuberkulose, Keuchhusten. [11] Daran sterben, vor allem in den Entwicklungsländern, jährlich immer noch Millionen Kinder. [12] In Deutschland werden diese Krankheiten von vielen Eltern offenbar nicht ernst genommen. [13] Weil sie sie gar nicht mehr kennen! [14] Denn mit

18 Multi-level annotation: syntax tree
NP NK NK NK Die einstige Fußball-Weltmacht ART ADJA NN Annotate, Synpathy

19 Multi-level annotation: coreference
MMAX

20 Multi-level annotation: text tree
RST Tool

21 Multi-level annotation: layers
Exmaralda

22

23

24 Multi-level annotation (2): Information structure - SFB632
B1 (Gur/Kwa languages) B2 (Tchadic languages) B4 (Diachronic Germanic / Latin translation) B6 (Spoken „Kiezdeutsch“) C1 (Newspaper German) - see below C6 (Hindi) D1 (Newspaper German) - see above D2 (Questionnaire - 13 different languages)

25 Multi-level annotation (2): Information structure - SFB632
B1 (Gur/Kwa languages) - Shoe/Toolbox, Exm. B2 (Tchadic languages) - Shoe/Toolbox, Exm. B4 (Diachronic Germanic / Latin) - Exmaralda B6 (Spoken „Kiezdeutsch“) - Exmaralda C1 (Newspaper German) - Synpathy, MMax C6 (Hindi) - XML D1 (Newspaper German) - Syn, MMax, RST, Exm D2 (Questionnaire) - Exmaralda

26 Multi-level annotation (2): Information structure - SFB632
B4 (Diachronic Germanic / Latin) - Exmaralda 1800 sentences: info structure, syntax, coherence relations C1 (Newspaper German) - Synpathy, MMax Large text collection with only selected sentences being annotated - see below D1 (Newspaper German) - Syn, MMax, RST, Exm 200 texts/2500 sentences, in part with coherence relations, coreference, syntax, info structure D2 (Questionnaire) - Exmaralda GB of audio data / 50K transcribed tokens, in part with phrase structure, info structure

27 SFB632: From annotation tool to database
Database reads PAULA Conversion scripts map from tool output to PAULA Add metadata to documents Fix some inconsistent tokenization Challenges Enforce common tokenization across layers (and thus across tools) Enforce syntactically correct annotation (Exmaralda) Manual work Check for typos and other errors (wrong type of annotation layer, etc.) Repair some inconsistent tokenization

28 ANNIS Database ANNIS V1: Data resides in main memory
In use since 2005 ANNIS V2: System with relational DB backend (PostgreSQL) To be launched this summer

29 ANNIS query language Issue queries across annotation layers
...to combine different realms of information givenness=giv & syncat=pp & rhetrel=contrast ...to check for conflicting annotations within the same realm ann1::givenness=new & ann2::givenness=giv & #1 _=_ #2 ...to check for completeness of annotations aboutness=ref & !givenness=* & #1 _=_ #2

30 ANNIS V1 Text view and annotation layers

31 ANNIS V2 Search for multiple constitutents in the Vorfeld

32 ANNIS V2 Hit list

33 ANNIS V2 Tree view

34 ANNIS V2 Coreference view

35 Availability ANNIS database V1 ANNIS database V2: later this year
PAULA documentation Conversion scripts AnnotTools to PAULA Exmaralda, MMAX2, TigerXML, RSTTool, URML, Palinka, generic inline XML Ontology & Tools* extensions for ontology-based corpus querying HTML-export for ontologies * Developed in the DFG project „Nachhaltigkeit linguistischer Daten“ (SFB 441, SFB 538, SFB 632)

36 Example pipelines Motivation The ANNIS linguistic information system
Multi-level annotation for discourse structure research Multi-level annotation for information structure research The ANNIS linguistic information system multi-level querying and visualization Example pipelines Corpus annotation and exploitation PAULA for text summarization The PAULA format Current state, future plans

37 Corpus annotation pipeline
information structure and word order in German* What contextual conditions are licensing pre-field occupation of non-subject constituents ? annotations grammatical annotation syntax, morphology pragmatic annotation anaphora, bridging, information status efficient, goal-specific annotation partial annotation selected examples + immediate context semiautomatic annotation * Chiarcos, C., J. Ritz, M. Stede (2008), Investigating non-canonical constructions in context: efficient corpus annotation and retrieval. to be presented at KONVENS 2008, Berlin, October, 2008

38 Corpus annotation pipeline
collect a number of texts mark target sentences sample selection automated pre-processing tokenization parsing manual annotation annotate anaphora and verify syntax use standard annotation tools for both tasks synchronization anaphora grammar integration conversion to PAULA

39 Corpus annotation pipeline
collect a number of texts mark target sentences convert to plain text with markup sample selection automated pre-processing use standard tokenizer mark sentence boundaries preserve markup tokenization BitPar (German version of TracePar)* POS, morph, TIGER-style syntax in case of failure, use TreeTagger/Chunker** POS, NP/PP-chunks parsing conversion to TIGER XML conversion from bracket format * ** * Helmut Schmid Efficient parsing of highly ambiguous context-free grammars with bit vectors. In Proceedings of COLING, Geneva, Switzerland. ** Helmut Schmid, Improvements in Part-of-Speech Tagging with an application to German. in S. Armstrong et al. (ed.), Natual Language Processing using very large corpora, Kluwer, Dordrecht.

40 Corpus annotation pipeline
pre-processing produce TIGER XML manual annotation MMAX* converted from TIGER XML preserve TIGER ids as annotations anaphoric annotation synchronization identify relevant context sentences grammatical annotation Synpathy** correct selected sentences synchronization verify MMAX references to TIGER XML integration * ** Christoph Müller, Michael Strube (2006): Multi-Level Annotation of Linguistic Data with MMAX2. In: Sabine Braun, Kurt Kohn, Joybrato Mukherjee (Eds.): Corpus Technology and Language Pedagogy. New Resources, New Tools, New Methods. Frankfurt: Peter Lang, pp (English Corpus Linguistics, Vol.3 )

41 Corpus annotation pipeline
MMAX format TIGER ids as annotation values anaphora grammar TIGER XML integration PAULA PAULA loss-less conversion to PAULA isomorphic to source format hier kommt dann endlich PAULA ins Spiel merging references to the same token file merged PAULA project integration replace TIGER ids from with pointing relations to elements integrated PAULA project

42 Corpus exploitation pipeline
What is the relation between different levels of description ? information status vs. morphosyntax discourse structure vs. anaphora Qualitative analysis Query the corpus for corresponding annotations and analyse these examples manually. cf. ANNIS slides Quantitative analysis Assess statistic correlations between different annotations.

43 Corpus exploitation pipeline Quantitative analysis
TIGER XML Exmaralda RST Tool MMAX POS, morph, syntax information structure discourse structure coreference conversion to PAULA corpus of PAULA projects integration of multiple annotations of the same set of documents extraction of feature vectors so far, no generic ARFF exporter has been developed. ANNIS 2.0 will be augmented with a number of example converters conversion to ARFF WEKA WEKA* workbench for statistic analyses statistic, neuronal, symbolic classifiers *

44 Corpus exploitation pipeline Quantitative analysis with WEKA
Preprocessing selecting relevant features from an ARFF feature list

45 Corpus exploitation pipeline Quantitative analysis with WEKA
example analysis (decision tree) information status and referring expressions in German (Potsdam Commentary Corpus)

46 NLP pipeline Summarization project* high-quality summarization
syntax, coreference, text structure, causal markers PAULA as exchange format between different NLP modules output of different modules is to be combined these may also run in parallel specific requirements for the exchange format * Stede, M., H. Bieler, S. Dipper, and A. Suriyawongkul (2006). SUMMaR: Combining Linguistics and Statistics for Text Summarization. In Proceedings of the 17th European Conference on Artificial Intelligence (ECAI-06)

47 Summarization, architecture
Merging Summary Calculation Syntactical Analysis (Connexor) Structure Weight Calculation Discourse Marker Annotation Layout Structure and Metadata Extraction Term Weight Calculation Treetagger Graphical Representation Coreference Analysis (Rosana) Topic Segmentation Number and Time Annotation Final Modules Text Structure Extraction Flexible Modules Tokenization and Sentence Boundary Detection flexible modules can be arranged in any order in the pipeline or be processed non- sequentially PAULA as common interchange format Preprocessing Modules

48 Summarization pipeline
Coreference Analysis (Rosana) Layout Structure and Metadata Extraction Graphical Representation Syntactic Analysis (Connexor) Text Structure Extraction Robust Morphosyntactic Analysis (TreeTagger) Summary Calculation Tokenization and Sentence Boundary Detection Term Weight Calculation Merging Preprocessing Modules Final Modules Topic Segmentation Flexible Modules (selection)

49 Summarization pipeline A fragment
Coreference Analysis (Rosana) * Layout Structure and Metadata Extraction Graphical Representation Syntactic Analysis (Connexor) ??? Transforming Rosana output to PAULA Text Structure Extraction Robust Morphosyntactic Analysis (TreeTagger) Summary Calculation Transforming relevant PAULA annotations to Connexor input format PAULA to be processed by other components in the summarization pipeline Tokenization and Sentence Boundary Detection Term Weight Calculation Merging PAULA coming from a preprocessing module Preprocessing Modules Final Modules Topic Segmentation Flexible Modules components in the pipeline are „wrapped“ to become consumers and generators of PAULA * Rosana requires Connexor as input format, hence, the mapping to PAULA is skipped at this point

50 Summarization pipeline A fragment
Coreference Analysis (Rosana) Layout Structure and Metadata Extraction Graphical Representation Syntactic Analysis (Connexor) ??? Transforming Rosana output to PAULA Text Structure Extraction Robust Morphosyntactic Analysis (TreeTagger) Summary Calculation Transforming relevant PAULA annotations to Connexor input format PAULA Tokenization and Sentence Boundary Detection Term Weight Calculation Merging PAULA Merging multiple annotation layers in one PAULA project Preprocessing Modules Final Modules Topic Segmentation Flexible Modules one single PAULA project comprising annotations from different modules

51 Requirements for an interchange format for summarization
advantages scalability modularization requirements supporting merge and split operations for annotations of the same document clear conceptual separation of annotations

52 PAULA Motivation The ANNIS linguistic information system
Multi-level annotation for discourse structure research Multi-level annotation for information structure research The ANNIS linguistic information system multi-level querying and visualization Example pipelines Corpus annotation and exploitation PAULA for text summarization The PAULA format Current state, future plans

53 PAULA format desiderata I
Potsdamer Austauschformat Linguistischer Annotationen designed with the following premises very general, annotation-specific format supporting multi-layer annotations for information structural (and other) phenomena conflicting hierarchies (RST vs. syntax) pointing references (e.g., anaphora)

54 PAULA format desiderata II
Potsdamer Austauschformat Linguistischer Annotationen designed with the following premises high coverage loss-less representation of information from a multitude of input formats and tools TIGER XML, Exmaralda, MMAX, RSTTool Connexor, Rosana, Brill Tagger

55 PAULA format desiderata III
Potsdamer Austauschformat Linguistischer Annotationen designed with the following premises merging and splitting operations self-contained annotation layers extraction/addition of new annotation layers with minimal effects to other annotation layers XML

56 PAULA format An “interlingua” for tools
Radical standoff each annotation layer stored in a separate file systematic application of xlinks for non-tree fragments crossing branches for discontinuous constituents, anaphoric annotation Make as few structural commitments as possible a wide variety of data formats can be represented as opposed to earlier, task-specific formats design inspired by early drafts for LAF conceptually related to GrAF (Ide & Suderman (2007))

57 PAULA format Basic elements
<mark> (markable) span of text which is subject to annotation, e.g. a token <struct> (structure) node in a hierarchical (tree or tree-like) structure <rel> (relation) relation between struct or mark elements <feat> (feature) annotation attached to a mark, struct, or rel element

58 PAULA format Basic elements of syntax annotation
NP NK NK NK Die einstige Fußball-Weltmacht ART ADJA NN Annotate, TIGERSearch, Synpathy

59 PAULA format Basic elements of syntax annotation
PAULA representation of structure elements (struct, mark) NK NP Die einstige Fußball-Weltmacht ART ADJA NN <struct> <rel> <mark> (type „tok“) struct elements rel elements mark elements (token) primary data

60 PAULA format Basic elements of syntax annotation
PAULA representation of annotation elements (feat) NK NP Die einstige Fußball-Weltmacht ART ADJA NN <struct> <rel> <mark> (type „tok“) struct elements cat=NP rel elements func=NK func=NK func=NK mark elements (token) POS=NN POS=ART POS=ADJA primary data

61 PAULA format Physical representation
Every type of structure (primary data, mark, struct) represented in an individual file struct and rel together encode hierarchical structures syntax.xml NK NP Die einstige Fußball-Weltmacht ART ADJA NN struct elements <struct> cat=NP rel elements <rel> <rel> <rel> func=NK func=NK func=NK tok.xml mark elements (token) <mark> <mark> <mark> POS=NN (type „tok“) POS=ART (type „tok“) POS=ADJA (type „tok“) primary data Die einstige Fußball-Weltmacht text.xml

62 PAULA format Physical representation
Dominance relations represented by XML hierarchy between struct and rel syntax.xml NK NP Die einstige Fußball-Weltmacht ART ADJA NN struct elements <struct> inline XML fragment cat=NP rel elements <rel> <rel> <rel> func=NK func=NK func=NK tok.xml mark elements (token) <mark> <mark> <mark> POS=NN (type „tok“) POS=ART (type „tok“) POS=ADJA (type „tok“) primary data Die einstige Fußball-Weltmacht text.xml

63 PAULA format Physical representation
Dominance relations represented by XML hierarchy between struct and rel and xlinks/xpointer between rel and dominated struct/mark syntax.xml NK NP Die einstige Fußball-Weltmacht ART ADJA NN struct elements <struct> inline XML fragment cat=NP rel elements <rel> <rel> <rel> xlink/ xpointer func=NK func=NK func=NK tok.xml mark elements (token) <mark> <mark> <mark> POS=NN (type „tok“) POS=ART (type „tok“) POS=ADJA (type „tok“) primary data Die einstige Fußball-Weltmacht text.xml

64 PAULA format Physical representation
Every type of structure (primary data, mark, struct) represented in an individual file marks refer to token sequences syntax.xml NK NP Die einstige Fußball-Weltmacht ART ADJA NN struct elements <struct> cat=NP rel elements <rel> <rel> <rel> func=NK func=NK func=NK tok.xml mark elements (token) <mark> <mark> <mark> POS=NN (type „tok“) POS=ART (type „tok“) POS=ADJA (type „tok“) primary data Die einstige Fußball-Weltmacht text.xml

65 PAULA format Physical representation
Every type of structure (primary data, mark, struct) represented in an individual file marks of type ‚tok‘ refer to spans of primary data syntax.xml NK NP Die einstige Fußball-Weltmacht ART ADJA NN struct elements <struct> cat=NP rel elements <rel> <rel> <rel> func=NK func=NK func=NK tok.xml mark elements (token) <mark> <mark> <mark> POS=NN (type „tok“) POS=ART (type „tok“) POS=ADJA (type „tok“) primary data Die einstige Fußball-Weltmacht text.xml

66 PAULA format Physical representation
Every type of structure (primary data, mark, struct) represented in an individual file marks of type ‚tok‘ refer to spans of primary data syntax.xml NK NP Die einstige Fußball-Weltmacht ART ADJA NN struct elements <struct> cat=NP rel elements <rel> <rel> <rel> func=NK func=NK func=NK tok.xml mark elements (token) <mark> <mark> <mark> POS=NN xlink/ xpointer (type „tok“) POS=ART (type „tok“) POS=ADJA (type „tok“) primary data Die einstige Fußball-Weltmacht text.xml

67 PAULA format Physical representation
For every annotation layer, every type of feat is also represented in a separate file syntax.xml NK NP Die einstige Fußball-Weltmacht ART ADJA NN cat_func.xml struct elements <struct> cat=NP rel elements <rel> <rel> <rel> func=NK func=NK func=NK tok.xml pos.xml mark elements (token) <mark> <mark> <mark> POS=NN (type „tok“) POS=ART (type „tok“) POS=ADJA (type „tok“) primary data Die einstige Fußball-Weltmacht text.xml

68 PAULA format Physical representation
Feats are attached to mark/struct elements by means of xlink/xpointer expressions syntax.xml NK NP Die einstige Fußball-Weltmacht ART ADJA NN cat_func.xml struct elements <struct> cat=NP rel elements <rel> <rel> <rel> func=NK func=NK func=NK tok.xml pos.xml mark elements (token) <mark> <mark> <mark> POS=NN (type „tok“) POS=ART (type „tok“) POS=ADJA (type „tok“) primary data Die einstige Fußball-Weltmacht text.xml

69 PAULA format Achievements
Generic format capable to represent hierarchical structures struct elements correspond to nodes struct/rel elements correspond to dominance relations capable to represent flat, layer-based annotations* mark elements correspond to spans of texts without hierarchical structure capable to represent pointing relations* rel elements without a dominating struct element represent non-dominance relations capable to represent any annotation assigned to these feat elements may point to any struct, mark, rel element * not shown here

70 PAULA format Achievements
Hierarchies are modelled by means of xlinks may represent any kind of dominance relation using the same mechanism, including discontinuous segments and crossing edges Represents every annotation layer on its own structures from different annotation layers do not interfere with each other e.g. conflicting hierarchies addition or removal of another annotation layer does not affect the representation of the remaining layers

71 PAULA format Achievements
Addition of annotation layers and merging annotation projects is easy if two annotation projects exist for one piece of primary data:* redirect all references to the token layer to the common token layer register the new annotation layer Removal of annotation layers is trivial if an annotation layer is to be removed remove the registration of the annotation layer in the current annotation project * Merging of two annotation projects requires identical tokenization, more in a minute.

72 PAULA format Some minor disadvantages
Overhead for a project with n annotation layers with different annotations, at least 2n+2 files are created Only partially human readable information distributed across multiple files

73 PAULA format More serious problems
Hard to process using script languages validity of xlink-references must be verified Maintenance there is a number of (quite elaborate) converters from and to PAULA any extension of the original format requires all these converters to be updated Merging annotation projects with different tokenization Regularly, correction of tokenization is required, e.g., in the output of tools that are insensitive to tokenization (RSTTool) or re-tokenize (Connexor)

74 PAULA Recent developments
Currently, the PAULA JAVA API is under development, including an implementation of the PAULA Object Model a parser for PAULA downward-compatible serialization facilities routines for standard operations aligning divergent tokenizations

75 PAULA Forthcoming Intended extensions of PAULA concern
sub-token annotations morphemes, tones, etc. parallel corpora multiple streams of primary data integration of media files

76 What we‘ve shown need for MLA for annotation and processing of pragmatic (and other linguistic) phenomena ANNIS, a tool for the querying and visualization of MLA example pipelines involving MLA typical problems synchronization adding/removal operations expressivity of existing formats

77 What we‘ve shown typical problems when processing MLA
synchronization adding/removal operations expressivity of existing formats premises for the development of PAULA generic format specifically designed for linguistic annotations `radical‘ standoff

78 What we‘ve shown Problems of radical standoff formats
excessive use of xlinks hard to read validation vacilities .... and a solution to these PAULA API

79 Thank you

80 Thank you ... and thanks to the team:
Anke Lüdeling (HUB), Ulf Leser (HUB) Heike Bieler (UP), Michael Götze (UP), Julia Ritz (UP), Amir Zeldes (HUB), Uwe Küssner (ext), {Stefanie Dipper, Tillmann Wegst} Karsten Hütter (HUB), Christian Lemke (UP), Viktor Rosenfeld (HUB), Florian Zipser (UP)


Download ppt "Christian Chiarcos and Manfred Stede Universität Potsdam "

Similar presentations


Ads by Google