Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

Similar presentations


Presentation on theme: "Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik."— Presentation transcript:

1 Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik Gmbh

2 2/24LREC /29/2004Marion Freese Overview xComForT – Outline Relevance for richly annotated corpora xComForT Features –Adaptation to new text formats –Integration of annotation tools Proposal for integration into LAF Summary

3 3/24LREC /29/2004Marion Freese xComForT – What is it? extensible Common Format for Text based on –XML –Text Encoding Initiative (TEI) –Corpus Encoding Standard (CES / XCES) provides extensibility and reusability

4 4/24LREC /29/2004Marion Freese xComForT – Whats it for? NOT –Standard for linguistic annotation BUT –Standards proposal for structural annotation of primary data –Common anchor for linguistic annotations (LA) –Set of guidelines for LA architecture (company-internal standard)

5 5/24LREC /29/2004Marion Freese Example: Newspaper (plain text) byline copyright meta information headline quotation byline dateline paragraph

6 6/24LREC /29/2004Marion Freese xComForT – Primary Document Example zu erhalten. mgd Montag, 4. Januar 1999 BAYERN Kafkaeskes Augsburg Der nächste Akzent Von Peter Richter Augsburg – Auch wenn nicht

7 7/24LREC /29/2004Marion Freese xComForT – Data Architecture substring / 1:1 range-to / 1:1 1:1 (#id) range-to / 1:1 1:1 (#id) xComForT storage format base document level 1level 2 token level token stream substring e.g. morpheme, syllable streams e.g. sentence, chunk, mw streams level 3 1 st linguistic level e.g. PoS, lemma, pronunciation streams level 4 2 nd linguistic level e.g. parse tree stream e.g. intonation stream segInfo

8 8/24LREC /29/2004Marion Freese Relevance for richly annotated Corpora Standoff-Markup –supports huge amount of annotation data » alternative / concurrent / ambiguous annotations » partial / underspecified results » flexible merging » various annotation types (multimodal, multimedia, metadata, …) media independence –reduces annotation dependencies Support for integration of external tools for annotation and exploitation common standards-based starting point for rich annotation

9 9/24LREC /29/2004Marion Freese Comparison with CES Structural markup and linguistic annotation are strictly separated in xComForT provides common base format for arbitrary linguistic annotation allows for using consistent annotation schema Primary document DTD is easily extensible while retaining TEI conformance xComForT provides more flexibility than CES wrt. resource formats (e.g. integration of different modalities possible)

10 10/24LREC /29/2004Marion Freese Creation of an extended DTD for storage xComForT.ent xComForT.dtd core markup definition class.mod class.new class.comments elem.mod elem.new elem.comments xcomfort_new.ent xcomfort_new.dtd extension definition xComForT_store. dtd TEI conformant storage format template TEI conformant extension storage format xComForT_store_ new.dtd

11 11/24LREC /29/2004Marion Freese Extension Definition Support core markup definition contains extension entity for each element and entity, e.g. »

12 12/24LREC /29/2004Marion Freese Integration of Annotation Tools Toolbox support for converting annotation tool output to xComForT annotation stream element names xComForT document type of annotation annotate.perl text nodes for annotation tool input: With the... e.g. sentence p

13 13/24LREC /29/2004Marion Freese Linguistic Annotation Tools – implemented examples input and output formats of –Tokenizer (from IMS, University of Stuttgart) » tokens » sentences –IMS TreeTagger » lemma » part-of-speech

14 14/24LREC /29/2004Marion Freese Relation to current LAF standardization issues (1) General requirements for the standard for a Linguistic Annotation Framework (LAF) (cf. Ide & Romary 2003) xComForT conforms to these requirements, i.e. to –Media independence –Human readability –Processability

15 15/24LREC /29/2004Marion Freese Relation to current LAF standardization issues (2) Remaining requirements are xComForTs main features, i.e. –Consistency –Uniformity –Incrementality –Expressiveness Two proposals for integration into the LAF Mapping between proprietary resource formats and the LAF annotation data model Resource reusability

16 16/24LREC /29/2004Marion Freese Proposal to the LAF (1-1) LAF architecture (Ide & Romary) Dump format

17 17/24LREC /29/2004Marion Freese Proposal to the LAF (1-2) Dump Format conforming to xComForT guidelines Advantages –Direct mapping from/to user-defined formats –Support for annotation tool integration –Easy conversion into proprietary formats Disadvantages –xComForT is possibly not the most adequate/efficient processing format –Different requirements of processing format vs. exchange format

18 18/24LREC /29/2004Marion Freese Proposal to the LAF (2-1) LAF architecture (Ide & Romary) Intermediate Format between resource and LAF dump format

19 19/24LREC /29/2004Marion Freese Proposal to the LAF (2-2) Intermediate Format (Common Document Format) Disadvantages –One more mapping step Advantages –Standards-based adaptation to proprietary formats –Mapping to dump format tightly defined and targeted –Common mapping tool, e.g. provided by the LAF

20 20/24LREC /29/2004Marion Freese Example: Potential LAF dump format Jones followed him into the front room, closing the door behind him (Ide&Romary2001)

21 21/24LREC /29/2004Marion Freese Example: Possible xComForT Representation (1) segments xComForT storage format level 1 PTBraw.xml level 2 token level substring token.xml level 3 1 st linguistic level level 4 2 nd linguistic level range-to sentence.xml chunk.xml segInfo chunk_relation.xml 1:1 (#id)

22 22/24LREC /29/2004Marion Freese Example: Possible xComForT Representation (2) chunk.xml chunk_relation.xml

23 23/24LREC /29/2004Marion Freese Summary standards-based common tools available and usable stand-off annotation easy plugging-in of linguistic annotation schema easily extensible markup of primary document easy adaptation to arbitrary resource Standard base format, e.g. to simplify support for mapping into the Linguistic Annotation Framework

24 24/24LREC /29/2004Marion Freese xComForTable Mapping to the LAF Thanks for your attention! … Any questions?

25 25/24LREC /29/2004Marion Freese Structural Markup improves Analysis e.g. sentence boundary detection Then things would get even worse. (see also pages 4 and 11) SHADOWS By Leena Dhingra I couldnt possibly do that. tokenizer input: -elements (without -elements) correct sentence markup [..]Then things would get even worse. (see also pages 4 and 11) SHADOWS By Leena Dhingra I couldnt possibly do that.

26 26/24LREC /29/2004Marion Freese Example – Discontinuous Material CES xComForT Die Gewinnzahlen Lotto (5. Juni): 5, 19, 21, 31, 43, 48 Zusatzzahl: 32, Superzahl: 9 Toto: lag noch nicht vor Die Gewinnzahlen Lotto (5. Juni): 5, 19, 21, 31, 43, 48 Zusatzzahl: 32, Superzahl: 9 Toto: lag noch nicht vor

27 27/24LREC /29/2004Marion Freese Example – Meta Information CES xComForT Montag, 7. Juni 1999 NACHRICHTEN M / F Süddeutsche Zeitung Nr. 127 / Seite 7 Montag, 7. Juni 1999 NACHRICHTEN M / F Süddeutsche Zeitung Nr. 127 / Seite 7 reference to taxonomy


Download ppt "Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik."

Similar presentations


Ads by Google