Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enabling xComForTable Mapping to the Linguistic Annotation Framework

Similar presentations

Presentation on theme: "Enabling xComForTable Mapping to the Linguistic Annotation Framework"— Presentation transcript:

1 Enabling xComForTable Mapping to the Linguistic Annotation Framework
Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik Gmbh

2 Overview xComForT – Outline Relevance for richly annotated corpora
xComForT Features Adaptation to new text formats Integration of annotation tools Proposal for integration into LAF Summary Marion Freese LREC /29/2004

3 xComForT – What is it? extensible Common Format for Text based on XML
Text Encoding Initiative (TEI) Corpus Encoding Standard (CES / XCES) provides extensibility and reusability 1+4. Self descriptiveness Marion Freese LREC /29/2004

4 xComForT – What’s it for?
NOT Standard for linguistic annotation BUT Standards proposal for structural annotation of primary data Common anchor for linguistic annotations (LA) Set of guidelines for LA architecture (company-internal standard) Marion Freese LREC /29/2004

5 Example: Newspaper (plain text)
byline copyright meta information headline quotation byline plain text from german newspaper corpus, not annoteted with text structure structure is plausible also without understanding the german text... byline: full form and shorthand dateline paragraph Marion Freese LREC /29/2004

6 xComForT – Primary Document Example
<xcomfortDoc type="text" extension="SZ" version="v0.6" TEIform="TEI.2"> <cesHeader ...> <! > </cesHeader> <text xml:lang="de"> <! > zu erhalten.</p> <byline type="signer"> <docAuthor type="short">mgd</docAuthor> </byline> </div> <div type="article" id="d _a12"> <opener id="d _a12o"> <divMeta> <publDate>Montag, 4. Januar 1999</publDate> <cat target="ns8"><hi>BAYERN</hi></cat> <! > </divMeta> <head id="d _a12hl1">Kafkaeskes Augsburg</head> <head id="d _a12hl2" type="sub">Der nächste Akzent <! ></head> <byline type="main">Von <docAuthor type="full">Peter Richter</docAuthor> <dateline><location>Augsburg</location> – </dateline> <p id="d _a12p1">Auch wenn nicht <! ></p> </xcomfortDoc> All elements would have an id – it’s omitted here cause of space problems Ns8 newspaper section for category registry (in cesHeader) Marion Freese LREC /29/2004

7 xComForT – Data Architecture
e.g. morpheme, syllable streams e.g. sentence, chunk, mw streams level 3 1st linguistic level range-to / 1:1 range-to / 1:1 level 4 2nd linguistic level e.g. parse tree stream e.g. intonation stream 1:1 (#id) xComForT storage format base document level 1 level 2 token level token stream substring substring / 1:1 1:1 (#id) segInfo e.g. PoS, lemma, pronunciation streams xComForT guidelines … to create for each annotation type its own document and not to mix e.g. chunk and lexical annotation In order to provide for independent annotation Marion Freese LREC /29/2004

8 Relevance for richly annotated Corpora
Standoff-Markup supports huge amount of annotation data alternative / concurrent / ambiguous annotations partial / underspecified results flexible merging various annotation types (multimodal, multimedia, metadata, …)  media independence reduces annotation dependencies Support for integration of external tools for annotation and exploitation  common standards-based starting point for rich annotation 1a. Creation, storage, manipulation, exploitation of all well known standoff annotation possibilities like Alternative/concurrent annot., various annotation types (different modalities 1b. (when strictly applied) E.g. changes affect only the documents that link to the changed ones (higher level documents, but not all!) e.g. change in sentence stream would – in the example streams given above – only affect the intonation stream Marion Freese LREC /29/2004

9 Comparison with CES Structural markup and linguistic annotation are strictly separated in xComForT  provides common base format for arbitrary linguistic annotation allows for using consistent annotation schema Primary document DTD is easily extensible while retaining TEI conformance  xComForT provides more flexibility than CES wrt. resource formats (e.g. integration of different modalities possible) Marion Freese LREC /29/2004

10 Creation of an extended DTD for storage
class.mod class.comments elem.mod elem.comments xcomfort_new.ent xcomfort_new.dtd extension definition xComForT.ent xComForT.dtd core markup definition xComForT_store.dtd TEI conformant storage format template TEI conformant extension storage format xComForT_store_new.dtd consists of a FIXED core def.: based on CES (only few modifications, motivated later on) MODULAR extension possibility: files with modified and new definitions base document is instance of storage format core def. and template exist only the files in red have to be written manually; new and modified elements and entities have to be defined in DTD syntax; support given by the following formalism – next slide Marion Freese LREC /29/2004

11 Extension Definition Support
core markup definition contains extension entity for each element and entity, e.g. <!ENTITY % x.byline ‘’> <!ELEMENT byline (#PCDATA | author %x.byline;)> <!ENTITY % x.byline ‘| interviewer’> <!ELEMENT byline (#PCDATA | author | interviewer)> instead of doing a copy&paste of the definition, we simply add the extension by redefining an extension entity black: core def.; upper red: extension definition (in class.mod); lower red: result Marion Freese LREC /29/2004

12 Integration of Annotation Tools
Toolbox support for converting annotation tool output to xComForT annotation stream element names xComForT document type of annotation annotate.perl text nodes for annotation tool input: <tn ancestors=“div p“ parentID=“div1.p1“>With</tn> <tn ancestors=“div p“ parentID=“div1.p1“>the</tn> ... e.g. sentence <elem>p</elem> <s xlink:href=“..“/> Marion Freese LREC /29/2004

13 Linguistic Annotation Tools – implemented examples
input and output formats of Tokenizer (from IMS, University of Stuttgart) tokens sentences IMS TreeTagger lemma part-of-speech Akronym IMS erklären Marion Freese LREC /29/2004

14 Relation to current LAF standardization issues (1)
General requirements for the standard for a Linguistic Annotation Framework (LAF) (cf. Ide & Romary 2003) xComForT conforms to these requirements, i.e. to Media independence Human readability Processability Marion Freese LREC /29/2004

15 Relation to current LAF standardization issues (2)
Remaining requirements are xComForT’s main features, i.e. Consistency Uniformity Incrementality Expressiveness Two proposals for integration into the LAF Mapping between proprietary resource formats and the LAF annotation data model Resource reusability Proposals lead to easier mapping, added value reagarding reusability Marion Freese LREC /29/2004

16 Proposal to the LAF (1-1) LAF architecture (Ide & Romary) Dump format
Marion Freese LREC /29/2004

17 Proposal to the LAF (1-2) Dump Format conforming to xComForT guidelines Advantages Direct mapping from/to user-defined formats Support for annotation tool integration Easy conversion into proprietary formats Disadvantages xComForT is possibly not the most adequate/efficient processing format Different requirements of processing format vs. exchange format Marion Freese LREC /29/2004

18 Proposal to the LAF (2-1) LAF architecture (Ide & Romary)
Intermediate Format between resource and LAF dump format Marion Freese LREC /29/2004

19 Proposal to the LAF (2-2) Intermediate Format (Common Document Format)
Disadvantages One more mapping step Advantages Standards-based adaptation to proprietary formats Mapping to dump format tightly defined and targeted Common mapping tool, e.g. provided by the LAF Marion Freese LREC /29/2004

20 Example: Potential LAF dump format
“Jones followed him into the front room, closing the door behind him” (Ide&Romary2001) <struct id="s0" type="S"> <struct id="s1" type="NP" xlink:href="xptr(substring(p/s[1]/text(),1,5))" rel="SBJ"/> <struct id="s2" type="VP" xlink:href="xptr(substring(p/s[1]/text(),7,8))"/> <struct id="s3" type="NP" xlink:href="xptr(substring(p/s[1]/text(),16,3))"/> <struct id="s4" type="PP" xlink:href="xptr(substring(p/s[1]/text(),20,4))" rel="DIR"> <struct id="s5" type="NP" xlink:href="xptr(substring(p/s[1]/text(),25,14))"/> </struct> <struct id="s6" type="S" rel="ADV"> <! > Marion Freese LREC /29/2004

21 Example: Possible xComForT Representation (1)
segments range-to sentence.xml chunk.xml PTBraw.xml substring token.xml segInfo chunk_relation.xml 1:1 (#id) xComForT storage format token level 1st linguistic level 2nd linguistic level level 1 level 2 level 3 level 4 Marion Freese LREC /29/2004

22 Example: Possible xComForT Representation (2)
chunk.xml chunk_relation.xml <segments level="ling1" type="chunk" xml:base="token.xml"> <chunk id="div1.p1.chunk1" type="NP" xlink:href="#div1.p1.tok1"/> <chunk id="div1.p1.chunk2" type="VP" xlink:href="#div1.p1.tok2"/> <chunk id="div1.p1.chunk3" type="NP" xlink:href="#div1.p1.tok3"/> <chunk id="div1.p1.chunk4" type="PP" xlink:href="#xpointer(id('div1.p1.tok4')/ range-to(id('div1.p1.tok7'))"/> <chunk id="div1.p1.chunk5" type="NP" xlink:href="#xpointer(id('div1.p1.tok5')/ </segments> Chunks in xml nicht hierarchisch, semantisch aber schon Can be converted to hierarchic representation (e.g. graphics) Why better than xml-hierarchic?? <segInfo level="ling2" type="rel" xml:base="chunk.xml"> <rel id="div1.p1.chunk1.rel" xlink:href="#div1.p1.chunk1>SBJ</rel> <rel id="div1.p1.chunk4.rel" xlink:href="#div1.p1.chunk4>DIR</rel> </segInfo> Marion Freese LREC /29/2004

23 Summary standards-based  common tools available and usable
stand-off annotation  easy plugging-in of linguistic annotation schema easily extensible markup of primary document easy adaptation to arbitrary resource  Standard base format, e.g. to simplify support for mapping into the Linguistic Annotation Framework Marion Freese LREC /29/2004

24 xComForTable Mapping to the LAF
Thanks for your attention! … Any questions? Marion Freese LREC /29/2004

25 Structural Markup improves Analysis
e.g. sentence boundary detection Then things would get even worse. (see also pages 4 and 11) SHADOWS By Leena Dhingra I couldn’t possibly do that. <p>[..]Then things would get even worse.<rs type=“see also“> (see also pages 4 and 11)</rs></p> </div> <div> <head>SHADOWS</head> <byline>By Leena Dhingra</byline> <p>I couldn’t possibly do that.</p> tokenizer input: <p>-elements (without <rs>-elements) correct sentence markup Marion Freese LREC /29/2004

26 Example – Discontinuous Material
Die Gewinnzahlen Lotto (5. Juni): 5, 19, 21, 31, 43, 48 Zusatzzahl: 32, Superzahl: 9 Toto: lag noch nicht vor CES xComForT <div id="d _a1" type="article"> <opener><! ></opener> <discontinuous id="d _a1. discontinuous" type="rubbish"> Die Gewinnzahlen Lotto (5. Juni): 5, 19, 21, 31, 43, 48 Zusatzzahl: 32, Superzahl: 9 Toto: lag noch nicht vor </ discontinuous> <closer><! ></closer> </div> disc. mat. that we do not want to include e.g. in linguistic analyses where we need continuous text DTD: simple element; shows futher specification of the kind of discontinuity <!ELEMENT discontinuous (#PCDATA)> <!ATTLIST discontinuous id ID #REQUIRED type (rubbish | editorial | ..) #IMPLIED> Marion Freese LREC /29/2004

27 Example – Meta Information
Montag, 7. Juni NACHRICHTEN M / F Süddeutsche Zeitung Nr. 127 / Seite 7 CES <opener><date>Montag, 7. Juni 1999</date> NACHRICHTEN M / F Süddeutsche Zeitung Nr. 127 / Seite 7 </opener> xComForT <opener> <divMeta> <publDate>Montag, 7. Juni 1999</publDate> <cat target="ns1">NACHRICHTEN</cat> <distribution>M / F</distribution> <publBy>Süddeutsche Zeitung</publBy> <volNr>Nr. 127</volNr> / <pageNr>Seite 7</pageNr> </divMeta> </opener> reference to taxonomy TODO: taxonomy-ausschnitt aus header!! except of publDate, all elements in divMeta are SZ-specific  extension Marion Freese LREC /29/2004

Download ppt "Enabling xComForTable Mapping to the Linguistic Annotation Framework"

Similar presentations

Ads by Google