SynAF:Provo ISO Meeting 14.08.2007 Thierry Declerck, DFKI GmbH.

Slides:



Advertisements
Similar presentations
Putting together a METS profile. Questions to ask when setting down the METS path Should you design your own profile? Should you use someone elses off.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Feature Structures and Unification.
ISO DSDL ISO – Document Schema Definition Languages (DSDL) Martin Bryan Convenor, JTC1/SC18 WG1.
ISO – plans and progress towards the revised international standard for thesauri Stella G Dextre Clarke Project Leader, ISO NP
Chapter 4 Syntax.
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
The SALSA experience: semantic role annotation Katrin Erk University of Texas at Austin.
MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information ISO TC37 SC4 WG Samuel Cruz-Lara, Gil Francopoulo, Laurent Romary,
Chapter 20: Natural Language Generation Presented by: Anastasia Gorbunova LING538: Computational Linguistics, Fall 2006 Speech and Language Processing.
NLP and Speech 2004 Feature Structures Feature Structures and Unification.
C SC 620 Advanced Topics in Natural Language Processing Lecture 20 4/8.
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.
DS-to-PS conversion Fei Xia University of Washington July 29,
CS 330 Programming Languages 09 / 13 / 2007 Instructor: Michael Eckmann.
Syntax Nuha AlWadaani.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
10 December, 2013 Katrin Heinze, Bundesbank CEN/WS XBRL CWA1: DPM Meta model CWA1Page 1.
Knowledge Mediation in the WWW based on Labelled DAGs with Attached Constraints Jutta Eusterbrock WebTechnology GmbH.
1 Syntax and Semantics The Purpose of Syntax Problem of Describing Syntax Formal Methods of Describing Syntax Derivations and Parse Trees Sebesta Chapter.
CLARIN web services and workflow Marc Kemps-Snijders.
►Thierry Declerck (DFKI GmbH, LT Lab. Saarbrücken, Germany) Standards and Infrastructures for Language Resources.
LIRICS mid-term review 1 LIRICS WP3: Morpho-syntactic and syntactic annotations Thierry Declerck DFKI-LT - Saarbrücken 23rd May 2006.
Working group on multimodal meaning representation Dagstuhl workshop, Oct
Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi and introduced in Tree-adjoining grammars are somewhat similar to context-free.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
LIRICS Mid-term Review 1 LIRICS WP2 – NLP Lexica Monica Monachini CNR-ILC - Pisa 23rd May 2006.
Linguistic Annotation Framework SC4 WG 1 Nancy Ide Vassar College USA.
Sekimo Solutions mentioned by the TEI  CONCUR: an optional feature of SGML (not XML) that allows multiple.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
IV. SYNTAX. 1.1 What is syntax? Syntax is the study of how sentences are structured, or in other words, it tries to state what words can be combined with.
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
24 Jan 2005 Kick off meeting (Luxembourg) 1 LIRICS Linguistic Infrastructure for Interoperable Resources and Systems ►Kick off meeting presentation ►Proposal.
Towards multimodal meaning representation Harry Bunt & Laurent Romary LREC Workshop on standards for language resources Las Palmas, May 2002.
C H A P T E R TWO Syntax and Semantic.
October 2005csa3180: Parsing Algorithms 11 CSA350: NLP Algorithms Sentence Parsing I The Parsing Problem Parsing as Search Top Down/Bottom Up Parsing Strategies.
TextBook Concepts of Programming Languages, Robert W. Sebesta, (10th edition), Addison-Wesley Publishing Company CSCI18 - Concepts of Programming languages.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 4.
Semantic Construction lecture 2. Semantic Construction Is there a systematic way of constructing semantic representation from a sentence of English? This.
ISO-PWI Lexical ontology some loose remarks Thierry Declerck, DFKI GmbH.
ISO/TC37/SC4/TDG6 Language Resource Ontologies , Pisa HASIDA Koiti CfSR, AIST, Japan.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Rules, Movement, Ambiguity
CSA2050 Introduction to Computational Linguistics Parsing I.
1 Context Free Grammars October Syntactic Grammaticality Doesn’t depend on Having heard the sentence before The sentence being true –I saw a unicorn.
The Minimalist Program
ISO TC 37/CLARIN SEMANTIC DATA REGISTRY WORKSHOP UTRECHT, DECEMBER ISOcat: Metadata Registry SUE ELLEN WRIGHT DECEMBER 2013.
ISBN Chapter 3 Describing Syntax and Semantics.
Supertagging CMSC Natural Language Processing January 31, 2006.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Natural Language Processing Lecture 14—10/13/2015 Jim Martin.
SYNTAX.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
SemAF – Basics: Semantic annotation framework Harry Bunt Tilburg University isa -6 Joint ISO - ACL/SIGSEM workshop Oxford, January 2011 TC 37/SC.
Formats, interoperability and standards Marc Kemps-Snijders.
Overview of Previous Lesson(s) Over View 3 Model of a Compiler Front End.
Chapter 11: Parsing with Unification Grammars Heshaam Faili University of Tehran.
Web Service Exchange Protocols Preliminary Proposal ISO TC37 SC4 WG1 2 September 2013 Pisa, Italy.
X-Bar Theory. The part of the grammar regulating the structure of phrases has come to be known as X'-theory (X’-bar theory'). X-bar theory brings out.
SYNTAX.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 King Faisal University.
Relations between Data Categories
Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur
SysML v2 Formalism: Requirements & Benefits
SYNTAX.
Chapter Eight Syntax.
Chapter Eight Syntax.
SDMX Information Model: An Introduction
Attributes and Values Describing Entities.
Faculty of Computer Science and Information System
Presentation transcript:

SynAF:Provo ISO Meeting Thierry Declerck, DFKI GmbH

The Background: Linguistic Annotation Framework (slides from Nancy Ide) Under development within ISO TC37/ SC4 (Language Resource Management) Intended to provide standardized means to represent linguistic data and annotations

LAF Approach Develop a common, abstract model that can capture all types of annotation information, regardless of the physical encoding Develop a generic, XML instantiation of the model, to and from which specific formats can be mapped Define a common set of data categories, for reference and use by annotators

General Principles Separation of data and annotations Separation of user annotation formats and the exchange (“pivot”) format Separation of annotation structure and content in the pivot format

Abstract Model Annotations represented as a graph of feature structures –Nodes are locations in primary data or other annotations Any format instantiating the model can be trivially mapped to another format via the pivot format

Overview Format A Format A Pivot Format Pivot Format B Format B A3 A2 A1 Combined Pivot Format Pivot Format

Pivot Format Need never be seen/used by user –In principle, user defines “mapping rules” and pivot is automatically generated (and vice versa) –Exchange format –Model used to enable mapping, also to inform design of new annotation schemes

Pivot: Stand-off Annotation Language data is regarded as “read-only” and contains no annotations Annotations are stand-off linked to the primary data or other annotation documents

Primary data POS annotation Syntactic annotation Alternative POS annotation Sense annotation Co-reference annotation Semantic roles Event annotation

Data Category Registry Addresses issue of standardization of annotation content Provides a set of reference categories onto which scheme-specific names can be mapped Provides a precise semantics for annotation categories Provides a point of departure for definition of variant, more precise, or new data categories

Exchange Specification Annotator provides a Data Category Specification (DCS) –mapping between scheme-specific instantiations and concepts in the DCR Including differences, departures, new categories –provides documentation for the user’s annotation scheme DCS included or referenced in data exchange –provides receiver with information to interpret annotation content or map to another instantiation –semantic integrity guaranteed by mutual reference to DCR concepts or definition of new categories in DCS

Pivot Format Design Primary concerns Maximize processing efficiency and consistency Ensure that processing is unambiguous Instantiate with a simple, minimal set of elements Fulfillment of these requirements has repercussions for users –Information must be explicitly provided in their representations or made explicit via the mapping N.B.: specifications for pivot format n user’s format –Only requirement is that user format can be mapped to the spec

Segmentation Minimal unit of granularity Points to virtual nodes characters in primary data May have multiple segmentations over the same data No associated annotation content (at this level) Set of linearly ordered edges

Annotations –Category Annotation label May be data category in DCR –Relation-object pair(s) Link label pointing to object(s) of the annotation (idref) Link label may be data category in DCR –Feature Structure Feature structure content providing annotation information Attribute-value pairs Recursive Can specify alternatives etc.

Annotation Layers Conceptual layers of annotation –E.g. morpho-syntax, syntax, co-reference… –ISO TC37/SC4 defining a set of layers Each layer has a schema defining the relevant categories and relations –E.g. syntax Category: Sentence Relations: SUBJ (Object: NP), MainVerb (Object: VP), “Constituent” (Object: NP | VP | PP) Inter-layer and cross-layer relations

Example |T|h|e| |c|l|o|c|k| |s|t|r|u|c|k| | … segmentation annotation

Mapping to the Pivot Format ((S (NP-SBJ-1 Paul) (VP intends) (S (NP-SBJ *-1) (VPto (VP leave) (NP IBM ) ) ) ).)) Penn Treebank |P|a|u|l| |i|n|t|e|n|d|s| |t|o| |l|e|a|v|e| |I|B|M|.| Desired Result? NPVP NP SUBJ VP CONSTITUENT ? S S SUBJ CONSTITUENT A problematic case…

Ideal Result? |P|a|u|l| |i|n|t|e|n|d|s| |t|o| |l|e|a|v|e| |I|B|M|.| NPVP NP SUBJ VP MAINVERB OBJ TO-INF S AMINVERB S RELCLAUSE SUBJ MAINVERB Token base: Paul msd: PN base: intend msd: V3S base: to msd: PREP base: leave msd: VINF base: IBM msd: PN base:. msd: PUNC HEAD MAINVERB HEAD MAINVERB tense: inf tense: pres number: sing Person: 3 number: sing Person: 3 tense: pres Type: declarative Segmentation Morpho-syntactic layer Syntactic layer

Goals Reference categories in DCR rather than give cats Reference FS fragments and schema layer definitions in on-line libraries Annotation schemes designed/modified to conform to the model

Summary (end of slides by Nancy Ide) Model still evolving –Precise pivot XML not fixed Basic principles/ideas already appearing in applications/schemes Mapping to pivot will be simple, straightforward

The SynAF Working Draft SynAF (Syntactic Annotation Framework) has been adopted by ISO as a NWI, and is now as a WD close to be submitted as a Committee Draft (CD). For reference: Project number: Project abbreviation: SynAF Project leader: Thierry Declerck, DIN WG: ISO/TC 37/SC 4/WG 2 Representation schemes SynAF is partly based on MAF (Morpho-Syntactic Annotation Framework) and will propose a base for future standardisation of (linguistic) semantic annotation.

Topic of SynAF SynAF is dealing with the description of a meta-model for syntactic annotation, which means that SynAF will describe elementary linguistic (in fact syntactic) abstractions that support the construction and the interoperability of (syntactic) annotations and resources, as well as the procedure for the creation of data categories for syntactic annotation. SynAF is thus not proposing a tagset for syntactic annotation, but is dedicated to proposing a (possibly hierarchical) list of data categories, which is much easier to update and extend, and which will represent a point of reference for particular tagsets used for the syntactic annotation of various languages, also in the context of various application scenarios.

Basis for SynAF Corpus (Linguistic) Annotation Frameworks that combine syntactic constituency and syntactic dependency –Tiger for Germany –ISST for Italian –Similar resources for other languages (see D3.1) Grammar Resources –Parsing output syntactic structures for various languages (HPSG, LS-GRAM Project, LFG parallel grammars, shallow grammars etc.)

The SynAF Proposal Syntactic Annotation has 2 Functions in NLP 1) To represent linguistic constituencies, like Noun Phrases (NP), describing a structured sequence of morpho-syntactically annotated items, where we consider also constituents built from non-contiguous elements, and 2) To represent dependency relations dependency information can exist between morpho-syntactically annotated items within a phrase (an adjective is the modifier of the head noun within an NP) or describe a specific relation between syntactic constituents at the clausal and sentential level (i.e. an NP being the "subject" of the main verb of a clause or sentence). In the first case we speak of an internal dependency and in the second case we speak of an external dependency. But the dependency relation can also be stated including empty elements (like the pro-drop property in romance languages)

The SynAF Proposal (2) SynAF is concerned thus with a meta-model that covers both dimensions of syntactic constituency and dependency, and SynAF is proposing a multi- layered annotation framework that allows the combined and interrelated annotation of language data along both lines of consideration. Also the data-categories to be proposed to ISO standardization will be about the basic annotation concerning both dimensions.

The SynAF Model: A first Draft Syntactic Annotation MAF Objects Raw Text applies to generates NT Nodes generates applies to generates Edges generates points to Morpho-Syntactic Annotation T Nodes

Some Remarks SynAF is about Graphs. Nodes in a syntax graph can be both terminals (word-forms) and non-terminals (constituents). Nodes can be interrelated by Edges (between source and target nodes). This definition supports discontinuity (a constituent can be described by multiple edges) At the representation level: List of edges within the corresponding nodes or separated? If separated then the edge information mentions explicitely source and target of the edges (cleaner from the point of view of algebra).

Comments... (2) Also encoding of initial and ending points of a node is supporting underspecification of annotation and the description of empty elements (start point = end point, naming the point where the empty element is belonging to). Need to define coherence condition on the paths one can build with edges.

Some Definitions Span: a pair of points identifying a segment of the document submitted to syntactic annotation. The first point  the second point. Multiple span: A sequence of spans where the ending point of each span  the starting point of the subsequent span. Category: a feature value providing the content of a node. Node: pair consisting of a (possibly multiple) span, a category, –(Alternative: Node : a set of edges form the previous annotation levels) Edge: a triplet with a source node, a target node, and a label. Non- Terminal nodes have an outgoing constituency edges (to be defined) Label: a feature value providing the content of an edge. A Terminal node: refers to a single wordFrom/lexical unit or a span with lentgh=0, and the node and the wordForm/lexical unit have identical span. (really needed???) Tackling NT overlapping over a T or a text part

Putting SynAF in XML A Terminal node: refers to a single wordFrom/lexical unit, and the node and the wordForm have identical span T <terminal <category name="$CATEGORY_DatCat“ span=„$DIGIT - $DIGIT“ <edge label="$LABEL_DatCat„ sourcenode=„$SourceNode“ targetnode=„$TargetNode“

Putting SynAF in XML (2) Non-Terminal nodes have at least one outgoing constituency edges (to be defined) <nonterminal

Comments to the XML proposal Still have to include recursions of „constituencies“ in the nonterminal. Still to consider the DatCats for having values of some features (all encoded as variables now – starting with the „$“ sign.)

Data Categories for SynAF We need 2 types of data categories in SynAF – Namimg the nodes (constituents), for example: noun phrase (NP), proper noun (PN), adpositional phrase (PP), etc. –Naming the labels (dependencies), for example: head (HD), modifier (MOD), accusative object (AO) or subject (SB), etc

Data Categories for SynAF (Constituency) Naming the nodes (constituents)constituents Part of TDG4!!!

Data Categories for SynAF (Dependency) Naming the labels (dependencies)dependencies

Issues for SynAF Level of complexity: deal only with the intersection of syntactic phenomenons that are present in all (or most) languages vs. an almost complete list of phenomena describing language dependant phenomea in details. Closely related: monolingual description vs. multilingual descriptions. Cross-lingual aspects: for example including in the annotation information taht supports translation?) Surface syntactic phenomena vs. „deep“ lingusitic phenomena (including transformation, movement, lexical rules) Etc...

Summery and future Work Discussions today raised a nimbe rof points to be accomadet form (TD: End of the Week) Eric: Treatment of Ambiguities Relation to DatCats: in the informal part. Progress on TDG4 to be achieved. Mid September deliver the WD for CD ballot.