IRCS Workshop on Linguistic Databases 11-13 December 2001 Philadelphia Standards for Language Resources Nancy IDE Department of Computer Science Vassar.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

TMF - a tutorial Part 3: Designing (schemas and) filters TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
Chapter 7 System Models.
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Status on the Mapping of Metadata Standards
Requirements. UC&R: Phase Compliance model –RIF must define a compliance model that will identify required/optional features Default.
1 ICS-FORTH EU-NSF Semantic Web Workshop 3-5 Oct Christophides Vassilis Database Technology for the Semantic Web Vassilis Christophides Dimitris Plexousakis.
Database Design: ER Modelling (Continued)
Andy Powell, Eduserv Foundation Feb 2007 The Dublin Core Abstract Model – a packaging standard?
SDMX in the Vietnam Ministry of Planning and Investment - A Data Model to Manage Metadata and Data ETV2 Component 5 – Facilitating better decision-making.
Database System Concepts and Architecture
XML: Extensible Markup Language
Database Systems: Design, Implementation, and Management Tenth Edition
Basics of HTML What is HTML?  HTML or Hyper Text Markup Language is the standard markup language used to create Web pages.  HTML is.
Using the Semantic Web to Construct an Ontology- Based Repository for Software Patterns Scott Henninger Computer Science and Engineering University of.
Overview of OASIS SOA Reference Architecture Foundation (SOA-RAF)
MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information ISO TC37 SC4 WG Samuel Cruz-Lara, Gil Francopoulo, Laurent Romary,
1 Words and the Lexicon September 10th 2009 Lecture #3.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Chapter 2 Data Models Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
DS-to-PS conversion Fei Xia University of Washington July 29,
EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.
EAGLES/ISLE Workshop LREC 2000 Athens, Greece Requirements, Tools, and Architectures for Annotated Corpora Nancy Ide Vassar College Chris Brew Ohio State.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
Codex Guidelines for the Application of HACCP
10 December, 2013 Katrin Heinze, Bundesbank CEN/WS XBRL CWA1: DPM Meta model CWA1Page 1.
TMF - a tutorial TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
2 1 Chapter 2 Data Model Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Standards for language resources the ISO/TC 37(/SC 4) perspective
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 7 Slide 1 System models l Abstract descriptions of systems whose requirements are being.
Chapter 4 System Models A description of the various models that can be used to specify software systems.
Working group on multimodal meaning representation Dagstuhl workshop, Oct
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
Linguistic Annotation Framework SC4 WG 1 Nancy Ide Vassar College USA.
Configuration Management (CM)
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
Scalable Metadata Definition Frameworks Raymond Plante NCSA/NVO Toward an International Virtual Observatory How do we encourage a smooth evolution of metadata.
24 Jan 2005 Kick off meeting (Luxembourg) 1 LIRICS Linguistic Infrastructure for Interoperable Resources and Systems ►Kick off meeting presentation ►Proposal.
24 Jan 2005 Kick off meeting (Luxembourg) 1 LIRICS Linguistic Infrastructure for Interoperable Resources and Systems ►Kick off meeting presentation ►Proposal.
Towards multimodal meaning representation Harry Bunt & Laurent Romary LREC Workshop on standards for language resources Las Palmas, May 2002.
ISO a tutorial Part 2: Representing data categories TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.
Chapter 7 System models.
System models l Abstract descriptions of systems whose requirements are being analysed.
Modified by Juan M. Gomez Software Engineering, 6th edition. Chapter 7 Slide 1 Chapter 7 System Models.
Introduction CS 3358 Data Structures. What is Computer Science? Computer Science is the study of algorithms, including their  Formal and mathematical.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 4.
SynAF:Provo ISO Meeting Thierry Declerck, DFKI GmbH.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Software Architecture Evaluation Methodologies Presented By: Anthony Register.
TMF - Terminological Markup Framework Laurent Romary Laboratoire LORIA (CNRS, INRIA, Universités de Nancy) ISO meeting London, 14 August 2000.
Metadata : an overview XML and Educational Metadata, SBU, London, 10 July 2001 Pete Johnston UKOLN, University of Bath Bath, BA2 7AY UKOLN is supported.
Towards a roadmap for standardization in language technology Laurent Romary & Nancy Ide Loria-INRIA — Vassar College.
Object-Oriented Programming © 2013 Goodrich, Tamassia, Goldwasser1Object-Oriented Programming.
ISO TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
SemAF – Basics: Semantic annotation framework Harry Bunt Tilburg University isa -6 Joint ISO - ACL/SIGSEM workshop Oxford, January 2011 TC 37/SC.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Web Service Exchange Protocols Preliminary Proposal ISO TC37 SC4 WG1 2 September 2013 Pisa, Italy.
XML Databases Presented By: Pardeep MT15042 Anurag Goel MT15006.
Describing Syntax and Semantics
Active Data Management in Space 20m DG
Abstract descriptions of systems whose requirements are being analysed
Presentation transcript:

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Standards for Language Resources Nancy IDE Department of Computer Science Vassar College Laurent ROMARY Equipe Langue et Dialogue LORIA/INRIA

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Goals present an abstract data model for linguistic annotations and its implementation using XML, RDF and related standards outline work of newly formed ISO committee: TC 37/SC 4 Language Resource Management –Using the work described as its starting point –Solicit the participation of members of the research community

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Goals of ISO TC 37/SC 4 prepare international standards/guidelines for effective language resource (LR) management in mono- and multi-lingual applications develop principles and methods for creating, coding, processing and managing LR –written corpora, lexical corpora, speech corpora, dictionary compiling and classification schemes Focus : –data modeling –data exchange, evaluation

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Standardization Process Two-phases: 1.Develop basic architecture to support wide- range of applications 2.Use as basis for building more precise standards for LR management Liaison with ISLE –Incorporate existing standards where possible –Broaden by including additional languages (e.g. Asian)

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Standardization is Tricky Skepticism within the community Arguments against LR standardization: 1.diversity of theoretical approaches makes standardization impractical or impossible 2.vast amounts of existing data and processing software will be rendered obsolete by the acceptance of new standards

IRCS Workshop on Linguistic Databases December 2001 Philadelphia SC4 Approach Efforts geared toward defining abstract models and general frameworks for creation and representation of language resources –In principle, abstract enough to accommodate diverse theoretical approaches Situate development squarely in the framework of XML and related standards –Ensure compatibility with established and widely accepted web-based technologies –Ensure feasibility of transduction from legacy formats into newly defined formats

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Call for Participation Success of the committee depends on communitys awareness of its activity, in order to ensure widespread adoption Involve from the outset broad range of potential users of the standards

IRCS Workshop on Linguistic Databases December 2001 Philadelphia The General Framework Model for linguistic annotation that can –be instantiated in a standard representational format –serve as a pivot format into and out of which proprietary formats may be transduced to enable comparison merging manipulation via common tools

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Overall Plan Format A Format C Abstract Format Operation via common tools, merging, etc Format B Annotation Format Tower of Babel

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Dialect Specification DATA CATEGORY REGISTRY Virtual AML Concrete AML Data Category Specification STRUCTURAL SKELETON Abstract XML encoding Concrete XML encoding Non-XML Encoding Universal Resources Project Specific Resources XSLT Script Overall Architecture

IRCS Workshop on Linguistic Databases December 2001 Philadelphia N.B. We do not expect XML to necessarily serve as the internal format used by tools etc. We do not care about creating yet another standard format We do not care (for this work) about designing specific annotation formats

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Data Model Identify a consistent underlying data model for data and its annotations –Formalized description of data objects Composition Attributes Class membership Applicable procedures, etc –Formalized description of relations among data objects –Independent of instantiation in any particular form

IRCS Workshop on Linguistic Databases December 2001 Philadelphia (Most) Abstract Model An annotation is a set of data or information associated with some other data More precise: an annotation is a one- or two-way link between –an annotation object, and –a point or span (or a list/set of points or spans) within a base data set Links may or may not have a semantics Points and spans may be objects, or sets/lists of objects

IRCS Workshop on Linguistic Databases December 2001 Philadelphia PRIMARY DATA [ [ [ ANNOTATION OBJECT ANNOTATION OBJECT ANNOTATION OBJECT ANNOTATION OBJECT

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Observations Granularity of the data representation and encoding is critical Must be possible to represent objects and relations in some form that prevents information loss

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Representing Annotation Objects Annotation objects may be relatively complex Abstract representation –graph of elementary structural nodes to which one or more information units are attached –distinction between structure and information units is critical to the design of a truly general model Annotations may be structured in several ways –Most common: hierarchical phrase structure analyses of syntax lexical and terminological information etc.

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Relations Among Annotations 1. Parallelism –two or more annotations refer to the same data object 2.Alternatives –two or more annotations comprise a set of mutually exclusive alternatives 3.Aggregation –two or more annotations comprise a list or set that should be taken as a unit

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Information Units Also called data categories –provide the semantics of the annotation –most theory and application-specific part of an annotation scheme No attempt to define data categories –Proposal : development of a Data Category Registry –Define data categories with RDF schemas –Formalize properties and relations –Templates that describe how objects are instantiated –Inheritance of appropriate properties

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Data Category Registry Several functions 1.provide a precise semantics for annotation categories can be used off the shelf or modified 2. provide a set of reference categories onto which scheme-specific names can be mapped 3.provide a point of departure for definition of variant or more precise categories Overall goal –Ensure that semantics of data categories are well-defined and understood

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Generic Mapping Tool (GMT) Instantiation of abstract format in XML Why XML? –Supported standard –Built-in representation for hierarchies (nested tags) –Sophisticated linking mechanisms Can link to points, spans, use explicit locations or tags –XSLT for transduction, XML Schemas for validation, etc.

IRCS Workshop on Linguistic Databases December 2001 Philadelphia A Few Simple Tags represents a structural node in the annotation may be recursively nested at any level –provides information attached to the node represented by the enclosing –type attribute identifies data category –Contents: string providing a value for the data category recursively nested elements (for complex structures) empty--points via a target attribute to an object in another document

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Other Tags –brackets alternative annotations –points to a non-contiguous related element –points to the data to which the annotation applies –assume the use of stand-off annotation –target attribute uses XML Pointers –groups information to be regarded as a unit

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Tag names etc. unimportant –It is the underlying data model that counts –Essentially uses feature structures GMT sufficiently powerful to represent information across annotation types Demonstrated applicability to –terminological and lexical information (Ide, et al., 2000) –syntactic annotation (Ide and Romary, 2001) Existing formats (XML or other) mapped to the GMT for merging, manipulation via common tools, etc.; then re-map to original formats for use in in- house tools and applications. etc.

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Examples Morpho-syntactic annotation –involves the identification of word classes over a continuous stream of word tokens –may refer to the segmentation of the input stream into word tokens –may also involve grouping together sequences of tokens or identifying sub-token units (or morphemes –description of word classes may include one or several features syntactic category, lemma, gender, number,…

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Representation in GMT Single type of structural node –represents a word-level structure unit One or several information units associated with each structural node

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Simple Case Paul PNOUN aimer VERB present 3 le DET plural croissant NOUN plural Paul aime les croissants Pointers to data in primary document

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Representing More Complex Cases de PREP le DET Example: du = de + le in French Points to du in text Gives the structure of the word underlying the word

IRCS Workshop on Linguistic Databases December 2001 Philadelphia GMT as a Tree Structure ….……….. …….du…. ……………. …………… ………….. ………… Primary Document Lemma : de Pos : prep seg : Lemma : le Pos : det

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Compound Words pomme_de_terre NOUN pomme NOUN de PREP terre NOUN Example: pomme de terre Component lemmas Primary lemma

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Tree ….……….. …………… Pomme de terre …………… ………….. ………… Primary Document Seg : Lemma : pomme Pos : noun lemma : pomme_de_terre Seg : Lemma : de Pos : prep Seg : Lemma : terre Pos : noun

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Advantages Enables specification of the required level of granularity –granularity of the segmentation in (or associated with) primary data may not correspond to that required for the annotation Can define relations over the tree independently –Compositional for morpho-syntax, syntax, etc. –Partitions in lexical data –…

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Orth : overdress Pron : [ jdciw ] Pos : verb Def : To dress (oneself or another) too elaborately or finely Pron : [ [masliw ] Pos : noun Def : A dress that may be worn over a jumper, blouse, etc. overdress verb [jdciw] To dress (oneself or another) too elaborately or finely noun [masliw] A dress that may be worn over a jumper, blouse, etc.

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Alternatives boucher VERB present 0.4 bouche NOUN 0.6

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Relating Annotation Levels Three ways: 1. Temporal anchoring associates positional information with each structural level 2. Event-based anchoring introduces a structural node to represent a location in the text to which all annotations can refer 3. Object-based anchoring enables pointing from a given level to one or several structural nodes at another level

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Temporal Anchoring Positional information –Usually, a pair of numbers expressing the starting and ending point of segment Attributes for : /startPosition/: the temporal or offset position of the beginning of the current structural node; /endPosition/: the temporal or offset position of the end of the current structural node. Example: iy

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Event-based Anchoring Useful when: –Not possible/desirable to modify the primary data by inserting markup to identify specific objects or points in the data –Primary data is marked with milestones (e.g., time stamps in speech data), where spans across the various milestones must be identified Here, elements represent markup for segmentation (e.g., segmentation into words, sentences, etc.).

IRCS Workshop on Linguistic Databases December 2001 Philadelphia GMT Rendering Structural node (landmark) referred to by annotations for the defined span Annotation graph formalism explicitly designed for this

IRCS Workshop on Linguistic Databases December 2001 Philadelphia GMT Advantages AG formalism reifies the arc vs. identification via XML tags GMT : the two methods are analogous –annotator can use either method AG not well-suited to hierarchically organized annotations –requires special mechanisms GMT: exploits the hierarchical structure built in to XML –flat and hierarchical annotations treated using the same mechanisms

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Object-based Anchoring Useful to make dependencies between two or more annotation levels explicit –Example: syntactic annotation can refer directly to the relevant nodes in a morpho- syntactically annotated corpus

IRCS Workshop on Linguistic Databases December 2001 Philadelphia de PREP le DET masc chat NOUN NP Representation for du chat

IRCS Workshop on Linguistic Databases December 2001 Philadelphia GMT as a Modeling Tool Rendering various formats into GMT representation has revealed some problems, inconsistencies in existing formats –Penn Treebank : inconsistent indication of relations (see Ide and Romary, ACL 2001 or Abeillé Treebank book, forthcoming) –NOMLEX lexicon : no (automatically perceivable) distinction between lists and alternatives The abstract format serves the unexpected purpose of providing a template for fundamental annotation properties

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Jumping Ahead… Is XML distracting us from our real work? –YES, because Focus on details of using XML and related standards can obscure the real work of data modeling –BUT Datas models are no use only in the abstract - need means to implement XML, schemas, RDF, etc. are powerful data modeling tools based on years of research in this area Need to know how to best exploit them for our purposes Need a synergy between modeling efforts and implementation in XML, RDF, etc. Need to remember that using XML is just a vehicle to ensure flexibility, convertability, and compatibility with evolving technologies

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Conclusion ISO committee –Work is continually evolving Try to stay at the leading edge of data representation –We are only at the assembly language level –We need to do this right to enable a web of databases Call for participation!!!

IRCS Workshop on Linguistic Databases December 2001 Philadelphia Thank You Contacts US Expert, ISO TC37 SC4 Nancy Ide Chairman, ISO TC37 SC4 Laurent Romary