ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary.

Slides:



Advertisements
Similar presentations
Using OLIF, The Open Lexicon Interchange Format Susan McCormick OLIF2 Consortium October 1, 2004.
Advertisements

Can I Use It, and If so, How? Christian Lieske SAP AG – MultiLingual Technology Discussion of Consortium Proposal for OLIF2 File Header.
OLAC Metadata Steven Bird University of Melbourne / University of Pennsylvania OLAC Workshop 10 December 2002.
UKOLN, University of Bath
ANSI TAG 37 Committee F43 Language Services and Products Interagency Language Roundtable September 30, 2011 Sue Ellen Wright ISO TC 37, Terminology and.
LREC 2000 Athens; Gerhard Budin and Alan Melby Accessibility of Multilingual Terminological Resources Current Problems and Prospects for the Future Gerhard.
MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information ISO TC37 SC4 WG Samuel Cruz-Lara, Gil Francopoulo, Laurent Romary,
LIRICS International Standards in Lexicography Gerhard Budin University of Vienna August 2005.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
©Silberschatz, Korth and Sudarshan10.1Database System Concepts W3C Activities HTML: is the lingua franca for publishing on the Web XHTML: an XML application.
© Tefko Saracevic, Rutgers University1 metadata considerations for digital libraries.
Interchange using TBX 8 th Metadata conference Berlin April 2005 Alan K. Melby Brigham Young University, Provo campus.
EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.
LREC 2000 Athens, Greece An XML-based Encoding Standard for Language Corpora Nancy Ide Vassar College Patrice Bonhomme LORIA/CNRS Laurent Romary LORIA/CNRS.
What Linguists Want (we think) Helen Aristar Dry & Anthony Aristar LINGUIST List & E-MELD.
Metadata: Its Functions in Knowledge Representation for Digital Collections 1 Summary.
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
Barcelona Meeting 21/06/05 MM 1 LIRICS WP2 LIRICS WP2 NLP LEXICA Task Leader: ILC-CNR (Pisa) presented by: Monica Monachini.
TMF - a tutorial TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
UKOLUG - July Metadata for the Web RDF and the Dublin Core Andy Powell UKOLN, University of Bath UKOLN.
Metadata Standards and Applications 4. Metadata Syntaxes and Containers.
9 th Open Forum on Metadata Registries Harmonization of Terminology, Ontology and Metadata 20th – 22nd March, 2006, Kobe Japan. Commonalities and Differences.
/21LIRICS IAG Meeting Barcelona LIRICS IAG Meeting /21 Universitat Pompeu Fabra Barcelona Introduction Gerhard Budin.
Provo, 16 Aug 2007 LMF meeting 1 Lexical Markup Framework: ISO Provo meeting Gil Francopoulo.
SC32 WG2 Metadata Standards Tutorial Metadata Registries and Big Data WG2 N1945 June 9, 2014 Beijing, China.
CLARIN web services and workflow Marc Kemps-Snijders.
Addressing Metadata in the MPEG-21 and PDF-A ISO Standards NISO Workshop: Metadata on the Cutting Edge May 2004 William G. LeFurgy U.S. Library of Congress.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Standards for language resources the ISO/TC 37(/SC 4) perspective
►Thierry Declerck (DFKI GmbH, LT Lab. Saarbrücken, Germany) Standards and Infrastructures for Language Resources.
LIRICS mid-term review 1 LIRICS WP3: Morpho-syntactic and syntactic annotations Thierry Declerck DFKI-LT - Saarbrücken 23rd May 2006.
Working group on multimodal meaning representation Dagstuhl workshop, Oct
Experiments with ODD outside the TEI framework Laurent Romary & Piotr Banski The ISO-TEI connection.
MPEG-21 : Overview MUMT 611 Doug Van Nort. Introduction Rather than audiovisual content, purpose is set of standards to deliver multimedia in secure environment.
Standards, Use and Prospects for Language Resource Management Key-Sun Choi 16 Aug TII, Moscow.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
The MultilingualWeb-LT Working Group receives funding by the European Commission (project name LT-Web) through the Seventh Framework Programme (FP7) in.
LIRICS Mid-term Review 1 LIRICS WP2 – NLP Lexica Monica Monachini CNR-ILC - Pisa 23rd May 2006.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
24 Jan 2005 Kick off meeting (Luxembourg) 1 LIRICS Linguistic Infrastructure for Interoperable Resources and Systems ►Kick off meeting presentation ►Proposal.
TEI and Scholarly publishing Laurent Romary INRIA & HUB-ISDL TEI council, chair.
ISLE: International Standards for Language Engineering A European/US joint project Martha Palmer University of Pennsylvania Tides Kickoff March 22, 2000.
24 Jan 2005 Kick off meeting (Luxembourg) 1 LIRICS Linguistic Infrastructure for Interoperable Resources and Systems ►Kick off meeting presentation ►Proposal.
Extensible Markup Language (XML) Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879).ISO 8879 XML is a.
Tommie Curtis SAIC January 17, 2000 Open Forum on Metadata Registries Santa Fe, NM SDC JE-2023.
Towards multimodal meaning representation Harry Bunt & Laurent Romary LREC Workshop on standards for language resources Las Palmas, May 2002.
9 th Open Forum on Metadata Registries Harmonization of Terminology, Ontology and Metadata 20th – 22nd March, 2006, Kobe Japan. Presentation Title: Day:
Encoding language corpora: current trends and future directions Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies.
A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy.
Xml:tm XML Based Text Memory Using XML technology to reduce the cost of translating XML documents 27 June 2005.
Xml:tm XML Text Memory Using XML technology to reduce the cost of translating XML documents.
TMF - Terminological Markup Framework Laurent Romary Laboratoire LORIA (CNRS, INRIA, Universités de Nancy) ISO meeting London, 14 August 2000.
Overview of SC 32/WG 2 Standards Projects Supporting Semantics Management Open Forum 2005 on Metadata Registries 14:45 to 15:30 13 April 2005 Larry Fitzwater.
ISO/TC37/SC4/N377 secretary report
ISO CD Editorial and technical comments. Contact Mailing list Subject: sub FirstName LastName.
ISO TC 37/CLARIN SEMANTIC DATA REGISTRY WORKSHOP UTRECHT, DECEMBER ISOcat: Metadata Registry SUE ELLEN WRIGHT DECEMBER 2013.
Slide 1 SDTSSDTS FGDC CWG SDTS Revision Project ANSI INCITS L1 Project to Update SDTS FGDC CWG September 2, 2003.
Towards a roadmap for standardization in language technology Laurent Romary & Nancy Ide Loria-INRIA — Vassar College.
ISO TC37/SC4 N429 ISO/TC37/SC4/TDG6 Language Resource Ontologies /12, Busan /12, Busan HASIDA Koiti HASIDA Koiti
ITS 2.0 in XLIFF 2 FEISGILTT Dublin June 2014 Yves Savourel ENLASO Corporation This presentation was made possible by.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
ISO/TC37/SC4 Draft Resolution
©Silberschatz, Korth and Sudarshan10.1Database System Concepts W3C - The World Wide Web Consortium W3C - The World Wide Web Consortium.
Developing OLIF, Version 2 Susan M. McCormick Christian Lieske OLIF2 Consortium SAP/Walldorf, Germany.
SemAF – Basics: Semantic annotation framework Harry Bunt Tilburg University isa -6 Joint ISO - ACL/SIGSEM workshop Oxford, January 2011 TC 37/SC.
TEI Consortium Meeting 7 November 2003 Nancy, France Fifteen (and a half) Years of the TEI A Retrospective and Prospective Nancy Ide Department of Computer.
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Lirics mid-term review
Part of the Multilingual Web-LT Program
Presentation transcript:

ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

Standards for language processing Primary resources (text, dialogues) Structural mark-up Basic annotations [TEI, MPEG7, TMX (XHTML…), etc.] NLP structures (annotations) POS tagging Chunks (cf. Named Entities) Deep Syntactic structures Co-references etc. [Eagles/ISLE, CES, MATE,…] Knowledge structures Hierarchies of types Relations between concepts (subjects/topics etc.) Links to primary resources [Topic Maps, OIL, RDF] Lexical structures (Language models) Terminologies Transfer lexica LTAG/HPSG/LFG lexica [TBX, OLIF, Eagles/ ISLE (Genelex)] Links Meta-data [Dublin core, OLAC, ISLE, MPEG7, RDF] Access protocols [Corba, SOAP]

Context ISO TC37 - Terminology and other language resources SC3 - Computer applications in terminology ISO Martif Latest version of TEI Terminology chapter ISO Data categories ISO CD (DIS: under ballot) TMF (Terminological Markup Framework) SC4 - Language resources

TC37/SC4 details Scope: Platform for designing and implementing linguistic resource formats and processes Multi-layer annotation of linguistic resources Exchange of information between NLP modules General strategy Involve a wide community from academia and industry Identification of experts in the various work items Involvment through national standardizing bodies Agenda Current: identification of possible work items and working groups Constituancy meeting and technical workshop at LREC (May 2002)

Organization Secretary: Prof. Key-Sun Choi, Korea Chair: Laurent Romary, France International Advisory Committee Permanent Chair: Prof. Antonio Zampolli, Italy

----- SC4 and other standardizing bodies W3C -basic protocols and formats XML (Schemas) XPath XPointer + RDF, SVG, SMIL, SOAP MPEG - Multimedia, XML based e.g. MPEG7-4 Word and phone lattices ISO TC37/SC4 - language resources, NLP perspective e.g. linguistic annotations, lexical formats TEI -text representation Reference for primary sources e.g.: text archives Text Audio/Speech Technical background What about gestures? Kinetic in the TEI SMIL? Oscar Contributing organizations

Working groups WG1: Basic descriptors and mechanisms for language resources Convener: Laurent Romary WG2: Representation schemes Convener: Kiyong Lee WG3: Multilingual text representation Convener: Alan K. Melby WG4: Lexical databases Convener: ?? WG5: Workflow of language Resource Management Convener: Christian Galinski

TC37/SC4 Work Items WG1/WI-0: Terminology of Language Resources WG1/WI-1: Linguistic annotation framework WG1/WI-2: Meta-data for multimodal and multilingual information WG2/WI-3: Structural content representation scheme WG2/WI-4: Multimodal content representation sheme WG2/WI-5: Discourse level representation scheme

TC37/SC4 Work Items - cont. WG3/WI-6a: Translation Memory, Alignment of parallel corpora WG3/WI-6a: Segmentation and counting algorithms (characters, words, sentences etc.) WG3/WI-6a: Meta-markup for GIL (Globalization, Internationalization and Localization) WG4/WI-7: NLP Lexica WG5/WI-8: Validation of language resources WG5/WI-9: Net-based distributed cooperative work for the creation of LRs

WI-0 Terminology of Language Resources Basic terminology of the various sub-fields of language resources and general methodology Project leader: Klaus-Dirk Schmitz Sources: ISO 1087 LREC proceedings + KAIST English dictionaries in Linguistics? Support from GTW

WI-1 Linguistic annotation framework Basic mechanisms and data structures for linguistic annotation and representation [data architecture] Methods and principles for the design of an annotation scheme Structural nodes and information units, Data category specification Linking and pointing mechanisms, Feature Structures, Meta-Markup « Stand-off » and « in-line » views - equivalences, combining levels. Administrative data categories

WI-1 - cont. Project leader: Nancy Ide (TBC) Contributors: Alan Melby, Koiti Hasida, Lee Gillam, Yves Savourel, Laurent Romary… Possible sources: TMF, iso12620-revised, Mate (general methodology) TEI (Linking mechanisms, feature structures) Link with Linguistic DS

WI-2 Meta-data for multimodal and multilingual information Description of a meta-data representation scheme to document linguistic information structures and processes General content description Local content description Project leader: Peter Wittenburg, MPI (Nijmegen, NL) Participants: Steven Bird, TEI aware person Possible sources: OLAC, Mile, TEI Header Liaison: TC46 (SC9), MPEG7/MDS, SCORM

WI-3 Structural content representation scheme Definition of annotation/representation scheme(s) for morpho-syntax and syntax, to be used for annotation and interchange purposes Meta-model for morpho-syntactic annotation Meta-model(s) for syntactic annotation (lexicalized grammar, elementary trees, dependancy structures) + corresponding Data category registries

WI-3 - cont. Project leader:John Carroll ?? Participants: Nuria Bell, … representatives from existing TreeBanks initiatives Possible sources: Eagles, TAGML, Linguistic DS SIGPARSE

WI-4 Multimodal meaning representation scheme Representation scheme for the semantic content of multimodal information (textual, spoken, graphical and gestural) Meta-modal for content representation (Events, participants, etc.) Data category registry for multimodal content Project leader: Harry Bunt (id=“1”) Possible sources: SIGSEM working group on semantic content Chair: #1 « Liaison » Semantic web activities

WI-5 Discourse level representation scheme Meta-model for discourse and dialogue representation Meta-model for discourse level annotation (e.g. reference annotation) + corresponding DatCat registry Possible sources: SIGDIAL DRI - Discourse Resource Initiative Mate

WI 6a Translation Memory, Alignment of parallel corpora Provides formats for the representation of multilingual textual data as produced in translation activities or constructed from existing primary sources Sources: OSCAR/TMX for translation memories TEI based linking mechanism (or see WI-1) for Parallel texts

WI 6b Segmentation and counting algorithms (characters, words, sentences etc.) Provide methods for segmenting streams of text with markup and means to for counting the corresponding segments Possible sources: OSCAR

WI 6c Meta-markup for GIL (Globalization, Internationalization and Localization) Identification of the specific markup modules needed to perform GIL activities Possible sources: OSCAR/OpenTag

WI-7 NLP lexica Lexicon representation formats for the various types of NLP applications (Machine Readable Lexica) Define a set of meta-models (classes of applications) Specific data categories (derivation, phonology, etc.) Based on the work done in other work items Possible sources Eagles Multext ISLE Computational lexicon Working group OLIF

WI-8 Validation of language resources Defines guidelines and requirements for producing and distributing high quality language resources Contacts: ELRA, TEI Possibles sources: To be defined

WI-9 Net-based distributed cooperative work for the creation of LRs Principles and methods for designing collaborative and cooperative compilation of LRs Define what is specific to LRs with regards Tracability of resources, version control, validation, quality management Protocols (Corba, SOAP), Workflow standards, Data management Contacts: Christian Galinski, Remi Zajac, … Sources: To be defined

Liaison - OSCAR (AKM) Brief history of LR exchange standards Parallel events since 1997 Open Tag - meta-markup (XML vs. Others) Major current OSCAR activities TMX - Translation Memory eXchange Counting and segmentation algorithms TBX (Terminologies) and OLIF (MT lexica) XLIFF and CGS - Annotation of source code and localisation of web sites xml:lang etc.: J. DeCamp and S.-E. Wright

Liaison - TEI (LR) General architecture and data modeling WI-1 Annotations (paragraph level, external annotations) WI-1 TEI Header WI-2 NLP lexica (with regards Terminologies and dictionaries) WI-7 Feature structures WI-1