1 LEXUS: A flexible web- based Lexicon Tool Interacting with ISO Data Category Registry Peter Wittenburg, Marc Kemps-Snijders MPI for Psycholinguistics.

Slides:



Advertisements
Similar presentations
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Advertisements

Using OLIF, The Open Lexicon Interchange Format Susan McCormick OLIF2 Consortium October 1, 2004.
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
28 March 2003e-MapScholar: content management system The e-MapScholar Content Management System (CMS) David Medyckyj-Scott Project Director.
SDMX in the Vietnam Ministry of Planning and Investment - A Data Model to Manage Metadata and Data ETV2 Component 5 – Facilitating better decision-making.
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
LIFTing LEGO with RELISH: Lexicon Interchange FormaT in Use Helen Aristar-Dry Institute for Language Information and Technology Eastern Michigan U.
ISO DSDL ISO – Document Schema Definition Languages (DSDL) Martin Bryan Convenor, JTC1/SC18 WG1.
The Wichita lexicon in LEXUS Armik Mirzayan University of Colorado at Boulder Jacquelijn Ringersma Max Planck Institute for Psycholinguistics RELISH Workshop.
Representing dictionaries with the TEI Proposal for basic guidelines Laurent Romary - Max Planck Digital Library With the help of Susanne Alt - CNRS.
Helping people find content … preparing content to be found Enabling the Semantic Web Joseph Busch.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information ISO TC37 SC4 WG Samuel Cruz-Lara, Gil Francopoulo, Laurent Romary,
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
What Linguists Want (we think) Helen Aristar Dry & Anthony Aristar LINGUIST List & E-MELD.
TMF - a tutorial TMF - Terminological Markup Framework Laurent Romary - Laboratoire Loria.
Data Exchange Tools (DExT) DExT PROJECTAN OPEN EXCHANGE FORMAT FOR DATA enables long-term preservation and re-use of metadata,
Teaching Metadata and Networked Information Organization & Retrieval The UNT SLIS Experience William E. Moen School of Library and Information Sciences.
Provo, 16 Aug 2007 LMF meeting 1 Lexical Markup Framework: ISO Provo meeting Gil Francopoulo.
EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic.
E-Meld Workshop on Digitization of lexical Information 3-5 August 2002, EMU, Ypsilanti Working Group on Lexicon Macrostructures Chairman’s Report Dafydd.
Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
Addressing Metadata in the MPEG-21 and PDF-A ISO Standards NISO Workshop: Metadata on the Cutting Edge May 2004 William G. LeFurgy U.S. Library of Congress.
Workshop on Integrated Application of Formal Languages, Geneva J.Fischer Mappings, Use of MOF for Language Families Joachim Fischer Workshop on.
University of Illinois at Urbana-Champaign OAI Alpha Experiences Timothy W. Cole Thomas G. Habing Grainger Engineering.
The ISO-DCR 17 January /20111CMDI tutorial Marc Kemps-Snijders a, Menzo Windhouwer b, Sue Ellen Wright c a Meertens Institute, b MPI for.
Standards for language resources the ISO/TC 37(/SC 4) perspective
Eureka! User friendly access to the MPI linguistic data archive Max Planck Institute for Psycholinguistics Alexander Koenig Jacquelijn Ringersma Claus.
Using the TEI framework as a possible serialization for LMF Laurent Romary INRIA & HUB-IDSL
Max Planck Institute for Psycholinguistics Tool development report H. Brugman MPI Nijmegen.
CLARIN-NL Call 3 ISOcat follow-up 10/10/20121CLARIN-NL ISOcat Call 3 follow-up.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
LIRICS Mid-term Review 1 LIRICS WP2 – NLP Lexica Monica Monachini CNR-ILC - Pisa 23rd May 2006.
LEXUS: a web based lexicon tool Jacquelijn Ringersma Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
1 DOBES/MPI Archive - architecture - Paul Trilsbeek, Roman Skiba, Peter Wittenburg MPI for Psycholinguistics Access Management Nijmegen November 2004.
CLARIN-NL Call 4 ISOcat follow-up 2/10/20131CLARIN-NL Call 4 ISOcat follow-up.
ISOcat introduction 20 March 20121CLARIN-NL ISOcat workshop.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
Exploring and Enriching a LR Archive via the Web Marc Kemps-Snijders, Alex Klassmann, Claus Zinn, Peter Berck, Albert Russel, Peter Wittenburg MPI for.
LEXUS a flexible web based lexicon tool LEXUS a flexible web based lexicon tool, august 21 th, 2005 Marc Kemps-Snijders Peter Wittenburg
Technology – Broad View Aspects that play a role when integrating archives leave the details of some core topics to the 2. day Bernhard Neumair:Base Technologies.
SIL FieldWorks Language Explorer: The lexicon component Gary Simons SIL International Lexicon Tools and Lexicon Standards Nijmegen, 4–5 August 2010.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
1 Analysis of Lexical Structures from Field Linguistics and Language Engineering Wim Peters - University of Sheffield Sebastian Drude - University of Berlin.
1 SHAWEL Sharable and Interactive Web-Lexicon Greg Gulrajani - Max-Planck-Institute in collaboration with David Harrison & Peter Wittenburg Max Planck.
TMF - Terminological Markup Framework Laurent Romary Laboratoire LORIA (CNRS, INRIA, Universités de Nancy) ISO meeting London, 14 August 2000.
LIRICS mid-term review 1 WP5 Adam Funk University of Sheffield 23rd May 2006.
ISO TC 37/CLARIN SEMANTIC DATA REGISTRY WORKSHOP UTRECHT, DECEMBER ISOcat: Metadata Registry SUE ELLEN WRIGHT DECEMBER 2013.
Architecture for an Ontology and Web Service Modelling Studio Michael Felderer & Holger Lausen DERI Innsbruck Frankfurt,
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
Towards a roadmap for standardization in language technology Laurent Romary & Nancy Ide Loria-INRIA — Vassar College.
DSpace - Digital Library Software
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
The Semantic Web. What is the Semantic Web? The Semantic Web is an extension of the current Web in which information is given well-defined meaning, enabling.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
SemAF – Basics: Semantic annotation framework Harry Bunt Tilburg University isa -6 Joint ISO - ACL/SIGSEM workshop Oxford, January 2011 TC 37/SC.
Annotation by category – ELAN and ISO DCR Han Slöetjes, Peter Wittenburg Max-Planck-Institute for Psycholinguistics LREC,
Formats, interoperability and standards Marc Kemps-Snijders.
Ontologies Reasoning Components Agents Simulations An Overview of Model-Driven Engineering and Architecture Jacques Robin.
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
ISOcat introduction 10 May /20111CLARIN-NL ISOcat workshop.
Marc Kemps-Snijders Menzo Windhouwer Sue Ellen Wright
Using E-Business Suite Attachments
The Re3gistry software and the INSPIRE Registry
Data Model.
Software Architecture & Design
Presentation transcript:

1 LEXUS: A flexible web- based Lexicon Tool Interacting with ISO Data Category Registry Peter Wittenburg, Marc Kemps-Snijders MPI for Psycholinguistics

2 Outline background – problem MPI motivation = NLP motivation playing LEGO ISO TC37/SC4 Data Categories Lexical Markup Framework LEXUS ToolMark DemoMark Outlook

3 Background 7 DOBES teams and 12 different lexica (structures, purposes) Tuvan orthography Tuvan appendix German orthography Russian orthography Russian appendix Xakas orthography Tofa orthography stem orthography sense * lexical sub-entry * sense nr sense gram cat gram subcat Engl Transl example * orthography Engl. Transl [T|pr] nr simple spreadsheet little more complex incl 1:N relations entry-type = [stem|idiom|lexical word] head outer-body-L* headword citation form homograph no phonetic form inner-body-L grammar sense number variety meaning etymology table example* comment* picture/photo* housekeeping* gloss word-level-gloss reversal definition encyclopedic info scientific name semantic domain semantic index thesaurus semantic relation* cross-ref* small part of a complex lexicon structure at top level 4 different entry types (only one is shown)

4 Problem have to use one archival lexicon representation format based on XML have to build one archival exploitation framework however, receive lexica character encodings in all sorts of formats (var. XML, SBX, CHAT, even Word) in various structures with different terminologies (lexical attributes, values) how to do cross-lexical searches? how to do lexical merging, linking and comparison? how to solve lexicon-corpus interaction? etc in NLP the same problems lack of standards lack of re-usability lack of interoperability you knew this already or?

5 Why not play LEGO? concrete lexicon schema is basically seen as lexical attributes grouped together with others and embedded in a tree structure. sense nr sense gram cat engl trans examples ortho engl trans gloss 1:N 1:1 data categories (lexical attributes, linguistic concepts) components (sub-schemas)

6 What else: Relations actually component association is a relation of special type need various type of relations between attributes and units in value strings each relation can be associated with features, i.e. relations can be seen as components in its own breite Sitzgelegenheit something broad to sit on bank etwas um zu sitzen something to sit on sitzgelegenheit gegenteil zu breit contrary to broad schmal

7 What else: Inheritance common attributes particular attributes b’ang common attributes particular attributes boeb’ang common attributes particular attributes goeb’ang just one example to reduce typing

8 What else: conditions (operations) probably better examples around if value(X) then modify contraints(Y) etc head outer-body-L lexemtype meaning sense nr meaning effect categorial effect sense nr just one example from DOBES etc if lexemtype = “stem | idiom | lexical word” if lexemtype = “auxil | inflect affix”

9 ISO TC37/SC4 – the solution? ISO TC37/SC4 is about standardization in LR Management central is data category registry basically a flat list of linguistic concepts will contain is_a relations that are part of the concept definition “transitive_verb” is_a “verb” with proper definitions and conceptual space (value range) request for filling DCR (Metadata, morphology, syntax, …) looking for abstract models (frameworks) for lexica for annotation structures for semantic annotations for syntactic annotations …

10 Underlying Model Data element concept Conceptual domain Data element Value domain Complex datcat Set of Simple datcats /Gender/ /masculine/ /feminine/ /neuter/ m, f, n Implemented as an XML attribute named ‘gen’ XML schema declaration verte XML object List of values Dutch system is different complex datcats simple datcats

11 Lexical Markup Framework General Model Metamodel Data category selection Lexical model

12 Core Model Metamodel Made of lexical layers Lexical layers Made of lexical components (or components) Lexical DB 1..1 Global Info 1..1 Lexical Entry 0..n n Form n 1..1 Sense basis for modeling purposes is UML there will be an XML-schema based instantiation

13 Extended Model Lexical DB 1..1 Global Info 1..1 Lexical Entry 0..n n Form n 1..1 Sense n Morphology Inflexion 1..1 Paradigm 0..n /lemma/ /POS/ /gender/ /key form/ /orthography/ /gender/ /number/ /tense/ /person/ /mood/ /orthography/ /variant for/ /identifier/

14 Proposed Extensions Lexical Entry 1..1 Form n 1..1 Sense Syntactic family Semantic formula Semantic argument Construct set Syntactic construct Syntactic position 0..n n n n Syntactic family Syntactic construct Semantic frame still ongoing discussions

15 What will LMF be? descriptions of the general model (metamodel + DCS) DC have to be ISO 11179/12620/… compliant Core model including component building, relations, conditions, inheritance Extension mechanism Proposed but not normative extensions (morphology, syntax, …) XML-schema based instantiation currently version 5 of the Draft Proposal ISO/TC 37/SC 4 N130 Rev.5 Date: Working draft of ISO WD 24613:2005 web-site:

16 Goal LEXUS To provide a framework capable of handling diverse lexicon structures and formats. Lexus is based upon Lexicon Markup Framework within ISO TC37/SC4 that defines a blueprint for such a flexible framework. LEXUS is first test and reference implementation of LMF. Increase interoperability by offering well accepted data categories (ISO, GOLD, Shoebox MDF) Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

17 Current Status supports full LMF core model allows for flexible creation of structures and content. supports use of well-accepted Data Category Registries (ISO 12620, Shoebox MDF) allows for dynamic editing of structures and content. supports use of multimedia content. import of existing lexica (Shoebox, Chat) export( Shoebox/LMF XML) customizable layout Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

18 Current Status user authentication personal workspace for creating and editing lexica merging facilities simple and advanced search Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

19 Current Status (Technical) Implemented in java and using Open Source components Uses Spring to ‘wire’ the application Modular approach avoiding ‘hard’ links Uses Hibernate as the persistence framework Allows use of multiple databases (Postgres, MySQL,…) Uses Tomcat as Servlet Container Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

20 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Users must authenticate before loggin onto the application. Logging onto the application

21 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Each user has his/her own personal workspace where private lexica are stored User workspace

22 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 New lexica may be created… Lexicon creation

23 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexicon import New lexica may be imported from a lexical resource…

24 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexicon structure The LMF core model can be identified in this simple structure. Components and datacategories can be identified using different icons. All may be dynamically created or modified.

25 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexicon structure Representation of a more complex structure. By selecting a node in the Tree the content of a component or datacategory is shown and may be modified.

26 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Data category selection Data categories can easily be selected from data category registries..

27 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexical entry overview Overview of lexical entries. By selecting a lexical entry the details will be revealed.

28 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexical entry details Details of a lexical entry. Entry structure modifications are bound to schema definition, e.g. cardinality.

29 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexical entry details Attribute values can be easily modified. Various value types are supported( text, video, audio, image or file)

30 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexical entry details Example of uploading a video file.

31 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexical entry details Viewing multimedia content.

32 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Alternative entry view Alternative views are provided which may be customized in look and feel.

33 Synchronization of lexica Personal Workspace Main Lexicon Lexica may be copied to and modified in personal workspace Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

34 Synchronization of lexica Personal Workspace Main Lexicon Lexica may be synchronized with main lexicon Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

35 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Synchronization of lexica When synchronizing lexica the user is notified of structural changes and is in total control of the synchronization proces.

36 Future directions Support for various types of relations Import of data from other sources Support for other Data Category Registries, e.g. GOLD Integration with MPI archive Integration with exploitation tools (ELAN, ANNEX) Miscellaneous user requests Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

37 References ISO (2004): Lexical Markup Framework. ISO Document in progress N. Ide, A. Lenci, N. Calzolari (2003): RDF Instantiation of ISLE/Mile Lexical Entries. LDC Workshop. Philadelphia P. Wittenburg, W. Peters, S. Drude (2002): Analysis of Lexical Structures from Field Linguistics and Language Engineering. LREC 2002 Conference. Las Palma, Mai P. Wittenburg (2001): Lexical Structures. MPI Technical Report. MPI Nijmegen J. Bell, S. Bird (2000): A Preliminary Study of the Structure of Lexicon Entries. Workshop on Web-Based Language Documentation and Description. Philadelphia. Ide, N., Kilgarriff, A. and Romary, L. (2000), A Formal Model of Dictionary Structure and Content, Euralex, Stuttgart Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

38 Example lexical structure Example lexical structure used in the TEOP project within DOBES Stem orthography Sense * Lexical subentry Sense nr sense Gram cat Gram subcat Engl. Transl. Example * orthography Engl. Transl. [T/pr] nr * sign stands for 1:n relations of sub-structures Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004