Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 LEXUS: A flexible web- based Lexicon Tool Interacting with ISO Data Category Registry Peter Wittenburg, Marc Kemps-Snijders MPI for Psycholinguistics.

Similar presentations


Presentation on theme: "1 LEXUS: A flexible web- based Lexicon Tool Interacting with ISO Data Category Registry Peter Wittenburg, Marc Kemps-Snijders MPI for Psycholinguistics."— Presentation transcript:

1 1 LEXUS: A flexible web- based Lexicon Tool Interacting with ISO Data Category Registry Peter Wittenburg, Marc Kemps-Snijders MPI for Psycholinguistics

2 2 Outline background – problem MPI motivation = NLP motivation playing LEGO ISO TC37/SC4 Data Categories Lexical Markup Framework LEXUS ToolMark DemoMark Outlook

3 3 Background 7 DOBES teams and 12 different lexica (structures, purposes) Tuvan orthography Tuvan appendix German orthography Russian orthography Russian appendix Xakas orthography Tofa orthography stem orthography sense * lexical sub-entry * sense nr sense gram cat gram subcat Engl Transl example * orthography Engl. Transl [T|pr] nr simple spreadsheet little more complex incl 1:N relations entry-type = [stem|idiom|lexical word] head outer-body-L* headword citation form homograph no phonetic form inner-body-L grammar sense number variety meaning etymology table example* comment* picture/photo* housekeeping* gloss word-level-gloss reversal definition encyclopedic info scientific name semantic domain semantic index thesaurus semantic relation* cross-ref* small part of a complex lexicon structure at top level 4 different entry types (only one is shown)

4 4 Problem have to use one archival lexicon representation format based on XML have to build one archival exploitation framework however, receive lexica character encodings in all sorts of formats (var. XML, SBX, CHAT, even Word) in various structures with different terminologies (lexical attributes, values) how to do cross-lexical searches? how to do lexical merging, linking and comparison? how to solve lexicon-corpus interaction? etc in NLP the same problems lack of standards lack of re-usability lack of interoperability you knew this already or?

5 5 Why not play LEGO? concrete lexicon schema is basically seen as lexical attributes grouped together with others and embedded in a tree structure. sense nr sense gram cat engl trans examples ortho engl trans gloss 1:N 1:1 data categories (lexical attributes, linguistic concepts) components (sub-schemas)

6 6 What else: Relations actually component association is a relation of special type need various type of relations between attributes and units in value strings each relation can be associated with features, i.e. relations can be seen as components in its own breite Sitzgelegenheit something broad to sit on bank etwas um zu sitzen something to sit on sitzgelegenheit gegenteil zu breit contrary to broad schmal

7 7 What else: Inheritance common attributes particular attributes b’ang common attributes particular attributes boeb’ang common attributes particular attributes goeb’ang just one example to reduce typing

8 8 What else: conditions (operations) probably better examples around if value(X) then modify contraints(Y) etc head outer-body-L lexemtype meaning sense nr meaning effect categorial effect sense nr just one example from DOBES etc if lexemtype = “stem | idiom | lexical word” if lexemtype = “auxil | inflect affix”

9 9 ISO TC37/SC4 – the solution? ISO TC37/SC4 is about standardization in LR Management central is data category registry basically a flat list of linguistic concepts will contain is_a relations that are part of the concept definition “transitive_verb” is_a “verb” with proper definitions and conceptual space (value range) request for filling DCR (Metadata, morphology, syntax, …) looking for abstract models (frameworks) for lexica for annotation structures for semantic annotations for syntactic annotations …

10 10 Underlying Model Data element concept Conceptual domain Data element Value domain Complex datcat Set of Simple datcats /Gender/ /masculine/ /feminine/ /neuter/ m, f, n Implemented as an XML attribute named ‘gen’ XML schema declaration verte XML object List of values Dutch system is different complex datcats simple datcats

11 11 Lexical Markup Framework General Model Metamodel Data category selection Lexical model

12 12 Core Model Metamodel Made of lexical layers Lexical layers Made of lexical components (or components) Lexical DB 1..1 Global Info 1..1 Lexical Entry 0..n 1..1 0..n Form 1..1 0..n 1..1 Sense basis for modeling purposes is UML there will be an XML-schema based instantiation

13 13 Extended Model Lexical DB 1..1 Global Info 1..1 Lexical Entry 0..n 1..1 0..n Form 1..1 0..n 1..1 Sense 1..1 1..n Morphology 0..1 1..1 Inflexion 1..1 Paradigm 0..n /lemma/ /POS/ /gender/ /key form/ /orthography/ /gender/ /number/ /tense/ /person/ /mood/ /orthography/ /variant for/ /identifier/

14 14 Proposed Extensions Lexical Entry 1..1 Form 1..1 0..n 1..1 Sense Syntactic family Semantic formula Semantic argument Construct set Syntactic construct Syntactic position 0..n 1..1 0..n 1..1 0..n 1..1 0..n Syntactic family Syntactic construct Semantic frame still ongoing discussions

15 15 What will LMF be? descriptions of the general model (metamodel + DCS) DC have to be ISO 11179/12620/… compliant Core model including component building, relations, conditions, inheritance Extension mechanism Proposed but not normative extensions (morphology, syntax, …) XML-schema based instantiation currently version 5 of the Draft Proposal ISO/TC 37/SC 4 N130 Rev.5 Date: 2005-03-19 Working draft of ISO WD 24613:2005 web-site: http://www.tc37sc4.org/

16 16 Goal LEXUS To provide a framework capable of handling diverse lexicon structures and formats. Lexus is based upon Lexicon Markup Framework within ISO TC37/SC4 that defines a blueprint for such a flexible framework. LEXUS is first test and reference implementation of LMF. Increase interoperability by offering well accepted data categories (ISO, GOLD, Shoebox MDF) Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

17 17 Current Status supports full LMF core model allows for flexible creation of structures and content. supports use of well-accepted Data Category Registries (ISO 12620, Shoebox MDF) allows for dynamic editing of structures and content. supports use of multimedia content. import of existing lexica (Shoebox, Chat) export( Shoebox/LMF XML) customizable layout Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

18 18 Current Status user authentication personal workspace for creating and editing lexica merging facilities simple and advanced search Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

19 19 Current Status (Technical) Implemented in java and using Open Source components Uses Spring to ‘wire’ the application Modular approach avoiding ‘hard’ links Uses Hibernate as the persistence framework Allows use of multiple databases (Postgres, MySQL,…) Uses Tomcat as Servlet Container Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

20 20 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Users must authenticate before loggin onto the application. Logging onto the application

21 21 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Each user has his/her own personal workspace where private lexica are stored User workspace

22 22 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 New lexica may be created… Lexicon creation

23 23 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexicon import New lexica may be imported from a lexical resource…

24 24 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexicon structure The LMF core model can be identified in this simple structure. Components and datacategories can be identified using different icons. All may be dynamically created or modified.

25 25 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexicon structure Representation of a more complex structure. By selecting a node in the Tree the content of a component or datacategory is shown and may be modified.

26 26 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Data category selection Data categories can easily be selected from data category registries..

27 27 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexical entry overview Overview of lexical entries. By selecting a lexical entry the details will be revealed.

28 28 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexical entry details Details of a lexical entry. Entry structure modifications are bound to schema definition, e.g. cardinality.

29 29 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexical entry details Attribute values can be easily modified. Various value types are supported( text, video, audio, image or file)

30 30 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexical entry details Example of uploading a video file.

31 31 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Lexical entry details Viewing multimedia content.

32 32 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Alternative entry view Alternative views are provided which may be customized in look and feel.

33 33 Synchronization of lexica Personal Workspace Main Lexicon Lexica may be copied to and modified in personal workspace Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

34 34 Synchronization of lexica Personal Workspace Main Lexicon Lexica may be synchronized with main lexicon Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

35 35 Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004 Synchronization of lexica When synchronizing lexica the user is notified of structural changes and is in total control of the synchronization proces.

36 36 Future directions Support for various types of relations Import of data from other sources Support for other Data Category Registries, e.g. GOLD Integration with MPI archive Integration with exploitation tools (ELAN, ANNEX) Miscellaneous user requests Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

37 37 References ISO (2004): Lexical Markup Framework. ISO Document in progress N. Ide, A. Lenci, N. Calzolari (2003): RDF Instantiation of ISLE/Mile Lexical Entries. LDC Workshop. Philadelphia P. Wittenburg, W. Peters, S. Drude (2002): Analysis of Lexical Structures from Field Linguistics and Language Engineering. LREC 2002 Conference. Las Palma, Mai P. Wittenburg (2001): Lexical Structures. MPI Technical Report. MPI Nijmegen J. Bell, S. Bird (2000): A Preliminary Study of the Structure of Lexicon Entries. Workshop on Web-Based Language Documentation and Description. Philadelphia. Ide, N., Kilgarriff, A. and Romary, L. (2000), A Formal Model of Dictionary Structure and Content, Euralex, Stuttgart Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004

38 38 Example lexical structure Example lexical structure used in the TEOP project within DOBES Stem orthography Sense * Lexical subentry Sense nr sense Gram cat Gram subcat Engl. Transl. Example * orthography Engl. Transl. [T/pr] nr * sign stands for 1:n relations of sub-structures Workshop ‘LexicalDabases and digital tools’ Nijmegen April 2004


Download ppt "1 LEXUS: A flexible web- based Lexicon Tool Interacting with ISO Data Category Registry Peter Wittenburg, Marc Kemps-Snijders MPI for Psycholinguistics."

Similar presentations


Ads by Google