Relish Rendering Endangered Languages Lexicons Interoperable through Standards Harmonization Marc Kemps-Snijders Max Planck.

Relish Rendering Endangered Languages Lexicons Interoperable through Standards Harmonization Marc Kemps-Snijders Marc.kemps-snijders@mpi.nl Max Planck Institute for Psycholinguistics SaLTMIL Workshop Speech and Language Technology for Minority Languages May 23 rd 2010 LREC Malta

Increase interoperability between endangered language lexica created on both sides of the Atlantic

Background Lexica constitute important record of endangered languages Diverging European and American standards for data formatting and markup LIFT/LLIFT vs. LMF GOLD vs. ISOcat Significant effort in tool support by all parties Structural differences Differences in terms and abbreviations Differences in interchange formats

European and American Projects and Standards MPIMPI ILITILIT DobesDobes InteraIntera DAM-LRDAM-LR ECHOECHO CLARINCLARIN LEGOLEGO EMELDEMELD Data Driven Ontology GOLD Community SILSIL Lexicons of endangered languages Standards for Terminology DCRDCR GOLDGOLD Standards for Lexicons LMFLMF LIFTLIFT ISO IS 12620:2009 DCR ISO FDIS 24613:2008 LMF UFUF

Methodology Bottom up approach Analyze existing lexica to identify commonalities and differences in lexical structure and content Tofa Udi Archi Iwaidja Mocovi Salar Kayardild LLIFT example  we don't want that. has no appropriate attributes we could use, either. -->  we're trying to map individuals to their different possible combinations of dialects, though. --> cow cow  as variant can't have under it. --> dabere dabere dabbere dabbere LLIFT example  we don't want that. has no appropriate attributes we could use, either. -->  we're trying to map individuals to their different possible combinations of dialects, though. --> cow cow  as variant can't have under it. --> dabere dabere dabbere dabbere Shoebox example \_sh v3.0 400 Iwaidja \_sh v3.0 400 Iwaidja\_DateStampHasFourDigitYear \lx a \lc Lexical citation ((R) => root) \ps Part of speech \de Definition \ge Gloss-English \re Reversal \xv Example vernacular \xe Example English \rf Reference for example \dt 11/Jul/2007 \lx a- \lc a- \a a- \ps v. prefix \de third person plural intransitive subject prefix \ge 3pl \re they \ng This is the neutral form; the 'towards' form is |fv{ayuwu-}, 'away' form is |fv{ijb-} ~ |fv{ijuwu-} \sd verb prefix \sd inflectional prefix \rf PL93 \xv Amalkban. \xe They move outside. \dt 15/Jul/2007 \lx a- \lc a- \a a- \ps n. pref. \de their (with possessed body parts) \ge 3pl \re their (with possessed body parts) \sd noun prefix \sd inflectional prefix \dt 29/Nov/2006 Shoebox example \_sh v3.0 400 Iwaidja \_sh v3.0 400 Iwaidja\_DateStampHasFourDigitYear \lx a \lc Lexical citation ((R) => root) \ps Part of speech \de Definition \ge Gloss-English \re Reversal \xv Example vernacular \xe Example English \rf Reference for example \dt 11/Jul/2007 \lx a- \lc a- \a a- \ps v. prefix \de third person plural intransitive subject prefix \ge 3pl \re they \ng This is the neutral form; the 'towards' form is |fv{ayuwu-}, 'away' form is |fv{ijb-} ~ |fv{ijuwu-} \sd verb prefix \sd inflectional prefix \rf PL93 \xv Amalkban. \xe They move outside. \dt 15/Jul/2007 \lx a- \lc a- \a a- \ps n. pref. \de their (with possessed body parts) \ge 3pl \re their (with possessed body parts) \sd noun prefix \sd inflectional prefix \dt 29/Nov/2006 Lexus example <lexicalEntry><headword_x0020_group> 11/Jul/2007 11/Jul/2007 <headword>a</headword> Lexical citation ((R) => root) Lexical citation ((R) => root) <part_x0020_of_x0020_speech_x0020_group><part_x0020_of_x0020_speech/><sense_x0020_number_x0020_group><contextualized_x0020_example_x0020_group><example_x0020__x0028_free_x0020_translation_x0029_/><contextualized_x0020_example/></contextualized_x0020_example_x0020_group><definition_x0020_group><English_x0020_reversal/><English_x0020_gloss/><definition/></definition_x0020_group><reference_x0020_group><reference/></reference_x0020_group></sense_x0020_number_x0020_group></part_x0020_of_x0020_speech_x0020_group></headword_x0020_group></lexicalEntry><lexicalEntry><headword_x0020_group> 12/Jul/2007 12/Jul/2007 <headword>^(d)angkarranaka</headword><citation_x0020_form>angkarranaka</citation_x0020_form><part_x0020_of_x0020_speech_x0020_group><part_x0020_of_x0020_speech>?</part_x0020_of_x0020_speech><sense_x0020_number_x0020_group><reference_x0020_group><reference>IwNo05:19Ap</reference></reference_x0020_group><contextualized_x0020_example_x0020_group>ce></reference_x0020_group><_x0032_D_x0020_group> The d-initial form is found after prefixes ending in K- ; elsewhere the root begins with |fv{a}. The citation form is |fv{dangkarranaka}. The d-initial form is found after prefixes ending in K- ; elsewhere the root begins with |fv{a}. The citation form is |fv{dangkarranaka}. </_x0032_D_x0020_group> Lexus example <lexicalEntry><headword_x0020_group> 11/Jul/2007 11/Jul/2007 <headword>a</headword> Lexical citation ((R) => root) Lexical citation ((R) => root) <part_x0020_of_x0020_speech_x0020_group><part_x0020_of_x0020_speech/><sense_x0020_number_x0020_group><contextualized_x0020_example_x0020_group><example_x0020__x0028_free_x0020_translation_x0029_/><contextualized_x0020_example/></contextualized_x0020_example_x0020_group><definition_x0020_group><English_x0020_reversal/><English_x0020_gloss/><definition/></definition_x0020_group><reference_x0020_group><reference/></reference_x0020_group></sense_x0020_number_x0020_group></part_x0020_of_x0020_speech_x0020_group></headword_x0020_group></lexicalEntry><lexicalEntry><headword_x0020_group> 12/Jul/2007 12/Jul/2007 <headword>^(d)angkarranaka</headword><citation_x0020_form>angkarranaka</citation_x0020_form><part_x0020_of_x0020_speech_x0020_group><part_x0020_of_x0020_speech>?</part_x0020_of_x0020_speech><sense_x0020_number_x0020_group><reference_x0020_group><reference>IwNo05:19Ap</reference></reference_x0020_group><contextualized_x0020_example_x0020_group>ce></reference_x0020_group><_x0032_D_x0020_group> The d-initial form is found after prefixes ending in K- ; elsewhere the root begins with |fv{a}. The citation form is |fv{dangkarranaka}. The d-initial form is found after prefixes ending in K- ; elsewhere the root begins with |fv{a}. The citation form is |fv{dangkarranaka}. </_x0032_D_x0020_group>

Methodology Top down approach Analyze existing standards for lexical resources (GOLD/LIFT and LMF/DCR) to identify commonalities and differences at the conceptual level. Harmonize concepts using ISO 12620 Data Category Registry Harmonize model approaches Harmonize interchange formats

Harmonizing 12620 data categories All linguistic concepts will be registered in the ISO 12620 Data Category Registry (ISOcat) Analysis of existing ISOcat data categories vs. GOLD vs. MDF ISOcat 12620 Data Category Registry GOLD Comunity \+DatabaseType MDF 4.0 \ver 5.0 \desc Standard Format markers defined in _Making Dictionaries: A guide to lexicography and the Multi-Dictionary Formatter_. David F. Coward, Charles E. Grimes, and Mark R. Pedrotti. Waxhaw, NC: SIL, 1998. (2nd edition) \+mkrset \lngDefault English \mkrRecord lx \+mkr an \nam Antonym \desc Used to reference an antonym of the lexeme, but using the \lf (lexical function) field for this is better practice. \lng vernacular \mkrOverThis sn \CharStyle\-mkr \+mkr bw \nam Borrowed word (loan) \desc Used for denoting the source language of a borrowed word. \lng English \mkrOverThis se \CharStyle\-mkr \+mkr ce \nam Cross-ref. gloss (E) \desc Gives the English gloss(es) for the vernacular lexeme referenced by the preceding \cf field. \lng English \mkrOverThis cf \CharStyle\-mkr \+DatabaseType MDF 4.0 \ver 5.0 \desc Standard Format markers defined in _Making Dictionaries: A guide to lexicography and the Multi-Dictionary Formatter_. David F. Coward, Charles E. Grimes, and Mark R. Pedrotti. Waxhaw, NC: SIL, 1998. (2nd edition) \+mkrset \lngDefault English \mkrRecord lx \+mkr an \nam Antonym \desc Used to reference an antonym of the lexeme, but using the \lf (lexical function) field for this is better practice. \lng vernacular \mkrOverThis sn \CharStyle\-mkr \+mkr bw \nam Borrowed word (loan) \desc Used for denoting the source language of a borrowed word. \lng English \mkrOverThis se \CharStyle\-mkr \+mkr ce \nam Cross-ref. gloss (E) \desc Gives the English gloss(es) for the vernacular lexeme referenced by the preceding \cf field. \lng English \mkrOverThis cf \CharStyle\-mkr MDF type file

Harmonizing 12620 data categories Example: part of speech Determiner Definite article PartOfSpeech article Indefinite article Is a... Complex ClosedSimple ISOcat: MorhoSyntax Profile GOLD ontology \+mkr ps \nam Part of speech \desc Classifies the part of speech. This must reflect the part of speech of the vernacular lexeme (not the national or English gloss). Consistent labeling is important; use the Range Set feature. Sense numbers are beneath \ps in this hierarchy; don't mark different \ps fields with sense numbers. \lng English \rngset adj adv …… n num pn post prtcl v \mkrOverThis se \mkrFollowingThis va \CharStyle\-mkr \+mkr ps \nam Part of speech \desc Classifies the part of speech. This must reflect the part of speech of the vernacular lexeme (not the national or English gloss). Consistent labeling is important; use the Range Set feature. Sense numbers are beneath \ps in this hierarchy; don't mark different \ps fields with sense numbers. \lng English \rngset adj adv …… n num pn post prtcl v \mkrOverThis se \mkrFollowingThis va \CharStyle\-mkr MDF Multi Dictionary Format

Harmonizing 12620 data categories Gold example 2 In some cases GOLD contains additional information Additional extensions to the conceptual domain isA relations between GOLD concepts GOLD ontology

Harmonizing 12620 data categories Relation Registries Relation Registries describes relations not handled through the ISO 12620 model Simple relations e.g MDF /PartOfSpeech/ ‘equals’ MorphoSyntax /PartOfSpeech/ GOLD relations (GOLD ontology is a Relation Registry) Compositional Relations (DC is composed of multiple more granular DCs) e.g. UDI MDF \1d (First dual)  person:firstPerson, grammaticalNumber: dual, value:… Model specific relations e.g. TBX model

Harmonizing 12620 data categories Relation Registries Relation registriesData Category registriesresource registries

Harmonizing interchange formats Possibility to use TEI? Can TEI serve as interchange format for LMF and be accepted by CLARIN community? Decision needs to be made before end 2010 to be useful for RELISH ODD (One Document does all) Documentation Schema information Schema documents validate xml data structure In August a workshop is organized to discuss the possibility of using TEI as an interchange format with representatives from ISO, CLARIN, TEI and endangered languages community

Adapting the tools Relish project will result in tool adaptation to support the interoperability aspects and interchange formats

Conclusions and remarks Minority and less resourced languages and tools are starting to actively participate in the standards discussions becoming part of the e-infrastructure landscape have the opportunity to play a mature role in the area of language resources We need organizations and individuals who are actively involved and represent the position of less resources languages in these discussions Results from Relish project may be useful for other less resourced language resources as well

Thank you for your attention Relish was made possible through the DFG/NEH Bilateral Digital Humanities Program

Relish Rendering Endangered Languages Lexicons Interoperable through Standards Harmonization Marc Kemps-Snijders Max Planck.

Similar presentations

Presentation on theme: "Relish Rendering Endangered Languages Lexicons Interoperable through Standards Harmonization Marc Kemps-Snijders Max Planck."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Relish Rendering Endangered Languages Lexicons Interoperable through Standards Harmonization Marc Kemps-Snijders Max Planck.

Similar presentations

Presentation on theme: "Relish Rendering Endangered Languages Lexicons Interoperable through Standards Harmonization Marc Kemps-Snijders Max Planck."— Presentation transcript:

Similar presentations

About project

Feedback