Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language Identification and IT Peter Constable and Gary Simons SIL International

Similar presentations


Presentation on theme: "Language Identification and IT Peter Constable and Gary Simons SIL International"— Presentation transcript:

1 Language Identification and IT Peter Constable and Gary Simons SIL International

2 17th International Unicode ConferenceSan Jose, CA September 2000 Language identification The use of identificational codes for tagging information objects to indicate the language in which the information is expressed

3 17th International Unicode ConferenceSan Jose, CA September 2000 Language identification Not considering automated language detection Considering only language identifiers, not identifiers for paralinguistic notions, such as writing system or locale

4 17th International Unicode ConferenceSan Jose, CA September 2000 About the Ethnologue SIL Ethnologue catalogue of all modern languages in the worldcatalogue of all modern languages in the world lists over 6,800 living languageslists over 6,800 living languages result of decades of researchresult of decades of research system of three-letter codessystem of three-letter codes

5 17th International Unicode ConferenceSan Jose, CA September 2000 About the Ethnologue

6 17th International Unicode ConferenceSan Jose, CA September 2000 About the Ethnologue

7 17th International Unicode ConferenceSan Jose, CA September 2000 About the Ethnologue Existing user base for Ethnologue codes: SILSIL UNESCOUNESCO Linguistic Data Consortium (850+ agencies)Linguistic Data Consortium (850+ agencies) The Linguist List (12,500 individual linguists)The Linguist List (12,500 individual linguists) The Endangered Language FundThe Endangered Language Fund othersothers

8 17th International Unicode ConferenceSan Jose, CA September 2000 Linguistic diversity # of languages: Africa: 2062 Americas: 1020 Europe: 237 Asia: 2202 Pacific: 1312

9 17th International Unicode ConferenceSan Jose, CA September 2000 Motivation for this paper Languages covered by standards ISO 639-x covers approx. 400 languages;ISO 639-x covers approx. 400 languages; existing needs to go much furtherover 6,800 languagesexisting needs to go much furtherover 6,800 languages immediate need among linguists and other researchers for use in XMLimmediate need among linguists and other researchers for use in XML

10 17th International Unicode ConferenceSan Jose, CA September 2000 Five issues ChangeCategorization Inadequate definition ScaleDocumentation

11 17th International Unicode ConferenceSan Jose, CA September 2000 The need for language identifiers Language-specific processing spell-checkingspell-checking sortingsorting morphological parsingmorphological parsing speech recognition/synthesisspeech recognition/synthesis language-specific typographic behaviourlanguage-specific typographic behaviour etc.etc.

12 17th International Unicode ConferenceSan Jose, CA September 2000 The need for language identifiers Language-specific processing choosing appropriate resourceschoosing appropriate resources Los eventos deportivos pra la juventud

13 17th International Unicode ConferenceSan Jose, CA September 2000 The need for language identifiers Two distinct issues: identify the languageidentify the language apply the specific processing for that languageapply the specific processing for that language

14 17th International Unicode ConferenceSan Jose, CA September 2000 The need for language identifiers Language detection identify language by inspection of data itselfidentify language by inspection of data itself available only for a few languagesavailable only for a few languages not practical for searching large corpora (e.g. the Internet)not practical for searching large corpora (e.g. the Internet) doesnt work on short text segmentsdoesnt work on short text segments She said, chat.

15 17th International Unicode ConferenceSan Jose, CA September 2000 The need for language identifiers Language-specific processing in general, must tag information objects to indicate languagein general, must tag information objects to indicate language identifiers are needed to distinguish every languageidentifiers are needed to distinguish every language

16 17th International Unicode ConferenceSan Jose, CA September 2000 Issue #1: change Languages are constantly changing Implications: systems of language tags cannot be staticsystems of language tags cannot be static the speech variety (varieties) denoted by a tag is time-boundthe speech variety (varieties) denoted by a tag is time-bound English c A.D. English c A.D.

17 17th International Unicode ConferenceSan Jose, CA September 2000 Issue #2: categorization Typical question: Are Serbian and Croatian the same language, or different languages? Operational definitions of language many different ways to formulate a definitionmany different ways to formulate a definition different definitions create different categorizationsdifferent definitions create different categorizations different categorizations serve different purposesdifferent categorizations serve different purposes

18 17th International Unicode ConferenceSan Jose, CA September 2000 Issue #3: inadequate definition Existing systems do not consistently employ a single operational definition ISO 639-2: codes for languages and for groups of languagesISO 639-2: codes for languages and for groups of languages nav = Navajo ath = Athapascan languages ISO 639-2: some languages are groups of languagesISO 639-2: some languages are groups of languages que = Quechua (47 distinct languages)

19 17th International Unicode ConferenceSan Jose, CA September 2000 Issue #3: inadequate definition Consistent use of a single definition in a given namespace is beneficial Requiring a single definition imposes too much constraint on users users may legitimately have different requirementsusers may legitimately have different requirements but no control results in confusion, especially when thousands of identifiers are addedbut no control results in confusion, especially when thousands of identifiers are added

20 17th International Unicode ConferenceSan Jose, CA September 2000 Issue #4: Scale Number of languages exceed existing systems by an order of magnitude (400 vs. 6,800) Existing systems do not scale well

21 17th International Unicode ConferenceSan Jose, CA September 2000 Issue #4: Scale ISO 639-x slow process unable to cope with large volume of requestsslow process unable to cope with large volume of requests minimal attestation (50 documents) not appropriate for lesser-known languagesminimal attestation (50 documents) not appropriate for lesser-known languages mnemonic codes (impossible for thousands of languages)mnemonic codes (impossible for thousands of languages) confusion due to inconsistent definitionconfusion due to inconsistent definition

22 17th International Unicode ConferenceSan Jose, CA September 2000 Issue #4: Scale RFC 1766 process unable to cope with large volume of requestsprocess unable to cope with large volume of requests confusion due to inconsistent definitionconfusion due to inconsistent definition unclear how to create tagsunclear how to create tags

23 17th International Unicode ConferenceSan Jose, CA September 2000 Issue #5: documentation Existing systems: cant tell what codes denote ISO 639-x: language, or group of languages?ISO 639-x: language, or group of languages? ara, Arabic: Standard only? all variants? bin, Bini =dial. of Yoruba (Nigeria; 20,000,000) =dial. of Anyin (Côte d'Ivoire; 810,000) =alt. name for Edo (Nigeria; 1,000,000) =alt. name for Pini (Australia; dying) ISO 639-x: which of several alternate possibilities?ISO 639-x: which of several alternate possibilities?

24 17th International Unicode ConferenceSan Jose, CA September 2000 Issue #5: documentation ISO 639-x: 2- vs. 3-letter codesISO 639-x: 2- vs. 3-letter codes st, Sesotho =nso, Sotho, Northern? =sot, Sotho, Southern? =both? to, Tonga =tog, Tonga (Nyasa)? =ton, Tonga (Tonga Islands)?

25 17th International Unicode ConferenceSan Jose, CA September 2000 Solving these problems Requirements of an adequate system: able to scaleable to scale able to deal with change, track history of changeable to deal with change, track history of change use a single operational definition for a given namespaceuse a single operational definition for a given namespace apply definition consistently within a namespaceapply definition consistently within a namespace complete, maintained, online documentationcomplete, maintained, online documentation

26 17th International Unicode ConferenceSan Jose, CA September 2000 What the Ethnologue offers Scale: already there enumeration of languagesenumeration of languages set of three-letter codesset of three-letter codes Change: careful management no re-use of codesno re-use of codes have begun recording revision historyhave begun recording revision history

27 17th International Unicode ConferenceSan Jose, CA September 2000 What the Ethnologue offers Definition: single definition, applied quite consistently definition: primary criterion of mutual non- intelligibility as a basis for identifying candidates for separate literacy, literaturedefinition: primary criterion of mutual non- intelligibility as a basis for identifying candidates for separate literacy, literature all categories are of the same type; no language families, groups, writing systemsall categories are of the same type; no language families, groups, writing systems

28 17th International Unicode ConferenceSan Jose, CA September 2000 What the Ethnologue offers Documentation extensive information maintained for every languageextensive information maintained for every language new site will provide various reportsnew site will provide various reports alternate names, location, population, etc.alternate names, location, population, etc. related ISO codes, relationshiprelated ISO codes, relationship return Ethnologue data given an ISO codereturn Ethnologue data given an ISO code evaluating possibilities for returning results as XMLevaluating possibilities for returning results as XML

29 17th International Unicode ConferenceSan Jose, CA September 2000 Integration with RFC 1766, XML Ethnologue codes immediately available using x- private-use tags not ultimately satisfactoryprivate-use tags not ultimately satisfactory Hopi:

30 17th International Unicode ConferenceSan Jose, CA September 2000 Integration with RFC 1766, XML Register thousands of new tags with IANA process would not be able to copeprocess would not be able to cope problems devising that many tagsproblems devising that many tags create considerable confusion in the single namespacecreate considerable confusion in the single namespace

31 17th International Unicode ConferenceSan Jose, CA September 2000 Integration with RFC 1766, XML Register i-sil- to specify a namespace maintained by a particular agency deals with scaledeals with scale creates a namespace with a particular definition that is consistently appliedcreates a namespace with a particular definition that is consistently applied avoids confusion of having a single namespace for all needsavoids confusion of having a single namespace for all needs allow alternate namespacesallow alternate namespaces

32 17th International Unicode ConferenceSan Jose, CA September 2000 Integration with RFC 1766, XML Possible refinement: define primary tag n- first sub-tag identifies a registered namespace of identifiersfirst sub-tag identifies a registered namespace of identifiers each namespace provides its own operational definition(s)each namespace provides its own operational definition(s) i- usage more consistent (languages only)i- usage more consistent (languages only) i- specifies a privileged namespace (doesnt require n-i- specifies a privileged namespace (doesnt require n-)

33 17th International Unicode ConferenceSan Jose, CA September 2000 Conclusions Language identifiers required for language-specific processing Immediate need for thousands of new language identifiers; in particular, for use in XML Five problem areasneed to be considered in any system SIL Ethnologue codes address all five problems Revising RFC 1766 to add a namespace mechanism can support this and would offer many benefits


Download ppt "Language Identification and IT Peter Constable and Gary Simons SIL International"

Similar presentations


Ads by Google