Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES.

Similar presentations


Presentation on theme: "1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES."— Presentation transcript:

1 1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES OF INDIA SOME ISSUES Dr.B.Mallikarjun Central Institute of Indian Languages Mysore , INDIA

2 1.1. Current status of corpora – major Indian languages 2.2. Current status of corpora - minor Indian languages 3.3. Importance of minor languages corpora 4.4. Objectives 5.5. Categorization of minor languages for corpora building 6.6. Minor languages: A sample 7.7. Issues in corpora building 8.8. Corpus processing tools – a. Basic b. Advancedab 9.9. Conclusion and a mission EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

3 3 India has 1652 mother tongues of 4 families.The Constitution of India in 8 th Schedule has recognized 18 languages spoken by 96.29% of the population. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003 Assamese : 2,622,836 Bengali : 3,535,863 Gujarati : Hindi : 3,003,004 Kannada : 2,239,537 Kashmiri : 2,266,588 Konkani : Malayalam: 2,349,526 Manipuri : Marathi : 2,213,241 Nepali : Oriya : 2,727,670 Punjabi : 1,966,260 Sanskrit: Sindhi : Tamil : 3,381,525 Telugu : 3,967,926 Urdu : 1,64,125

4 4 *Different quantum. *Comparable quality. *Quantum and coverage is inadequate for wider NLP activities. *Needs to be augmented with wider coverage. *Enhancing attempts have some problems needing immediate solution. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

5 5 *1634 are minor languages spoken by 3.71% of the population. *Indo-Aryan and Dravidian language families have both major and minor languages. *Almost all the languages of the other two families, Munda and Tibeto-Burman are “minor” languages. *Text corpora building has not taken place in these languages. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

6 6  Minor languages hardly attract the attention of the policy makers anywhere in the world.  These are endangered in Indian social, educational and linguistic contexts.  Linguists evince great interest to study the richness of languages and try to save the endangered languages from extinction. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

7 7  They hardly attract and become source for technological research.  Technology has made it possible to empower all languages whether they are major or minor ones.  Creating corpora in minor languages, especially those that have small or no written literature have certain critical advantages for linguistic computing.  Experimentation with corpora designs and standards is more easily done in these languages because of manageable quantum of data. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

8 8 Archival and cross-linguistic comparison within a language family and across language families. Utilize language technology for their preservation and continued use. Fine-tune language analysis where grammatical analysis is available. Use machine readable form of the texts to produce possibly precise analysis of the language where ever such analysis is not available. Also use some of the minor languages corpora for machine translation purposes. Speech corpora too has more significance in minor languages, since most of them exist in spoken form and many are yet to be rendered into written form. Indigenous knowledge systems: Most of the minor languages are resources of cultural heritage and a treasure house of indigenous knowledge systems. Once the same is available in the machine readable form by using UNL can be made available to the universal knowledge base. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

9 9 Minor languages can be classified into 3 groups on the basis of the issues to be tackled while building corpora. First category : Languages other than the 18 major languages having good amount of literary and other texts and also used in wider domains like : Bodo, Kurukh, Maithili, Santhali, Tripuri etc. Second category : Languages are the once with limited quantity of written texts but not widely used in different domains such as education, administration etc. like : Kodava, Tulu, etc. Third category : Languages available only in spoken form and yet to be rendered into written form like Toda, Kota, Yerava, etc. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

10 10 13,689 No script Indigenous Knowledge System Dravidian Yerava 97,011KannadaVery lessDravidianKodava or Coorgi 77,66,597DevanagariYesIndo AryanMaithili No. of speakersScriptTextLg.familyName These languages are representative of the ground linguistic reality in India. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

11 11 In-compatibility of adopted software not accommodative of all the features of Maithili, Kodava and YeravaMaithiliKodavaYerava Standard software based on the grammar of the concerned script and UNICODE for Kannada: - 1, 2, 3, Technical: key-board, input and storage All available text / All transcribed speech MaithiliMaithili, Kodava and YeravaKodava Yerava Sampling Sampling - domains Period Text Minor languageMajor languageIssue EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

12 12 EACL 2003, CLSAL: Budapest – April 12 – 17, 2003 Frequency count of words and syllables : The facilities created for languages like Hindi and Kannada are there and where ever necessary language specific modifications are made and used.

13 13 Comparison of Maithili, Kodava and Yerava Corpora Average Word length% rurakaMost frequent Syllable Word types Corpus size YeravaKodavaMaithiliStatistical distribution EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

14 Average Word length% ka Most frequent Syllable Word types Corpus size Hindi (Premchand) Hindi (India Today) Hindi (Naiduniya) Hindi (CIIL) MaithiliStatistical distribution EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

15 15 r ar u r aMost frequent Syllable Average sentence length % Average Word length% Word types Corpus size MalayalamKannadaYeravaKodaguStatistical distribution EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

16 16 1.Key Word in Context 2.Search by required word 3.Sorting and indexing The facilities created for languages like Hindi and Kannada are there and where ever necessary language specific modifications can be made and used. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

17 17 1.Part-of-speech tagging 2.Morphological analyzer EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

18 18 1.Non availability of standard basic tag set is one of the major drawbacks. 2.Each Institution/group of scholars use their own notations: CLAWS, Research institution in IT,CLAWSResearch institution in IT CIIL(Maj lg.)CIIL(Maj lg.), CIIL(Min lg.)CIIL(Min lg.) 3.The tagging tools being developed even for major languages are at different stages of development. 4. The POS tagging tool developed for Hindi can be tried out at the first instance on Maithili to see its viability. Hindi too is not having fully working POS tagging tool. 5. Due to limited data in Kodava and Yerava manual tagging is preferred. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

19 19 The Morphological Analyzers designed for the minor languages of India should be sensitive enough to take care of their specific features. 1.Tagged lexicon 2.Rules to cover the processes of: Inflection - Suffixing is normally based on word ending Derivation – Both prefixing and suffixing are possible – depends on lexical item EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

20 20 Yerava word ‘-ati’ has three meanings such as ‘to sweep’, ‘wind blow’ and ‘bottom’ for which meaning has to be taken depending upon the context. In such of these cases the morphological analyzer demands a semantic tool. Kodava word bappe has the meaning ‘I am coming’ but when it is used in the context of leave taking, it means, ‘I am leaving.’ Cultural nuances in the context of leave taking do not allow one to use the word poope ‘going or leaving’ because it would only mean that the person is saying the ultimate good-bye to this world. It is possible to judge the meaning of such words only with the knowledge of the culture represented by a language. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

21 21 Ambiguities are seen in three senses - Word sense, Pronoun sense and Structural sense. Word sense ambiguities are words having multiple meanings that will be found in all the languages. With regard to the second one, pronominal and adjectival anaphora are also ambiguities. In English, disambiguation tools have been developed. After the inception of a few lexical databases such as Word Net, Euro Net, etc., researchers seem to have overcome the ambiguity problem to certain extent. In the case of Indian languages, however, in the absence of such a sensitive tool, one has to work manually in order to cross over disambiguate even in the case of major languages. Minor languages need better linguistic analysis to arrive at tangible and usable disambiguation procedures. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

22 22  India abounds in many endangered languages. Technology can actually help maintain a language.  Technology should immediately take into account the concerns of minority languages. Especially, major language technologies of the region should accommodate the needs of the minor languages too.  Corpora building in minor languages poses new challenges to innovate novel ways to accommodate and adequately describe the distinctive features of these languages.  Comparison of corpora studies - within a family of languages, across the families of languages and at the international level will be helpful in bringing out a standard module of developing corpora. EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

23 23 EACL 2003, CLSAL: Budapest – April 12 – 17, 2003 Thank You

24 24 EACL 2003, CLSAL: Budapest – April 12 – 17, Kannada Code Chart

25 25 EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

26 26 EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

27 27 EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

28 28 Demography Astrology Criminology Physical Education / Sports Health and Family Welfare Forestry Sexology Culture & Anthropology Commerce Banking Accountancy Industry & handicrafts Finance Textile Technology Official And Media Languages Mass Media Legislative Administrative Translated Material Literature Scientific Legal Administration Translated Psychology EACL 2003, CLSAL: Budapest – April 12 – 17, 2003 Aesthetics Literature Novel Short Story Essays Criticism Humour Children 's Literature Biographies & Autobiographies Travelogues Letters/Diaries/ Speeches Plays Science Fiction Folk Tales Text Books(School) Social Sciences Fine Arts Music Dance/Impersonations Drawing Sculpture Musical Instruments Hobbies Natural, Physical And Professional Sciences Botany Zoology Geology Geography Bio Chemistry Micro Biology Physics Chemistry Mathematics Statistics Computer Sciences Astronomy Text book(Science) Medicine Ayurveda Homeopathy Yoga Naturopathy Engineering Architecture Oceanology Agriculture Veternary Film Technology Photography Marine Biology Fisheries Textile Technology Social Sciences Sociology Linguistics Psychology Anthropology History, Archeology, Epigraphy Political Science Home Science Library Science Religion, Philosophy Economics Logic Journalism Folklore/Mythology Public Administration Law Business Management Education Text Books-Social Science

29 29 EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

30 30 EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

31 31 EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

32 32 EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

33 33 EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

34 34 EACL 2003, CLSAL: Budapest – April 12 – 17, 2003

35 35 EACL 2003, CLSAL: Budapest – April 12 – 17, 2003 Thank You


Download ppt "1 EACL 2003, Budapest : April 12 – 17, 2003 Computational Linguistics for South Asian Languages Expanding Synergies with Europe CORPORA IN MINOR LANGUAGES."

Similar presentations


Ads by Google