Presentation is loading. Please wait.

Presentation is loading. Please wait.

1. 2 Indian Languages - 2001 AdiGaroKolamiMaltoRengma Afghani / Kabuli / PashtoGondiKomMaramSangtam AnalHalabiKondaMaringSavara AngamiHalamKonyakMiri.

Similar presentations


Presentation on theme: "1. 2 Indian Languages - 2001 AdiGaroKolamiMaltoRengma Afghani / Kabuli / PashtoGondiKomMaramSangtam AnalHalabiKondaMaringSavara AngamiHalamKonyakMiri."— Presentation transcript:

1 1

2 2 Indian Languages - 2001 AdiGaroKolamiMaltoRengma Afghani / Kabuli / PashtoGondiKomMaramSangtam AnalHalabiKondaMaringSavara AngamiHalamKonyakMiri / MishingSema AoHmarKorkuMishmiSherpa Arabic / ArbiHoKorwaMoghShina BaltiJatapuKoyaMonpaSimte Bhili / BhilodiJuangKuiMundaTamang BhotiaKabuiKukiMundariTangkhul BhumijKarbi / MikirKurukh / OraonNicobareseTangsa BishnupuriyaKhandeshiLadakhiNissi / DaflaThado ChakhesangKhariaLahauliNocteTibetan Chakru / ChokriKhasiLahndaPaiteTripuri ChangKhezhaLakherParjiTulu Coorgi / KodaguKhiemnunganLalungPawiVaiphei DeoriKhond / KondhLepchaPersianWancho DimasaKinnauriLiangmeiPhomYimchungre EnglishKisanLimbuPochuryZeliang GadabaKochLothaRabhaZemi GangteKoda / KoraLushai / MizoRaiZou

3 3 Annotated, quality language data (both-text and speech) and tools in Indian languages to Individuals Institutions Industry etc., for Research and Development. Created in house, throughoutsourcing, acquisition. acquisition. Mission Statement

4 4 A repository of linguistic resources in all Indian languages in the form of text, speech and lexical corpora. Facilitating creation of such databases by different organizations. Setting standards for data collection and storage of corpora for different research and development activities. Supporting development and sharing of tools for data collection and management. Facilitating training through workshops, seminars etc. in technical as well as process related issues. Creating and maintaining the LDC-IL website that would be the primary gateway for accessing LDC-IL resources. Designing or providing help in creation of appropriate language technology for mass use. Providing the necessary linkages between academic institutions, individual researchers and the masses. Objectives

5 5 Participating Institutions in India IISc Bangalore; All Indian Institutes of Technology; IIITs at Hyderabad and elsewhere; ISI Calcutta/Hyderabad/Bangalore; C-DAC, Pune; TIFR Mumbai; Universities like HCU; DU; JNU; NEHU HP Labs India; IBM; Language institutions like KHS, NCPUL & RSKS; and, Of course, the MCIT-TDIL All academic institutes, research organizations and Corporate R&D groups from India and abroad working on Indian languages will be encouraged to participate in LDC-IL. The following have already shown interest:

6 6 Funding & Management The core funding from the Government of India. All activities will be in a project mode. Will attempt to leverage expertise already available to cut avoidable cost and delay. All staff will be on contract. All receipts and payments through internet gateways, or through conventional means, will go to the Consolidated Fund. However, the Government will release grants required to the Consortium as required. If need be, the support will be extended beyond the initial six year period. As the nodal agency, CIIL will further distribute the relevant funding for specific sub-components of the scheme to other academic institutions. An annual progress report will be submitted to the government.

7 7 Arrangements 1.LDC-IL will be open to all institutions, Research Organizations, and Corporate sector from all over the world. 2.Members will be encouraged to contribute databases and share revenues from sale of the data they contribute 3.The databases will be available for R&D purposes to all members and non-members on payment of the appropriate fee, with a license for use only. 4.The organization will be asked to sign a License Agreement that the databases will not be distributed by it to others either free or for a fee. 5.The IP and the copyright of any product developed as a result of such an R&D activity shall lie with the organization that has created the product.

8 8 Tasks Establishing standards Creating language resources Annotating language data Building systems/helping system building Creating human resources Co-ordinating language resource developing activities

9 9 Major Areas Linguistic Resource Development Creation of different kinds of Corpora including Pathological speech, Historical/ Inscriptional databases Natural Language Processing Speech Recognition and Synthesis Character Recognition By-products like Word finders, lexicons of different kind, thesauri, Usage compilations etc.

10 10 Text Corpora - Monolingual / Parallel Corpora (SL) Sl. No.Languages1 st Year2 nd Year3 rd Year4 th Year5 th YearTotal 1 Assamese2222210 2 Bengali2222210 3 Bodo0.6 3 4 Dogri0.6 3 5 Gujarati2222210 6 Hindi2222210 7 Kannada2222210 8 Kashmiri111115 9 Konkani111115 10 Maithili111115 11 Malayalam2222210 12 Manipuri111115 13 Marathi2222210 14 Nepali2222210 15 Oriya2222210 16 Punjabi2222210 17 Sanskrit0.4 2 18 Santali0.6 3 19 Sindhi0.6 3 20 Tamil2222210 21 Telugu2222210 22 Urdu2222210

11 11 Tools for Corpora Management & Analysis Frequency analyzers for character, word, sentence. KWIC and KWOC retrievers. Tool for Automatic transliterations from Indian language scripts to Roman and vice versa: Kannada, Tamil, Telugu, Assamese, Bengali, Manipuri, Manipuri, Malayalam, Punjabi, Oriya, Gujarati. Parallel corpora tools for text alignment, including sentence alignment tool and chunk alignment tool as well as an interface for aligning corpora. Tools for Morphological analysis POS tagging Semantic tagging Syntactic tree bank

12 12 Computational Grammars for Indian Languages Task 1: Hierarchical POS Tag set Task 2: Dictionary - (a) closed class words and (b) open class words Task 3: Morphological analyzer and generator Task 4: Manual POS annotation and development of an automatic tagger Task 5: Semantic tagging Task 6: Chunker Task 7: Tree banking Task 8: Shallow parser, which will eventually turn into a deep parser

13 13 Linguistic Research Lexical studies Semantics Pragmatics & Discourse analysis Sociolinguistics Dialectology & Variation studies Stylistics Language teaching Historical linguistics Psycholinguistics Social psychology Cultural studies

14 14 Speech Corpora  Develop tools that facilitate collection of high quality speech data  Collect data that can be used for building speech recognition. speech synthesis and provide speech-to-speech translation from one language to another language spoken in India (including Indian English).  Apart from these like applications in the area of text corpora, speech corpora also, main efforts are on the engineering side. So, efforts shall also be made to collect  Child language corpora  Pathological speech/language data and  Speech error Data

15 15 Applications Speech Recognition and Speech Synthesis Speech to Speech translation for a pair of Indian languages Command and control applications Multimodal interfaces to the computer in Indian languages E-mail readers over the telephone Readers for the visually disadvantaged Speech enabled Office Suite etc

16 16 Speech Dataset 1.Phonetically Balanced Vocabulary 2.Phonetically Balanced Sentences 3.Connected Text created using phonetically balanced vocabulary 4.Date Format 5.Command and Control Words 6.Proper Nouns 500 place and 500 person names 7.Most Frequent Words: 1000 8.Form and Function Words 9.News domain: news, editorial, essay - each text not less than 500 words

17 17 Number of Speakers Data will be collected from minimum of 300 (150 Male and 150 Female) speakers of each language. In addition to this, natural discourse data from various domains too shall be collected for Indian languages for research into spoken language. Data for speech synthesis shall be collected from limited number of speakers - 3 male and 3 female in the studio environment. They shall invariably have very good voice quality and are professional voice givers/media announcers.

18 18 Annotation of data: 1.Data to be used for speech recognition shall be annotated at phoneme, syllable, word and sentence levels 2.Data to be used for speech synthesis shall be annotated at phone, phoneme, syllable, word, and phrase level. Annotation tools: Tools will be developed for semiautomatic annotation of speech data. These tools will also be useful for annotating speech synthesis databases.

19 19 Coverage of languages I YearII YearIII Year 1. Bengali7. Manipuri13. Maithili19. Sindhi 2. Hindi8. Malayalam14. Dogri20. Oriya 3. Tamil9. Punjabi15. Bodo21. Marathi 4. Telugu10. Urdu16. Konkani22. Khasi 5. Assamese11. Kannada17. Santali23. Tulu 6. Nepali12. Gujarati18.Kashmiri24. Kodava

20 20 Indian Sign Language corpora Northern India : Delhi 1st year Southern India: Mysore 2nd year North-eastern India: Shillong 3rd year Western India: Lchalkaranji 4th year Eastern Indian: Kolkata 5th year Lexical items 15000 Sentences 2500 Production data 50

21 21 Character Recognition Development of standards, tools and linguistic resources (datasets) for the fields of Online HWR, Offline HWR and OCR. Promotion of development of these technologies. Promotion of development of important and challenging applications of these technologies in the context of Indic languages and scripts.

22 22 By-products like lexicon, thesauri, WordNet etc Creation of frequency dictionaries - five per year First year: Bengali, Hindi, Kannada, Manipuri, Urdu. Second year: Bodo, Dogri, Maithili, Nepali, Konkani. Third year: Assamese, Gujarati, Oriya, Punjabi, Tamil, Fourth year: Kashmiri, Malayalam, Marathi, Sanskrit, Santali. Fifth year : other languages Multilingual multi directional dictionary - an ongoing process Aiding wordnet creation and collaborating with others for the same - an ongoing process

23 23 Licensing Policy Licensing is an important issue for LDC-IL. The draft policy for licensing shall be evolved through discussions within one year. The same shall be finalized within another one year by the time the annotated data is available for delivery purposes.

24 24 Evaluation The data that the LDCIL creates and obtains has to be evaluated. For each kind of data, tool etc., matrices have to be evolved. Bench marking, good standards etc., have to be developed. In one year time frame, the same shall be accomplished for first set of tools. In the next year/s the same for other data and tools shall be developed

25 25 Beyond Roadmap Above all and in addition to what LDCIL has projected in the roadmap the LDC-IL will positively respond to the specific language data needs of the individuals, institutions and industry by taking up their requests on priority basis for licensing purposes. In the beginning the derivatives of the databases shall be licensed and after all the licensing issues are resolved the databases shall also be licensed.

26 26 Monolingual Text Corpora Sl. No. LanguageWord Count 1. Bengali50,42,724 2. Bodo6,37,801 3. Dogri8,24,443 4. English21,15,461 5. Hindi3,45,85,882 6. Kannada71,84,702 7. Kodava1,83,322 Sl. No.LanguageWord Count 8.KonkaniI5,69,906 9.Maithili83,92,505 10.Manipuri16,37,104 11.Nepali21,58,324 12.Tamil4,67,096 13.Urdu22,80,782 14.Yarava13,904

27 27 Parallel Text Corpora Sl. No.LanguageTextsWord Count 1 English Bengali 05 1,26,828 93,952 2 English Dogri 04 88,025 93,293 3 English Hindi 73 17,57,736 17,53,235 4 English Kannada 32 7,79,258 4,76,855 5 English Maithili 07 1,59,419 1,36,421 6 English Nepali 11 2,63,256 2,02,157

28 28 Speech Data Set Details AssameseBengaliGujaratiHindiKannada Phon. Bal. Vocabulary 439561689800390 Phon. Bal. Sentences 200 500150 Connected Texts66666 Command & Control Words 25023829625082 Proper Nouns8418239028241018 Most Frequent Words -1000- Form & Function Words 265178232200432 News Domain texts 150

29 29 Speech Data Set Details MaithiliManipuriNepaliTamilUrdu Phon. Bal. Vocabulary 509374421565775 Phon. Bal. Sentences208200 228195 Connected Texts66666 Command & Control Words 18724374369141 Proper Nouns824825834908500 Most Frequent Words1000 Form & Function Words 243189190598380 News Domain texts150 Other languages to be completed before March 31, 2009 Malayalam, Punjabi

30 30 Speech Corpora Other languages to be completed before March 31, 2009 Malayalam, Punjabi, Urdu Language InformantsDuration MaleFemaleMinutesHours Assamese81 198533.05 Bengali23823414850247.50 Gujarati7783376962.49 Hindi31431611483191.23 Kannada82 294049.00 Maithili82 334055.40 Manipuri82 260243.22 Nepali4860330755.07 Tamil7871512785.45

31 31 Frequency Dictionaries: Most frequent 5000 words Published Sl. No.Language 1.Bengali 2.Hindi 3.Kannada 4.Manipuri To be published by March 31, 2009 Sl. No.Language 1.Nepali 2.Urdu

32 32 Development of Tools The following packages will be developed: 1.KWIC and KWOC Retriver 2.Tool for Semi Automatic Annotation of Speech Data. Corpora management packages developed: 1.Word Frequency Analyser 2.N-Gram (Bi-Gram, Tri-Gram) for word and character 3.Speech Annotation Manual prepared and published

33 »Interns LDC-Interns LDC- 33

34 34


Download ppt "1. 2 Indian Languages - 2001 AdiGaroKolamiMaltoRengma Afghani / Kabuli / PashtoGondiKomMaramSangtam AnalHalabiKondaMaringSavara AngamiHalamKonyakMiri."

Similar presentations


Ads by Google