1. 2 Indian Languages - 2001 AdiGaroKolamiMaltoRengma Afghani / Kabuli / PashtoGondiKomMaramSangtam AnalHalabiKondaMaringSavara AngamiHalamKonyakMiri.

Slides:



Advertisements
Similar presentations
Introduction to Computational Linguistics
Advertisements

Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Knowledge Sharing Platform Empowering Communities through regional Content and Services C. Kathiresan C-DAC, Hyderabad, India Session V : e-Content & ICT.
Resource Creation for Training and Testing of Transliteration Systems for Indian Languages Sowmya V.B. *, Monojit Choudhury *, Kalika Bali *, Tirthankar.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
सुस्वागतम् Welcome Technology Development for Indian Languages
ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.
S ANDHAN Indian language search engine. S ANDHAN – C ONSORTIUM P ROJECT IIT Bombay (co-ordinator) CDAC Noida (co-cordinator) CDAC Pune IIT Kharaghpur.
Language, Ethnicity, and Disparities in Contemporary India
1 Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V Gangashetty B. Yegnanarayana Raj Reddy IIIT Hyderabad, India.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Natural Language Processing DR. SADAF RAUF. Topic Morphology: Indian Language and European Language Maryam Zahid.
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi.
Information and Communication Technologies in the field of general education in Armenia NATIONAL CENTER OF EDUCATIONAL TECHNOLOGIES.
Internationalized Domain Names (IDNs) Yale A2K2 Conference New Haven, USA April 27, 2007 Ram Mohan Building a Sustainable Framework.
Indian Language Initiatives at LDC Denise DiPersio
Language Technologies for Multilingual Societies META-FORUM 2011, June 27/28, 2011, Budapest, Hungary Swaran Lata Director & Head, Technology Development.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
NERIL: Named Entity Recognition for Indian FIRE 2013.
A PROPOSAL FOR CREATION OF A FOR INDIA Focus: linguistic data.
IFRE: Delhi, India Nandini Raghuraman MS-2. India  Largest democracy in the world  Population: 1,147,995,898 (July, 2008)  Life expectancy at birth:
Recent Activities of Speech Corpora and Assessment in Korea Yong-Ju Lee Wonkwang University Korea.
Modular InfoTech’s Modular Infotech is proud to offer Tools and Components enabled with Indian language so as to address each & every client located across.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
India Jai Hind!. Cuisine Places Culture Languages Dresses Traditions.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Development of NE Wordnet: An Integrated Wordnet for Languages of the North-East India Assamese & Bodo by Utpal Saikia Biswajit Brahma Dibyajyoti Sarmah.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
21st September 2004localisation and the digital divide1 and the Development and the Information Society Economic divides Language divides Cultural divides.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Cooperation for Arabic Language Resources and Tools – The MEDAR Project Bente Maegaard, Mohamed Attia, Khalid Choukri, Olivier Hamon, Steven Krauwer, Mustafa.
UNICODE & Indic Scripts
L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.
Interlingua Annotation Owen Rambow Advaith Siddharthan Kathleen McKeown
An ISO 9001:2008 Company With all the tools you need to compute in Indian Languages.
1 Branches of Linguistics. 2 Branches of linguistics Linguists are engaged in a multiplicity of studies, some of which bear little direct relationship.
© 2015 albert-learning.com Indian languages Indian Languages.
Catia Cucchiarini, Walter Daelemans and Helmer Strik Strengthening the Dutch Language and Speech Technology Infrastructure Catia Cucchiarini, Walter Daelemans.
ALR 2013 Some observations Pushpak Bhattacharyya, ALR Chair.
Utkal University We Work On Image Processing Speech Processing Knowledge Management.
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
CENTRE FOR INFORMATION ON LANGUAGE SCIENCES CIIL, Mysore Dr. R. Suman Kumari Librarian NATIONAL INFORMATION CENTRE FOR INDIAN LANGUAGES भारतीय भाषाओं का
INDIA’S APPEARANCE TILL 59 th INDEPEDANCE DAY 1906.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.
INTRODUCTION TO APPLIED LINGUISTICS
Vision Transtech India– About Us Established in 2004 A Global Services company Adopters of New technology Customization 150+ highly skilled resources always.
SPEECH TECHNOLOGY An Overview Gopala Krishna. A
Products/Solutions/Expertise of C-DAC Mumbai in Smart City Domain
We Translate… You Market!!
Computational and Statistical Methods for Corpus Analysis: Overview
Natural Language Processing (NLP)
Eurostat D2 – Regional Indicators and Geographical Information
--Mengxue Zhang, Qingyang Li
India Geography and Languages
Computational Linguistics: New Vistas
Indradhanush WordNet Project Consortium PRSG Meeting
Natural Language Processing (NLP)
How Do I Write a Good Technology Plan?
India Geography and Languages
India Geography and Languages
Indian Languages Market: The Complex Script
Natural Language Processing (NLP)
Presentation transcript:

1

2 Indian Languages AdiGaroKolamiMaltoRengma Afghani / Kabuli / PashtoGondiKomMaramSangtam AnalHalabiKondaMaringSavara AngamiHalamKonyakMiri / MishingSema AoHmarKorkuMishmiSherpa Arabic / ArbiHoKorwaMoghShina BaltiJatapuKoyaMonpaSimte Bhili / BhilodiJuangKuiMundaTamang BhotiaKabuiKukiMundariTangkhul BhumijKarbi / MikirKurukh / OraonNicobareseTangsa BishnupuriyaKhandeshiLadakhiNissi / DaflaThado ChakhesangKhariaLahauliNocteTibetan Chakru / ChokriKhasiLahndaPaiteTripuri ChangKhezhaLakherParjiTulu Coorgi / KodaguKhiemnunganLalungPawiVaiphei DeoriKhond / KondhLepchaPersianWancho DimasaKinnauriLiangmeiPhomYimchungre EnglishKisanLimbuPochuryZeliang GadabaKochLothaRabhaZemi GangteKoda / KoraLushai / MizoRaiZou

3 Annotated, quality language data (both-text and speech) and tools in Indian languages to Individuals Institutions Industry etc., for Research and Development. Created in house, throughoutsourcing, acquisition. acquisition. Mission Statement

4 A repository of linguistic resources in all Indian languages in the form of text, speech and lexical corpora. Facilitating creation of such databases by different organizations. Setting standards for data collection and storage of corpora for different research and development activities. Supporting development and sharing of tools for data collection and management. Facilitating training through workshops, seminars etc. in technical as well as process related issues. Creating and maintaining the LDC-IL website that would be the primary gateway for accessing LDC-IL resources. Designing or providing help in creation of appropriate language technology for mass use. Providing the necessary linkages between academic institutions, individual researchers and the masses. Objectives

5 Participating Institutions in India IISc Bangalore; All Indian Institutes of Technology; IIITs at Hyderabad and elsewhere; ISI Calcutta/Hyderabad/Bangalore; C-DAC, Pune; TIFR Mumbai; Universities like HCU; DU; JNU; NEHU HP Labs India; IBM; Language institutions like KHS, NCPUL & RSKS; and, Of course, the MCIT-TDIL All academic institutes, research organizations and Corporate R&D groups from India and abroad working on Indian languages will be encouraged to participate in LDC-IL. The following have already shown interest:

6 Funding & Management The core funding from the Government of India. All activities will be in a project mode. Will attempt to leverage expertise already available to cut avoidable cost and delay. All staff will be on contract. All receipts and payments through internet gateways, or through conventional means, will go to the Consolidated Fund. However, the Government will release grants required to the Consortium as required. If need be, the support will be extended beyond the initial six year period. As the nodal agency, CIIL will further distribute the relevant funding for specific sub-components of the scheme to other academic institutions. An annual progress report will be submitted to the government.

7 Arrangements 1.LDC-IL will be open to all institutions, Research Organizations, and Corporate sector from all over the world. 2.Members will be encouraged to contribute databases and share revenues from sale of the data they contribute 3.The databases will be available for R&D purposes to all members and non-members on payment of the appropriate fee, with a license for use only. 4.The organization will be asked to sign a License Agreement that the databases will not be distributed by it to others either free or for a fee. 5.The IP and the copyright of any product developed as a result of such an R&D activity shall lie with the organization that has created the product.

8 Tasks Establishing standards Creating language resources Annotating language data Building systems/helping system building Creating human resources Co-ordinating language resource developing activities

9 Major Areas Linguistic Resource Development Creation of different kinds of Corpora including Pathological speech, Historical/ Inscriptional databases Natural Language Processing Speech Recognition and Synthesis Character Recognition By-products like Word finders, lexicons of different kind, thesauri, Usage compilations etc.

10 Text Corpora - Monolingual / Parallel Corpora (SL) Sl. No.Languages1 st Year2 nd Year3 rd Year4 th Year5 th YearTotal 1 Assamese Bengali Bodo Dogri Gujarati Hindi Kannada Kashmiri Konkani Maithili Malayalam Manipuri Marathi Nepali Oriya Punjabi Sanskrit Santali Sindhi Tamil Telugu Urdu

11 Tools for Corpora Management & Analysis Frequency analyzers for character, word, sentence. KWIC and KWOC retrievers. Tool for Automatic transliterations from Indian language scripts to Roman and vice versa: Kannada, Tamil, Telugu, Assamese, Bengali, Manipuri, Manipuri, Malayalam, Punjabi, Oriya, Gujarati. Parallel corpora tools for text alignment, including sentence alignment tool and chunk alignment tool as well as an interface for aligning corpora. Tools for Morphological analysis POS tagging Semantic tagging Syntactic tree bank

12 Computational Grammars for Indian Languages Task 1: Hierarchical POS Tag set Task 2: Dictionary - (a) closed class words and (b) open class words Task 3: Morphological analyzer and generator Task 4: Manual POS annotation and development of an automatic tagger Task 5: Semantic tagging Task 6: Chunker Task 7: Tree banking Task 8: Shallow parser, which will eventually turn into a deep parser

13 Linguistic Research Lexical studies Semantics Pragmatics & Discourse analysis Sociolinguistics Dialectology & Variation studies Stylistics Language teaching Historical linguistics Psycholinguistics Social psychology Cultural studies

14 Speech Corpora  Develop tools that facilitate collection of high quality speech data  Collect data that can be used for building speech recognition. speech synthesis and provide speech-to-speech translation from one language to another language spoken in India (including Indian English).  Apart from these like applications in the area of text corpora, speech corpora also, main efforts are on the engineering side. So, efforts shall also be made to collect  Child language corpora  Pathological speech/language data and  Speech error Data

15 Applications Speech Recognition and Speech Synthesis Speech to Speech translation for a pair of Indian languages Command and control applications Multimodal interfaces to the computer in Indian languages readers over the telephone Readers for the visually disadvantaged Speech enabled Office Suite etc

16 Speech Dataset 1.Phonetically Balanced Vocabulary 2.Phonetically Balanced Sentences 3.Connected Text created using phonetically balanced vocabulary 4.Date Format 5.Command and Control Words 6.Proper Nouns 500 place and 500 person names 7.Most Frequent Words: Form and Function Words 9.News domain: news, editorial, essay - each text not less than 500 words

17 Number of Speakers Data will be collected from minimum of 300 (150 Male and 150 Female) speakers of each language. In addition to this, natural discourse data from various domains too shall be collected for Indian languages for research into spoken language. Data for speech synthesis shall be collected from limited number of speakers - 3 male and 3 female in the studio environment. They shall invariably have very good voice quality and are professional voice givers/media announcers.

18 Annotation of data: 1.Data to be used for speech recognition shall be annotated at phoneme, syllable, word and sentence levels 2.Data to be used for speech synthesis shall be annotated at phone, phoneme, syllable, word, and phrase level. Annotation tools: Tools will be developed for semiautomatic annotation of speech data. These tools will also be useful for annotating speech synthesis databases.

19 Coverage of languages I YearII YearIII Year 1. Bengali7. Manipuri13. Maithili19. Sindhi 2. Hindi8. Malayalam14. Dogri20. Oriya 3. Tamil9. Punjabi15. Bodo21. Marathi 4. Telugu10. Urdu16. Konkani22. Khasi 5. Assamese11. Kannada17. Santali23. Tulu 6. Nepali12. Gujarati18.Kashmiri24. Kodava

20 Indian Sign Language corpora Northern India : Delhi 1st year Southern India: Mysore 2nd year North-eastern India: Shillong 3rd year Western India: Lchalkaranji 4th year Eastern Indian: Kolkata 5th year Lexical items Sentences 2500 Production data 50

21 Character Recognition Development of standards, tools and linguistic resources (datasets) for the fields of Online HWR, Offline HWR and OCR. Promotion of development of these technologies. Promotion of development of important and challenging applications of these technologies in the context of Indic languages and scripts.

22 By-products like lexicon, thesauri, WordNet etc Creation of frequency dictionaries - five per year First year: Bengali, Hindi, Kannada, Manipuri, Urdu. Second year: Bodo, Dogri, Maithili, Nepali, Konkani. Third year: Assamese, Gujarati, Oriya, Punjabi, Tamil, Fourth year: Kashmiri, Malayalam, Marathi, Sanskrit, Santali. Fifth year : other languages Multilingual multi directional dictionary - an ongoing process Aiding wordnet creation and collaborating with others for the same - an ongoing process

23 Licensing Policy Licensing is an important issue for LDC-IL. The draft policy for licensing shall be evolved through discussions within one year. The same shall be finalized within another one year by the time the annotated data is available for delivery purposes.

24 Evaluation The data that the LDCIL creates and obtains has to be evaluated. For each kind of data, tool etc., matrices have to be evolved. Bench marking, good standards etc., have to be developed. In one year time frame, the same shall be accomplished for first set of tools. In the next year/s the same for other data and tools shall be developed

25 Beyond Roadmap Above all and in addition to what LDCIL has projected in the roadmap the LDC-IL will positively respond to the specific language data needs of the individuals, institutions and industry by taking up their requests on priority basis for licensing purposes. In the beginning the derivatives of the databases shall be licensed and after all the licensing issues are resolved the databases shall also be licensed.

26 Monolingual Text Corpora Sl. No. LanguageWord Count 1. Bengali50,42, Bodo6,37, Dogri8,24, English21,15, Hindi3,45,85, Kannada71,84, Kodava1,83,322 Sl. No.LanguageWord Count 8.KonkaniI5,69,906 9.Maithili83,92, Manipuri16,37, Nepali21,58, Tamil4,67, Urdu22,80, Yarava13,904

27 Parallel Text Corpora Sl. No.LanguageTextsWord Count 1 English Bengali 05 1,26,828 93,952 2 English Dogri 04 88,025 93,293 3 English Hindi 73 17,57,736 17,53,235 4 English Kannada 32 7,79,258 4,76,855 5 English Maithili 07 1,59,419 1,36,421 6 English Nepali 11 2,63,256 2,02,157

28 Speech Data Set Details AssameseBengaliGujaratiHindiKannada Phon. Bal. Vocabulary Phon. Bal. Sentences Connected Texts66666 Command & Control Words Proper Nouns Most Frequent Words Form & Function Words News Domain texts 150

29 Speech Data Set Details MaithiliManipuriNepaliTamilUrdu Phon. Bal. Vocabulary Phon. Bal. Sentences Connected Texts66666 Command & Control Words Proper Nouns Most Frequent Words1000 Form & Function Words News Domain texts150 Other languages to be completed before March 31, 2009 Malayalam, Punjabi

30 Speech Corpora Other languages to be completed before March 31, 2009 Malayalam, Punjabi, Urdu Language InformantsDuration MaleFemaleMinutesHours Assamese Bengali Gujarati Hindi Kannada Maithili Manipuri Nepali Tamil

31 Frequency Dictionaries: Most frequent 5000 words Published Sl. No.Language 1.Bengali 2.Hindi 3.Kannada 4.Manipuri To be published by March 31, 2009 Sl. No.Language 1.Nepali 2.Urdu

32 Development of Tools The following packages will be developed: 1.KWIC and KWOC Retriver 2.Tool for Semi Automatic Annotation of Speech Data. Corpora management packages developed: 1.Word Frequency Analyser 2.N-Gram (Bi-Gram, Tri-Gram) for word and character 3.Speech Annotation Manual prepared and published

»Interns LDC-Interns LDC- 33

34