Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language Technologies for Multilingual Societies META-FORUM 2011, June 27/28, 2011, Budapest, Hungary Swaran Lata Director & Head, Technology Development.

Similar presentations


Presentation on theme: "Language Technologies for Multilingual Societies META-FORUM 2011, June 27/28, 2011, Budapest, Hungary Swaran Lata Director & Head, Technology Development."— Presentation transcript:

1 Language Technologies for Multilingual Societies META-FORUM 2011, June 27/28, 2011, Budapest, Hungary Swaran Lata Director & Head, Technology Development for Indian Languages Programme & Country Manager, W3C India Govt. of India 6 CGO Complex, Lodi Road, New Delhi 110 003 Meta forum 2011 1

2 Diverse Multilinguality in India and its Complexity Meta forum 2011 2

3 Organization of my talk Why and How TDIL Programme got initiated Important Milestones  Technology Development  Multilingual Standards  Proliferations Lessons Learnt Problems Arising out of Multilingualism Funding Vs. Long-term Goals Potential for Collaboration Meta forum 2011 3

4 Constitution of India (8 th Schedule Covers 22 Indian Languages)  Emphasize on planned development of Indian languages for use in all spheres of life.  Development and use of Indian Languages in all domains of National life to maintain linguistic and cultural diversity  Development of sustainable technologies to break linguistic barriers across diverse speech communities  Provide equal opportunities to citizens through the use of Information Technology Official languages Act 1963  Hindi as Official Language of Republic of India  15 Indian Languages (ILs) in 8 th Schedule  3 ILs added in 1992 (Konkani, Manipuri and Nepali)  4 ILs added in 2003 (Bodo, Maithili, Dogri, and Santali) Multilingual and Multicultural India Meta forum 2011 4

5 Why and How TDIL Programme got initiated DoE (1976) Year 2000 MIT Year 2002 MCIT = DIT + DoT Technology Development Council (TDC) 1988 – 1991  Funded Project for Development of Devanagari Graphics and Intelligence based Script Technology (GIST) UNIX Terminal at IIT Kanpur  Exploiting phonetic correspondence of Indian languages – GIST extended to others Indian Languages  GIST Card (PC add-on card) developed at CDAC Pune (Society set up in 1988)  Indian Standard Code for Information Interchange (ISCII) – BIS: 13194 (1991) – 8 bit encoding and keyboard layout standard covering 15 languages. Department of Electronics Ministry of Information Technology Ministry of Communication & Information Technology Department of Information Technology Department of Telecommunications Meta forum 2011 5

6 Technology Development for Indian Languages (TDIL) Programme – Milestones 199520002005 20092011 Increase in Funding & Participation, Evolving Vision and Focus on Standards Seeding Phase Capacity Building phase Multilingual Technology Development Future Roadmap PoC Research in Hindi and monolingual Corpora building Set-up Resource Centres in each state, Mentoring through existing projects  Consortium Mode – Multiple Institutional Projects in MT, OCR, OHWR, CLIA & Speech  Multilingual Resources Development based on standards  Free BIPKs for 22 ILs  Major Thrust on Research in Speech and Mobile Area  Productization Efforts  Standards for Multilingual Web  Addressing language specific bottlenecks  Localization Initiatives Meta forum 2011 6

7 8 85 13 31 Growth of Language Technology Research Institutions Meta forum 2011 7

8 Machine Translation System [1995-2010 – Consolidation] English to Hindi Machine Translation System has been deployed in Parliament for Machine Translation of the Parliament Proceedings.  Matching Efforts in Integrating the MT system into organizational Workflow & Training of the staff  Improvement in quality and speed of translation service English to Indian Languages Machine Translation System in 3 Indian Languages – Hindi, Bengali, Malayalam -- to translate the Voluminous Course Material of Vocational Training Programme:  Reduces cost of translation by 30%  Saves Human Effort by more than 50% Beta Deployments : Meta forum 2011 8

9 Machine Translation Systems:- Eng.- Indian Languages – 8 Language Pairs The Machine Translation Systems has been made available through TDIL Data Centre (http:// www.tdil-dc.in)for feedback and improvisation through crowd sourcing.www.tdil-dc.in Machine Translation System [1995-2010 – Consolidation] Meta forum 2011 9

10 Machine Translation Systems:- Indian Languages.- Indian Languages – 6 Language Pairs Machine Translation System [1995-2010 – Consolidation] Meta forum 2011 10

11 Cross-lingual Information Access [since 2006] AcrossAcross six Indian Languages : Hindi, Marathi, Bengali, Punjabi, Tamil and Telugu. ; Tourism Domain Index based searching based pre-processing of Indian Language query [precision @5 = 0.4 to 0.5]. UNL based search tried in Tamil to compare the efficacy. [ Precision based on Indexed based search =0.42 ; UNL based search = 0.59]. Next 3 years target :  Enhance precision to 0.7  Addition of 3 languages [ Assamese, odia, Gujarati] Beta Trial proposed on existing search engine. Meta forum 2011 11

12 Optical Character Recognition [since 2006] 11 Indian Scripts11 Indian Scripts Accuracy - Character level 97% ; Word-level 80-85% Working on printed documents between 1960 -2000 Response time : 3-4 Minutes Next 3 years target :  Word-level > 90%  Handling bi-lingual documents [IL + English]  Multi-column layout support  Post Correction Tools  Braille Interface development and deployment for Indian language book publishing  On-line OCR service through TDIL Data Centre  Deployment at a Historical Library Meta forum 2011 12

13 On-line Handwriting Recognition System [OHWR] - since 2006 AcrossAcross six Indian Languages : Devanagri, Kannada, Malayalam, Bengali, Tamil and Telugu. ; SDK developed  Stroke Level – 95%  Character Level – 84% Census Data Collection stored as Unicode Database Next 3 years target :  Achieve complete Coverage of Conjuncts & Complex Characters, Nukta characters Integration with TTS and deployment for Speech Impaired  Addition of new languages [Assamese, Urdu, Marathi, Manipuri, Bodo] Beta Trial proposed on existing search engine. Meta forum 2011 13

14 Text-to-Speech in Indian Languages [since 2006] Based on Festivox Frame Work TTS Engine Integrated with NVDA (Windows) and ORCA (Linux) screen readers Mean Opinion Score : Hindi 3.2, Bengali, Marathi, Telugu, Tamil, Malayalam : ~3.0 Training of Visually Challenged Persons on screen readers. Next 3 years target :  Improvement of MOS Score of TTS engine up-to 3.8 – 4.0  TTS engine for Indian Languages for Mobile Android Platforms  Addition of 5 New Indian Languages – Odia, Gujarati, Assamese, Bodo  Proof of concept for adaptation for one Hindi Dialect. Meta forum 2011 14

15 ORCA Screen Reader integrated with IL TTS Meta forum 2011 15

16 Multi-lingualStandards Multilingual Standards – Multi stake holders Meta forum 2011 UNICODE ISO Encoding Web Content, architecture and Web Based Services Web Content, architecture and Web Based Services W3C Language Tag, Ref Glyph set, Key- Board ISO UNICODE Locale Data ELRA, NIST, LDC Linguistic Resources, Tools and Evaluation Internet Protocol and Domain Name ICANN, IANA, IETF, ISOC 15 Meta forum 2011 16

17 No of Languages/Standards components Year W3C Work initiated in 5 areas : Internationalization CSS, Mobile Web, E-Gov and Speech Standardization Activity for Indian Languages Meta forum 2011 17

18 UNICODE Completed for 22 Official Indian Languages and Vedic Sanskrit - Unicode 6.0 Devanagari BengaliMalayalam UNICODE 18

19 Encoding Included in Unicode 6.0 – Code Point 20B9 [August 2010] Included in ISO 10646-1 [ Oct 2010] Included in ISCII – Notification issued by BIS Key Board `  Key Combination – CTRL + ALT+4 or AltGr + 4  Consensus by all stake-holders and major industry players  ISO- 14442 - Notification issued by BIS [Dec 1, 2010]  Software Patches released by Microsoft, Redhat, C-DAC [April 2011] FontsSakal-Bharti font for New Rupee Symbol Meta forum 2011 19 Enabling of New Rupee Symbol in ICT environment [Govt. Notification in July -2010]

20 Common Locale Data Repository Completed in 9 Indian Languages - Included in CLDR 2.0Work for Rest of the Indian Languages in Progress for their inclusion in the next version of CLDRMost of the Changes suggested by Govt. of India accepted by Unicode consortium. Screen shots of CLDR Hindi UpdationScreen shots of CLDR Bengali Up-dation CLDR

21 Web Standards - W3C StandardsWork InitiatedProgress So far Cascading Style Sheet (CSS) Hindi Listing submitted to W3C Akshara Definition for Indic Languages requirements of text-segmentation of CSS specification Detailed Testing of CSS 2.1 underway Pronunciation Lexicon Specification (PLS) and Speech synthesis Mark-Up Language (SSML)  Reference Phoneme set development  IPA verification in Indic languages  Acoustic –phonetic analysis  Initiated for Hindi, Bengali, Punjabi  IPA verification for Bengali completed Mobile Web  Gap Analysis for Mobile Web in Indian Languages  Mobile Fonts and Rasterization Engine in Indic Languages  Mobile OK Checker Proposed to Work with Telecom Centres of Excellences in India. Mobile Industry Associations E-Gov Best Practices  Internationalization Best Practices for Indic Languages Draft developed and under finalization. Web Accessibility Adoption of W3C WCAG 2.0 standard in India Incorporation of WCAG 2.0 into National Electronic Accessibility Policy.

22 Lessons Learnt Language Resource Development  Copyright issues  Standardization of Meta data and Tag sets  Language specificities  Validation vs. Time and Cost investment  Investment in Semantic and Syntactic Resources like Word-Net, Tree banks etc respectively Language Independent Methodologies  Core Technology Development engine identification  Availability of Researchers and Scientific manpower  Domain Selection  Limited technology institutions Meta forum 2011 22

23 Leadership Issues  Computer Science Experts vs. Linguistic Experts  Multi Institutional Consortia Project Leadership  Development plan vs. Budget plan vs. National five year plan  Researchers in Academics Language dependent planning  Language selection criteria  Participation of State Language Departments  Availability of Institutions  Availability of Linguistic and Language Experts Lessons Learnt Meta forum 2011 23

24 Standardization Issues  Development level Standards  Third party testing  Software engineering practices  Use case scenario  Integration issues Other Issues  User involvements  Limited deployments  Models for proliferation  Lab to Pilot to Commercial  Divergent requirements of GenX and non ICT communities Lessons Learnt Meta forum 2011 24

25 Problem rising from Multilingualism Multiple language speakers (Native language, Hindi and English) English default language of official communication and higher education and also spoken language in urban and semi urban areas Orthographic complexity  Tamil language having lesser alphabets  Conjunct and Glutenation problem  Reforms in orthography Spoken language issues:  Phonetic variation among Indian languages  Variation of Hindi spoken in 7 to 8 states  Dialect variation (Awadhi, Bhojpuri, Khadi boli, Braj Bhasha etc) The paradigm shift to statistical approaches:  Huge amount of speech corpora capturing dialect variation  Parallel text corpora and other language resources  Interfacing from multilingual language resources  Cross lingual access 25

26 Funding vs. Long Term Goals Expl seeding Capacity Building Multilingual consortia Social impact Meta forum 2011 26

27 Graphs infer that optimal funding is available Language activities have crossed threshold Next plan (12 th) higher allocation of resources targeted More Language groups need to be funded in each state with special focus on small language resources Multiple script issues Funding vs. Long Term Goals Time Frame – future challenges for five years Replication of successful technology development for newer languages Improvisation of language technologies:  Improve accuracy to bring it a usable level  Productization efforts  Porting efforts on mobile platforms  Providing services on cloud based services Strategies for social impact Meta forum 2011 27

28 Potential for Cooperation Enhancement and Adaptation of engines like sphinx, festival, HTS, NUTCH harfbuzz, free type etc. to bring a paradigm shift in development form Latin centric to Multi-lingual centric. Pilot projects to try methodology applied for Indian languages to European language and vice versa  Angla-Bharati English to Indian languages MT Framework may be tried for English to other European Languages  Replicating European localization models for taking localization technologies to users in India.  Cross-lingual Information Retrieval between Indian Languages and European Languages.  Collaborative Effort on Speech Technology development in Indo-EU Languages – new research frontiers in speech modeling, Speech recognition grammar, Phonetic Search.  Speech Enabling of Mobile Devices in Indo-EU Languages involving the mobile manufacturers and innovative product development for mass market applications Linguistic Resource Sharing for Research Purpose. Language Technology Evaluation Models in Indian Language Technology / Product / Solutions based on Successful European Models 28

29 Thanks & Questions slata@mit.gov.in 91-11-24363525 ক ક क ಕ കൂ क କ ਕ క గ ક ಕ କ ਕ ক क ક గ ಕ ಕ


Download ppt "Language Technologies for Multilingual Societies META-FORUM 2011, June 27/28, 2011, Budapest, Hungary Swaran Lata Director & Head, Technology Development."

Similar presentations


Ads by Google