Presentation is loading. Please wait.

Presentation is loading. Please wait.

Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

Similar presentations


Presentation on theme: "Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language."— Presentation transcript:

1 Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language Processing BRAC University D. Net’s 5 th Anniversary Seminar Series: Youth and ICTs: ICT and Localization 29 th January, 2006

2 N. UzZaman, BRAC University ICT and Localization, 29/1/06 2 Outline Statistics of Bangla language speaker Localization and local language computing BRAC University’s Initiative Local and Regional Initiatives

3 N. UzZaman, BRAC University ICT and Localization, 29/1/06 3 Statistics of Bangla language speakers Spoken by 245 million people 7 th most widely spoken language Spoken mainly in Bangladesh and Indian state of West Bengal More than 144 million people from Bangladesh

4 N. UzZaman, BRAC University ICT and Localization, 29/1/06 4 Why localization? The masses can harness the power of information National Interest: digital divide, governance, language preservation, …

5 N. UzZaman, BRAC University ICT and Localization, 29/1/06 5 Localization Internationalized software in local languages Few groups are working actively –Ankur, Ekushey, D.Net (content development) Active projects –Linux, Mozilla, Open Office

6 N. UzZaman, BRAC University ICT and Localization, 29/1/06 6

7 N. UzZaman, BRAC University ICT and Localization, 29/1/06 7

8 N. UzZaman, BRAC University ICT and Localization, 29/1/06 8 Larger picture Good start, but a long way to! Local language computing: advanced applications –Optical character recognition –Machine translation –Speech synthesis –Speech recognition –Dialog systems

9 N. UzZaman, BRAC University ICT and Localization, 29/1/06 9 Challenges Language Resources –Fonts –Lexicon (word list) –Corpus (collection of texts) –Tag the lexicon and corpus

10 N. UzZaman, BRAC University ICT and Localization, 29/1/06 10 Challenges for next few years! Language processing research –Document authoring (desktop, web (blog, forums, emails), etc) –Morphological analyzer –Speech processing –Information Retrieval (web searching, name searching, spelling checker) –OCR (Optical Character Recognition) –Syntactic analysis (can be used in MT) –Machine Translation –And many more…

11 N. UzZaman, BRAC University ICT and Localization, 29/1/06 11 Status of Bangla Computing Scattered work done, very little unification Scarcity of free and open-source software Little or no attention paid to computational linguistics - the backbone Many individuals are working, results few good publications in ICCIT, IUB’s ICCPB and other conferences

12 N. UzZaman, BRAC University ICT and Localization, 29/1/06 12 BRAC University’s Initiative Research Lab (Center for Research on Bangla Language Processing) –9 full-time Research staff (6 CS background, 3 linguistics background) –Seed funding from PAN Localization project of IDRC –Students working part-time, doing internship –Software/documents all OPEN SOURCE Academics –Course on Natural Language Processing –Student projects and theses on NLP

13 N. UzZaman, BRAC University ICT and Localization, 29/1/06 13 Status of BU Research lab’s work Publications –ICCIT 2004: 3 (Morphology 2, spelling checker) –BU Journal: 1 (Morphological parsing) –IASTED CI: 1 (Name searching) –IEEE NLP KE 05: 1 (Spelling checker) –ICCIT 2005: 1 (Morphology) –Undergraduate Thesis: 3 (Phonetic encoding, OCR, Bangla text input in mobile) –Total: 10 4 more research paper submitted Ongoing thesis: 4

14 N. UzZaman, BRAC University ICT and Localization, 29/1/06 14 Status of BU Research lab’s work Invited talks: –University of Toronto CS Seminar –Stanford University NLP group (May 2005) –IDRC Partners Conference in Cambodia (June 2005) –IJCNLP 2005, Jeju Island, Korea (October 2005)

15 N. UzZaman, BRAC University ICT and Localization, 29/1/06 15 Language Resources Fonts: Good open-source fonts available Lexicon: –80+ thousand list of words; expected to be 110 thousand in the next release –Tagging and annotation is underway. Significant and large project Corpus: –Yet to begin

16 N. UzZaman, BRAC University ICT and Localization, 29/1/06 16 Language processing research Document authoring –Editor, Banglapad: open source, platform independent, rich text editor (supports Bangla spell checking, export to html) Status: Version 1, Release candidate 1 http://sourceforge.net/projects/banglapad –Transliteration, pata: Type phonetically in English, you will get similar sounding dictionary word Desktop application: http://sourceforge.net/projects/pata; Status: Completehttp://sourceforge.net/projects/pata Web based transliteration: Status: Expected by June 2006 –Community network tools: Set of tools to community networking (blogs, forums, etc) in Bangla. Not only content authoring but also web services such as spelling checker. Status: Expected by early 2007

17 N. UzZaman, BRAC University ICT and Localization, 29/1/06 17 Language processing research Morphology: –verb morphology is reasonably complete –noun morphology is somewhat usable, but much more needs to be done –statistical methods for dealing with Bangla compound words and blends are being worked on Grapheme To Phoneme (G2P): –Digital pronunciation dictionary –Useful step for speech processing –Status: Expected by June 2006

18 N. UzZaman, BRAC University ICT and Localization, 29/1/06 18 Language processing research Speech Processing –Text-to-speech: Voice for Festival. Status: First demo expected by May 2006. –Automatic Speech Recognition: Limited vocabulary segmented speech recognition. Status: First demo expected by August 2006.

19 N. UzZaman, BRAC University ICT and Localization, 29/1/06 19 Language processing research Information Retrieval: –Spelling checker: Gives phonetic suggestion and ranks phonetically http://sourceforge.net/projects/puspaspeller/ Integrated with other text editors, Banglapad Status: Complete –Searching Phonetic web searching for Bangla Input can be English or Bangla Status: Expected by June 2006 –Name searching Can be used in hospital, institutes, census, etc Status: Expected by October 2006

20 N. UzZaman, BRAC University ICT and Localization, 29/1/06 20 Language processing research Pattern recognition/image processing/document processing: –Document skew correction: Bangla document skew corrector based on Radon transform. Complete. –Segmentation: Bangla line segmentation: Complete Bangla word segmentation: Complete Bangla character segmentation: Work in progress. The large number of combinations (consonant clusters and the non- spacing marks) complicates this task. This is omnifont, so must work with any typeface.

21 N. UzZaman, BRAC University ICT and Localization, 29/1/06 21 Language processing research –Pattern recognition: Neural net based recognizer: Fairly complete for the basic alphabet and a subset of the consonant clusters. The non- spacing marks pose a significant challenge. Hidden Markov Model (HMM) based recognizer: Just started, first implementation expected in May, 2006. Syntax: –Very preliminary work on Bangla syntax using the Lexical Functional Grammar (LFG) formalism –Also a parallel effort using the Head-driven Phrase Structure Grammar (HPSG) formalism

22 N. UzZaman, BRAC University ICT and Localization, 29/1/06 22 Local and Regional Initiatives IDRC Pan Localization Network (PanL10n) Phase I 2004-2006: 7 country collaboration  BRAC University, Bangladesh  Department of IT, Bhutan  National ICT Development Agency, Cambodia  Science Tech and Environment Agency, Laos  Madan Puraskar Pustakalaya, Nepal  University of Colombo School of Computing, Sri Lanka  Afghanistan Phase II proposed for 2007-2010

23 N. UzZaman, BRAC University ICT and Localization, 29/1/06 23 Local and Regional Initiatives IDRC Pan Localization Network Phase II (2007- 2010): Further development of user-end local language technology Development of user end training for using the local language technology Conduction of this training Local language content development Measuring effects of using local language technology

24 N. UzZaman, BRAC University ICT and Localization, 29/1/06 24 D.Net’s Initiative

25 N. UzZaman, BRAC University ICT and Localization, 29/1/06 25 Summary Local language computing Significant challenges, from language resources to human resources 30+ years work for English and Western languages; just beginning for Bangla Include students from CS, linguistics OPEN SOURCE a must for knowledge sharing! Other universities should also come forward


Download ppt "Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language."

Similar presentations


Ads by Google