Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

Introduction to Computational Linguistics
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
1 AFNLP 2008 Meeting Indonesia Country Report Hammam Riza Agency for the Assessment and Application of Technology (BPPT) Ministry of.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Center for Research in Urdu Language Processing PAN Localization Project A Regional Initiative to Develop Local Language Computing Capacity in Asia ثناء.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
HLT Research and Development for Baltic Languages in Tilde Andrejs Vasiļjevs, Raivis Skadiņš Tilde Riga, October 27, 2004.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Machine Learning Damon Waring 22 April of 15 Agenda Problem, Solution, Benefits Problem, Solution, Benefits Machine Learning Overview/Basics Machine.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
1 NLP in Thailand by Asanee Kawtrakul Kasetsart University.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
9/8/20151 Natural Language Processing Lecture Notes 1.
Introduction to Natural Language Processing Heshaam Faili University of Tehran.
Enlightening minds. Enriching lives. Tamil Digital Industry Badri Seshadri K.S.Nagarajan New Horizon Media.
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
PrepTalk a Preprocessor for Talking book production Ted van der Togt, Dedicon, Amsterdam.
Syllabus and curriculum design From LETRAC to Bologna Belinda Maia University of Porto.
Machine Translation, Digital Libraries, and the Computing Research Laboratory Indo-US Workshop on Digital Libraries June 23, 2003.
Research Component on Technology Concluding Thoughts Sarmad Hussain Center for Research in Urdu Language Processing National University of Computer and.
1 Computational Linguistics Ling 200 Spring 2006.
CS 4705 Natural Language Processing Fall 2010 What is Natural Language Processing? Designing software to recognize, analyze and generate text and speech.
1 BILC SEMINAR 2009 Speech Recognition: Is It for Real? Tony Mirabito Defense Language Institute English Language Center (DLIELC) DLIELC.
NLP Related Activities in Thailand Virach Sornlertlamvanich Information Research and Development Division National Electronics and Computer Technology.
Sustainability of the work and PANL10n network: Vision beyond 2010 Regional Conference on Localized ICT Development & Dissemination Across Asia PAN Localization.
Natural Language Processing Guangyan Song. What is NLP  Natural Language processing (NLP) is a field of computer science and linguistics concerned with.
Licensing and Distribution of Resources and Software PAN L10n Perspective Sarmad Hussain Center for Research in Urdu Language Processing National University.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Computational Linguistics. The Subject Computational Linguistics is a branch of linguistics that concerns with the statistical and rule-based natural.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Welcome to our Presentation. 2 Cloudy life Group Members Farid Ahmed Mahbuba Akther Marufa Aktar Shamsun Nahar Shanaj Parvin.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
© 2013 by Larson Technical Services
Utkal University We Work On Image Processing Speech Processing Knowledge Management.
1 An Introduction to Computational Linguistics Mohammad Bahrani.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Hitoshi ISAHARA National Institute of Information and Communications Technology (NICT) Sustainability of the work and PAN L10n network: Vision Beyond 2010.
Introduction. Internet Worldwide collection of computers and computer networks that link people to businesses, governmental agencies, educational institutions,
CS416 Compiler Design1. 2 Course Information Instructor : Dr. Ilyas Cicekli –Office: EA504, –Phone: , – Course Web.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Prepared by: Shammur Absar Chowdhury, CRBLP
How can speech technology be used to help people with disabilities?
Objectives and Plan of Action
HLT Research and Development for Baltic Languages in Tilde
Natural Language Processing (NLP)
CS416 Compiler Design lec00-outline September 19, 2018
Introduction CI612 Compiler Design CI612 Compiler Design.
Natural Language Processing
Statistical n-gram David ling.
CS416 Compiler Design lec00-outline February 23, 2019
Natural Language Processing (NLP)
Lec00-outline May 18, 2019 Compiler Design CS416 Compiler Design.
Artificial Intelligence 2004 Speech & Natural Language Processing
Natural Language Processing (NLP)
Presentation transcript:

Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language Processing BRAC University D. Net’s 5 th Anniversary Seminar Series: Youth and ICTs: ICT and Localization 29 th January, 2006

N. UzZaman, BRAC University ICT and Localization, 29/1/06 2 Outline Statistics of Bangla language speaker Localization and local language computing BRAC University’s Initiative Local and Regional Initiatives

N. UzZaman, BRAC University ICT and Localization, 29/1/06 3 Statistics of Bangla language speakers Spoken by 245 million people 7 th most widely spoken language Spoken mainly in Bangladesh and Indian state of West Bengal More than 144 million people from Bangladesh

N. UzZaman, BRAC University ICT and Localization, 29/1/06 4 Why localization? The masses can harness the power of information National Interest: digital divide, governance, language preservation, …

N. UzZaman, BRAC University ICT and Localization, 29/1/06 5 Localization Internationalized software in local languages Few groups are working actively –Ankur, Ekushey, D.Net (content development) Active projects –Linux, Mozilla, Open Office

N. UzZaman, BRAC University ICT and Localization, 29/1/06 6

N. UzZaman, BRAC University ICT and Localization, 29/1/06 7

N. UzZaman, BRAC University ICT and Localization, 29/1/06 8 Larger picture Good start, but a long way to! Local language computing: advanced applications –Optical character recognition –Machine translation –Speech synthesis –Speech recognition –Dialog systems

N. UzZaman, BRAC University ICT and Localization, 29/1/06 9 Challenges Language Resources –Fonts –Lexicon (word list) –Corpus (collection of texts) –Tag the lexicon and corpus

N. UzZaman, BRAC University ICT and Localization, 29/1/06 10 Challenges for next few years! Language processing research –Document authoring (desktop, web (blog, forums, s), etc) –Morphological analyzer –Speech processing –Information Retrieval (web searching, name searching, spelling checker) –OCR (Optical Character Recognition) –Syntactic analysis (can be used in MT) –Machine Translation –And many more…

N. UzZaman, BRAC University ICT and Localization, 29/1/06 11 Status of Bangla Computing Scattered work done, very little unification Scarcity of free and open-source software Little or no attention paid to computational linguistics - the backbone Many individuals are working, results few good publications in ICCIT, IUB’s ICCPB and other conferences

N. UzZaman, BRAC University ICT and Localization, 29/1/06 12 BRAC University’s Initiative Research Lab (Center for Research on Bangla Language Processing) –9 full-time Research staff (6 CS background, 3 linguistics background) –Seed funding from PAN Localization project of IDRC –Students working part-time, doing internship –Software/documents all OPEN SOURCE Academics –Course on Natural Language Processing –Student projects and theses on NLP

N. UzZaman, BRAC University ICT and Localization, 29/1/06 13 Status of BU Research lab’s work Publications –ICCIT 2004: 3 (Morphology 2, spelling checker) –BU Journal: 1 (Morphological parsing) –IASTED CI: 1 (Name searching) –IEEE NLP KE 05: 1 (Spelling checker) –ICCIT 2005: 1 (Morphology) –Undergraduate Thesis: 3 (Phonetic encoding, OCR, Bangla text input in mobile) –Total: 10 4 more research paper submitted Ongoing thesis: 4

N. UzZaman, BRAC University ICT and Localization, 29/1/06 14 Status of BU Research lab’s work Invited talks: –University of Toronto CS Seminar –Stanford University NLP group (May 2005) –IDRC Partners Conference in Cambodia (June 2005) –IJCNLP 2005, Jeju Island, Korea (October 2005)

N. UzZaman, BRAC University ICT and Localization, 29/1/06 15 Language Resources Fonts: Good open-source fonts available Lexicon: –80+ thousand list of words; expected to be 110 thousand in the next release –Tagging and annotation is underway. Significant and large project Corpus: –Yet to begin

N. UzZaman, BRAC University ICT and Localization, 29/1/06 16 Language processing research Document authoring –Editor, Banglapad: open source, platform independent, rich text editor (supports Bangla spell checking, export to html) Status: Version 1, Release candidate 1 –Transliteration, pata: Type phonetically in English, you will get similar sounding dictionary word Desktop application: Status: Completehttp://sourceforge.net/projects/pata Web based transliteration: Status: Expected by June 2006 –Community network tools: Set of tools to community networking (blogs, forums, etc) in Bangla. Not only content authoring but also web services such as spelling checker. Status: Expected by early 2007

N. UzZaman, BRAC University ICT and Localization, 29/1/06 17 Language processing research Morphology: –verb morphology is reasonably complete –noun morphology is somewhat usable, but much more needs to be done –statistical methods for dealing with Bangla compound words and blends are being worked on Grapheme To Phoneme (G2P): –Digital pronunciation dictionary –Useful step for speech processing –Status: Expected by June 2006

N. UzZaman, BRAC University ICT and Localization, 29/1/06 18 Language processing research Speech Processing –Text-to-speech: Voice for Festival. Status: First demo expected by May –Automatic Speech Recognition: Limited vocabulary segmented speech recognition. Status: First demo expected by August 2006.

N. UzZaman, BRAC University ICT and Localization, 29/1/06 19 Language processing research Information Retrieval: –Spelling checker: Gives phonetic suggestion and ranks phonetically Integrated with other text editors, Banglapad Status: Complete –Searching Phonetic web searching for Bangla Input can be English or Bangla Status: Expected by June 2006 –Name searching Can be used in hospital, institutes, census, etc Status: Expected by October 2006

N. UzZaman, BRAC University ICT and Localization, 29/1/06 20 Language processing research Pattern recognition/image processing/document processing: –Document skew correction: Bangla document skew corrector based on Radon transform. Complete. –Segmentation: Bangla line segmentation: Complete Bangla word segmentation: Complete Bangla character segmentation: Work in progress. The large number of combinations (consonant clusters and the non- spacing marks) complicates this task. This is omnifont, so must work with any typeface.

N. UzZaman, BRAC University ICT and Localization, 29/1/06 21 Language processing research –Pattern recognition: Neural net based recognizer: Fairly complete for the basic alphabet and a subset of the consonant clusters. The non- spacing marks pose a significant challenge. Hidden Markov Model (HMM) based recognizer: Just started, first implementation expected in May, Syntax: –Very preliminary work on Bangla syntax using the Lexical Functional Grammar (LFG) formalism –Also a parallel effort using the Head-driven Phrase Structure Grammar (HPSG) formalism

N. UzZaman, BRAC University ICT and Localization, 29/1/06 22 Local and Regional Initiatives IDRC Pan Localization Network (PanL10n) Phase I : 7 country collaboration  BRAC University, Bangladesh  Department of IT, Bhutan  National ICT Development Agency, Cambodia  Science Tech and Environment Agency, Laos  Madan Puraskar Pustakalaya, Nepal  University of Colombo School of Computing, Sri Lanka  Afghanistan Phase II proposed for

N. UzZaman, BRAC University ICT and Localization, 29/1/06 23 Local and Regional Initiatives IDRC Pan Localization Network Phase II ( ): Further development of user-end local language technology Development of user end training for using the local language technology Conduction of this training Local language content development Measuring effects of using local language technology

N. UzZaman, BRAC University ICT and Localization, 29/1/06 24 D.Net’s Initiative

N. UzZaman, BRAC University ICT and Localization, 29/1/06 25 Summary Local language computing Significant challenges, from language resources to human resources 30+ years work for English and Western languages; just beginning for Bangla Include students from CS, linguistics OPEN SOURCE a must for knowledge sharing! Other universities should also come forward