Presentation is loading. Please wait.

Presentation is loading. Please wait.

Universiti Malaysia Sarawak

Similar presentations

Presentation on theme: "Universiti Malaysia Sarawak"— Presentation transcript:

1 Universiti Malaysia Sarawak
Sarawak Language Technology (SaLT) Research Group SaLT Initiatives: Preservation and Maintenance of Sarawak Languages Faculty of Computer Science and Information Technology Universiti Malaysia Sarawak Associate Professor Alvin W. Yeo

2 Overview Languages in Sarawak
Maintenance and Revitalisation: Holistic Approach Sarawak Language Technology (SaLT) Research Group SaLT Projects Borneo Corpus Management System (BCMS) Iban-English Machine Translation TRanslation IBan-English (TRIBE) Multimodal-INTegration (MINT) of Sketch and Melanau Daro-Matu Speech in Spatial Queries Speech Language Dialog Systems (SLaDS) Development of Language Tools Current findings

3 Where are we?

4 East Malaysia> Sarawak> Kuching

5 Introduction (cont’d)
Sarawak is a state rich in culture. 27 ethnic groups in Sarawak (STB, 2005), each with its own culture and language. Sarawak has 46 living languages and 1 extinct; according to the Ethnologue (Gordon, 2005) Each ethnic group may have different languages Sarawak Dewan Bahasa dan Pustaka 63 known languages in Sarawak

6 Rationale Population No. of languages Cumulative no. of languages
1– 100 4 9% 101 – 500 8 12 27% 501 – 1000 16 36% 1001 – 5000 18 34 76% 5001 – 10,000 38 84% 10,001 – 50,000 6 44 98% 50,001 – 100,000 100,001 - 1 45 100% Extinct No data available Cumulative language and number of speakers (Ethnologue,2005)

7 Problem World’s linguistic and cultural diversity is under threat.
Many minority languages are on the brink of extinction. Minority language communities Further disadvantaged economically and socially. Dominant languages Exogamy Revitalizing minority languages can bring economic and social benefits as well as cultural benefits.

8 Preservation of Culture Supporting Technologies
Holistic Approach: Framework for Language Revitalization and Maintenance Preservation of Culture Stakeholders Community/ civil society People Researchers Applications Ethnic group organisations Comp. Scientists Internet: Online Presence Software applications and operating systems Hardware: Input devices: keyboards, tablets/pen/stylus Research institutions Social Scientists Government agencies IT spec. Supporting Technologies Linguists NGOs Communi-ty readiness: ICT literacy Web techno-logies: Java, Flash Methodologies: engaging communities; development lifecycles Computing Technologies: Natural Language Processing, Image Processing, Speech Recognition and Generation Translators Industry Trainers

9 Sarawak Language Technologies (SaLT) Research Group
Technology 4 All: 4As Awareness Understanding Needs Creating awareness Access Training Computer literacy Hardware/Software Telecommunications Adoption as is (minor modification) Use technology Address needs Adaptation Customisation by stakeholders Customisation by users Community-Centred

10 SaLT Role of technology in language maintenance and revitalisation
On revitalising and maintaining the existing conventional languages by building corpora, conducting research and developing tools for Sarawak Ethnic Languages. Sarawak Language Technologies (SaLT) Research Group covers Codification of the ethnic languages Creation of corpora of the various languages in Sarawak Research in computational linguistics projects which involves languages and peoples of Sarawak Development of tools: word processors, spell checkers

11 Language Technology Understanding and explication of language phenomena in a computationally tractable form, resulting in techniques for interchanging various linguistic forms speech, text, morphology, syntax, semantics/meaning, discourse, knowledge, thus leading to the creation and development of intelligent applications involving language.

12 Levels of Technology Linguist/ comp. scientist
INPUT (corpus) APPLICATION (machine translation, multimodal spatial application) PROCESSOR (tagger, parser, multimodal integration) Lexicographer/ Linguist/ comp. scientist Linguist/ comp. scientist General and conceptual dictionary

13 Specialists Needed Lexicographers
Computer scientists DBA, SE & N/W (data maintenance & grid) Linguists Information Scientists Psychologists Anthropologists Computational Linguistics Natural Language Processing

14 Current Projects

15 Current Projects (cont’d)

16 Roadmap for SaLT

17 Advisors and Organisations Involved
No Name Expertise Organisation 1 Prof. Zaharin Yusoff Computational Linguistics (CL) & Natural Language Proc. (NLP) MMU 2. Prof. Ahmad Zaki Abu Bakar CL & NLP UTM 3. AP Dr Normaziah Abdul Aziz NLP & Artificial Intelligence UIAM 4. Prof. Dr. Tang Enya Kong 5 Dr. Bali Ranaivo NLP & CL 6. Prof. Dr. Zuraidah Mohd. Don Linguistics UM 7. Dr. Gerry Knowles Phonetics and Phonology MIQUEST Worldwide Sdn Bhd 8. Professor Dr. Peter Songan Community development UNIMAS

18 Collaborators Universities Involved Organisations Involved
UNIMAS (FCSIT, FCSHD, FSS, CLS) Multimedia University Universiti Teknologi Malaysia Universiti Islam Antarabangsa Malaysia Universiti Sains Malaysia Universiti Malaya Localisation Research Centre, University of Limerick, Ireland University of Waikato, New Zealand Organisations Involved Tun Jugah Foundation Dewan Bahasa dan Pustaka (Sarawak Branch) Melanau Association Dayak Bidayuh National Association Sarawak Museum Pustaka Negeri Sarawak Majlis Adat Istiadat

19 Team members a. Staff FCSIT AP Dr Alvin Yeo Wee (Head) FCSHD CLS
AP Dr. Narayanan K. Dr Edwin Mit Suhaila Saee Sarah Flora Samson Nurfauza Jali Suriati Khartini Jali Sy. Fazlin Seyed Fadzir Lee Jun Choi FCSHD Dr. Ng Giap Weng D’oria Islamiah Wan Norizan CLS Dr. Ting Su Hie Salbia Hassan Yvonne Michelle Campbell 19

20 Team members (cont’d) b. Research Assistants c. Students
Beatrice Chin (FCSIT) Teh Lee Na (FCSIT) Jennifer Wilfred (FCSIT) Lai Nyong Fock (FCSIT) Mohd. Hanafiah Semuni (FCSHD) Loh Chee Wyai (FCSIT) Ang Siaw Tiong (FCSIT) c. Students Level No. of Students Post-graduate PhD 2 Master by Research 6 Master by Coursework 5 Undergraduate 22 Total 35 20

21 Borneo Corpus Management System (BCMS)
Problem/Background: Currently there is no existing corpus management system to manage corpora available in minority languages of Sarawak Solution: Build a system that is able to manage and maintain the corpora Objectives: To design an easy and usable Corpus Processing Toolkit for researchers Integrate the various tools together in one single platform Current Status: Working on the Morphological Analysers and Spell Checkers

22 Corpus Manager (After processing)
Original Content Processed Content File tree that display the processed files. The file is stored in the folder based on category Editable Content Used to highlight the extracted information in the content

23 Corpus Analyser: Sentence Splitter
The output is each sentence of current document

24 Iban-Corpus Development
Problem/Background Indigenous languages in Sarawak are slowly dying out due to: One way to stem this “extinction” of languages: Provide more local content – but how?? Solution Translate English documents to documents in minority languages MT is needed to facilitates and accelerates the translation process Objectives Identify a methodology that can be used to translate English to minority languages, by taking Iban as a case study Current Status Built Iban corpus with 23,833 words with 3,831 distinct words Constructed bilingual lexicon with 1,688 words with 1,192 distinct words

25 Iban-English Machine Translation
Problem/Background Traditional knowledge (TK) is tacit knowledge; generally not stored and known only by the older generation, who speaks little English TK is very important. It needs to be preserved and protected. Machine Translation (MT) can help to preserve TK Translate available resources into English so that it is accessible by all, e.g. researchers (social scientists) and younger generation However, translation of closely related languages is easier Solution Translate TK documents to English through a closely related language as pivot language Case study: Iban as source language, Malay as pivot language and English as target language

26 Objectives Current Status
To demonstrate that the performance of translation through a pivot language is comparable with performance of direct translation Realise benefits (efficiency) of translating multiple “similar” languages through a common pivot language Current Status Building of Iban corpus and lexicon Linguistic comparison on Iban and Malay language


28 Multimodal Integration: Preamble
User sketching on the Wacom tablet with CogSketch sketch interface describing a place. Dragon Naturally Speaking software for capturing the speech with a microphone.

29 Multimodal Integration of Sketch and Melanau Daro-Matu Speech in Spatial Queries (MINT)
Problem/Background English: main communication medium Language is unique and distinct Individual uses different languages may have different approaches in conceptualizing, communicating, reasoning, expressing their thoughts Translation is not sufficient enough Building the entire system for certain targeted speakers is time consuming Solution Internationalisation (i18n) Localisation (l10n)

30 Objectives Integrate Melanau Daro-Matu speech and sketch (image) modalities Identify the interaction patterns of Melanau users. Identify the similarities and difference of English, Malay and Melanau (extending to Iban as well) Localise architecture and representation of multimodal integration in Melanau Daro-Matu, and other languages

31 Modalities Representation Speech Representation Sketch Interpretation
Input Capturing Input Interpretation Modalities Representation Speech Representation Sketch Interpretation Sketch Speech Interpretation Part-Of-Speech Tagging Language-Dependent Components Tokenization Tagging using trained corpus Tagging corrections acquired from templates Lexicon required Grammar rules required Annotated Text Spatial information retrieval Speech Sketch Representation Modalities Integration Sketch and Speech Integration Database Searching Sentence Splitter Transcription

32 Spoken Language Dialogue System (SLaDS)
Problem/Background Spoken language system (SLS) has become an ever-increasing human-system interface. Many studies have been conducted by foreign researchers to unravel the challenge in the design of spoken language system. This study focuses on the design and development of spoken language dialogue system within the context of Malaysian user. Solution The project is performed by conducting a simulation test of the real SLS system with local user. The system is then evaluated by adopting the Wizard of Oz method with the objectives to determine its efficacy. The result of this testing will be useful for the future development of Malaysian SLS.

33 Objectives Current Status
To investigate the spoken language and interaction design, and its employment in the development of Spoken Language Dialog Systems To determine the efficacy of imported usability evaluation techniques applied in the Spoken Language Dialogue Systems Identify speech patterns to develop a predictive model for speech recognition Current Status To date, the study is already in its testing stage to capture the dialogue content. Respondent is prompt to interact with the system. The dialogue from the interaction will be taped, transcribed and analysed.

34 Wizard’s Control Panel
SCREENSHOTS VIDEO Video showing interaction sample; Wizard’s Control Panel User’s view

35 Research Projects: Fundamental Research Grant
Minority Languages Online (MiLO): Preserving Cultures by Mobilising Minority Languages (of Sarawak) Online. (completed 30 June 2007) Continued with CLS, Univ. of Waikato Wikipedia approach to development of Bidayuh lexicon Bario Lakuh Digital Library (completed) Recordings of Kelabit songs Transcibed, translated With audio and video 35

36 e-Vocabulary for Sarawak Malay
Problems: Language endangerment Vocabulary of Sarawak Malay (Original source) Main source: Vocabulary book written by W.S.B.BUCK from Bau, which was published by Sarawak Civil Service on 11th May, 1932. Total of word entries: 1026 words 36

37 AbiWord in Local Languages
Background One of the most widely used computer application nowadays is the word processor. Open Source Software (OSS): can used, studied, and redistributed in modified or unmodified form without restriction Solution/Objectives AbiWord (comprehensive word processor) to be localised To identify the processes of translation of computing terminology

38 Task Progress Data collection: Interface: Current Status: Ongoing
Template Ongoing Interface: Toolbar Menu Submenu Icon Tooltips Operation Completed Running

39 Screen shots Interface Example of Menu Panel

40 Current Findings: Challenges
Resources of some languages available Generally lacking; data collection very challenging Writing systems and grammar rules do not exist Lack of human resources Fluent in the (untainted) form (translating, POS tagging)

41 Current Findings: Bright future
Community Awareness Associations of ethnic groups aware of need Advanced in age interested, younger generation not so Protocol followed Upper management support required to “open doors” Local researchers are interested Colleagues & students Machine translation, speech to text, text to speech Development of speech corpus

42 Multi-ethnic Group

43 Concluding Remarks Decreasing number of speakers of languages in Sarawak Maintenance and Revitalisation: Holistic Approach Sarawak Language Technology (SaLT) Research Group SaLT Projects Machine translation, multimodal integration, speech language dialog system, corpus management systems, online dictionaries/repositories, digital libraries Challenges: community involvement and data collection and analysis Silver lining: committed NGOs and researchers Internationalisation and localisation approach

44 Acknowledgements Institutional support from Financial Support grants
Universiti Malaysia Sarawak Jugah Foundation, Melanau Association, Dewan Bahasa dan Pustaka (Sarawak Branch), Majlis Adat Istiadat, Dayak Bidayuh National Association Financial Support grants UNIMAS Fundamental Research Grant Scheme Federal Ministry of Science, Technology and Innovation Science Fund Grant Scheme (01-09-SF0028, SF0029, SF0030)

45 Fifth International Cyberspace Conference on Ergonomics (CybErg 2008)
Theme: Local knowledge, Global Applications Special Discussion on Maintenance and Preservation of Languages On-going 15 Sept – 15 Oct 2008 Free Registration

46 Conference on IT In Asia (CITA’09)
Sixth International Conference on IT In Asia (CITA’09) Theme: “Enabling technologies for Knowledge-driven Society: People-Powered Systems” Tracks on Computational Linguistics, Human Computer Interaction, Software Engineering Kuching, Malaysia, 6- 9 July 2009; Rainforest Music Festival

47 Thank You Terima Kasih Jian Kenin

Download ppt "Universiti Malaysia Sarawak"

Similar presentations

Ads by Google