Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng DLSU, College of Computer Studies Natural.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
Tanmoy Bhattacharya Coordinator Equal Opportunity Cell University of Delhi ICT for PwDs: with Special Reference to Indian Sign Language.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
eWika: Digitalization of Philippine Languages
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Language Resources in Indonesia Language Technology & Applied Information Laboratory Directorate for Information Technology and Electronics Agency for.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Center for Research in Urdu Language Processing PAN Localization Project A Regional Initiative to Develop Local Language Computing Capacity in Asia ثناء.
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.
1/7 INFO60021 Natural Language Processing Harold Somers Professor of Language Engineering.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Växjö University Joakim Nivre Växjö University. 2 Who? Växjö University (800) School of Mathematics and Systems Engineering (120) Computer Science division.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
METIS-II: a hybrid MT system Peter Dirix Vincent Vandeghinste Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven TMI 2007,
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi.
China Patent Information For Western Users Huabing Liu Intellectual Property Publishing House, SIPO.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France.
Sentiment Analysis with a Multilingual Pipeline 12th International Conference on Web Information System Engineering (WISE 2011) October 13, 2011 Daniëlla.
Paradigm based Morphological Analyzers Dr. Radhika Mamidi.
0 Automated Formative Assessment: Providing Linguistic Support through Online Modules Presented by: Ken Petersen 11/18/2011 ASEES 42nd Annual Convention.
Introduction to Natural Language Processing Heshaam Faili University of Tehran.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
NLP Related Activities in Thailand Virach Sornlertlamvanich Information Research and Development Division National Electronics and Computer Technology.
Kakia Chatsiou A brief introduction to XLE LG617 - XLE Lab1 LG617 A brief introduction to XLE Kakia Chatsiou Dept of Language.
Summary Report Survey on Research and Development of Machine Translation in Asian Countries Virach Sornlertlamvanich Information Research and Development.
Markers What are markers? And Why are they Important.
Advisors: Gabor Sarkozy, WPI Andras Kornai, MTA-Sztaki April 23 rd, 2013 Zhongxiu Liu CS 14’ Yidi Zhang CS 13’
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Acknowledgements Contact Information Objective An automated annotation tool was developed to assist human annotators in the efficient production of a high.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Research Topics CSC Parallel Computing & Compilers CSC 3990.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
How Can Corpora Help Me To Be Successful in CO150?
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Identifying Entity Relationships in News Reports 27. January 2010 Martin Jačala, Jozef Tvarožek Faculty of Informatics and Information Technology Slovak.
Natural Language Processing
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
Generality and Openness in Enabling Methodologies for Morphology and Text Processing Anssi Yli-Jyrä Department of General Linguistics, University of Helsinki.
Towards a Translation Assessment Assistant Tom Cheesman.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
DocLing2016 Software Tools Peter K. Austin Department of Linguistics SOAS, University of London
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
A method to restrict the blow-up of hypotheses... A method to restrict the blow-up of hypotheses of a non-disambiguated shallow machine translation system.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
English-Lithuanian-English Lexicon Database Management System for MT Gintaras Barisevicius and Elvinas Cernys Kaunas University of Technology, Department.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
FLEx 1 NATHANIEL EVERSOLE JULIET MORGAN. WHAT IS FLEx?
BBI 3423 LANGUAGE AND ICT.
Automatic Translation
Presentation transcript:

Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng DLSU, College of Computer Studies Natural Language Processing Research Lab

MT Research in RP started in 1993 at UP-Los Ba ň os Dr. Rachel Roxas and Allan Borra –grammar-based in 2004 start at DLSU –hybrid approach

ENG-FIL MT System Project 3-year project started 2005 funded by DOST-PCASTRD composition: –6 faculty members of College of Computer Studies –15 computer science majors –assisted by the Filipino Dept and Dept in English & Applied Linguistics of DLSU-M

Architectural Design of the Program Language Resources: Lexicon (electronic dictionary), Morphological Analyzer & Generator Part-of-Speech tagger Grammar, Corpus (Tagged) MT: Example-based MT: Rule-based User Interface Output Modeller Source TextTarget Text Translator Engine

Rule-Based approach Apply translation rules The boy ate apples. Kumain ng mga mansanas ang batang lalaki. Where do we get the translation rules?

Example-Based Learn the rules from examples The boy ate apples. Kumain ng mga mansanas ang batang lalaki. ABCD A B CD Rule Learned: A B C DA B C DC ng D A B

Using the rule A B C DA B C DC ng D A B The mother cooked fish. Nagluto ng isda ang nanay. AB CD A B CD

Using the rule A B C DA B C DC ng D A B The mother went home. Umuwi ng bahay ang nanay. AB CD A B CD

The boy ate the fish. Limitation of a Rule A B C DA B C DC ng D A B A BCD

Results of the MT Engine Qualities of a Good Translation –Clarity – 3.3 –Accuracy – 3.2 –Naturalness highest score of respondents (5 linguists)

Challenge! Language resources –Quality of translation is dependent on it. –Built from almost non-existent digital forms –manual vs. automatic construction

Lexicon Diksyunaryo ng Wikang Filipino automatic construction (AeFLEX): –accuracy rate - 57% Currently contains about 30,000+ entries Challenge: Lexical resources –translation documents –part-of-speech tagger

Morphological Analyzer and Generator Dictionary is incomplete Create a software that: –analyzes – determines the root word –generates – generates the inflected word Given: eating -> eat -> kain -> kumakain Challenge : Lexical resources –lexicon –part-of-speech tagger

Part-Of-Speech Tagger automatic association of parts-of-speech to words in a document –Can? – kaya vs. lata –Baba? – chin or go down Challenge : Lexical resource –corpora –lexicon –morphological analyzer –grammar

Corpora collection of translation-pair documents used by the lexicon extractor and part-of- speech tagger, example-based MT came from translation works of DLSU English majors, verified by linguists consists of 207,000 words

Lexicon Resource Dependency Corpus

Bringing it home … 171 Philippine Languages (SIL) No Philippine Corpora Unfortunately, today, the Philippines has one of the highest rates of dying languages (Solfed Foundation Inc) “Without our language, we have no culture, we have no identity, we are nothing.” (Thorrson)

eWika: Digitalization of Philippine Languages Build the Philippine Corpus Build software tools to study or use the corpus –Across Regions –Across Forms and Genres –Across Languages

Across Regions Web-based application: GLOBALIZATION –upload, download, tools Contributors (Main players) Verifiers Server: DLSU-M commits to host the server for the next three years. Terms of Use: Research purposes.

Across Languages 171 Philippine Languages (SIL List)SIL List start with 8 major languages –Tagalog, Cebuano, Ilocano, Hiligaynon, Bikol, Waray, Kapangpangan, Boholano Filipino Sign Language

Across Forms and Genres In various forms: –Text –Speech –Video: Filipino sign language In various Genres: –Text – literary & creative, essays, news articles, religious, etc –Speech – scripted, conversations, etc –Video – common signs, regional signs, signs for specific purposes (legal, IT, etc.)

The dream of building electronic, online Philippine language resources and tools Many many many major hurdles to overcome NEEDED : Language Resources, Tools, & Peopleware