Presentation is loading. Please wait.

Presentation is loading. Please wait.

Isalin Translate eWika: Digitalization of Philippine Languages Charibeth K. Cheng March 19, 2008.

Similar presentations


Presentation on theme: "Isalin Translate eWika: Digitalization of Philippine Languages Charibeth K. Cheng March 19, 2008."— Presentation transcript:

1 Isalin Translate eWika: Digitalization of Philippine Languages Charibeth K. Cheng March 19, 2008

2 Machine Translation Automate translation A study under Natural Language Processing MT System Sentence in SOURCE LANGUAGE Sentence in TARGET LANGUAGE

3 ENG-FIL MT System Project 3-year project started last year funded by DOST-PCASTRD composition: –6 faculty members of College of Computer Studies –15 computer science majors –assisted by the Filipino Dept and Dept in English & Applied Linguistics of DLSU-M

4 Agenda Architecture of the MT System Linguistic resources Demo of the Translation Engine Results for English to Japanese translation

5 Architectural Design of the Program Language Resources: Lexicon (electronic dictionary), Morphological Analyzer & Generator Part-of-Speech tagger Grammar, Corpus (Tagged) MT: Example-based MT: Rule-based User Interface Output Modeller Source TextTarget Text Translator Engine

6 Challenge! Language resources –Quality of translation is dependent on it. –Built from almost non-existent digital forms –manual vs. automatic construction

7 Lexicon Builder Used IsaWika! database as initial lexicon Created a lexicon extraction program to automatically determine candidate translation pairs from corpora Currently contains about 23,000 entries Co-occurring words are likely translation Challenge: Lexical resources –parallel corpora –part-of-speech tagger Database

8 Morphological Analyzer Initially collected morphological rules from grammar books Developed an example-based morphological phenomenon learner –learn from –example: Challenge : Lexical resources –lexicon –part-of-speech tagger –morphological rules Generator

9 Part-Of-Speech Tagger automatic association of parts-of-speech to words in a document existing Filipino tagger achieves < 80% accuracy Challenge : Lexical resource –tagged parallel corpora –lexicon –morphological analyzer –grammar

10 Grammar Derived manually Challenge: Free word order in sentence formation. The man bought an umbrella from the store. Bumili ang lalaki ng payong sa tindahan. Bumili sa tindahan ng payong ang lalaki. Ang lalaki ay bumili ng payong sa tindahan.

11 Corpora used by the lexicon extractor and part-of- speech tagger, example-based MT came from translation works of DLSU English majors, verified by linguists consists of 207,000 words, 5000 of which are tagged

12 Translation Rules currently learned from the corpora disadvantages –garbage-in-garbage-out –comprehensiveness need for linguistic-verified rules

13

14 Bringing it home … 171 Philippine Languages (SIL) No Philippine Corpora Unfortunately, today, the Philippines has one of the highest rates of dying languages (Solfed Foundation Inc) “Without our language, we have no culture, we have no identity, we are nothing.” (Thorrson)

15 eWika: Digitalization of Philippine Languages Build the Philippine Corpus Build software tools to study or use the corpus –Across Languages –Across Regions –Across Forms and Genres –Across Land and Sea

16 Across Languages 171 Philippine Languages (SIL List)SIL List Summer Institute of Linguistics Major languages Near extinction languages How about the languages in-between?

17 Filipino Sign Language The History of Sign Language in the Philippines: Piecing Together the Puzzle (Abat & Martinez, 9 th Phil Linguistics Congress, 2006) Deaf individuals: handicapped vs members of a linguistic minority Sign languages as true languages

18 Across Boundaries Across Languages Across Regions Across Forms and Genres Across Land and Sea

19 Across Regions e-Wika: Connecting the Philippine Islands through Language 17 Regions: The regions are: Ilocos Region (Region I), Cagayan Valley (Region II), Central Luzon (Region III), CALABARZON (Region IV-A), MIMAROPA (Region IV-B), Bicol Region (Region V), Western Visayas (Region VI), Central Visayas (Region VII), Eastern Visayas (Region VIII), Zamboanga Peninsula (Region IX), Northern Mindanao (Region X), Davao Region (Region XI), SOCCSKSARGEN (Region XII), Caraga (Region XIII), Autonomous Region in Muslim Mindanao (ARMM), Cordillera Administrative Region (CAR), National Capital Region (NCR) (Metro Manila)

20

21 Across Boundaries Across Time: historical, contemporary Across Languages Across Regions Across Forms and Genres Across Land and Sea

22 Across Forms and Genres In various forms: Text Speech: speech to text system (ongoing project) Video: Filipino sign language In various Genres: categories of entries in the corpus

23 Across Boundaries Across Time: historical, contemporary Across Languages Across Regions Across Forms and Genres Across Land and Sea

24 Web-based application: c/o Solomon See (upload, download, tools) Contributors (Main players) Verify-ers Facilitators Server: DLSU-M commits to host the server for the next three years. Terms of Use: Research purposes.

25 The dream of building Philippine language resources and tools Many many many major hurdles to overcome Language Resources, Tools, & Peopleware: Needed


Download ppt "Isalin Translate eWika: Digitalization of Philippine Languages Charibeth K. Cheng March 19, 2008."

Similar presentations


Ads by Google