Presentation is loading. Please wait.

Presentation is loading. Please wait.

LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia 11800 Penang, Malaysia.

Similar presentations


Presentation on theme: "LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia 11800 Penang, Malaysia."— Presentation transcript:

1 LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia 11800 Penang, Malaysia Symposium on Language Resources in Asia

2 HISTORICAL PERSPECTIVE 19771980 1990 2000 USM GETA UTMK MT MT, MAHT CL TOOLS NLP APPLICATIONS UTM CICC MT ITNM TRANSLATION UKM NLP UM UiTM MT CALL UNIVERSITI SAINS MALAYSIA (USM) Unit Terjemahan Melalui komputer (UTMK) UNIV. TEKNOLOGI MALAYSIA (UTM) INSTITUT TERJEMAHAN NEGARA (ITNM) UNIV. KEBANGSAAN MALAYSIA (UKM) UNIVERSITI MALAYA (UM) UNIV. Institut TEKNOLOGI MARA (UiTM) DEWAN BAHASA DAN PUSTAKA (DBP)

3 LINGWARE DATA APPLICATION BASED GENERIC TOOLS LINGUISTIC DATA COMP. LING. TOOLS MAIN POINTS NOT TOO MANY MOSTLY NOT UPDATED SOME ARE REUSABLE LANGUAGE RESOURCES THE MORE RECENT ONES DEPENDENT ON DEMAND BUT MODULAR & REPROGRAMMABLE LANGUAGE DATA VERY LITTLE NOT REUSABLE METHODOLOGIES OK REASONABLE SOME INCOMPLETE DIFFICULT TO ACQUIRE BUT REUSABLE RECALL: Too Few Researchers (60 at peak in 1991, now 15) Lacking in Formal Linguistic Studies for Malay Lack of Culture of Data Accumulation

4 LINGUISTIC RESOURCES GENERIC TOOLS MT software: JEMAH Automatic Generator of Lingware Analysis Synthesis User-Driven MT Sytem Language Tools: -Spellchecker -Desktop Accessories (Dicts) -Text Analysis -etc. Linguistic Tools: -Corpus System -Dictionary System -Grammar Editor (STCG) -Bilingual Corpus Bank -etc. APPLICATION BASED TOOLS MAHT system: SISKEP Example Based MT EDI (parsing/generation msg. types) Semantic Driven Search Engine WEB Crawler Internet Portal (??) NOT TOO MANY MOSTLY NOT UPDATED SOME ARE REUSABLE THE MORE RECENT ONES DEPENDENT ON DEMAND BUT MODULAR & REPROGRAMMABLE LINGWARE DATA Ariane/Jemah MT English->Malay (all phases) STCG Malay Grammar VERY LITTLE NOT REUSABLE METHODOLOGIES OK

5 LANGUAGE DATA DICTIONARIES (WINHELP) ENGLISH-MALAY DBP (KIMD)10.16 MB1945 pages MALAY DBP (KD) 6.63 MB1566 pages TERMINOLOGIES (MABBIM) 8.13 MB1069 pages COMPUTER (Malay) 1.15 MB … FRENCH-ENGLISH-MALAY 3.57 MB … DICTIONARIES (Databases: attribute format) KIMD (as above)missing data B,O,R,S,T,U,V,X,Y,Z KD (as above)alphabet A only (1,544 words) MALAY THESAURUS CORPUS Malay Books, Letters to Editor (System)2.2 million words Translations (Malay only in MS Word) 23 titles (average 1.5 MB, 350 pages) English-Malay (Parallel Text) 3 titles (1 with sentence alignment) REASONABLE SOME INCOMPLETE DIFFICULT TO ACQUIRE BUT REUSABLE

6 LANGUAGE DATA (cont..) KIMD-WordNet Link (A->F only) Sources are KIMD and WordNet, and linked by sense entry in Wordnet and KIMD, e.g. abacus KIMD(abacus,n,1 [device, for, calculating, ’,’, a, square, or, rectangular, frame, ….]). ***(entry and definition taken from KIMD – some redefined to fit) WORDNET(102155519, 1, ‘abacus’, n, 2, 0, [performs, arithmetic, functions, by, ….]). ***(entry and definition taken from Wordnet) ===sepua, sempoa, dekak-dekak ***(Malay equivalent taken from KIMD) KD Sense Processing (A->Z) Source is KAMUS DEWAN (KD) Steps of process: Extract word senses (ws) from KD (result: approx. 30K ws with definition) Extract primitive words (ps) from KD based on frequency (result: approx. 5K ps with definition) Extract synonyms from KD (result: approx. 6K synonyms) Use KD sense numbering to tag synonyms. Example of result: syn_kd(adem1, sejuk1) syn_kd(adem3, tenang2)

7 LANGUAGE DATA (cont..) OTHER POSSIBLE SOURCES OF DATA DEWAN BAHASA DAN PUSTAKA (LANGUAGE ACADEMY) Copies of all types in UTMK (perhaps more volume) Corpus: more recent publications (books, novels, journals, etc.) NEWSPAPERS Corpus: more recent years, i.e. since publishing on internet STAR, NEW STRAINTS TIMES, etc. OTHER R&D CENTRES UNIV. TEKNOLOGI MALAYSIA (UTM) INSTITUT TERJEMAHAN NEGARA (ITNM) UNIV. KEBANGSAAN MALAYSIA (UKM) UNIVERSITI MALAYA (UM) UNIV. Institut TEKNOLOGI MARA (UiTM)

8 THANK YOU ARIGATO MERCI SHUKRIYA GRAZZIE XIE-XIE NI TERIMA KASIH


Download ppt "LANGUAGE RESOURCES IN MALAYSIA Zaharin Yusoff Computer-Aided Translation Unit School of Computer Sciences Universiti Sains Malaysia 11800 Penang, Malaysia."

Similar presentations


Ads by Google