Download presentation
Presentation is loading. Please wait.
Published byCurtis Fleming Modified over 7 years ago
1
IDENTIFYING AND CLASSIFYING UNKNOWN WORDS IN MALAY
Ranaivo-Malançon Bali, Chua Chong Chai, Ng Pek Kuan Computer Aided Translation Unit (UTMK) School of Computer Sciences Universiti Sains Malaysia Penang, Malaysia
2
OUTLINE Introduction What are “unknown words”? Why do we need to deal with unknown words? Our objectives Related works Method for the identification and classification of unknown word (ICUW) Experiment & Evaluation of the results Conclusion & Discussion Future works SNLP2007, Thailand
3
WHAT ARE “UNKNOWN WORDS”?
Words that are not listed in the reference lexicon Because They are misspelled They do not belong to the vocabulary of the language: foreign words, loanwords The reference lexicon does not consider these words as lexemes: abbreviation, proper nouns The reference lexicon has not been updated yet: neologisms SNLP2007, Thailand
4
WHY DO WE NEED TO DEAL WITH UNKNOWN WORDS?
A text analyser needs to know about “word” A text analyser is a common component for most Natural Language Processing applications To be robust, a text analyser must be able to process all words SNLP2007, Thailand
5
THREE OBJECTIVES To reduce step by step the initial set of unknown words To determine the classes of words that remain unknown at the end of the whole process To specify the weaknesses of each identifier in order to improve their accuracy SNLP2007, Thailand
6
RELATED WORKS Three types of methods to process unknown words based on their objective Identification of unknown words As in POS tagging Distinction between unknown and known words As in word segmentation for languages like Thai, Chinese, Japanese, etc. Identification and classification of unknown words (ICUW) Few works Toole (2000): Used decision trees; Identification and classification of misspellings and names; 86.6% precision Mikheev (2002): Applied a document-centered approach; Identification and classification of proper names and abbreviations; Best achievement: [95-97]% precision on proper name disambiguation and [98-99]% precision on abbreviation recognition Goh et al. (2005): Used a hierarchical model with multi-classifiers; Identification and classification of numbers, time nouns, and person names; Higher precision (89%) when they used multi-classifiers compare to the method of using only one classifier for all types of unknown words (86%) SNLP2007, Thailand
7
METHOD (our ICUW) A chain of filters to identify and classify unknown words in Malay texts Malay language Official language of Malaysia Written with Latin alphabet (Rumi) or Arabic alphabet (Jawi) Unknown words studied in this work Proper names Abbreviations Loan words Affixed words SNLP2007, Thailand
8
FLOWCHART OF THE ICUW SNLP2007, Thailand
9
ABBREVIATION IDENTIFICATION
Identification by parentheses Kesatuan Perkhidmatan Perguruan Kebangsaan (KPPK) hari ini mencadangkan agar faedah "Pemberian Wang Tunai Gantian Cuti Rehat" (GCR) diperluaskan kepada semua guru biasa di negara ini. Identification by common formats Any sequence of letters, each separated by a full-stop, e.g. A.N.M. Any sequence of capital letters with two, three, or four letters, e.g. AO, CGR, AEN, KPPK Any sequence of consonants in upper case, e.g. PBPKNM Any sequence of vowels in upper case, e.g. UIA, IAEA SNLP2007, Thailand
10
PROPER NAME RECOGNITION
By the definition of abbreviations Words that precede an identified abbreviation Words that start with each letter of the abbreviation Words with their initial in capital case Kesatuan Perkhidmatan Perguruan Kebangsaan (KPPK) hari ini mencadangkan agar faedah "Pemberian Wang Tunai Gantian Cuti Rehat" (GCR) diperluaskan kepada semua guru biasa di negara ini. By specific titles Words starting in capital case and placed after titles Tan Sri Abdul Samad Idris Datuk Mohd Khalid Yunus Dr Mahathir Mohamad SNLP2007, Thailand
11
LOAN WORD IDENTIFICATION (1)
Specific subset of letters {f, q, v, x, z} Position of letters or a sequence of letters We get the set by “reversing” the Malay orthographic rules proposed by Mabbim (1992) Initial: ae, kh, gh, sy, abs, eks, auto, heks, hipo, homo, hiper, inter, intro, proto, super, hetero, C1C1 Medium: ae, sh, th Final: e, o, c, j, w, y, ks, ans, oid, asma, isme, logi, grafi Anywhere: ee, oo, uu, ie, bb, cc, dd, hh, jj, ll, mm, pp, qq, rr, ss, tt, vv, ww, xx, yy, zz, ph, sequence of three consonants (not necessarily the same) SNLP2007, Thailand
12
LOAN WORD IDENTIFICATION (2)
Specific morphographemic rules Native word Loan word mengipas (<kipas) mengkritik (<kritik) ‘to fan’ ‘to criticise’ memacu (<pacu) memproses (<proses) ‘to spur’ ‘to process’ menyeduh (<seduh) mensabotaj (<sabotaj) ‘to infuse’ ‘to sabotage’ menimbang (<timbang) mentradisi (<tradisi) ‘to measure’ ‘to make sthg a tradition’ Consonant-vowel structures If the syllabic structure of the word does not belong to one of these structures, then it is a loan word 1 syllable CV,VC,CVC 2 syllables V.V, V.VC, V.CV, V.CVC, VC.CV, VC.CVC, CV.V, CV.CV, CVC.CV, CVC.CVC 3 syllables CV.CV.CV SNLP2007, Thailand
13
AFFIXED WORD ANALYSIS Analysis of affixed words only
anak-anak ‘children’ is not affixed but reduplicated => Not analysed beranak-anak ‘to have children’ is affixed (and reduplicated) => Analysed Context-independent analyser Returns all possible morphological analyses: segmentations and possible roots word = beribu ber+ibu ‘to be a mother’ or ‘to have a mother’ ber+ribu ‘thousands’ SNLP2007, Thailand
14
EXPERIMENT SNLP2007, Thailand
15
WHAT ARE THE 184 WORDS? 83 proper names
Names of person: Rosmawati, Ameran, … Names of country/town/etc.: Perancis, Somali, Setiu, … Names of institution/company/etc.: Tetuan, Linkaran, … 50 morphologically complex words and missing roots e.g. disalahtadbirkan (prefix + compound + suffix); matapelajaran (compound + circumfix + ); Portugisnya (proper noun + suffix); sekretariatnya (loan word+ suffix); mengakses (prefix + loan word) 32 misspelled words e.g. *mamastikan instead of memastikan; *tuanpunya instead of tuan punya, … 11 reduplicated words e.g. ekonomi-ekonomi, isteri-isteri, … 6 loanwords e.g. antikuiti, demokratisasi, … 1 neologism or misspelling? pengwujudan (peN+wujud+an): pewujudan (peN+wujud+an) exists and it means ‘creation or establishment’ 1 abbreviation KUSTEM SNLP2007, Thailand
16
ERROR ANALYSIS With abbreviation rules With proper name rules
337 abbreviations identified = 20 errors genuine abbreviations Errors due to the parameter ‘length’ < 4: e.g. NEW, YEW, YORK, ASIA With proper name rules 203 proper names identified = 2 errors genuine proper names Errors due to tokenisation: Ir.Anton Sebastian, Datuk M.Kayveas With loan word rules 1098 loan words identified = 954 errors genuine loan words Errors due to the fact that many rules for the identification of loan words are applicable for other types of words: 747 proper names foreign words + 29 abbreviations + 28 spelling errors With affixed word rules 1529 affixed words recognised No error SNLP2007, Thailand
17
CONCLUSION & DISCUSSION (1)
First objective attained: the number of unknown words dropped spectacularly, from 3351 to 184 But 184 is not the actual value 976 (Errors = ) = 1160 real unknown words (non-identified and classified) 1160 represent one-third of the total number of unknown words Two-third unknown words were recognised and classified SNLP2007, Thailand
18
CONCLUSION & DISCUSSION (2)
Second objective attained: the classes of remaining unknown words have been identified Most of them are proper names and words generated by one or more morphological processes SNLP2007, Thailand
19
CONCLUSION & DISCUSSION (3)
Third objective attained: the weaknesses of each identifier have been identified Abbreviation identifier The two rules (by parentheses and by common formats) are satisfactory: 94% of correct identification BUT they are not sufficient The parameter ‘length’ causes false identification and misses long abbreviations Proper name identifier The two rules (by abbreviation definition and by the presence of titles) are satisfactory: ~99% of correct identification BUT they are not sufficient: many proper names remain unclassified or classified incorrectly Affix word analyser The affix word analyser works very well: 100% of correct analysis BUT many complex words remain unanalysed SNLP2007, Thailand
20
FUTURE WORKS Find other parameters (beside ‘length’) for abbreviation identification Increase the number of proper names (current list = 1369) and find an accurate approach for proper name identification Constrain more the loan word rules, e.g. at the level of consonant-vowel structures Create a complete Malay morphological analyser: affix word analyser + reduplicated word analyser + compound word analyser Integrate other identifiers in the program, e.g. neologism identifier, foreign word identifier SNLP2007, Thailand
21
References C.-L. Goh, M. Asahara and Y. Matsumoto Training multi-classifiers for Chinese Unknown word detection. Journal of Chinese Language and Computing, 15(1): 1-12. A. Mikheev Periods, Capitalized Words, etc. Computational Linguistics 28(3): Mabbim (Majlis Bahasa Brunei Darussalam-Indonesia-Malaysia) General guidelines for the formation of terms in Malay. DBP, Malaysia. J. Toole Categorizing unknown words: using decision trees to identify names and misspellings. In Proc. of the 6th Conference on Applied Natural Language Processing, Seattle, Washington, pp SNLP2007, Thailand
22
THANK YOU! Questions? Ranaivo-Malançon Bali, ranaivo@cs.usm.my
Chua Chong Chai, Ng Pek Kuan,
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.