Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University."— Presentation transcript:

1 Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University of Science and Technology Automatic Chinese unknown word extraction using small-corpus-based method Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on, IEEE

2 Intelligent Database Systems Lab Outline Motivation Objective Introduction Extracting possible unknown words SPLR Modification Prefixed/suffixed, Compound word selection Experiment Conclusion Opinion N.Y.U.S.T. I.M.

3 Intelligent Database Systems Lab N.Y.U.S.T. I.M. Motivation any Chinese character can either represent a word or be a part of other words no blank between Chinese words for identifying the boundaries some drawbacks- Statistics and Rules Based “ 拍打皮卡丘 ” “ 觀光協會 ” 、 ” 神奇寶貝 ”

4 Intelligent Database Systems Lab Objective Extract Chinese unknown words efficiency accuracy words occur rarely small size of document for training N.Y.U.S.T. I.M.

5 Intelligent Database Systems Lab 1-1.Introduction unknown words which don’t exist in dictionary or vocabulary Identifying the boundaries “ 拍打皮卡丘 ” “ 資料探勘非常有意思 ” Semantic ambiguity “ 觀光協會 ”,” 神奇寶貝 ” N.Y.U.S.T. I.M.

6 Intelligent Database Systems Lab 1-2.Introduction Restrict scope for Particular types of the unknown words ‘Prefixes/suffixes’ identify proper name Hybrid method to estimate the probability Identifying general unknown words difficultly “ 熱鬧非凡 ” 、 ” 回味無窮 ” 、 ” 神奇寶貝 ” “ 發生什麼 ” 、 ” 老師問問題 ” N.Y.U.S.T. I.M.

7 Intelligent Database Systems Lab 1-3.Introduction Statistics-based methods Small documents cause low accuracy Develop a method Advantage of the efficiency of statistics-based Accuracy of identify when small size of document N.Y.U.S.T. I.M.

8 Intelligent Database Systems Lab 2.Previous Works The proper name can’t be identified (compound word) “ 中國國際商業銀行 ” “ 中國 ” , ” 國際 ” , ” 商業 ” , ” 銀行 ” Statistics-based method occur frequency PLU-based likelihood ration (PLR) Not only efficient but also fast Occur rarely can’t be extracted N.Y.U.S.T. I.M.

9 Intelligent Database Systems Lab 3-1.Extracting Possible Unknown Words Preprocessing Retrieving possible character sequences Maximum length of character sequences is limited Eliminate stop words from character sequences The frequently occurring character sequences are then regarded as possible unknown words. N.Y.U.S.T. I.M.

10 Intelligent Database Systems Lab 3-2.Extracting Possible Unknown Words sequence occur follows the subsequence, the sequence should not be unknown words “ 去福利社 ” occur follow “ 福利社 ”, so “ 去福利社 ” isn’t a possible unknown word N.Y.U.S.T. I.M.

11 Intelligent Database Systems Lab 3-3.Extracting Possible Unknown Words Defined: N.Y.U.S.T. I.M.

12 Intelligent Database Systems Lab 3-4.Extracting Possible Unknown Words “ 去福利社 ” 200 times “ 福利社 ” 1000 times SPLR(tp)= = N.Y.U.S.T. I.M. Tolerate error coefficients

13 Intelligent Database Systems Lab 4.Modification 1.one-charactered prefix( 前綴 ) or suffix( 字尾 ) “ 導師室 ” “ 導師 ” results in low SPLR of “ 導師室 ” 2.Familiar sequences “ 從教室裡衝出來 ” isn’t an unknown word but would be identified by simple SPLR method N.Y.U.S.T. I.M.

14 Intelligent Database Systems Lab 4-1-1. Prefixed/Suffixed Word Revising Some words which contain the prefixed or suffixes have been collected by dictionaries which are available. For example, an unknown word : “ 總領隊 ” includes the prefix, “ocw + mcw” “ 導師室 ” includes the suffix, “mcw + ocw” N.Y.U.S.T. I.M.

15 Intelligent Database Systems Lab 4-1-2. Prefixed/Suffixed Word Revising The one-charactered prefixes/suffixes can be extracted in advance from available dictionaries. N.Y.U.S.T. I.M.

16 Intelligent Database Systems Lab N.Y.U.S.T. I.M.

17 Intelligent Database Systems Lab 4-2-1. Compound Word Selection Familiar sequence in the document: includes one or more common words while the compound words consists of particular words “ 從教室裡衝出來 ” consists of the common words “ 教室 ” and “ 出來 ” “ 文具用品 ” 100 times “ 文具 ” 100 times “ 用品 ” 100 times N.Y.U.S.T. I.M.

18 Intelligent Database Systems Lab 4-2-2. Compound Word Selection ts is the word included by tp and not a one-charactered word is the threshold A sequences consist of the common words, should not be possible unknown words N.Y.U.S.T. I.M.

19 Intelligent Database Systems Lab 4-2-3. Compound Word Selection Familiar sequences and compound words can be differentiated efficiently “ 神奇寶具 ” 200 times “ 神奇 ” 230 times “ 寶貝 ” 250 times “ 發生什麼 ” 200 times “ 發生 ” 2000 times “ 什麼 ” 4000 times N.Y.U.S.T. I.M. 200/230 200/2000

20 Intelligent Database Systems Lab 5.Experimtents Data set : 1,285 students essays Theme: “Recess at School” Characters: 470,665 N.Y.U.S.T. I.M.

21 Intelligent Database Systems Lab 5-1.Experimtents-SPLR N.Y.U.S.T. I.M.

22 Intelligent Database Systems Lab 5-2.Experimtents-Familiar N.Y.U.S.T. I.M.

23 Intelligent Database Systems Lab 5-3.Experimtents-prefixed/suffixed Prefixed or suffixed pattern in CKIP lexicon ( 中央研究院資訊科學研究所 - 中文知識庫小組 ) N.Y.U.S.T. I.M.

24 Intelligent Database Systems Lab 6.Conclusion efficiency accuracy words occur rarely small set of training corpus N.Y.U.S.T. I.M.

25 Intelligent Database Systems Lab Opinion Information Retrieval unknown Word compound word Semantic web N.Y.U.S.T. I.M.


Download ppt "Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University."

Similar presentations


Ads by Google