Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc.

Similar presentations


Presentation on theme: "1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc."— Presentation transcript:

1 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

2 2 Abstract Unknown word is the main factor that affect the performance of WS. To solve the unknown word, this paper proposes two way: Morphological rule: solving the regular unknown words. Statistical model : solving the irregular unknown words.

3 3 Outline Introduction System architecture Overview of the baseline model The morphological analysis Tagging part of speech Unknown word modeling

4 4

5 5 Introduction-(1) Word: 許多中文處理工作的基本單位 在中文有沒有界限的困擾 Unknown word 影響 WS 頗大. Unknown word 的分類 : Regular: EX: time, date (11:50, 11/12), reduplication Irregular: EX: proper names, compound nouns.

6 6

7 7 Introduction-(2) 不同類型的 unknown word 的對策 : Regular: 使用 morphological rule 來辨識. Irregular: 使用統計模式來辨識.

8 8 System Architecture-(1)

9 9 System Architecture-(2) Lexicon: 89590 entries. 49 tags.

10 10 System Architecture-(2) Lexicon: 89590 entries. 49 tags. # of characters / word # of entries 1 1,734 2 35,492 3 19,650 4 24,054 5 6,140 6 2,020 >=7 500 Total 89,590

11 11 System Architecture-(3) Morphological Rules: 17 條. ( 在最後面的 Appendix A) Corpus:

12 12 Morphological Rules

13 13 Statistics of Corpora

14 14 Overview of the Baseline Model-(1) The baseline model:

15 15 Overview of the Baseline Model-(2) Baseline vs. Max match:

16 16

17 17 Overview of the Baseline Model-(3) Two error patterns: s_ns( mis-combined error): Ex.| 一 | 個 | 人 |  | 一 | 個人 | ns_s( over-segmentation error): Ex.| 轉換器 |  | 轉換 | 器 |

18 18 Statistics of Error Patterns

19 19 The Morphological Analysis-(1) 本 paper 提出了使用 Morphological rules 來找出規則的 unknown words. Rule ordering: Using SFS(sequencial forward selection) procedure. Cost = w r * (1-P r ) + w p * (1-P p )

20 20 The Morphological Analysis-(2)

21 21 The Morphological Analysis-(3) Baseline model + morphological rule:

22 22 The Morphological Analysis-(4) 使用 morphological rule 後對 s_ns 與 ns_s 的改善 :

23 23 Tagging part of speech-(1)

24 24 Tagging part of speech-(2)

25 25 Tagging part of speech-(3)

26 26 Tagging part of speech-(4)

27 27 Unknown word modeling-(1) 5 unknown word categories: 應加入辭典的 words. Ex: 爭議 應用 morphological rules 規範的 words. Ex: 牛肝, 牛心. 縮寫. Ex: 國大. 專有名詞. Ex: 胡適. 其他.( 如印錯的 word, Ex: 吩付 辭典中沒有的 word. )

28 28 Unknown word modeling-(2) 使用 unknown word model 來找不規則 的 unknown word. 確認有無 unknown word 存在所預測的區域. 如果有, 找出 unknown word 是那一塊.

29 29 Unknown word modeling-(3) 確認有沒有 :

30 30 Unknown word modeling-(4) 確認那一塊 :

31 31 Result-(1)

32 32 Result-(2)


Download ppt "1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc."

Similar presentations


Ads by Google