IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada

IRF2 Outline General characteristics of Chinese Monolingual IR in Chinese CLIR with Chinese OOV: important Problem in Chinese IR Solutions?

IRF3 1. General characteristic of Chinese Sentence = ideograms with no separation 它是一种适于在拖拉机使用的转向球接头， … Words? 它 / 是 / 一种 / 适于 / 在 / 拖拉机 / 使用 / 的 / 转向 / 球 / 接头 / ， …

IRF4 Word formation Each character can be a word ( 人 -person) Most words are composed of two or more characters ( 人群 -mass) However No clear definition of the notion of word 办公楼 (office building)  / 办公楼 / or / 办公 / 楼 /? Inconsistency in manual segmentation Many new words are created (abbreviations) E.g. 网络 (network) 管理员 (administrator)  网管（ webmaster)

IRF5 2. IR using word segmentation Using rules, dictionaries and/or statistics Problems for information retrieval Segmentation Ambiguity: more than 1 segmentation possibility e.g. “ 发展中国家 ”  发展中 (developing)/ 国家 (country) 发展 (development)/ 中 (middle)/ 国家 (country) 发展 (development)/ 中国 (China)/ 家 (family) Different words have similar meaning 接头 (connector, plug) ↔ 插头 (plug) ↔ 插座 (plug) New words can be formed quite freely 接 (reception) 桶 (bucket): Not a common word, but can be used 网 (network) 店 (store): more and more used… 的 (of, taxi) 车 (car): taxi car (?), car of (someone)…

IRF6 Alternative: n-grams Usually unigrams and bigrams As effective as using a word segmentation Account for some flexibility However Noise: non meaningful combinations Wrong combinations 非酿造型啤酒 (non-brewed beer) 非酿造型啤酒非 / 酿造 / 型 / 啤酒非 / 酿造 / 型 / 啤酒非酿 / 酿造 / 造型 / 型啤 / 啤酒非酿 / 酿造 / 造型 / 型啤 / 啤酒 Style, appearance, …Non-meaningful

IRF7 7 Possible approach: Combining words and n-grams 前年收入有所下降 Score function in language modeling similar to other languages Previous results: Word ~ bigram > unigram Chinese Mono-lingual IR Word: 前年 / 收入 / 有所 / 下降 or: 前 / 年收入 / 有所 / 下降 Unigram: 前 / 年 / 收 / 入 / 有 / 所 / 下 / 降 Bigram: 前年 / 年收 / 收入 / 入有 / 有所 / 所下 / 下降

IRF8 Our recent tests Chinese Monolingual IR (Query: Title) Collec- tions WBUWUBU 0.3W+ 0.7U 0.3B +0.7U W+B +U TREC5.2585.2698.3012.3298.3074.3123.3262.3273 TREC6.3861.3628.3580.4220.3897.4090.3880.4068 NTCIR3.2609.2492.2496.2606.2820.2754.2840.2862 NTCIR4.1996.2164.2371.2254.2350.2431.2429.2387 NTCIR5.2974.3151.3390.3118.3246.3452.3508.3470 Average.2805.2827.2970.3099.3077.3170.3184.3212

IRF9 Why is this useful? NTCIR 5 Topic 18 烟草商诉讼赔偿 (Tobacco company, suit, compensation) Word: 烟草商 (Tobacco company) 诉讼 (suit) 赔偿 (compensation) Unigram (0.7659) > Word(0.1625) The relevant documents include words 烟草, 公司, 业者, 香烟, 烟商, but cannot match “ 烟草商 ”. NTCIR 5 Topic 24 经济舱综合症候群航班 (Economy class, syndrome, flight) Word: 经济 (economy) 综合症 (syndrome) 候 (wait) 航班 (flight) Ubigram(.7607)>Word(0.0002) “.. 综合症候..” is segmented into “../ 综合症 / 候 /..” It cannot match “ 症候 ” (syndrome). The combination of words with unigrams or bigrams helps

IRF10 Also works for Korean and Japanese? Run Means Average Precision (MAP) UBWBUWU0.3B+0.7U RigidRelaxRigidRelaxRigidRelaxRigidRelaxRigidRelaxRigidRelax C-C-T-N4.1929.2370.1670.2065.1679.2131.1928.2363.1817.2269.1979.2455 C-C-T-N5.3302.3589.2713.3300.2676.3315.2974.3554.3017.3537.3300.3766 J-J-T-N4.2377.2899.2768.3670−−.2807.3722 −−.2873.3664 J-J-T-N5.2376.2730.2471.3273−−.2705.3458−−.2900.3495 K-K-T-N4.2004.2147.3873.4195−−.4084.4396 −−.3608.3889 K-K-T-N5.2603.2777.3699.3996−−.3865.4178 −−.3800.4001

IRF11 2. CLIR: query translation Machine translation: rules+dictionaries Statistical translation model: Parallel texts Automatically extract possible translations Comparison Stat. TM doe not produce human-readable translations But can include related words Usually, word-based translation

IRF12 Our recent tests: also translate into n-grams English Word Chinese Word Chinese Unigram Chinese Bigram Bigram&Unigram “history and civilization” || “ 历史文明 ” … history / and / civilization || 历史 / 史文 / 文明 … TM (word-to-bigram): p( 历史 |history) p( 史文 |history) p( 文明 |history) GIZA++ training history / and / civilization || 历 / 史 / 文 / 明 … TM (word-to-unigram): p( 历 |history) p( 史 |history) p( 文 |history) GIZA++ training … …

IRF13 Combining different translations English Query Chinese Documents

IRF14 Bilingual linguistic resources for CLIR An English-Chinese parallel corpus mined from Web about 281,000 parallel sentence pairs LDC English-Chinese bilingual dictionaries 42,000 entries Translation model Combination of the 2 translation models

IRF15 CLIR results English  Chinese CLIR Collec- tions WBUWUBU 0.3W+ 0.7U 0.3B+ 0.7U TREC5.1904.2003.1922.2448.2277.2158.2251 TREC6.2047.2293.2602.2670.2772.2672.2822 NTCIR3.1288.1017.1536.1628.1504.1619.1495 NTCIR4.0956.0953.1382.1410.1308.1337.1286 NTCIR5.1158.1323.1762.1532.1462.1682.1602 Average.1470.1518.1841.1938.1865.1894.1891

IRF16 General observations for Chinese IR Using both words and n-grams for Chinese IR and Chinese query translation N-grams can account for flexibility in Chinese words CLIR with Chinese can also benefit from translations into Chinese n-grams

IRF17 4. OOV problem in Chinese OOV (Out-Of-Vocabulary) Problem TREC queries: 63% named entities are OOV Even more on the Web Specialized terms (abbreviations) New words Impossible to collect all terms manually Solutions Parallel texts (translations by n-grams) Mono-lingual corpus

IRF18 Translation of named entities Statistical transliteration Frances Taylor  弗朗西斯泰勒茀琅希思泰勒弗郎西丝泰勒 …

IRF20 Candidate extraction Templates Four templates to extract candidates c 1 c 2..c n (En) c 1 c 2..c n, En, c’ 1 c’ 2..c’ m c 1 c 2..c n : En c 1 c 2..c n 是 / 即 En Comparing four templates Use template 1 in following experiments TemplatePercentagePrecision 117.65%54% 268.35%6.5% 39.05%2.5% 44.94%1% Table 2: Comparing Precision of the Four Templates

IRF21 Translation model Train a translation model Candidate List

IRF22 Dictionary Mining Results Mining Results Processed more than 300GB Chinese web pages 161,117 translation pairs are mined Translation %Transliteration %Accuracy % 53.5546.4590.15 Table 4: Accuracy of Mined Dictionary

IRF23 Coverage of the Dictionary on Query Log Data 9,065 popular English terms from the MSN Chinese search engine

IRF24 CLIR experiment

IRF25 Conclusions In addition to the general approaches, Chinese IR should also consider the characteristics of the language (also for other Asian languages – Japanese and Korean) Difficulty in translating new (technical) words and proper names Exploit parallel/comparable or monolingual texts Additional problem: make the retrieved document readable Full text translation Running sentences in patent: relatively easy Technical terms: may be difficult with Chinese Gisting: translation assistance tool, useful for a user with some knowledge of the document language

IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

Similar presentations

Presentation on theme: "IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

Similar presentations

Presentation on theme: "IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada."— Presentation transcript:

Similar presentations

About project

Feedback