Presentation is loading. Please wait.

Presentation is loading. Please wait.

A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.

Similar presentations


Presentation on theme: "A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China."— Presentation transcript:

1 A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China

2 Introduction Challenges in CWS Ambiguous Unknown word Web and search technology Free from OOV problem Adaptive to different segmentation standards Entirely unsupervised

3 The proposed approach Segments Collecting Query sentence => sub-sentence (by punctuation) Submit sub-sentence to a search engine Collect the highlights from returned snippets Query : “ 我明天要去止锚湾玩 ” :

4 The proposed approach Segments Scoring Select a subset of segments as final segmentation Frequency-based: term frequency Segment occurrences : total number of occurrences SVM-based SVM classifier with RBF kernel and maps the outputs into probabilities as the scores  Reconstruct the query using the segment way with highest score

5 The proposed approach Segments Selecting Valid subset: if its member segments can reconstruct exactly the query Score of valid subset: the average score of its member segments. Greedy search to find valid subset For efficiency consideration Select the valid subset which has highest score as final segmentation

6 Evaluations Experiment setting SVM-based score Training set: 3000 randomly selected sentences Feature space ——Three dimensional : TF DF LEN TF: term frequency DF: number of documents indexed by a segment Len: number of characters in a segment Frequency-based score need no training set

7 Evaluations Comparison result SIGHAN’05

8 Evaluations Worse than reported results Why SVM is worse ? Feature space too simple Advantage: only 3000 or non training set Avoids OOV problem Better performance can be achieved with more search results provided (Google+Yahoo!)

9 Evaluations Comparison to IBM full parser

10 Conclusion It is good at discovering new words (no OOV proble m) and adapting to different segmentation standards Entirely unsupervised which saves labors to labeling training data. Finding more effective scoring methods Combining current approach to other types of segmentation methods to give a better performance

11 My work going on…… Discriminative Reranking ——ACL 07 & 03 1 Michael Collins and Terry Koo 2 Zhongqiang Huang: Purdue Univ.

12 Background Have been applied to many NLP application NER, Parsing, sentence boundary detection Haven’t try it on POS-tagging Motivation 1 Rerank the output of an existing probabilistic tagger. 2 The base tagger produces a set of candidate tag sequence for each sentence. 3 A second model attempts to improve upon this initial ranking using additional features

13 Collins’ Reranking Algorithm Training the reranker n sentences each with n i candidates Along with log-probability produced by the HMM tagger “goodness” score : measures the similarity between the candidate and the gold reference.

14 Collins’ Reranking Algorithm Training data consists of a set of examples each along with a “goodness” score and a log-probability

15 Collins’ Reranking Algorithm A set of indicator functions :extract binary features on each example. Each indicator function is associated with a weight parameter which is real valued. is associated with

16 Collins’ Reranking Algorithm The ranking function The objective of training Set to minimize: Where:

17 Experiments Using HMM as the base model Data set The most recently released Penn Chinese Tree bank 5.2 (denoted CTB, released by LDC) ——33 POS tags ——500K words, 800K characters, 18K sentences

18 Experiments Divide into 20 chunks, with each chunk N-best tagged by the HMM model trained on the combination of the other 19 chunks

19 Experiments Result of Reranking Models N-gram features: N-gram + morphological features.

20 Conclusion Reranking method is efficient on POS task extract additional reranking features utilizin g more explicitly the characteristics of Man darin. explore semi-supervised training methods f or reranking.


Download ppt "A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China."

Similar presentations


Ads by Google