Presentation is loading. Please wait.

Presentation is loading. Please wait.

Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper.

Similar presentations


Presentation on theme: "Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper."— Presentation transcript:

1 Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper Paper follow up ICCLP, SIGIR paper Guihong Cao Guihong Cao MSKK-III – Clustering for technique transfer MSKK-III – Clustering for technique transfer Yang Wen Yang Wen MSKK-III – Distance word dependency MSKK-III – Distance word dependency Min Zhang Min Zhang MSKK/CSC – Entropy based pruning for applications of (Pinyin/Hiragana) input system MSKK/CSC – Entropy based pruning for applications of (Pinyin/Hiragana) input system

2 Chinese Spelling Checking (or, the Big CSC) Jianfeng Gao NLC Group, MSRCN

3 Outline Introduction Introduction Chinese spelling checking Chinese spelling checking Our approach Our approach Key techniques and experiments Key techniques and experiments Millstone Millstone

4 Introduction Chinese spelling errors using MS-Pinyin input system Chinese spelling errors using MS-Pinyin input system Chinese spelling error patterns Chinese spelling error patterns English spelling checking English spelling checking Why CSC is difficult? Why CSC is difficult? Goal: Automatically correct Chinese spelling errors using MS-Pinyin (MSPY) input system

5 Text in the brain Syllable Key stroke (Typing) Converted text Chinese spelling errors using MSPY Pinyin (phonetic) errors Typographic errors System errors

6 Chinese spelling errors patterns Substitution errors Substitution errors Pinyin error Pinyin error System error (include Pinyin error in some systems) System error (include Pinyin error in some systems) Non-substitution errors word segmentation errors Non-substitution errors word segmentation errors insertion/deletion/transposition Typographic errors – insertion/deletion/transposition

7 English spelling checking Non-word error detection (the hte) Non-word error detection (the hte) N-gram (letter) analysis N-gram (letter) analysis Dictionary lookup Dictionary lookup Real-word error detection (from form) Real-word error detection (from form) NLP – parser driven NLP – parser driven Statistical approach – data/error driven Statistical approach – data/error driven Local – n-gram language model, depend on pre-defined confusion set Local – n-gram language model, depend on pre-defined confusion set Global – Winnow, Bayesian, TBL, etc. Global – Winnow, Bayesian, TBL, etc. Problem – lack of error detection Problem – lack of error detection

8 Why CSC is difficult? Word segmentation Word segmentation Ambiguous Ambiguous OOV – Proper noun detection (personal name, location, organization, etc.) OOV – Proper noun detection (personal name, location, organization, etc.) Segmentation error propagation Segmentation error propagation Non-word errors (in sense of English) do not exist Non-word errors (in sense of English) do not exist MSPY makes good use of word trigram language model MSPY makes good use of word trigram language model

9 Chinese spelling checking CSC – related works CSC – related works Template matching – long distance, e.g. Template matching – long distance, e.g. Pattern matching – long words (n>=3), e.g., Pattern matching – long words (n>=3), e.g., N-gram models – substitution errors N-gram models – substitution errors CSC – challenges CSC – challenges Long distance, coverage issue of template/pattern set Long distance, coverage issue of template/pattern set High-frequent-used confusion set, e.g. { } { } High-frequent-used confusion set, e.g. { } { } OOV, especially the proper nouns OOV, especially the proper nouns N-gram, has been fully used by MSPY N-gram, has been fully used by MSPY

10 Chinese spelling errors patterns in MSPY Proper noun Proper noun Personal name Personal name Location Location organization organization Non-word errors: context independent Non-word errors: context independent Insertion/deletion/transposition/substitution Insertion/deletion/transposition/substitution E.g., E.g., Real-word errors: context sensitive Real-word errors: context sensitive E.g.,, E.g.,,

11 Flowchart of our approach Text with errors Word segmentation Non-word error correction Real-word error correction Proper noun detection Word fuzzy matching Trigger: single char string, low prob Context sensitive disambiguation

12 Word segmentation and proper noun detection Language model based word segmentation Language model based word segmentation Class-based language model Class-based language model P(W) = P outside (W) P inside a (W| ), a = ? P(W) = P outside (W) P inside a (W| ), a = ? Outside probability – PN tagged training data Outside probability – PN tagged training data Using NLPWIN to tag the corpus Using NLPWIN to tag the corpus Filtering, rule base Filtering, rule base EM? EM? Inside probability – PN list training data Inside probability – PN list training data Using cache (or, dynamic dictionary) Using cache (or, dynamic dictionary)

13 Experiments and Findings Measure: precision/recall – definition Measure: precision/recall – definition Training data – People Daily Training data – People Daily Tag tool – NLPWIN Tag tool – NLPWIN Test data – spec. Test data – spec. Results and Findings Results and Findings

14 Long word fuzzy matching Definition of Distance(s1, s2) Definition of Distance(s1, s2) Long word, n>=3, Long word, n>=3, Sum of delete/insert/substitute a character Sum of delete/insert/substitute a character Fast fuzzy matching Fast fuzzy matching Global – Lei Zhangs ACL Global – Lei Zhangs ACL Local – trigger, (single char, or low n-gram probability ) Local – trigger, (single char, or low n-gram probability ) Search – error detection/correction Search – error detection/correction Viterbi Viterbi Simplified version Simplified version Long word + Local matching Long word + Local matching

15 Experiments and Findings Contact: 100 person, 3000 -- 5000 characters/person Contact: 100 person, 3000 -- 5000 characters/person Error analysis Error analysis Algorithm … Algorithm … Measure: precision/recall Measure: precision/recall Large lexicon, acquisition. Large lexicon, acquisition. Trigger/threshold ? Trigger/threshold ? Results and Findings Results and Findings

16 Context sensitive disambiguation Building confusion set – specific to MSPY Building confusion set – specific to MSPY Feature selection – Context vector Feature selection – Context vector Collocation – contiguous POS or words/characters Collocation – contiguous POS or words/characters Context words – words/characters within a K-size window Context words – words/characters within a K-size window Triple ? Triple ? Weighting schema and Classifier Weighting schema and Classifier Context Vector, TFIDF Context Vector, TFIDF Winnow, Bayesian, TBL, etc. Winnow, Bayesian, TBL, etc. Scaling up Scaling up Enlarge confusion set Enlarge confusion set Feature pruning Feature pruning Adaptation Adaptation

17 Experiments and Findings Measure: precision/recall Measure: precision/recall Training data Training data Test data (XXX confusion set) Test data (XXX confusion set) Results and Findings Results and Findings

18 Experiments and Findings Current Work Current Work Pseudo-training set based on MSPY IME Pseudo-training set based on MSPY IME Preliminary data processing (400M PD) Preliminary data processing (400M PD) Unigram error model (10,000 Words useful) Unigram error model (10,000 Words useful) /69484 /10289 /2394 …… /69484 /10289 /2394 …… Trigram error pattern (980,000 useful) Trigram error pattern (980,000 useful) [ ] => / [ ] => [ ] => / [ ] => Experiments based on basic approaches Experiments based on basic approaches Pseudo-test set from Pseudo-test set from Continuous pair (Recall = 50%, Precision = 25%) Continuous pair (Recall = 50%, Precision = 25%) Pattern Matching (??) Pattern Matching (??) Future Work Future Work Hybrid approaches Hybrid approaches Pattern Clustering + Continuous pair Pattern Clustering + Continuous pair Functional words error detection Functional words error detection

19 System evaluation – put it all together Evaluation toolset Evaluation toolset Measure: precision/recall Measure: precision/recall Training data Training data Test data Test data Results and Findings Results and Findings

20 Prototype Demo … Demo … Online & offline CSC Online & offline CSC Right click Right click Spelling error detection/correction Spelling error detection/correction Proper noun detection/correction Proper noun detection/correction

21 Assignment Jianfeng Gao – overall, fuzzy matching Jianfeng Gao – overall, fuzzy matching Mu Li – context sensitive disambiguation Mu Li – context sensitive disambiguation Jian Sun – PN detection Jian Sun – PN detection Yang Wen – system evaluation Yang Wen – system evaluation Yulin Kang – demo Yulin Kang – demo Lei Zhang – senior consultant Lei Zhang – senior consultant

22 Millstone Oct. 2001, Ming says Yes (TAB demo) Oct. 2001, Ming says Yes (TAB demo) Dec. 2001, Dong says Yes (Transfer) Dec. 2001, Dong says Yes (Transfer) Aug. 2002, HJ says Yes (Party) Aug. 2002, HJ says Yes (Party)

23 Information Access at \\msrcn4p3\rootD\gaojf\spell Access at \\msrcn4p3\rootD\gaojf\spell Contact me if any problems Contact me if any problems Jianfeng Gao, Tel: 86-10-62617711-5778, Email: jfgao@microsoft.com Jianfeng Gao, Tel: 86-10-62617711-5778, Email: jfgao@microsoft.com


Download ppt "Re-organization of IR/CSC team Hongchao He Hongchao He Conf. follow up TREC-10, NTCIR Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper."

Similar presentations


Ads by Google