Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jing-Shin Chang1 Word Segmentation Models: Overview Chinese Words, Morphemes and Compounds Word Segmentation Problems Heuristics Approaches Unknown Word.

Similar presentations


Presentation on theme: "Jing-Shin Chang1 Word Segmentation Models: Overview Chinese Words, Morphemes and Compounds Word Segmentation Problems Heuristics Approaches Unknown Word."— Presentation transcript:

1 Jing-Shin Chang1 Word Segmentation Models: Overview Chinese Words, Morphemes and Compounds Word Segmentation Problems Heuristics Approaches Unknown Word Problems Probabilistic Model  supervised mode  unsupervised mode Dynamic Programming Unsupervised Model for Identifying New Words

2 Jing-Shin Chang2 English Text With Well Delimited Word Boundary (Computer Manual)  For information about installation, see Microsoft Word Getting Started. To choose a command from a menu, point to a menu name and click the left mouse button ( 滑鼠左鍵 ). For example, point to the File menu and click to display the File commands. If a command name is followed by an ellipsis, a dialog box ( 對話框 ) appears so you can set the options you want. You can also change the shortcut keys ( 快捷鍵 ) assigned to commands. (Microsoft Word User Guide) (1996/10/29 CNN)  Microsoft Corp. announced a major restructuring Tuesday that creates two worldwide product groups and shuffles the top ranks of senior management. Under the fourth realignment..., the company will separate its consumer products from its business applications, creating a Platforms and Applications group and an Interactive Media group.... Nathan Myhrvold, who also co-managed the Applications and Content group, was named to the newly created position of chief technology officer.

3 Jing-Shin Chang3 Chinese Text Without Well Delimited Word Boundaries China Times 1997/7/26:  台經院指出,隨著股市活絡與景氣回溫,第一季車輛及零件營業額成長十六. 八一%,顯示民間需求回升。再加上為加入WTO,開放進口已是時勢所趨, 也將帶動消費成長。台經院預測今年民間消費全年成長率可提昇至六. 七四%。  在投資方面,第一季國內投資出現回升走勢,固定資本形成實質增加六. 五六 %,其中民間投資實質增加八. 九五%。在持續有民間大型投資計畫進行、國 內房市 回溫、與政府開放投資、加速執行公共工程等多項因素下,預測今年 全年民間投資將成長十一. 八%。  台經院表示,口蹄疫 連鎖效應在第二季顯現,使第二季出口貿易成長率比預 期低,出口年增率二. 一%,比去年低。而進口年增率為七. 三八%,因此第二 季貿易出超僅十七. 一四億美元,比去年第二季減少四十三. 六五%。不過,由 於第三、四季為出口旺季,加上國際組織均預測今年世界貿易量擴大,台經 院認為我國商品出口應可轉趨順暢。

4 Jing-Shin Chang4 Example: Word Segmentation [Chang 97] Input: 移送台中少年法庭審理 Seg1*: 移送 / 台中少年 / 法庭審理 Seg2*: 移送 / 台中 / 少年 / 法庭審理 Seg3 : 移送 / 台中 / 少年法庭 / 審理  Successively better segmentation with an unsupervised approach ([Chang 97]) Input: 土地公有政策 Seg1 : 土地 / 公有 / 政策 Seg2*: 土地公 / 有 / 政策  Longest match problem + Unknown word problem

5 Jing-Shin Chang5 Example: Word Segmentation Input: 修憲亂成一團結果什麼也沒得到 Output: 修 憲 亂 成 一 團結 果 什麼 也 沒 得 到  mis-merge problem

6 Jing-Shin Chang6 Why Word Segmentation Word is the natural unit for natural language analyses & NLP applications Tricky output may results if tokenization is not carefully conducted.  tokenization is the first step in most NLP applications  e.g., using character bi-grams as the indexing keys (I.e., representatives of documents) in search engine design and other similarity-based information retrieval tasks

7 Jing-Shin Chang7 Word Segmentation Problems in Basic IR System Information Sources & Acquisition  Web Pages  Web Robots: access all web pages of interested or registered sites to local storage  News Groups  News server: accept postings to the news groups  BBS Articles  BBS server: admin posting of BBS articles  IntraNet documents  shared through local lans Document Conversion & Normalization  html to txt, etc. Indexing System  identify features of documents & keep a representative signature for each document Searching System  convert query into representative signature  compare the signature of input query to the signatures of archived documents  rank the relevant documents by similarity

8 Jing-Shin Chang8 Basic Indexing Techniques & WS Problems Vector Space Approach  document (or query) as a vector of frequency of terms (or variants of frequencies)  compare query vector against document vectors for similarity & relevance Problems (quick but dirty)  depends on word frequencies only (not even compound words)  independent of word orders (no structural or syntactic information)  simple minded query functions: (user requirements not satisfied)  keyword matching (exact or fuzzy)  logical operators (AND, OR, NOT)  near/quasi natural language query  Chinese specific problems: weird output due to un-segmented input  Indexed with character 2-grams (not by words)  資訊月 => 資訊月刊  島內頻寬升級為 1GB  黨內頻喊換閣揆  錄音帶內容 … 尹清峰頻頻說 : “…”

9 Jing-Shin Chang9 Heuristics Approaches Matching Against Lexicon  scan left-to-right or right-to-left Heuristic Matching Criteria  (1) Longest (Maximal) Match  select the longest sub-string on multiple matches  (2) Minimum number of matches  select the segmentation patterns with smallest number of words Greedy Method, Hard-Rejection  skip over matched lexicon entry, and repeat matching, regardless of whether there are embedded or overlapped word candidate in the current matched word

10 Jing-Shin Chang10 Heuristics Approaches Problems  hard-decision: skip over possible matching if it was covered by a previous match (impossible to recover based on more evidences)  i.e., p(w) = 1 or 0, for any word ‘w’, depending on whether it was covered by a previously matched word, unconditionally  less contextual constraints: depends on local match  not depending on all context  not jointly optimized  cannot handle unknown word problem:  words not registered in dictionary will not be handled gracefully  e.g., new compound words, proper names, numbers Advantages  simple and easy to implement  only need a large dictionary  need no training corpora for estimating probabilities

11 Jing-Shin Chang11 Problems with Segmentation Using Known Words  Incomplete Error Recovery Capability  Two types of segmentation errors due to unknown word problems: Over-segmentation: Split unknown words into short segments  (e.g., single character regions ` 修憲 '=> ` 修 憲 ') – 分析家 對 馬來西亞 的 預測 Under-Segmentation: Prefer long segment when combining segments  ( 搶詞問題 )  e.g., ` 土地 公有 政策 ‘ =WS Error (` 公有 ’ unknown)=> ` 土地公 有 政策 '  =Merge=> ` 土地公有 ', ` 有政策 ' (NOT: ` 土地 ', ` 公有 ', ` 政策 ')  團結 : mis-merge=> 修 憲 亂 成 一 團結 果 什麼 也 沒 得 到  MERGE operation ONLY recover over-split candidates but NOT over-merged  (under-segmented) candidates

12 Jing-Shin Chang12 Problems with Segmentation Using Known Words  Use known words for segmentation without considering potential unknown words (zero word probabilities to unknown words)  cannot take advantages of contextual constraints over unknown words to get the desired segmentation  millions of randomly merged unknown word candidates for filter  (- 省都委會 :) 獲省都委會同意 => 獲 省 都 委 會同 意  => 省都 | 省都委 | 省都委會 | 都委 | 都委會同 | 委會同 | 委會同意  (+ 省都委會 :) 獲省都委會同意 => 獲 省都委會 同意  an extra disambiguation step for resolving overlapping candidates  e.g., 省都 vs 省都委會 (etc.)  e.g., 彰化 縣 警 刑警隊 少年組

13 Jing-Shin Chang13 Probabilistic Models Find all possible segmentation patterns, and select the best one according to a scoring function Advantages of Probabilistic Models:  Soft-decision: retain all possibilities of segmentation, without pre- exclude any possibilities, and select the best by a scoring function which might maximizes the joint likelihood of segmentation  Take care of contextual constraints to maximize the likelihood of the whole segmentation pattern  all words in a segmentation pattern will impose constraints on neighboring words  the segmentation pattern which best fit such constraints (or criteria) are selected  Unsupervised training is possible even though there is no dictionary or there is only a small seed dictionary or seed segmentation corpus  because many probabilistic optimization criteria can be maximized by iteratively trying possible segmentations and re-estimation in some known ways  e.g., EM & Viterbi training

14 Jing-Shin Chang14 Probabilistic Model Basic Model:  Word Uni-gram Model [Chang 1991]: to jointly optimize the likelihood of segmentation by the product of probabilities of constituent words in the segmentation pattern  Dynamic Programming: fast searching for the best possible segmentation even though there is a vast number of possible segmentation patterns Other Models: [Chiang et. al, 1992]  Including parts-of-speech ( 詞類 ), morphological ( 詞素 ) features into accounts  Including simple, yet probably useful features like length distribution into accounts  Taking care of unknown words into consideration

15 Jing-Shin Chang15 Word Uni-gram Model for Identifying Words  Segmentation Stage: Find the best segmentation pattern S*  which maximizes the following likelihood function of the input corpus  c 1 n : input characters c 1, c 1,..., c n  S j : j-th segmentation pattern, consisting of { w j,1, w j,2,..., w j,mj }  V(t): vocabulary (n-grams in the augmented dictionary used for segmentation)  S*(V): the best segmentation (is a function of V)

16 Jing-Shin Chang16 Dynamic Programming (DP) Dynamic Programming  A methodology for searching for the best solution without explicitly enumerating all candidates in the solution space  Resolve the optimization problem for the whole problem by resolving the optimization problem of a much simpler sub-problem, whose solution does not really depend on the large number of combination of the remaining parts of the whole problem  therefore, virtually reduce the large solution space into a very small one  Resolving successively larger sub-problems after the simpler ones are resolved, and finally resolve the optimization problem of the whole task Requirement  The optimum solution of the sub-problem should not depends on the remaining parts of the whole problem

17 Jing-Shin Chang17 Dynamic Programming (DP) Steps Initialization  initialize known path scores Recursion  find best previous local path, assuming current node is one of the node in the best path, by comparing sum of local and accumulative scores  keep the trace of the best previous path, and  accumulative score for this best path Termination Path Backtracking  trace back the best path

18 Jing-Shin Chang18 Dynamic Programming (DP) Examples:  shortest path problem  speech recognition: DTW (Dynamic Time Wrapping)  minimum alignment cost between an input speech feature vector and a speech feature vector for the typical utterance of a word  speech-to-speech distance measure  speech-text alignment  align words in speech waveform with written transcription  an extension of isolated word recognition using DTW  speech-to-phonetic transcription  spelling correction:  minimum editing cost between an input string and a reference pattern (e.g., dictionary word)  editing operations: insertion, deletion, substitution (including matching)  advanced operations: swapping  post-editing cost:  cost required to modify machine translated text into fluent translation

19 Jing-Shin Chang19 Dynamic Programming (DP) Examples:  Bilingual Text Alignment  find corresponding sentences in two parallel bi-lingual corpus  sentence length to sentence length distribution in words in characters  Word Correspondence, Translation Equivalent, Bilingual Collocation ( 連語 )  find corresponding words in aligned sentences of bilingual corpora  word association metrics as distance between matching word association metrics: anything that indicate degree of (in-)dependency between word pairs can be used for this purpose to be addressed in later chapters …  Machine Translation

20 Jing-Shin Chang20 Application of DP: Feedback Control via. A Parameterized System

21 Jing-Shin Chang21 Application of DP: Feedback Controlled Parameterized MT Architecture Metrics for error distance  (i) Levison Distance  (ii) e(t-1,I,j) = log P( T(I)* | S(I) ) - log P( T(I,j) | S(I) )

22 Jing-Shin Chang22 Dynamic Programming (DP) for Finding the Best Word Segmentation Ex. 國民大會代表人民行使職權 (c1, c2, …, cN) Scan all character boundaries left-to-right For all word boundary with index ‘idx’, assuming that idx is one of the best segmentation boundary, then the best previous segmentation boundary idx_best can be found by  idx_best = argmax {accumulative_score( 0, j ) x d( j, idx) }  over all j = idx-1 to max(idx – k, 0) (k: maximum word length (in characters))  d(j, idx) = Prob(c[j+1…idx]) (the probability that c[j+1…idx] form a word)  initialization: accumulative_score( 0,0) = 1.0  update: accumulative_score(0, idx) = accumulative_score(0, idx_best) x d(idx_best, idx) After scanning all word boundaries, and finding all (assumed) best previous word boundaries, trace back from the end (which is surely one of the best word boundary), and get the real best segments  right-to-left scanning is virtually identical with the above steps

23 Jing-Shin Chang23 Unsupervised Word Segmentation: Viterbi Training for Identifying New Words Criteria:  1. produce words that maximizes the likelihood of the input corpus  2. avoid producing over-segmented entries due to unknown words Viterbi Training Approach:  Re-estimate the parameters of the segmentation model iteratively to improve the system performance, where the word candidates in the augmented dictionary contain known words and potential words in the input corpus. Potential unknown words will be assigned non-zero probabilities automatically in the above process.

24 Jing-Shin Chang24 Viterbi Training for Identifying Words (cont.)  Segmentation Stage: Find the best segmentation pattern S*  which maximizes the following likelihood function of the input corpus  c 1 n : input characters c 1, c 1,..., c n  S j : j-th segmentation pattern, consisting of { w j,1, w j,2,..., w j,mj }  V(t): vocabulary (n-grams in the augmented dictionary used for segmentation)  S*(V): the best segmentation (is a function of V)

25 Jing-Shin Chang25 Viterbi Training for Identifying Words (cont.)  Reestimation Stage: Estimate the word probability which maximizes the likelihood of the input text:  Initial Estimation:  Reestimation:


Download ppt "Jing-Shin Chang1 Word Segmentation Models: Overview Chinese Words, Morphemes and Compounds Word Segmentation Problems Heuristics Approaches Unknown Word."

Similar presentations


Ads by Google