Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Validating Transliteration Hypotheses Using the Web: Web Counts vs. Web Mining Presenter : You Lin Chen Authors : Hikaridai,Seika-cho, Soraku-gun, Kyoto 2007.WI.7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Methodology Experiments Conclusion Comments
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation Web counts hit counts approximate Web frequency. Some Web search engines disregard punctuation and capitalization when matching a search term. Second, it is not easy to consider the contexts of transliteration hypotheses with Web counts. 3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objectives To address these problems, we propose a novel method for validating transliteration hypotheses based on Web mining. 4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology 5 Ranking transliteration hypotheses machine transliteration system transliteration hypotheses Clinton 克林頓 Query Clinton 、 克林頓 Data Set Generate Web Pages contextual Information as feature trained SVM English terms Extract Ranking transliteration hypotheses trained MEM
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology freq(tci,Wl): the number of occurrences of tci in W l ex : freq(tci,W1)=6 freq d (SW, tci, W l, d): Co- occurrence of SW and tci within distance d ex : freq d (SW, tc,W,d=10)=5 freq p (SW, tci,Wl,d): Co- occurrence of SW and tc as parenthetical expressions within distance d ex : freq p (SW, tci,W1,d=10)=5 6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology x i ∈ X be a feature vector of tc i ∈ TC g SVM (x )= w · x i + b, where x cor is a positive sample and the others are negative samples 7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology g MEM (xi)= Pr(tc cor |xi) The maximum entropy model (MEM) is a widely used probability model that can in- corporate heterogeneous information e ff ectively. an event (ev) is usually composed of a target event (te) and a history event (he); say ev =. 8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Experiments
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Conclusion Experiments showed that our Web mining-based transliteration validation method was consistently better than systems based on Web counts
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Comments Advantage … Drawback … Application …