Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Phonetic String Matching:Lessons from Information Retrieval.

Slides:



Advertisements
Similar presentations
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.
Advertisements

To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology U*F clustering : a new performant “ clustering-mining ”
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Empirical Study of a 3D Visualization for Information.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A data mining approach to the prediction of corporate failure.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 The k-means range algorithm for personalized data clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Taxonomy of Similarity Mechanisms for Case-Based Reasoning.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Finding Terminology Translations From Hyperlinks On the.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Virus Pattern Recognition Using Self-Organization Map.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology SIGIR1 Improving Web Search Results Using Affinity Graph.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Ming Hsiao Author : Bing Liu Yiyuan Xia Philp S. Yu 國立雲林科技大學 National Yunlin University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. How valuable is medical social media data? Content analysis of the medical web Presenter :Tsai Tzung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Fast accurate fuzzy clustering through data reduction Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Motivated Reinforcement Learning for Non-Player Characters.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Efficient Optimal Linear Boosting of a Pair of Classifiers.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using Text Mining and Natural Language Processing for.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Authors :
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Extreme Visualization: Squeezing a Billion Records into a Million Pixels Presenter : Jiang-Shan Wang.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Rival-Model Penalized Self-Organizing Map Yiu-ming Cheung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Qing.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology O( ㏒ 2 M) Self-Organizing Map Algorithm Without Learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Region-based image retrieval using integrated color, shape,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Balaji Rajagopalan Mark W. Isken 國立雲林科技大學 National Yunlin University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A hierarchical clustering algorithm for categorical sequence.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology ACM SIGMOD1 Subsequence Matching on Structured Time Series.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author : Yongqiang Cao Jianhong Wu 國立雲林科技大學 National Yunlin University of Science.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering : integrating data clustering over optimization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2005.ACM GECCO.8.Discriminating and visualizing anomalies.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Key Blog Distillation: Ranking Aggregates Presenter : Yu-hui Huang Authors :Craig Macdonald, Iadh Ounis.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Visualizing social network concepts Presenter : Chun-Ping Wu Authors :Bin Zhu, Stephanie Watts, Hsinchun.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge Presenter : Jiang-Shan Wang Authors.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Aristidis Likas Nikos Vlassis Jakob J.Verbeek 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Presenter : Chien-Hsing Chen Author: Geoffrey I. Webb.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Nonlinear Mapping for Data Structure Analysis John W.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology IEEE EC1 Generating War Game Strategies Using A Genetic.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Michael.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Intelligent Exploration for Genetic Algorithms Using Self-Organizing.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Phonetic String Matching:Lessons from Information Retrieval Advisor : Dr. Hsu Graduate : Chih-Ling Wang Authors : Justin Zobel Philip Dart 2003 IEEE.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction Phonetic matching versus information retrieval Phonetic matching techniques Performance assessment Combination of evidence Conclusions Personal Opinion

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation We explore the accuracy of the phonetic string matching.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective In this paper we propose a new phonetic matching techniques and describe the results of a new comparative investigation of phonetic matching.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Introduction Phonetic matching is used to identify strings that may be of similar pronunciation, regardless of their actual spelling.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Introduction(cont.) There are two pragmatic issues that must be addressed in such a phonetic matching system. One is of speed – answers should be found reasonably quickly. The other pragmatic issue is accuracy. The parallels between information retrieval and phonetic matching mean that They can be measured by the same kinds of techniques. Methods for improving information retrieval performance may also apply to phonetic matching.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Phonetic matching versus information retrieval In information retrieval, ranking is the process of identifying which of a set of documents are most likely to be similar in content to a given query. Phonetic matching is the process of identifying which of a set of strings are most likely to be similar in sound to a given query string.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Phonetic matching versus information retrieval(cont.) In both cases the matching process is: fundamentally inexact, since human judgment is required to tell whether the process’s guess is correct Similarity is relative, unable in isolation to determine whether a query and potential answer are matches. It is difficult to give an accurate definition of relevance.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Phonetic matching versus information retrieval(cont.) We consider phonetic matching to be the process of identifying strings that, after elimination of possible transmission or cognition errors, may sound the same. Transmission errors include, sound-alike mistakes in data entry ;mishearing of a spoken name on a imperfect transmission medium. Cognition errors include, mistaking a pronunciation for an expected word.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Phonetic matching techniques Soundex  Soundex uses codes based on the sound of each letter to translate a string into a canonical form of at most four characters, preserving the first letter.  Soundex makes the error of transforming dissimilar- sounding strings to the same code, and of transforming similar-sounding strings to different codes.  There is no ranking of matches: strings are either similar or not similar.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Example: reynold(r005043)=>r543 renauld(r050043) =>r543 Phonetic matching techniques(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Phonix Phonix is a Soundex variant. Letters are mapped to a set of codes using the same algorithm, but a slightly different set of codes is used, and prior to mapping about 160 letter-group transformations are used to standardise the string. The sequence tjv is mapped to chv if it occurs at the start of a string, and x is transformed to ecs. These transformations provide context for the phonetic coding and allow c and s to be distinguished. Phonetic matching techniques(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Example: reynold(r005043)=>r543 renauld(r050043) =>r543 Phonetic matching techniques(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 In our experiments we consider a variant of Phonix, here called Phonix+, in which truncation is not applied and a minimal edit distance is used to compare the resulting strings. Phonetic matching techniques(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Q-gram methods A q-gram of string s is any substring of s of some fixed length q. Simply counting q-grams does not allow for length differences. Phonetic matching techniques(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Example:rhodes;rod We have used this q-gram method with q=2 rhodes= rod= Phonetic matching techniques(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Agrep Agrep is a utility that embodies a fast algorithm for identifying strings that contain a substring which is identical to a query but for at most k insertions, deletions, or replacements, where k is a predefined constant. Agrep was not designed for the task of phonetic matching, but rather for fast searching of large files. Phonetic matching techniques(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Edit distances A simple edit distance, which counts the minimal number of single-character insertions, deletions, and replacements needed to transform one string into another, could be used for phonetic matching since similar-sounding words are often spelled similarly. Phonetic matching techniques(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 For two strings s and t of length i and j respectively, this edit distance can computed with the recurrence relation edit(i,j). The function returns 0 if and are identical, and 1otherwise. Phonetic matching techniques(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Example: rhodes;rod Phonetic matching techniques(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 Editex Editex is a phonetic distance measure that combines the properties of edit distances with the letter- grouping strategy used by Soundex and Phonix. Editex also groups letters that can result in similar pronunciations, but doesn't require that the groups be disjoint and can thus reflect the correspondences between letters and possible similar pronunciation more accurately. Phonetic matching techniques(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22 Editex is defined by the edit distance recurrence relation with a redefined function returns 0 if and are identical, 1 if and are both occur in the same group, and 2 otherwise. The function is identical to.If a is h or w and a b then is 1. Phonetic matching techniques(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23 Phonometric methods Our algorithms for phonometric matching consist of two stages: First, the string of letters is converted into a string of phonemes by a string-to-pronunciation conversion algorithm. The second stage is comparison of strings of phonemes. The distance between pronunciations as represented by strings of phonemes can be measured more precisely than the distance between strings of letters. Phonetic matching techniques(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 Tapering Tapering is a refinement to the edit distance techniques based on a human-factors property: differences at the start of a pronunciation can be more significant than differences at the end. A tapered edit distance of particular interest is one in which the maximum penalty for replacement or deletion at start of string just exceeds twice the minimum penalty for replacement or deletion at end of string. Phonetic matching techniques(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 We can now compare the various approaches to phonetic matching.Results are shown in Table 1, which is of 11-point recall- precision. For many of the techniques tested, only a few distinct ranks are possible, and some techniques only return two ranks, match and not-match. Performance assessment

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 26 The least effective methods such as Phonix and Soundex only return a small number of answers for most queries. Phonix and Soundex are not only finding many wrong answers but not finding many right ones. The “baseline” results are for a trivial phonetic matching method: find all strings with at most one character – an insertion, deletion, or replacement – different from the query. Performance assessment(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 27 A particular problem of best agrep is the tiny number of correct answers returned – less than one per query – but we stress that agrep was not designed for phonetic matching. Performance assessment(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 28 An interesting discovery is that even the most successful of the methods fetch rather different sets of answers, sometimes almost without overlap. As for information retrieval, it seems, two methods can perform well without finding the same answers. Performance assessment(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 29 Phonetic matching has strong parallels with information retrieval. Matching techniques fetch a ranked list of matches in which each entry has weight attached to it; this weight is the likelihood that the entry is a good match. Combining the ranked lists produced by different retrieval mechanisms can improve performance. Combination of evidence

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 30 The “(none)” lines are the results of running the methods individually. The best performance of all is given by the combination of Phonix+ and the q-gram method, neither of which works particularly well alone. Combination of evidence(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 31 More sophisticated techniques for combination could be used: Weighting the ranks from the different techniques. Combining more than two methods. That combination of evidence is successful in this context. Combination of evidence(cont.)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 32 Conclusions Two of our proposals – the Ipadist and Editex methods – do indeed lead to improved performance, whereas the third – tapering – was not successful. We showed that combination of evidence, which has been successfully applied to information retrieval, consistently improves performance. Our new methods are substantially more effective than existing methods such as edit distances, and that combination of evidence is as valuable in this domain as it is in information retrieval.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 33 Personal Opinion The concept in this paper may use in our research, but I haven’t have a clear idea to implement it.I need more time to think…think…