Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Investigation of Web Query Refinement via Topic Analysis and Learning with Personalization Department of Systems Engineering & Engineering Management The.
Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.
Text Search and Fuzzy Matching
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Wavelet-Based Multiresolution Matching for Content-Based Image Retrieval Presented by Tienwei Tsai Department of Computer Science and Engineering Tatung.
Machine Learning Approach for Ontology Mapping using Multiple Concept Similarity Measures IEEE/ACIS International Conference on Computer and Information.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
Finding Similar Questions in Large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee Retrieval Models for Question and Answer Archives.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
1 CSA4050: Advanced Topics in NLP Spelling Models.
Towards an Intelligent Multilingual Keyboard System Tanapong Potipiti, Virach Sornlertlamvanich, Kanokwut Thanadkran Information Research and Development.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Interoperable Visualization Framework towards enhancing mapping and integration of official statistics Haitham Zeidan Palestinian Central.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
On using context for automatic correction of non-word misspellings in student essays Michael Flor Yoko Futagi Educational Testing Service 2012 ACL.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Autumn Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Content-Based Image Retrieval Using Color Space Transformation and Wavelet Transform Presented by Tienwei Tsai Department of Information Management Chihlee.
January 2012Spelling Models1 Human Language Technology Spelling Models.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Tolerant Retrieval Review Questions
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Applying Deep Neural Network to Enhance EMPI Searching
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Query in Streaming Environment
Yunsik Son1, Seman Oh1, Yangsun Lee2
COMBINED UNSUPERVISED AND SEMI-SUPERVISED LEARNING FOR DATA CLASSIFICATION Fabricio Aparecido Breve, Daniel Carlos Guimarães Pedronette State University.
Lecture 12: Data Wrangling
Test Case Purification for Improving Fault Localization
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
University of Illinois System in HOO Text Correction Shared Task
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

A String Similarity Measure Based on Orthographic and Phonetic Similarity for Spelling Correction Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2 1 Research Institute of Korean Studies, Korea University, Anam-dong 5-1, Seoul, 136-701, Korea 2 Department of Industrial & Management Engineering, Hansung University, Samseongyoro-16gil, Seoul, 136-792, Korea {motdg, haigh}@korea.ac.kr and seelee@hansung.ac.kr Abstract. The most commonly used string similarity measure for spelling correction is minimum edit distance (MED), which is based solely on the orthographic similarity between two strings. In order to overcome this shortcoming, this paper presents a more sophisticated similarity measure that considers both the orthographic and phonetic similarity between two strings. To demonstrate the effectiveness of the proposed measure, we implement and test a spelling correction system and apply it to two languages, namely English and Korean. We investigate the useful features for each language through various experiments and achieve 10-best accuracies of 95.2 for English and 97.4 for Korean. Keywords: String similarity, Spelling correction, Edit distance, Phonetic algorithm, n-gram indexing 1 Introduction Spelling correction is a function which automatically corrects a string containing spell errors. More generally, if a string not in a dictionary is found, similar words with the string listed in the dictionary will be suggested automatically. Spelling correction can be a very convenient function when users do not know the exact spelling of a word or make typing mistakes. The core technique for spelling correction is an approximate string matching algorithm. The most commonly used string similarity measure for approximate string matching is minimum edit distance (MED) [1]. MED between two strings is defined as the minimum number of editing operations, e.g., insertion, deletion and substitution, required to transform one string into another. MED is effective for differentiating orthographic similarity, but not for phonetic similarity. Therefore, MED is not suitable for catching phonetic or cognitive mistakes. This paper presents a string similarity measure that combines orthographic and phonetic features. Recent approaches to spelling correction based on probability models [2], [3] require a large training set of spelling errors paired with the correct spelling of the * Do-Gil Lee is a corresponding author. 129

2 String Similarity Measure word. Such data should be collected from real texts and manually corrected, which is time-consuming. The method proposed herein, with an ‘unsupervised’ manner, does not require massive data. 2 String Similarity Measure The devised scoring function considers the following four features for string similarity: common letter feature, n-gram overlap feature, edit distance feature, and phonetic feature. The following four assumptions for the scoring function are made. First, two strings with more letters in common are more similar. Second, two strings with a higher value of ranking score function are more similar. Third, two strings with shorter edit distance are more similar. Fourth, two strings with shorter edit distance in phonetic code are more similar. Various types of function can reflect these assumptions. Each assumption can be expressed in more than one function. Among the various combinations of these functions, the optimal can be determined experimentally, and the final score calculated by multiplication of the functions for the features. 3 Experiments To evaluate the system performance, k-best accuracy is used, which is the ratio of the number of candidates containing the correct answer to the total k candidates suggested by the system (in this study, 10-best accuracy). In this work, we tested our spelling correction system for both English and Korean. An English dictionary containing 1,300,000 headwords was used to generate candidates of strings for the query and an error/correct answer data set of 998 pairs was used. A Korean dictionary containing 500,000 headwords was used and error/correct answer data set of 2,579 was used. Metaphone is used as a phonetic algorithm for English. Two cases were tested with 1) the front 4 bytes of the generated code, and 2) the entire phonetic code. For Korean, Kodex [4] was used. To investigate the effect of the parameters used in each feature on the performance of the system, various combinations of parameters were tested. Table 1. Top 5 results for English C N E S n-gram size Phonetic code 10-best acc. 1 2-gram Metaphone(4byte) 95.19 2 94.79 94.69 3 130

Table 2. Top 5 results for Korean C N E S n-gram size 10-best acc. 1 2 2-gram 97.40 3 97.29 3-gram 4-gram 97.21 Tables 1 and 2 exhibit 10-best accuracies of top 5 results with combining features. 4 Conclusion We have presented a more sophisticated similarity measure than MED by considering both orthographic and phonetic similarity, and implemented a spelling correction system based on the proposed similarity measure. The method is language independent and was demonstrated for both English and Korean. The experimental results demonstrate that the proposed similarity measure outperforms MED in both languages for spelling correction problems. Acknowledgements. This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MEST) (KRF-2007-361- AL0013). References Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady. 707--710 (1966) Brill, E. and Moore, R. C.: An improved error model for noisy channel spelling correction. In: the 38th Annual Meeting on Association for Computational Linguistics. Annual Meeting of the ACL. (2000) Kristina Toutanova and Robert C. Moore: Pronunciation modeling for improved spelling correction. In: the 40th Annual Meeting on Association for Computational Linguistics (ACL '02). Association for Computational Linguistics, Stroudsburg, PA, USA, 144--151 (2002) Kang, B. and Choi, K.: Two approaches for the resolution of word mismatch problem caused by English words and foreign words in Korean information retrieval. In: the Fifth international Workshop on information Retrieval with Asian Languages, (2000) 131