Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spelling Correction for Search Engine Queries B runo Martins and Mario J. Silva Proceedings of EsTAL-04, España for Natural Language Processing (2004)

Similar presentations


Presentation on theme: "Spelling Correction for Search Engine Queries B runo Martins and Mario J. Silva Proceedings of EsTAL-04, España for Natural Language Processing (2004)"— Presentation transcript:

1 Spelling Correction for Search Engine Queries B runo Martins and Mario J. Silva Proceedings of EsTAL-04, España for Natural Language Processing (2004) Swapnil Chhajer schhajer@usc.edu http://schhajer.co.nr

2 3 Topics Covered in Class Peter Norvig’s Spelling Corrector: Query Processing [33-35] Levenshtein Algortihm: Query Processing [36-41] Evaluation Metrices: Precision & Recall: Introduction to Information Retrieval [16] Soundex Algorithm: Query Processing [18] April 16, 2013Spelling Correction for Search Engine Queries

3 4 Motivation & Abstract Misspelled queries retrieve pages with misspelled words which leaves behind the most appropriate pages. 10-12% of queries are misspelled. To provide user with the best possible match instead of making user choose one of the possible corrections from the correction list. April 16, 2013Spelling Correction for Search Engine Queries

4 Google: Spelling Correction 5April 16, 2013Spelling Correction for Search Engine Queries

5 Spelling Correction Uses Correcting documents being indexed Retrieve matching documents when query contains spelling error Flavors: Isolated words Check words on its own Unable to catch correctly spelled typos from vs.form Context-sensitive Look at surrounding words, e.g., I flew form Heathrow to Narita. 6April 16, 2013Spelling Correction for Search Engine Queries “a paragraph cud half mini flaws but wood bee past by the isolated spill checker”

6 General issues in Spelling Correction UI Did you mean works for one suggestion. What about multiple possible corrections ? Computational Cost Spelling Correction is potentially expensive Avoid running on each query Maybe just on query that matches few documents Guess: Spelling Correction of major search engines is efficient enough to be run on every query 6April 16, 2013Spelling Correction for Search Engine Queries

7 8 Kinds of Spelling Mistakes: Typos Wrong characters by mistake Categorized mainly into 4 categories: Insertions (Missing Letter) “appellate” as “appellare”, “prejudice” as “prejudsice” Deletions (Extra Letter) “plaintiff” as “paintiff”, “judgment” as “judment”, “liability” as “liabilty”, “discovery” as “dicovery”, “fourth amendment” as “fourthamendment” Substitutions (Wrong letter) “habeas” as “haceas” Transpositions “fraud” as “fruad”, “bankruptcy” as “banrkuptcy, “subpoena” as “subpeona”, “plaintiff” as “plaitniff” 80-95% differ from the correct spellings in just one of the four ways. Keyboard layout is important in such cases. April 16, 2013Spelling Correction for Search Engine Queries

8 Wrong characters on purpose Most common type of mistake in general web queries Mistakes derived from either pronunciation or spelling or semantic confusions Brainos: Soundalike (Phonetic Errors) “subpoena” as “supena”,“voir” as “voire”, “latter” as “ladder”, “withholding” as “witholding”, “foreclosure” as “forclosure” Brainos: Confusions “preclusion” as “perclusion”, “men” as “mans”, “juries” as “jurys” or “jureys”, “dramshop” as “dram shop” 8 Kinds of Spelling Mistakes: Brainos April 16, 2013Spelling Correction for Search Engine Queries

9 10 Dictionary Storage: Ternary Search Trees(TST) Data structure: Ternary Search Tree(TST) Type of a TRIE, limited to 3 children per node. TRIE is the common definition for a tree storing strings, in which there is one node for every common prefix and the strings are stored in extra leaf nodes. Searching: O(log(n)+k) n: number of strings in tree k: length of string being searched for April 16, 2013Spelling Correction for Search Engine Queries

10 TST Continued… 11 Figure: A ternary search tree storing the words “to”, “too”, “toot”, “tab” and “so”, all within an associated frequency of 1 April 16, 2013Spelling Correction for Search Engine Queries

11 Spelling Correction Algorithm Implemented using edit distance, rule-based techniques, n-grams probabilistic techniques, neural nets, similarity key techniques, or combinations. Goal: To find edit distance based on different strategies. Shorter distance implies Good Correction. Soundex System: Indexing based on sound. Devised to help with the problem of phonetic errors. Metaphone Systems: Specific to English language Transforming words into codes based on phonetic properties Based on consonants & diphthongs Spelling correction for web Complete waste to make context dependent correction as user hardly type more than three terms for a query 11April 16, 2013Spelling Correction for Search Engine Queries

12 12 Spelling Correction Algorithm Continued… User entered query is tokenized ignoring non-word characters. Convert all words into lower case, and check whether the word is correctly spelled. Update the frequencies for correctly spelled words. This basically acts as a feedback to the system. Feedback system can be helpful for Spell Checker in predicting patterns in user’s searches. Misspelled words are replaced by correctly spelled words. Finally, a new query is presented to the user as a suggestion, together with the results page for the original query. April 16, 2013Spelling Correction for Search Engine Queries

13 Algorithm is divided into 2 phases: Phase 1: Generation of a set of candidate suggestions Phase 2: Select the best choice among those selections Phase 1 9 Steps, at each step look up dictionary for words that relate to the original misspelling. Differ in one character from the original word. Differ in two character from the original word. Differ in one letter removed or added. Differ in one letter removed or added, plus one letter different. Differ in repeated characters removed. Correspond to 2 concatenated words (space between words eliminated). Differ in having two consecutive letters exchanged & 1 character different Have the original word as a prefix. Differ in repeated characters removed & 1 character different. 13 Spelling Correction Algorithm Continued… April 16, 2013Spelling Correction for Search Engine Queries

14 Phase 2: Heuristics used Return the one if it only differs in accented characters Return if it only differs in one character, with the error corresponding to an adjacent letter in the same row of the keyboard. Return the smallest one, if there are solutions having same metaphone key as the original string. Return if it only differs in one character, with the error corresponding to an adjacent letter in an adjacent row of the keyboard. In last, return the last word. Heuristics are followed sequentially and only move to the next if no matching words are found. If there are more than one matching words, return the one with first character matched. If still, there are more than one, choose the word with highest frequency. 14 Spelling Correction Algorithm Continued… April 16, 2013Spelling Correction for Search Engine Queries

15 15 Results Comparison Aspell Spell Checker http://aspell.sourceforge.net/ Aspell uses Metaphone algorithm with near miss strategy 48.33% correct forms were correctly guessed. Outperformed Aspell by 1.66% April 16, 2013Spelling Correction for Search Engine Queries * Doesn’t detect the misspelling - Failed in returning a suggestion.

16 16 Results Comparison Continued… Tumba! : Search engine for Portuguese web April 16, 2013Spelling Correction for Search Engine Queries Table: Results from spelling checker with Tumba!

17 17 Conclusion & Future Work Spelling checker uses a ternary search tree data structure for storing the dictionary. For data source, referred two popular Portuguese newspapers. Queries in search engine may contain company or person’s name. In such cases, keeping two dictionaries, one in the TST used for correction and another in an hash-table used only for checking valid words, could yield good results. April 16, 2013Spelling Correction for Search Engine Queries

18 Pros & Cons Pros Considered various factors affecting edit distance including probabilistic estimations. Used feedback system to improve the quality of user queried results. Cons Did not consider Context Sensitive spell checking. It is not language independent system. Mainly focused on Portuguese words. No discussion about spell corrected completion suggestions as a query is incrementally entered. 18April 16, 2013Spelling Correction for Search Engine Queries

19 References Contemporary Spelling Correction - Decoding the noisy channel, Bob Carpenter Using the Web for Language Independent Spellchecking and Autocorrection, Whitelaw, Hutchinson, Chung and Ellis How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic Analysis through Complex Network Approach, Choudhury, Thomas, Mukherjee, Basu and Ganguly 19April 16, 2013Spelling Correction for Search Engine Queries


Download ppt "Spelling Correction for Search Engine Queries B runo Martins and Mario J. Silva Proceedings of EsTAL-04, España for Natural Language Processing (2004)"

Similar presentations


Ads by Google