Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spelling correction as an iterative process that exploits the collective knowledge of web users Silviu Cucerzan and Eric Brill July, 2004 Speaker: Mengzhe.

Similar presentations

Presentation on theme: "Spelling correction as an iterative process that exploits the collective knowledge of web users Silviu Cucerzan and Eric Brill July, 2004 Speaker: Mengzhe."— Presentation transcript:

1 Spelling correction as an iterative process that exploits the collective knowledge of web users Silviu Cucerzan and Eric Brill July, 2004 Speaker: Mengzhe Li

2 Spell Checking of Search Engine Queries Traditional Word Processing Spell Checker: Resolve typographical errors Compute a small set of in-lexicon alternatives relying on:  In-lexicon-word frequencies  The most common keyboard mistakes  Phonetic/cognitive mistakes  Word substitution errors(Very few) -  The use in inappropriate contexts  Typographical/cognitive mistakes Web Query Very short, less than 3 word on average Frequency and severity are significantly greater Validation cannot be decided by lexicon or grammaticality Consist of one or more concepts Contain legitimate words not found in traditional lexicon 1

3 Spell Checking of Search Engine Queries Difficulty of applying traditional spell checker on web query Defining a valid web query is difficult Impossible to maintain a high-coverage lexicon Difficult to detect word substitutions in very large lexicon Alternative Method: Evolving expertise of using web search engines – collected search query logs Validity of words – frequency in what people are querying for “the meaning of a word is its use in the language: Utilize query logs to learn the validity -Build model for valid query probabilities -Despite the fact that a large percentage of queries are misspelled -No trivial way to determine the valid from invalid 2

4 Traditional Lexicon-based Spelling Correction Approaches For any out-of-lexicon word, find the closet word form in the available lexicon and hypothesize it as the correct spelling alternative based on the edit distance function. Iteratively redefine the problem to diminish the role of trusted lexicon Include a threshold so that all words in the distance are good candidates. Using prior probability instead of the actual distance to choose the alternative. Consider the frequency of words in a language Using the product between the likelihood of misspelling a word and the prior probability of words to achieve a probabilistic edit distance Condition the probability of the correction 3

5 Contd. Tokenize the text so that the context can be taken into account. Ex: power crd -> power cord video crd -> video card Misspelled word should be corrected depending on contexts. Consider word substitution error. Ex: golf war -> gulf war sap opera -> soap opera Consider concatenation and splitting, Ex: power point slides -> powerpoint slides chat inspanish -> chat in spanish Consider terms in web queries. Ex: gun dam planet -> gundam planet limp biz kit -> limp bizkit Consider word substitution errors and generalize the problem Out–lexicon words are valid in web query correction and in- lexicon words should be changed to out-lexicon words  No longer explicit use of a lexicon  Query data is more important in the string probability  Substitute for a measure of the meaningfulness of strings as web queries 4

6 Distance Function: Modified Context-dependent weighted Damerau - Levenshtein edit function: Defined as the minimum number of point changes required to transform a string into another, which allows insertion, deletion, substitution, immediate transportation, and long-distance movement of letters as point changes. - Using statistics from query logs to refine the weights Importance of distance function d and threshold δ: Restrictive – right correction might not be possible Less-limited – unlikely correction might be suggested Desired – large distance corrections for a diversity of situation Distance Function and Threshold 5

7 Exploiting Large Web Query Logs Any string that appears in the query log used for training can be considered a valid correction and can be suggested as an alternative to the current web query based on the relative frequency of the query and the alternative spelling. Three essential properties of the query logs: Words in the query logs are misspelled in various ways, from relatively easy-to-correct misspellings to very- difficult-to-correct ones, that make the user’s intent almost impossible to recognize; The less malign (difficult to correct) a misspelling is the more frequent it is; The correct spellings tend to be more frequent than misspellings. 6

8 Exploiting Large Web Query Logs Example for the iterative function: Anol Scwartegger -> Arnold Schwarzenegger Misspelled Query: anol scwartegger First Iteration: Arnold schwartnegger Second Iteration: Arnold schwarznegger Third Iteration: Arnold Schwarzenegger Fourth Iteration: no further correction 7 Shortcoming of query as full string to be corrected: Depend on the agreement between the relative frequencies and the character error model. Need to identify all queries in the query log that are misspellings of other queries. Find a correction sequence of logged queries for any new query. Only cover exact matches of the queries that appear in these logs Provide a low coverage of infrequent queries.

9 Example of Query Uses Substrings Tokenization process uses space and punctuation delimiters in addition to the information provided about multi-word compounds by a trusted lexicon to extract the input query and word unigram and bigram statistics from query logs to be used as the system’s language model. 9

10 Query Correction Procedure: 1.An input query is tokenized using space and word-delimiter information in addition to the available lexical information. 2.A set of alternatives is computed using the weighted Levenshtein distance function described before and two thresholds for in-lexicon and out-lexicon tokens. 3.Matches are searched in the space of word unigrams and bigrams extracted from query logs and trusted lexicon. 4.Modified Viterbi search is employed to find the best possible alternative string to the input query. Constraint: no two adjacent words change simultaneously. Restriction: In-lexicon words are not allow changes in the first iteration. Fringe: Searched paths form. Assumption based on constraint: The list of alternatives for each word is randomly ordered but the input word in the trusted lexicon is on the first position of the list 10

11 Modified Viterbi Search Method Figure 1. Example of trellis of the modified Viterbi search Using word-bigram statistics may face that stop words may interface negatively with the best path search. To avoid, special strategy is used: 1.Ignore the stop word as in Figure 1 2.Compute the best alternatives for the skipped stop words in a second Viterbi search as in Figure 2 Figure 2. Stop-word treatment 11

12 Conclusion Success in using the collective knowledge stored in search query log for the spell correction task. Effective and efficient search method with great space complexity. Appropriate suggestion proposed by iteration spelling check with restriction and modified edit distance function. Technique to find the extremely informative but noisy resource exploiting the errors made by people as a way to do effective query spelling correction. Still need larger and more actual evaluation data to provide a more convincing achievement. Adaptation of the technique to the general purpose spelling correction by using statistics from both query-logs and large office document collection. 12

Download ppt "Spelling correction as an iterative process that exploits the collective knowledge of web users Silviu Cucerzan and Eric Brill July, 2004 Speaker: Mengzhe."

Similar presentations

Ads by Google