Presentation is loading. Please wait.

Presentation is loading. Please wait.

Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University.

Similar presentations


Presentation on theme: "Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University."— Presentation transcript:

1 Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University s.taherzade@stu.yaduni.ac.ir

2 Architecture of Search Engines Autumn 20112 Crawler(s) Page Repository Indexer Module Collection Analysis Module Query Engine Ranking Client Indexes : Text Structure Utility Queries Web

3 Introduction 10% -12% search engines queries is misspelled. Spelling Correction effects in information retrieval. A good spelling corrector should only act when it is clear that the user made an error. Autumn 20113

4 Spelling Errors Typographic errors –These errors are occurring when the correct spelling of the word is known but the word is mistyped by mistake. –(example: Taht --> that) –Word boundaries (example: home page --> homepage) Cognitive errors –These are errors occurring when the correct spellings of the word are not known. –(example: seprate --> separate) Autumn 20114

5 Spelling Error Correction The problem of spelling error correction entails three sub-problems: –Detection of an error –Generation of candidate corrections –Ranking of candidate corrections Autumn 20115

6 Spelling Error Correction (cont.) An example: –For misspell input query : استعلام سوابق تعمین اجتماعی –Error detection : استعلام سوابق تعمین اجتماعی –Generate candidate : { تخمین، تامین، تعمیر، تضمین، تعمیم، تعیین } –Candidate ranking : { تامین، تعمیم، تعیین، تضمین، تعمیر، تخمین } –Correction : استعلام سوابق تامین اجتماعی Autumn 20116

7 Implementing Spelling Correction There are two basic principles underlying most spelling correction algorithms: –1. Of various alternative correct spellings for a misspelled query, choose the “nearest” one. This demands that we have a notion of nearness or proximity between a pair of queries. Autumn 20117

8 –2. When two correctly spelled queries are tied (or nearly tied), select the one that is more common. The simplest notion of more common is to consider the number of occurrences of the term in the collection. A different notion of more common is employed in many search engines, especially on the web. The idea is to use the correction that is most common among queries typed in by other users. Autumn 20118

9 Error Detection N-gram based techniques –Spellcheckers without dictionaries –Non-positional vs. Positional –It begins by going right through the dictionary and tabulating all the trigrams (three-letter sequences) For instance, abs, will occur quite often (“absent”, “crabs”) Whereas, pkx, won't occur at all. It would detect “pkxie”, which might have been mistyped for “pixie” Autumn 20119

10 Dictionary based techniques –Given a word, look it up in the dictionary for validation. –Dictionary construction issues –Effective Search Lookup Hash table Trie (aka. pseudo-Btree for retrieval text) For Example استعلام سوابق تعمین اجتماعی ✓ معنی واژه تعمین ╳ Autumn 201110

11 Type of the errors : –Non-Word errors –Real-Word errors Most of errors in web query is Real-Word error. Context based error detection is used for real word errors. Autumn 201111

12 Generate Candidates Generate Candidates Techniques: –Minimum edit distance techniques –Similarity key techniques –Rule-based techniques –N-gram-based techniques –Probabilistic techniques –Neural networks Autumn 201112

13 Minimum edit distance techniques Edit distance –Given two character strings s1 and s2, the edit distance between them is the minimum number of edit operations required to transform s1 into s2. –Edit operations or Damura-Levenshtein distance Insertion, e.g. typing acress for cress Deletion, e.g. typing acress for actress Substitution, e.g. typing acress for across Transposition, e.g. typing acress for caress Autumn 201113

14 The literature on spelling correction claims that 80 to 95% of spelling errors are an edit distance of 1 from the target. Compute edit distance between erroneous word and all dictionary words. Select those dictionary words whose edit distance is within a pre-specified threshold value. Autumn 201114

15 Autumn 201115

16 Similarity key techniques Similarity Key Techniques – Aim: Tries to assign common codes to similar words and String. Coding Schemas –Sound similarity (receive ➡ receive) Soundex Algorithm –Shape similarity ( انتخاب ➡ انتحاب ) Shapex Algorithm Autumn 201116

17 Soundex Autumn 201117

18 N-Gram Based Technique N-Grams –An N-gram is a sequence of N adjacent letters in a word –The more N-grams, two strings, share the more similar they are. Similarity coefficient δ –δ = |common N-grams| / |Total N-grams| –Jaccard coefficient Autumn 201118

19 N-Gram similarity example: –fact vs. fract –Bigrams in fact : -f fa ac ct t- 5 bigrams –Bigrams in fract : -f fr ra ac ct t- 6 bigrams –Union : -f fa fr ra ac ct t- 7 bigrams –Common : -f ac ct t- 4 bigrams δ = 4/7 = 0.57 Autumn 201119

20 Generate candidate – N-gram inverted index –For example misspell “bord” ➡ bo or rd –We would enumerate “aboard”, “boardroom” and “border”. Autumn 201120

21 Probabilistic Techniques Find the most probable transmitted word (correct dictionary word) for a received erroneous string (misspelling). Generic Algorithm –The model assigns a probability to each correct dictionary word for being a possible correction of the misspelling. The word with highest probability is considered the closest match (or the actual intended word). Autumn 201121

22 Probabilistic Techniques (cont.) Autumn 201122

23 Autumn 201123

24 Autumn 201124

25 Error model letter-to-letter confusion probabilities. –[Kernighan 1990] –keyboard adjacencies. A probability matrix –Rule base. string-to-string confusion probabilities. –[Brill 2000] –we needed a training set of (s i, w i ) string pairs, where s i represents a spelling error and w i is the corresponding corrected word. Autumn 201125

26 for each training pair (q1, q2) –we counted the frequencies of edit operations α → β. These frequencies are then used for computing P(α → β), which shows the probability that when users intended to type the string α they typed β instead. –As an example, we extract the following edit operations from the training pair (satellite, satillite): –Window size 1: e → i; –Window size 2: te → ti, el → il; –Window size 3: tel → til, ate → ati, ell → ill. Autumn 201126

27 Language Model سازمان بیمه تامین... Guessing the next word or word prediction. Definition –A statistical language model is a probability distribution over sequences of words. –Having a way to estimate the relative likelihood of different phrases is useful in many natural language processing applications. Autumn 201127

28 Language Model (cont.) Autumn 201128

29 We might represent this probability as follows: P(w 1, w 2..., w n-1, w n ) We can use the chain rule of probability to decompose this probability: Autumn 201129

30 But how can we compute probability like: Counting N-grams of words in corpora. –The general equation for this N-gram approximation to the conditional probability of the next word in a sequence is: Autumn 201130

31 For bigram model: For example: Autumn 201131

32 To improve language model –Co-occurrence frequencies + Confusion sets –N-Gram POS Probabilities –... Autumn 201132

33 Forms of spelling correction Isolated-term Context -sensitive Autumn 201133

34 End Question? Autumn 201134


Download ppt "Autumn 20111 Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University."

Similar presentations


Ads by Google