Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval.

Similar presentations


Presentation on theme: "Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval."— Presentation transcript:

1 Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval 1

2 Introduction to Information Retrieval Outline ❶ Recap ❷ Dictionaries ❸ Wildcard queries ❹ Edit distance ❺ Spelling correction ❻ Soundex 2

3 Introduction to Information Retrieval Dictionary data structures for inverted indexes The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure? Sec. 3.1

4 Introduction to Information Retrieval A naïve dictionary An array of struct: char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes How do we store a dictionary in memory efficiently? How do we quickly look up elements at query time? Sec. 3.1

5 Introduction to Information Retrieval Dictionary data structures Two main choices: – Hash table – Tree Some IR systems use hashes, some trees Sec. 3.1

6 Introduction to Information Retrieval Hashes Each vocabulary term is hashed to an integer – (We assume you’ve seen hashtables before) Pros: – Lookup is faster than for a tree: O(1) Cons: – No easy way to find minor variants: judgment/judgement – No prefix search[tolerant retrieval] – If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything Sec. 3.1

7 Introduction to Information Retrieval Trees Simplest: binary tree More usual: B-trees Trees require a standard ordering of characters and hence strings … but we standardly have one Pros: – Solves the prefix problem (terms starting with hyp) Cons: – Slower: O(log M) [and this requires balanced tree] – Rebalancing binary trees is expensive But B-trees mitigate the rebalancing problem Sec. 3.1

8 Introduction to Information Retrieval 8 Binary tree 8

9 Introduction to Information Retrieval Tree: B-tree – Definition: Every internal node has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4]. a-hu hy-m n-z Sec. 3.1

10 Introduction to Information Retrieval Outline ❶ Recap ❷ Dictionaries ❸ Wildcard queries ❹ Edit distance ❺ Spelling correction ❻ Soundex 10

11 Introduction to Information Retrieval Wild-card queries: * mon*: find all docs containing any word beginning “mon”. Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo *mon: find words ending in “mon”: harder – Maintain an additional B-tree for terms backwards. Can retrieve all words in range: nom ≤ w < non. Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent ? Sec. 3.2

12 Introduction to Information Retrieval 12 How to handle * in the middle of a term  Example: m*nchen  We could look up m* and *nchen in the B-tree and intersect the two term sets.  Expensive  Alternative: permuterm index  Basic idea: Rotate every wildcard query, so that the * occurs at the end.  Store each of these rotations in the dictionary, say, in a B-tree 12

13 Introduction to Information Retrieval B-trees handle *’s at the end of a query term How can we handle *’s in the middle of query term? – co*tion We could look up co* AND *tion in a B-tree and intersect the two term sets – Expensive The solution: transform wild-card queries so that the *’s occur at the end This gives rise to the Permuterm Index. Sec. 3.2

14 Introduction to Information Retrieval 14 Permuterm index  For term HELLO : add hello$, ello$h, llo$he, lo$hel, and o$hell to the B-tree where $ is a special symbol 14

15 Introduction to Information Retrieval 15 Permuterm → term mapping 15

16 Introduction to Information Retrieval 16 Permuterm index  For HELLO, we’ve stored: hello$, ello$h, llo$he, lo$hel, and o$hell  Queries  For X, look up X$  For X*, look up $ X*  For *X, look up X$*  For *X*, look up X*  For X*Y, look up Y$X*  Example: For hel*o, look up o$hel*  Permuterm index would better be called a permuterm tree.  But permuterm index is the more common name. 16

17 Introduction to Information Retrieval 17 Processing a lookup in the permuterm index  Rotate query wildcard to the right  Use B-tree lookup as before  Problem: Permuterm more than quadruples the size of the dictionary compared to a regular B-tree. (empirical number) 17

18 Introduction to Information Retrieval 18 k-gram indexes  More space-efficient than permuterm index  Enumerate all character k-grams (sequence of k characters) occurring in a term  2-grams are called bigrams.  Example: from April is the cruelest month we get the bigrams: $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$  $ is a special word boundary symbol, as before.  Maintain an inverted index from bigrams to the terms that contain the bigram 18

19 Introduction to Information Retrieval 19 Postings list in a 3-gram inverted index 19

20 Introduction to Information Retrieval 20 k-gram (bigram, trigram,... ) indexes  Note that we now have two different types of inverted indexes  The term-document inverted index for finding documents based on a query consisting of terms  The k-gram index for finding terms based on a query consisting of k-grams 20

21 Introduction to Information Retrieval 21 Processing wildcarded terms in a bigram index  Query mon* can now be run as: $m AND mo AND on  Gets us all terms with the prefix mon... ... but also many “false positives” like MOON.  We must postfilter these terms against query.  Surviving terms are then looked up in the term-document inverted index.  k-gram index vs. permuterm index  k-gram index is more space efficient.  Permuterm index doesn’t require postfiltering. 21

22 Introduction to Information Retrieval Processing wild-card queries As before, we must execute a Boolean query for each enumerated, filtered term. Wild-cards can result in expensive query execution (very large disjunctions…) – pyth* AND prog* If you encourage “laziness” people will respond! Which web search engines allow wildcard queries? Search Type your search terms, use ‘*’ if you need to. E.g., Alex* will match Alexander. Sec. 3.2.2

23 Introduction to Information Retrieval Outline ❶ Recap ❷ Dictionaries ❸ Wildcard queries ❹ Edit distance ❺ Spelling correction ❻ Soundex 23

24 Introduction to Information Retrieval 24 Distance between misspelled word and “correct” word  We will study several alternatives.  weighted edit distance  Edit distance and Levenshtein distance  k-gram overlap 24

25 Introduction to Information Retrieval 25 Weighted edit distance  As above, but weight of an operation depends on the characters involved.  Meant to capture keyboard errors, e.g., m more likely to be mistyped as n than as q.  Therefore, replacing m by n is a smaller edit distance than by q. 25

26 Introduction to Information Retrieval 26 Edit distance  The edit distance between string s 1 and string s 2 is the minimum number of basic operations that convert s 1 to s 2.  Levenshtein distance: The admissible basic operations are insert, delete, and replace  Levenshtein distance dog-do: 1  Levenshtein distance cat-cart: 1  Levenshtein distance cat-cut: 1  Levenshtein distance cat-act: 2 26

27 Introduction to Information Retrieval 27 Levenshtein distance: Algorithm 27

28 Introduction to Information Retrieval 28 Levenshtein distance: Algorithm 28

29 Introduction to Information Retrieval 29 Levenshtein distance: Algorithm 29

30 Introduction to Information Retrieval 30 Levenshtein distance: Algorithm 30

31 Introduction to Information Retrieval 31 Levenshtein distance: Algorithm 31

32 Introduction to Information Retrieval 32 Each cell of Levenshtein matrix 32 cost of getting here from my upper left neighbor (copy or replace) cost of getting here from my upper neighbor (delete) cost of getting here from my left neighbor (insert) the minimum of the three possible “movements”; the cheapest way of getting here

33 Introduction to Information Retrieval 33 Levenshtein distance: Example 33

34 Introduction to Information Retrieval 34 Exercise ❶ Compute Levenshtein distance matrix for OSLO – SNOW ❷ What are the Levenshtein editing operations that transform cat into catcat? 34

35 Introduction to Information Retrieval 35

36 Introduction to Information Retrieval 36

37 Introduction to Information Retrieval 37

38 Introduction to Information Retrieval 38

39 Introduction to Information Retrieval 39

40 Introduction to Information Retrieval 40

41 Introduction to Information Retrieval 41

42 Introduction to Information Retrieval 42 How do I read out the editing operations that transform OSLO into SNOW ? 42

43 Introduction to Information Retrieval 43

44 Introduction to Information Retrieval 44

45 Introduction to Information Retrieval 45

46 Introduction to Information Retrieval 46

47 Introduction to Information Retrieval 47

48 Introduction to Information Retrieval 48

49 Introduction to Information Retrieval 49

50 Introduction to Information Retrieval 50

51 Introduction to Information Retrieval 51

52 Introduction to Information Retrieval 52

53 Introduction to Information Retrieval Outline ❶ Recap ❷ Dictionaries ❸ Wildcard queries ❹ Edit distance ❺ Spelling correction ❻ Soundex 53

54 Introduction to Information Retrieval Spell correction Two main flavors: – Isolated word Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words e.g., from  form – Context-sensitive Look at surrounding words, e.g., I flew form Heathrow to Narita. Sec. 3.3

55 Introduction to Information Retrieval 55 Correcting queries  First: isolated word spelling correction  Premise 1: There is a list of “correct words” from which the correct spellings come.  Premise 2: We have a way of computing the distance between a misspelled word and a correct word.  Simple spelling correction algorithm: return the “correct” word that has the smallest distance to the misspelled word.  Example: informaton → information  For the list of correct words, we can use the vocabulary of all words that occur in our collection.  Why is this problematic? 55

56 Introduction to Information Retrieval 56 Alternatives to using the term vocabulary  A standard dictionary (Webster’s, OED etc.)  An industry-specific dictionary (for specialized IR systems) 56

57 Introduction to Information Retrieval 57 k-gram indexes for spelling correction  Enumerate all k-grams in the query term  Example: bigram index, misspelled word bordroom  Bigrams: bo, or, rd, dr, ro, oo, om  Use the k-gram index to retrieve “correct” words that match query term k-grams  Threshold by number of matching k-grams 57

58 Introduction to Information Retrieval 58 k-gram indexes for spelling correction: bordroom 58 ….

59 Introduction to Information Retrieval Jaccard coefficient A commonly-used measure of overlap Let X and Y be two sets; then the J.C. is Equals 1 when X and Y have the same elements and zero when they are disjoint X and Y don’t have to be of the same size Sec. 3.3.4

60 Introduction to Information Retrieval Example : bigrams match Query A = bordroom reaches term B = boardroom A = bo or rd dr ro oo om B = bo oa ar rd dr ro oo om = 2 / (7 + 8 -6) = 6/9 60

61 Introduction to Information Retrieval Context-sensitive spell correction Text: I flew from Heathrow to Narita. Consider the phrase query “flew form Heathrow” We’d like to respond Did you mean “flew from Heathrow”? because no docs matched the query phrase. Sec. 3.3.5

62 Introduction to Information Retrieval Context-sensitive correction Need surrounding context to catch this. First idea: retrieve dictionary terms close (in weighted edit distance) to each query term Now try all possible resulting phrases with one word “fixed” at a time – flew from heathrow – fled form heathrow – flea form heathrow Hit-based spelling correction: Suggest the alternative that has lots of hits. Sec. 3.3.5

63 Introduction to Information Retrieval Outline ❶ Recap ❷ Dictionaries ❸ Wildcard queries ❹ Edit distance ❺ Spelling correction ❻ Soundex 63

64 Introduction to Information Retrieval 64 Soundex  Soundex is the basis for finding phonetic (as opposed to orthographic) alternatives.  Example: chebyshev / tchebyscheff  Algorithm:  Turn every token to be indexed into a 4-character reduced form  Do the same with query terms  Build and search an index on the reduced forms 64

65 Introduction to Information Retrieval 65 Soundex algorithm ❶ Retain the first letter of the term. ❷ Change all occurrences of the following letters to ’0’ (zero): A, E, I, O, U, H, W, Y ❸ Change letters to digits as follows:  B, F, P, V to 1  C, G, J, K, Q, S, X, Z to 2  D,T to 3  L to 4  M, N to 5  R to 6 ❹ Repeatedly remove one out of each pair of consecutive identical digits ❺ Remove all zeros from the resulting string; pad the resulting string with trailing zeros and return the first four positions, which will consist of a letter followed by three digits 65

66 Introduction to Information Retrieval 66 Example: Soundex of HERMAN  Retain H  ERMAN → 0RM0N  0RM0N → 06505  06505 → 06505  06505 → 655  Return H655  Note: HERMANN will generate the same code 66

67 Introduction to Information Retrieval 67 How useful is Soundex?  Not very – for information retrieval  OK for similar-sounding terms. The idea owes its origins to work in international police department, seeking to match names for wanted criminals despite the names being spelled differently in different countries. 67


Download ppt "Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval."

Similar presentations


Ads by Google