Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Slides:



Advertisements
Similar presentations
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Advertisements

Indexing DNA Sequences Using q-Grams
Lecture 24 MAS 714 Hartmut Klauck
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Longest Common Subsequence
Near-Duplicates Detection
Greedy Algorithms Amihood Amir Bar-Ilan University.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
David Luebke 1 5/4/2015 CS 332: Algorithms Dynamic Programming Greedy Algorithms.
Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.
CpSc 881: Information Retrieval. 2 Dictionaries The dictionary is the data structure for storing the term vocabulary. Term vocabulary: the data Dictionary:
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Tolerant Retrieval.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
CPSC 322, Lecture 9Slide 1 Search: Advanced Topics Computer Science cpsc322, Lecture 9 (Textbook Chpt 3.6) January, 23, 2009.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
Proof Points Key ideas when proving mathematical ideas.
Introduction to Computability Theory
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Chair of Software Engineering Einführung in die Programmierung Introduction to Programming Prof. Dr. Bertrand Meyer Exercise Session 10.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Modern Information Retrieval Chapter 4 Query Languages.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.
College Algebra Sixth Edition James Stewart Lothar Redlin Saleem Watson.
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Jessie Zhao Course page: 1.
Dijkstra’s Algorithm. Announcements Assignment #2 Due Tonight Exams Graded Assignment #3 Posted.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Biological Sequence Comparison and Alignment Speaker: Yu-Hsiang Wang Advisor: Prof. Jian-Jung Ding Digital Image and Signal Processing Lab Graduate Institute.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Information Retrieval WS 2012 / 2013 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
Dictionaries and Tolerant retrieval
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
CS38 Introduction to Algorithms Lecture 10 May 1, 2014.
Dynamic Programming & Memoization. When to use? Problem has a recursive formulation Solutions are “ordered” –Earlier vs. later recursions.
CS 361 – Chapter 10 “Greedy algorithms” It’s a strategy of solving some problems –Need to make a series of choices –Each choice is made to maximize current.
9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.
Minimum Edit Distance Definition of Minimum Edit Distance.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
TU/e Algorithms (2IL15) – Lecture 4 1 DYNAMIC PROGRAMMING II
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
Sorting by placement and Shift Sergi Elizalde Peter Winkler By 資工四 B 周于荃.
Proof of correctness of Dijkstra’s algorithm: Basically, we need to prove two claims. (1)Let S be the set of vertices for which the shortest path from.
Example 2 You are traveling by a canoe down a river and there are n trading posts along the way. Before starting your journey, you are given for each 1
Dynamic Programming for the Edit Distance Problem.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
Distance Functions for Sequence Data and Time Series
Query Languages.
Randomized Algorithms CS648
Dynamic Programming Computation of Edit Distance
Lecture 8. Paradigm #6 Dynamic Programming
Bioinformatics Algorithms and Data Structures
Minwise Hashing and Efficient Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture 8, Thursday December 10 th, 2009 (Error-Tolerant Search)

Goal of Today’s Lecture Learn about error-tolerant search –it’s about typing errors in the query and in the documents –a natural word similarity measure: Levenshtein distance definition intuition how to compute it (dynamic programming) –given a query word and a large set of words, find the most similar word(s) in the set –context-sensitive query correction (Did you mean?) Last half hour: –open discussion about the exercise sheets –how much work they are, how hard, etc.

What is Error-Tolerant Search We query a search engine and don’t get results –… or few results –for example: algoritm –Reason 1: we misspelled the query (should be: algorithm) –Reason 2: in the document(s) we are looking for, the words are misspelled [show examples] Solution for problem due to Reason 1 –find words that are “similar” to the (misspelled) query words Solution for problem due to Reason 2 –find words that are “similar” to the (correctly spelled) query in any case, we need to find “similar” words

Similarity Measures Levenshtein distance a.k.a. Edit distance –given two strings / words x and y –consider the following transformations replace a single character: alkorithm  algorithm insert a character: algoritm  algorithm delete a character: allgorithm  algorithm –the Levenshtein or edit distance ED(x, y) is then defined as the minimal number of transformation to get from x to y for example: x = allgorithm y = aigorytm then ED(x, y) = 4 more about other measures later

Efficient computation Terminology –x(i) = the prefix of x of length i in particular, x(|x|) = x –y(j) = the prefix of y of length j in particular, y(|y|) = y –ε denotes the empty word Recursion –it seems that we can compute ED(x,y) recursively from the ED of prefixes of x and y, namely –ED(x(i), ε) = i and ED(ε, y(j)) = j –for i and j both > 0, we have ED(x(i), y(j)) = the minimum of ED(x(i), y(j-1)) + 1 ~ insert y[j] ED(x(i-1), y(j)) + 1 ~ delete x[i] ED(x(i-1), y(j-1)) + 1 ~ replace x[i] by y[j] ED(x(i-1), y(j-1)) if x[i] = y[j]

An Example Let’s compute ED(“bread”, “board”) –assuming that the recursion from the previous slide is correct (which we will prove on one of the next slides)

Time and Space Complexity Let |x| = n and |y| = m –then the time complexity is O(n ∙ m) O(1) time per table entry and this seems hard to improve in the general case –the space complexity is O(n ∙ m) again O(1) space per table entry but this can be easily improved we can go column by column and only store the last column or we can go row by row and only store the last row this gives a space complexity of O(min(n, m))

Correctness Proof The recursion looks correct –but it’s actually not easy to prove that it’s correct –it is easy to prove that the recursion gives a possible sequence of transformations … … and thus an upper bound on the edit distance –but it’s not clear that it gives an optimal sequence of transformations –I will give you the proof idea –and a sketch of some parts –one important part you will do as an exercise

Proof Outline Lemma 1: –in an optimal sequence of transformations, if a character is inserted, it is not later deleted again or replaced [next slide] Lemma 2: –a sequence of transformations for x  y is called monotone if the transformations on x occur at strictly increasing positions (except that the next operation after a delete may be at the same position) –the recursion computes an optimal monotone sequence of transformations [slide after the next] Lemma 3: –for each optimal sequence of transformations, there is a monotone one with the same length [Exercise!]

Proof Sketch of Lemma 1 Proof sketch of Lemma 1: –if a character gets inserted and later deleted again, we can remove both operations and get a shorter sequence –if a character gets inserted and later replaced, we can remove the replace and insert the replaced character right away, and thus get a shorter sequence

Proof Sketch of Lemma 2 Proof sketch of Lemma 2: –proof is by induction on |x| + |y| = n + m –Case 1: last transformation occurs at position |y| then the previous transformation do one of x(n)  y(m-1) if last transformation was delete x(n-1)  y(m) if last transformation was insert x(n-1)  y(m-1) if last transformation was replace and by way of induction these are optimal –Case 2: last transformation occurs at position < |y| then x[n] = y[m] and these transformations do x(n-1)  y(m-1) and by way of induction this is optimal

Finding Similar Words in a Large Set Given –a query word q –a large set of words S –a threshold δ Find –all words in S with edit distance ≤ δ to the query word q Naïve algorithm –for each word in S compute the edit distance to q –one edit distance computation takes around 1 µsec –so for 10 million words in S  10 seconds –that is inacceptable as response time for a search engine

Filtering with a Permuterm Index Given x and y with ED(x, y) ≤ δ –then if x and y are not too short they have at least a certain substring in common –actually one can prove that there is a rotation x’ of x and a rotation y’ of y such that x’ and y’ have a common prefix of size at least L = ceil(max(|x|, |y|) / δ) – 1 [where ceil(…) means rounded upwards] –it’s one of the exercises to prove this! –here is an intuitive illustration of why this holds true:

Filtering with a Permuterm Index This suggests the following algorithm –build a Permuterm index for S that is, compute all possible rotations of all words in S and sort them –let L be the size of a common prefix from the slide before –then for each rotation of the query word q let q’ be the prefix of size L of q find all matches of q’* in the Permuterm index for S –for all matches thus found, compute the actual edit distance –this will find all words s with ED(q, s) ≤ δ –let’s see by an example how effective the filtering is …

Filtering with a K-Gram Index Let’s build a k-gram index for S –that is, for each string of length k have a list of all words in S containing that string as a substring, for example (k=2) bo: aboard, about, boardroom, border, … or: border, lord, morbid, sordid, … rd: aboard, ardent, boardroom, border, … –take q = bord as an example query string –then using the lists above, we can easily compute: for each word s in S the number of k-grams q and s have in common for example: bord and boardroom … … have exactly two 2-grams in common (bo and rd)

Filtering with a K-Gram Index Jaccard distance between two words x and y –the Jaccard coefficient of two sets A and B is defined as J(A, B) = |A n B| / |A u B| –for example, for A = {1, 2, 3} and B = {2, 3, 4} A n B = {2, 3}, A u B = {1, 2, 3, 4}  J(A, B) = 1/2 –given two words x and y let A be the set of k-grams of x let B be the set of k-grams of y then the k-gram Jaccard distance J(x, y) = J(A, B) –for example, for x = bord and y = boardroom A = {bo, or, rd}, |A n B| = 2 (last slide) |A u B| = – 2 = 9  J(x, y) = 2 / 9 ≈ 0.22

Filtering with a K-Gram Index So scanning the inverted lists of the k-gram index … –… quickly gives us all words with Jaccard distance below a given threshold –unfortunately, the Jaccard distance between two words does not always correspond well with their “intuitive” similarity Example 1: J(“weigh”, “weihg”) = 2 / 6 = 1/3 (too low) Example 2: J(“aster”, “terase”) = 3 / 6 = 1/2 (too high) –the k-gram index can also be used to filter out words with too large edit distance if the edit distance between x and y is ≤ δ then x’ and y’ must have at least max(|x’|,|y’|) (δ-1)∙k k-grams in common where x’ and y’ are x and y with k-1 # padded left and right

Context-Sensitive Query Correction The “Did you mean …?” you know from Google & Co –the simplest way to realize this is as follows –for each query word, find the most similar words as before example query: infomatik fribourg infomatik: informatik, information, informatics fribourg: freiburg, freiburger, friedburg –now try out all combinations and suggest the “best” one to the user, where “best” can mean: retrieves the largest number of hits is most frequent in the query log a combination of the two –trying out all combinations is too expensive in practice –simple trick: precompute statistics about co-occurrence of words