Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis.

Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis

Computing Edit Distance Two text strings are given: X and Y We want to quantify how similar they are: –Comparing DNA sequences in studies of evolution of different species –Spell checkers One of the measures of similarity is the edit distance between X and Y (small distance high similarity)

Edit Distance: Definition We want to convert X into Y by performing one of three operations: Delete a letter, insert a letter, or substitute one letter for another. E.g. X =“ACGGTTA” can be converted toY=“CGTAT” by deleting the 1st A, 2nd G, and substituting A T in last two positions. ACGGTTA _CG_ TAT

Edit Distance: Definition We want to convert X into Y by performing one of three operations: Delete a letter, insert a letter, or substitute one letter for another. The minimum number of these operations that convert X into Y is called the edit distance between X and Y.

Edit Distance: Optimal Substructure Denote by E(i,j) the edit distance between the i-th prefix of X (x 1 x 2 …x i ) and the j-th prefix of Y (y 1 y 2 …y j ) –If x i =y j, then E(i,j)=E(i-1,j-1) –If x i  y j, Either substitute x i   y j, (cost is 1+ E(i-1,j-1) ) or delete x i (cost is 1+ E(i-1,j) ) or insert y j (cost is 1+ E(i,j-1) ) Decide which decision to do by comparing the three values, taking the minimum one. –“Cut-and-paste” argument

Edit Distance: Computing Let n be the length of the word X, and let m be the length of Y. To compute E[i,j] (the Edit distance of (X i, Y j ) ) we construct a 2-dim array (in Scheme – vector of vectors) of size (n+1)x(m+1). We initialize the array at the left most column and topmost row: E(i,0)=i, E(0,j)=j (the edit distance to an empty word).

To fill entry E(i,j), we need the three “former values” E(i-1,j-1),E(i,j-1),E(i-1,j). Having these, we use the recurrence we saw to fill E(i,j). Desired value is E(n,m). Observe: conditions in the problem restrict sub- problems (What is the total number of sub- problems?)

012345 0012345 1112334 22 33 44 55 66 77??? X = ACGGTTA Y = CGTAT

Edit Distance: Example Lets do a dry run with X =“ACGGTTA”, Y=“CGTAT”

Short Introduction to Search Engines

Applictions ?

Typical Web Search Engine Architecture crawl the web create an inverted index Check for duplicates, store the documents Inverted index Search engine servers user query Show results To user DocIds Courtesy R. Ramakrishnan

Goals Speed Space Efficiency Accuracy: “The first item should be what I want to see?” Updates: Periodic? Dynamic?

Typical Methods Full Text scanning (egrep?) Inverted File Indexing (Most common) Signature Files Vector Space Model

Types of queries Boolean Proximity? (Edit Distance?) In relation to other documents. FileType + Keywords Allow for: Prefix matches? Wildcards? Edit distance bounds. (egrep)

Common Tricks Case Unfolding: Tallahassee = tallahassee. Stemming: Compress = compressed = compression ( off-the shelf stemmers available for English) Ignore words: a, the, it, be,… Thesaurus: fast = rapid (typically use available clustering)

Inverted File Index Periodically rebuilt, static otherwise. Documents are parsed to extract tokens. These are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2

How Inverted Files are Created After all documents have been parsed the inverted file is sorted alphabetically.

How Inverted Files are Created Multiple term entries for a single document are merged. Within-document term frequency information is compiled.

How Inverted Files are Created Finally, the file can be split into –A Dictionary or Lexicon file and –A Postings file

How Inverted Files are Created Dictionary/Lexicon Postings

Why use Inverted Files? Permits fast search for individual terms For Boolean queries. For statistical ranking algorithms.

Issues with Inverted files? How to minimize the space taken by the postings list? Access to the lexicon? How to do union and intersection of postings.

Minimizing Space Store postings with deltas –Original posting list: 3,5,20,21,23 –Delta Encoding: 3,2,15,1,2 Use compression on delta encoding –Huffman, Arithmetic

Access to Lexicon? Static: –Sorted arrays. –Perfect Hashing Dynamic –Tries –B-Trees Prefix Matching?

TriesTries Courtesy Tamassia & Goodrich. Useful for ReTrieval First appearance: 1959 Radix Search?

Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries –After preprocessing the pattern, KMP’s algorithm performs pattern matching in time proportional to the text size If the text is large, immutable and searched for often (e.g., works by Shakespeare), we may want to preprocess the text instead of the pattern A trie is a compact data structure for representing a set of strings, such as all the words in a text –A tries supports pattern matching queries in time proportional to the pattern size

d r e d d pkco t ll e ylll Standard Tries (§ 11.3.1) The standard trie for a set of strings S is an ordered tree such that: –Each node but the root is labeled with a character –The children of a node are alphabetically ordered –The paths from the external nodes to the root yield the strings of S Example: standard trie for the set of strings S = { bear, bell, bid, bull, buy, sell, stock, stop } rl su a eib

Analysis of Standard Tries A standard trie uses O(n) space and supports searches, insertions and deletions in time O(dm), where: n total size of the strings in S m size of the string parameter of the operation d size of the alphabet

Applications of Tries A standard trie supports the following operations on a pre-processed text in time O(m), where m is the size of word X: –Word Matching: find the first occurrence of the word X in the text. –Prefix Matching: Find the first occurrence of the longest prefix of word X in the text.

Word Matching with a Trie We insert the words of the text into a trie Each leaf stores the occurrences of the associated word in the text

Compressed Tries Solves the following problems in the standard trie. –Creation of extra nodes in the trie (Path Compression) Just a different representation of the standard trie. First appearance: 1968

Compressed Tries A compressed trie has internal nodes of degree at least two It is obtained from standard trie by compressing chains of “redundant” nodes

Compact Representation Compact representation of a compressed trie for an array of strings: –Stores at the nodes ranges of indices instead of substrings –Uses O(s) space, where s is the number of strings in the array –Serves as an auxiliary index structure

Suffix Trie (§ 11.3.3) The suffix trie of a string X is the compressed trie of all the suffixes of X

Analysis of Suffix Tries Compact representation of the suffix trie for a string X of size n from an alphabet of size d –Uses O(n) space –Supports arbitrary pattern matching queries in X in O(dm) time, where m is the size of the pattern –Can be constructed in O(n) time

Tries and Web Search Engines The index of a search engine (collection of all searchable words) is stored in a compressed trie. Each leaf of the trie is associated with a word and a list of pages (URLs) containing that word (called the occurrence list). The trie is kept in internal memory. The occurrence lists are kept in external memory and are ranked by relevance.

Tries and Web Search Engines Boolean queries for sets of words (e.g. Java and coffee) correspond to sets of operations (e.g. intersection) on the occurrence lists. Additional information retrieval techniques are used, such as: –Stopword Elimination (as done in the standard tries example). –Stemming (e.g. identify “add” “adding” and “added” as the same word). –Link Analysis (recognise authoritative pages).

Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis.

Similar presentations

Presentation on theme: "Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis.

Similar presentations

Presentation on theme: "Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis."— Presentation transcript:

Similar presentations

About project

Feedback