String Matching Module-5.

String Matching Module-5

Text-Search Data Structures
Goals of the lecture: Dictionary ADT for strings: to understand the principles of tries, compact tries, Patricia tries Text-searching data structures: to understand and be able to analyze text searching algorithm using the suffix tree and Pat tree Full-text indices in external memory: to understand the main principles of String B-trees. Suffix Arrays and LCP array Related transformations

Dictionary ADT for Strings
Dictionary ADT for strings – stores a set of text strings: search(x) – checks if string x is in the set insert(x) – inserts a new string x into the set delete(x) – deletes the string equal to x from the set of strings Assumptions, notation: n strings, N characters in total m – length of x Size of the alphabet d = |S|

Predecessor Problem Given k strings, T1; T2; :::; Tk
Problem is given some pattern P, Return where P fits among the k strings in lexicographical order. E.g. Given T={abc,ana,almond,akien} For a pattern P=“Almighty“ the predecessor would be “akien“ as T arranged in order would be To={abc,akien,almond,ana}

BST of Strings We can, of course, use binary search trees. Some issues: Keys are of varying length A lot of strings share similar prefixes (beginnings) – potential for saving space Let’s count comparisons of characters. What is the worst-case running time of searching for a string of length m?

Tries Trie – a data structure for storing a set of strings (name from the word “retrieval”): Let’s assume, all strings end with “$” (not in S) b s e a r $ i d u l k n y Set of strings: {bear, bid, bulk, bull, sun, sunday}

Properties of Tries A multi-way tree.
Store characters of keys in each node, not keys. Each node has from 1 to d children corresponding to d elements of set Each edge of the tree is labeled with a single character. Each leaf node corresponds to the stored string, which is a concatenation of characters on a path from the root to this node. All the descendants of a node have a common prefix of the string associated with that node

Search and Insertion in Tries
Trie-Search(t, P[k..m]) //inserts string P into t 01 if t is leaf then return true 02 else if t.child(P[k])=nil then return false else return Trie-Search(t.child(P[k]), P[k+1..m]) The search algorithm just follows the path down the tree (starting with Trie-Search(root, P[0..m])) Trie-Insert(t, P[k..m]) 01 if t is not leaf then //otherwise P is already present 02 if t.child(P[k])=nil then Create a new child of t and a “branch” starting with that chlid and storing P[k..m] 04 else Trie-Insert(t.child(P[k]), P[k+1..m]) How would the delete work?

Trie Node Structure- Implementation
What is the node structure? = What is the complexity of the t.child(c) operation?: An array of child pointers of size d: waist of space, but child(c) is O(1) A hash table of child pointers: less waist of space, child(c) is expected O(1) A list of child pointers: compact, but child(c) is O(d) in the worst-case A binary search tree of child pointers: compact and child(c) is O(lg d) in the worst-case

Trie Node Structure- Example
For these words BE, BED, and BACCALAUREATE, BUN, BROWN, CROWN, Represent how node having prefix, ‘B’ will be represented, if each of these implementations are used. Array List Hash BST BBST (AVL) Also compute memory required if each letter needs 1 byte, each pointer 4 bytes and Alphabet, = {A…Z, $}

Analysis of the Trie If d=Size of alphabet
Search, insertion, and deletion (string of length m): depending on the node structure: O(dm) List; O(m lg d) BST; O(m) Array & HT; Observation: Having chains of one-child nodes is wasteful

Compact Tries Replace a chain of one-child nodes with an edge labeled with a string i.e non-branching nodes compressed into single edge Each non-leaf node (except root) has at least two children b s e a r $ i d u l k n y b sun ear$ ul day$ id$ $ l$ k$

Compact Tries for words in document
Implementation: Strings are external to the structure in one array, edges are labeled with indices in the array (from, to) Usage- Word matching: Use the compact trie to “store” all words in the text Each leaf in the compact trie has a list of indices in the text where the corresponding word appears. The tree consists of O(k) non-branching nodes

Word Matching with Tries
(17,18) (31,34) (1,2) (22,24) (14,16) (19,19) 31 (3,3) 20 (8,11) 12 17 (4,5) 6 (28,30) 1 25,35 T: they think that we were there and there To find a word P: At each node, follow edge (i,j), such that P[i..j] = T[i..j] If there is no such edge, there is no P in T, otherwise, find all starting indices of P when a leaf is reached

Word Matching with Tries
Building of a compact trie for a given text: How do you do that? Describe the compact trie insertion procedure Running time: O(N) Complexity of word matching: O(m) What if the text is in external memory? In the worst-case we do O(m) I/O operations just to access single characters in the text – not efficient

Patricia trie Patricia trie: T:
a compact trie where each edge’s label (from, to) is replaced by {T[from], (to – from + 1) } (w,2) (a,4) (t,2) (r,3) (_,1) (a,3) 31 (e,1) 20 (i,4) 12 17 (y,2) 6 (r,3) 1 25,35 T: they think that we were there and there

Querying Patricia Trie
Word prefix query: find all words in T, which start with P[0..m-1] Patricia-Search(t, P, k) // Searches P into t 01 if t is leaf then 02 j ¬ the first index in the t.list 03 if T[j..j+m-1] = P[0..m-1] then return t.list // exact match 05 else if there is a child-edge (P[k],s) then if k + s < m then return Patricia-Search(t.child(P[k]), P, k+s) else go to any descendent leaf of t and do the check of line 03, if it is true, return lists of all descendent leafs of t, otherwise return nil else return nil // nothing is found

Application of Tries Auto Complete Spell Checkers
Longest Prefix Matching Automatic Command completion Network browser history Implement a Phone Directory

Auto Complete Auto Complete functionality is used widely in mobile apps and text editors. Trie is an efficient data structure widely used for its implementation. Trie provides an easy way to search for the possible dictionary words to complete word because of the following reasons Looking up data in a trie is faster in the worst case O(n)(n = size of the string involved in the operation) time compared. A trie can provide an alphabetical ordering of the entries by key.

Auto Complete Searching in a trie enables us to trace pointers to get to a node that represent the string user has entered. By exploring a trie traversing down the tree, we can easily enumerate all strings that complete the user input. This is used by many editors like Notepad++,Sublime Text Mobile Apps like WhatApp,Hike,Messaging uses Auto Complete

Spell Checkers Spell checking is a three-step process. Check if a word is in a dictionary, generate potential suggestions, and then sort the suggestions–hopefully with the intended word on top. Tries can be used to store that dictionary and by searching the words over the data structure one can easily implement a spell checker in the most efficient way. Using trie not only the lookup for a word into the dictionary becomes easy but an algorithm to provide the list of valid words or suggestions can be easily constructed.

Longest Prefix Matching
Also called Maximum prefix length match refers to an algorithm used by routers in Internet protocol(IP) networking to select an entry from a routing table. One of the first IP lookup techniques to employ tries is the radix trie implementation in the BSD kernel. Optimizations requiring contiguous masks bound the worst case lookup time to O(W) where W is the length of the address in bits. In order to speed up the lookup process,multi bit trie schemes were developed which perform a search using multiple bits of the address at a time.

Automatic Command completion
When using an Operating System such as Unix, we typr in system commands to accomplish certain tasks. For example to see the list of commands(ls /usr/bin/ps*) having prefix ps can be autosuggested by just pressing tab. We can simply the task of typing in commands by providing a command completion facility which automatically types in the command suffix once the user has typed in a long enough prefix to uniquely identify the command.

Network browser history
A network browser keeps a history of the URLs of sites that you have visited. By organizing this history as a trie, the user need only type the prefix of a previously used URL and the browser can complete the URL.

Implement a Phone Directory
Phone Directory can be efficiently implemented using Trie Data Structure. We insert all the contacts into Trie. Generally search query on a Trie is to determine whether the string is present or not in the trie, but in this case we are asked to find all the strings with each prefix of ‘str’. This is equivalent to doing a DFS traversal on a graph. From a Trie node.

Text-Search Problem Input: Output: Reformulate the problem:
Text T = “carrara” Pattern P = “ar” Output: All occurrences of P in T Reformulate the problem: Find all suffixes of T that has P as a prefix! We already saw how to do a word prefix query. carrara arrara rrara rara ara ra a

Suffix Trees Suffix tree – a compact trie (or similar structure) of all suffixes of the text T Patricia trie of suffixes is sometimes called a Pat tree a r (a,1) (r,1) carrara$ (c,8) r (r,1) rara$ (r,5) $ a ($,1) (a,1) rara$ a$ 1 (r,5) (a,2) 1 7 $ ra$ 3 (r,3) 7 3 ($,1) 5 5 2 4 6 4 2 6 carrara$

Applications of Suffix Trees
String Matching: Walk down and report all leaves beneath v. Runtime depends on structure of node First k occurences: maintain a pointer from each node to its leftmost descendent leaf & connect leaves via linked list. O(k) additional time needed. Counting occurences: store size of subtree at each node

Applications of Suffix Trees
Longest repeating substrings: longest common prefix of strings starting at i and j- LCA query Multiple documents: Concatenate with $1, $1,$2 …$k .Trim below making each a leaf. Document retrieval: To find d distinct documents Ti containing pattern P in O(P+d)

Constructing Suffix Trees
The naïve algorithm Insert all suffixes one after another: O(N2) Clever algorithms: O(N) McCreight, Ukkonen Scan the text from left to right, use additional suffix links in the tree Question: How does the the Pat tree looks like after inserting the first five prefixes using the naïve algorithm? Honolulu$

Suffix Array It is a sorted array of all suffixes of a string X.
It is an integer array providing the starting positions of suffixes of X in lexicographical order. can be constructed by performing a depth-first traversal of a suffix tree. The leaf-labels during the traversal correspond to suffix arrays Reverse not true, also LCP array needed

Suffix Array text =banana$ Sorted suffixes Suffix Array
with indices Suffix Array with indices All Suffixes of string

LCP Array It is Longest Common Prefix array
It is an auxiliary data structure to the suffix array It stores the lengths of the longest common prefixes between pairs of consecutive suffixes in the suffix array. Simple to built form already built suffix array. H[1:n], LCP array given by

LCP Array E.g. If A := [aab, ab, abaab, b, baab] is a suffix array,
LCP of H[1] undefined. LCP of A[2] = ab and A[3] = abaab is ab, so H[3] = 2. LCP of A[1] = aab and A[2] = ab is a, so H[2] = 1. LCP array with suffix array speeds up computations of suffix tree.

String Matching Module-5.

Similar presentations

Presentation on theme: "String Matching Module-5."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

String Matching Module-5.

Similar presentations

Presentation on theme: "String Matching Module-5."— Presentation transcript:

Similar presentations

About project

Feedback