Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alon Efrat Computer Science Department University of Arizona Suffix Trees.

Similar presentations


Presentation on theme: "Alon Efrat Computer Science Department University of Arizona Suffix Trees."— Presentation transcript:

1 Alon Efrat Computer Science Department University of Arizona Suffix Trees

2 2 Purpose Given a (very long) text R, preprocess it, so that once a query text P is given, we can efficiently find if P appears in R. (Later – also where P appears in R). Example R= “ HelloWorldWhatANiceDay ”, IsIn( “ World ” ) = YES, IsIn( “ Word ” ) = No IsIn( “ l ” )=8 YES (note – appears more than once)

3 3 Definition: A suffix For a word R, a suffix is what is left of R after deleting the first few characters. All the suffixes of R= “ Hello ” Hello ello llo lo o

4 4 Alg for answering IsIn Preprocessing: Create an empty trie T. Given R= “ HelloWorldWhatANiceDay ”, insert into T all suffixes of R. Answering IsIn(P): Just check if P is in T That is, return find(P). (Here, find is as studied in the lecture on tries)

5 5 Example T= “ hello ”. Suffixes: “ hello ”, “ ello ”, “ llo ”, “ lo ”, ” o ”. h e l o e l l o l l o o Examples: P= “ ll ” l o

6 6 Lets get greedy Given a (very long) text R, preprocess it, so that once a query text P is given, we can find the location of P in R (if at all) efficiently. More specifically, report the index of where P starts to appear in R. (If more then one answer, report the last one). Example R= “ HelloWorldWhatANiceDay ”, Where( “ World ” ) = 5, that is, the answer is 5, since “ World ” appears starting at index 5 in R. Where( “ Word ” ) = NoWhere Where( “ l ” )=8 (also in other places)

7 7 Alg for answering Where Modify the trie, so that each node also contains a field b_inx. When inserting a word s to the trie, whose first character is in index k of R, modify to nodes along the insertion path to contain the value k. Preprocessing: Create an empty trie T. Given R= “ HelloWorldWhatANiceDay ”, insert into T all suffixes of R. Answering IsIn(P ): Just check if P is in T That is, return find(P), and the value of b_inx where the search terminates. ( Here, find is as studied in the lecture on tries) Resulting DataStructure is called: Uncompressed Suffix Tree

8 8 Example T= “ hello ”. Suffixes: “ hello ”, “ ello ”, “ llo ”, “ lo ”, ” o ”. h e l o e l l o l l o o Examples: P= “ ll ” l b_inx=0 1 1 1 1 2 3 o b_inx=2 3 4 \

9 9 So much memory ????? The problem with this data structure results from long paths: A sequence of nodes, each but the last one has a single child, and all has the same value of b_inx. h e l o e l l o b_inx=0 h e l o e l l o l l o o l 1 1 1 1 2 3 o b_inx=2 3 4 \

10 10 More examples of paths 0 0 0 0 1 1

11 11 Solution Recall that all strings in the tree are suffixes of the same text R. Add a new field to each node, called c_inx and lng such that if lng>0 then when computing a string, we need to concatenate lng chars from P starting at position c_idx e l l o b_inx=0 h e l l o h c_idx=1, lng=4 e l o e l o R= “ h e l l o ” 0 1 2 3 4 ---------

12 12 Compressing the tree Assuming we are visiting nodes v of the tree, whose distance (num of edges) from the root in the uncompress trie is k. Also assume that v is the first node on a path. Then c_idx = b_idx + k. So the function compress_tree should `know ’ the distance from the root (in the uncompress tree) of the visited node.

13 13 Need a function compress_tree that accepts a node v of the tree, and the depth of v in the uncompressed tree. Also need the function check_path( NODE *p) returning the length (in # edges) of the path starting at *p. So for example if *p has two children, it returns 0;

14 14 Compressing the tree – cont ’ compress_tree( NODE * p, int depth){ for each cell ar[i] of *p if ( (d = check_path (p->ar[i] ) ) > 0 ){ Let q be a pointer to the node at the end of the path. Let h be the length of the path and let d be the depth of q (in the uncompressed tree). Both q, d and h should be obtained from check_path (think how) Set p->ar[i]=q Free unused nodes q -> c_idx = q -> b_idx+depth+1 q -> lng = h compress_tree( q, d ) }

15 15 How large is the tree now Lemma: If T is a tree with no node of degree 1, then the number of nodes is O(number-of-leaves) In our scenario, number-of-leaves<|R| So the size of the trie is O(|R|).


Download ppt "Alon Efrat Computer Science Department University of Arizona Suffix Trees."

Similar presentations


Ads by Google