# On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.

## Presentation on theme: "On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda."— Presentation transcript:

On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda (Kyushu University & JST)

Pattern Searching Problem Given ： text T in   and pattern P in   Find ： an occurrence of P in T   : alphabet    : set of strings Using an indexing structure for T, we can solve the above problem in O(|P|) time.

Suffix Trie A trie representing all suffixes of T a a c b \$ c b \$ c b \$ b \$ \$ T = aacb\$ aacb\$ acb\$ cb\$ b\$ \$

Unwanted Matching pace runner pattern s text

Unwanted Matching pace runner pattern s text

Unwanted Matching pace runner pattern s text

Introducing Word Separator # # : word separator - special symbol not in  D =   # : dictionary of words Text T : an element of D + (T is a sequence T 1 T 2 …T k of k words in D) e.g., T = This#is#a#pen#   = {A,…,z}  D = {...,This#,...,a#,...is#,...pen#,...}

Word-level Pattern Searching Problem Given: text T in D + and pattern P in D + Find: an occurrence of P in T which immediately follows # e.g. The#space#runner#is#not#your#good#pace#runner#

Word-level Pattern Searching Problem Given: text T in D + and pattern P in D + Find: an occurrence of P in T which immediately follows # e.g. The#space#runner#is#not#your#good#pace#runner#

Word Suffix Trie A trie representing the suffixes of T which immediately follows # (and T itself). T = aa#b# aa#b# a#b# #b# b# # a a # b # b #

Comparison a a # b # b # a a # b # # b # # b # b # T = aa#b# Suffix Trie Word Suffix Trie

Construction Suffix Trie ： Ukkonen’s on-line algorithm （ 1995 ） Word Suffix Trie ： We modify Ukkonen’s algorithm by:  Using minimum DFA accepting dictionary D  Redefining suffix links

Minimum DFA The minimum DFA accepting D =   # clearly requires constant space (for fixed  ). We replace the root node of the suffix trie with the final state of the DFA. # 

Suffix Links T = aa#b# # a,b a a # b b # # Word Suffix Trie

Suffix Links For any node s with path(s) = u, we can write as u = xy such that  x ∈ D , and  y is a proper prefix of some string in D. Note that the choice of x and y is unique for each u = path(s).

Suffix Links # a,b a a # b b # # path( ) = aa, x = , y = aa path( ) = aa#b#, x = aa#b#, y =  Word Suffix Trie x ∈ D  and y is a proper prefix of some string in D

Suffix Links [cont.] If x ∈ D +, suflink(s) = r such that  r = x’y, and  x = hx’ for some h ∈ D. Otherwise (if x =  ), suflink(s) = initial state of DFA.

Suffix Links a,b a a # b b # # = aa, x = , y = aa suflink( ) = initial state of DFA = aa#b, x = aa#, y = b  suflink( ) = b Word Suffix Trie # x ∈ D  and y is a proper prefix of some string in D

Suffix Links [cont.] a a # b # # b # # b # b # T = aa#b# a,b a a # b b # # Suffix Trie Word Suffix Trie #

On-line Construction   # a T = aa#b# a,b a Suffix Trie Word Suffix Trie #

On-line Construction a a T = aa#b# a,b a a Suffix Trie Word Suffix Trie #   #

# On-line Construction a a # # # T = aa#b# a,b a a # Suffix Trie Word Suffix Trie

  # # On-line Construction a a # # # T = aa#b# b b b b a,b a a # b b Suffix Trie Word Suffix Trie

  # # On-line Construction a a # # # T = aa#b# b b b b # # # # a,b a a # b b # # Suffix Trie Word Suffix Trie

Pseudo Code Just change here!! a a # b # b # a a # b # # b # # b # b # # a,b #  #

Like Dress-up Doll

Like Dress-up Doll [cont.] Hair is different Body is the same!! Different looking!!!! a a # b # b # a a # b # # b # # b # b # # a,b #  #

Drawback of Word Suffix Trie Word suffix tries require O(k|T|) space. Andersson et al. introduced word suffix trees which can be implemented in O(k) space.

Construction of Word Suffix Trees Algorithm by Andersson et al. （ 1996 ）  for text T = T 1 T 2 …T k, constructs word suffix trees in O(|T|) expected time with O(k) space. Our algorithm  simulates the on-line word suffix trie algorithm on word suffix trees.  runs in O(|T|) time in the worst cases, with O(k) space.

Suffix Tree a a # b # # b # # b # b # # T = aa#b#

Word Suffix Tree a # b # b # a T = aa#b#

Normal and Word Suffix Trees a a # b # b # a a # b # # b # # b # b # # T = aa#b# Suffix Tree Word Suffix Tree

Construction Algorithm Just change here!!

Conclusions We first proposed an on-line word suffix trie construction algorithm.  The keys to the algorithm are the minimal DFA accepting D and the re-defined suffix links. Further, we introduced an on-line algorithm to build word suffix trees that works with O(k) space and in O(|T|) time in the worst cases.

Further Work We like this “dress-up doll” strategy :) Hair is different Body is the same!! Different looking!!!!

Further Work “Sparse Directed Acyclic Word Graphs” by Shunsuke Inenaga and Masayuki Takeda Accepted to SPIRE’06 “Sparse Compact Directed Acyclic Word Graphs” by Shunsuke Inenaga and Masayuki Takeda Accepted to PSC’06

Download ppt "On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda."

Similar presentations