New Indices for Text : Pat Trees and PAT Arrays Gaston H. Gonnet Ricardo A. Baeza-Yates Tim Snider 報告者:吳彥欽
報告大綱 Authors introduce Introduction PAT Tree Searching algorithms on the PAT Tree PAT Array Summary
Authors Introduce Gaston H. Gonnet Professor, ETH Zürich, Switzerland, Informatik , Institute for Scientific Computation http://www.inf.ethz.ch/personal/gonnet/ Symbolic and algebraic computation, heuristic algorithms Computational Biochemistry algorithms. Development of the Darwin system . Text searching and sorting algorithms
Text Searching Methods Lexicographical indices Clustering techniques Indices based on hashing
Traditional Model Keywords Problems Basic structure is assumed. Keywords extraction # of keyword is variable. Queries are restricted to keywords
PAT tree How to build indices ??? Why use PAT tree Keywords ??? Full text !!! Why use PAT tree No restriction on structure No keyword are used
PAT-tree Structure PAT tree is a Patricia tree constructed over all the possible sistring of a text. Patricia tree sistring
Patricia tree Binary Digital tree Internal node skip number External node link to data Example : 011001 110010 100100 001000 010001 100010 000101 010111 001011
Sistring Treat text as a long string Each position in the text corresponds to a Semi-Infinite String Semi-Infinite String Example :
Sistring Example Ex: Text Today is Thursday,I want to.. sistring1 Today is Thursday,I want to.. sistring2 oday is Thursday,I want to.. sistring7 is Thursday,I want to.. sistring10 Thursday,I want to.. : :
PAT Tree PAT tree is a Patricia tree constructed over all the possible sistring of a text. PAT tree = Patricia tree + all Sistring of text Example : abbaabaaababa TEXT 123456789…… POSITION
Indexing Point Words Searching Phrase Searching Indexing point is application dependent
Searching Algorithms on the PAT tree Prefix Searching Range Searching Longest Repetition Searching Proximity Searching Most Significant or Most Frequent Searching Regular Expression Searching
Prefix Searching Every node in the same subtree has the same prefix. A subtree or A single node or Missed Keep the size of each subtree in the internal node.
Proximity Searching Build S1, S2 in PAT tree Find the tallest subtree which contained the S1 and S2. Sorted S1, S2 by position of the answer. Check the proximity condition
Most Significant or Most Frequent Searching Searching the biggest subtree Most common word
Regular Expression Searching Convert regular expression into a deterministic finite automation(DFA) Convert character DFA into binary DFA PAT tree
Improvement Efficiency is important. PAT tree drawback External node will use large physical space. # of internal node could be very large.
Solution Mapping the tree onto the disk using supernodes Allocate as much as possible of the tree in a disk page. Bucking of external nodes Every subtree with size less than b stores in a bucket.
But !!……… Disk page fullness in the actual experiments close to 80% (using greedy algorithm). Each tree page has 10 steps path.
PAT Array The size of the Bucket !!! Using suffix array in Bucket PAT array example :
New Discovery PAT array only missed the longest repetition. Prefix searching and Range searching can only use PAT array.
PAT Array Operation Build PAT array in memory Merge two PAT array Using paging, avoid memory thrashing Merge two PAT array O( n2*log(n1) ) + O( n2 ) Split first, then merge .
Delayed Reading Paradigm Sistring. Random disk access Reading sistring Store request in the pool, wait for time. Use request to generate more requests
Summary Signature file Inverted file PAT tree Storage is small but searching time is linear. Filtering is needed. Inverted file Performance is good but storage is huge. PAT tree …………………