Presentation is loading. Please wait.

Presentation is loading. Please wait.

New Indices for Text : Pat Trees and PAT Arrays

Similar presentations


Presentation on theme: "New Indices for Text : Pat Trees and PAT Arrays"— Presentation transcript:

1 New Indices for Text : Pat Trees and PAT Arrays
Gaston H. Gonnet Ricardo A. Baeza-Yates Tim Snider 報告者:吳彥欽

2 報告大綱 Authors introduce Introduction PAT Tree
Searching algorithms on the PAT Tree PAT Array Summary

3 Authors Introduce Gaston H. Gonnet
Professor, ETH Zürich, Switzerland, Informatik , Institute for Scientific Computation Symbolic and algebraic computation, heuristic algorithms Computational Biochemistry algorithms. Development of the Darwin system . Text searching and sorting algorithms

4 Text Searching Methods
Lexicographical indices Clustering techniques Indices based on hashing

5 Traditional Model Keywords Problems Basic structure is assumed.
Keywords extraction # of keyword is variable. Queries are restricted to keywords

6 PAT tree How to build indices ??? Why use PAT tree Keywords ???
Full text !!! Why use PAT tree No restriction on structure No keyword are used

7 PAT-tree Structure PAT tree is a Patricia tree constructed over all the possible sistring of a text. Patricia tree sistring

8 Patricia tree Binary Digital tree Internal node skip number
External node link to data Example :

9 Sistring Treat text as a long string
Each position in the text corresponds to a Semi-Infinite String Semi-Infinite String Example :

10 Sistring Example Ex: Text Today is Thursday,I want to..
sistring1 Today is Thursday,I want to.. sistring2 oday is Thursday,I want to.. sistring7 is Thursday,I want to.. sistring10 Thursday,I want to.. : :

11 PAT Tree PAT tree is a Patricia tree constructed over all the possible sistring of a text. PAT tree = Patricia tree + all Sistring of text Example : abbaabaaababa TEXT …… POSITION

12 Indexing Point Words Searching Phrase Searching
Indexing point is application dependent

13 Searching Algorithms on the PAT tree
Prefix Searching Range Searching Longest Repetition Searching Proximity Searching Most Significant or Most Frequent Searching Regular Expression Searching

14 Prefix Searching Every node in the same subtree has the same prefix.
A subtree or A single node or Missed Keep the size of each subtree in the internal node.

15 Proximity Searching Build S1, S2 in PAT tree
Find the tallest subtree which contained the S1 and S2. Sorted S1, S2 by position of the answer. Check the proximity condition

16 Most Significant or Most Frequent Searching
Searching the biggest subtree Most common word

17 Regular Expression Searching
Convert regular expression into a deterministic finite automation(DFA) Convert character DFA into binary DFA PAT tree

18 Improvement Efficiency is important. PAT tree drawback
External node will use large physical space. # of internal node could be very large.

19 Solution Mapping the tree onto the disk using supernodes
Allocate as much as possible of the tree in a disk page. Bucking of external nodes Every subtree with size less than b stores in a bucket.

20 But !!……… Disk page fullness in the actual experiments close to 80% (using greedy algorithm). Each tree page has 10 steps path.

21 PAT Array The size of the Bucket !!! Using suffix array in Bucket
PAT array example :

22 New Discovery PAT array only missed the longest repetition.
Prefix searching and Range searching can only use PAT array.

23 PAT Array Operation Build PAT array in memory Merge two PAT array
Using paging, avoid memory thrashing Merge two PAT array O( n2*log(n1) ) + O( n2 ) Split first, then merge .

24 Delayed Reading Paradigm
Sistring. Random disk access Reading sistring Store request in the pool, wait for time. Use request to generate more requests

25 Summary Signature file Inverted file PAT tree
Storage is small but searching time is linear. Filtering is needed. Inverted file Performance is good but storage is huge. PAT tree …………………


Download ppt "New Indices for Text : Pat Trees and PAT Arrays"

Similar presentations


Ads by Google