New Indices for Text : Pat Trees and PAT Arrays

Slides:



Advertisements
Similar presentations
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Advertisements

Introduction to Information Retrieval
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Modern Information Retrieval
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.
Indexing and Searching
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Fast Text Searching for Regular Expressions or Automaton Searching on Tries RICARDO A. BAEZA-YATES University of Chile, Santiago, Chile AND GASTON H. GONNET.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Chapter 13 File Structures. Understand the file access methods. Describe the characteristics of a sequential file. After reading this chapter, the reader.
The Chinese University of Hong Kong Introduction to PAT-Tree and its variations Kenny Kwok Department of Computer Science and Engineering.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Introduction to Digital Libraries Information Retrieval.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Why indexing? For efficient searching of a document
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Data Indexing Herbert A. Evans.
Tries 07/28/16 11:04 Text Compression
Text Indexing and Search
Indexing Structures for Files and Physical Database Design
Indexing Goals: Store large files Support multiple search keys
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Data Structures Interview / VIVA Questions and Answers
Spatial Indexing I Point Access Methods.
Hash-Based Indexes Chapter 11
CS 430: Information Discovery
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
PAT Trees Index for arbitrary character sequence in text
The Quad tree The index is represented as a quaternary tree
Indexing and Searching (File Structures)
Query Languages.
Hash Tables.
Lecture#12: External Sorting (R&G, Ch13)
B+-Trees and Static Hashing
Hash-Based Indexes Chapter 10
Index tuning Hash Index.
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
CS202 - Fundamental Structures of Computer Science II
Database Systems (資料庫系統)
LINEAR HASHING E0 261 Jayant Haritsa Computer Science and Automation
2018, Spring Pusan National University Ki-Joune Li
CPS216: Advanced Database Systems
Suffix Arrays and Suffix Trees
Wednesday, 5/8/2002 Hash table indexes, physical operators
Hash-Based Indexes Chapter 11
Chapter 11 Instructor: Xin Zhang
Indexing and Searching
CSE 326: Data Structures Lecture #14
ICOM 5016 – Introduction to Database Systems
Presentation transcript:

New Indices for Text : Pat Trees and PAT Arrays Gaston H. Gonnet Ricardo A. Baeza-Yates Tim Snider 報告者:吳彥欽

報告大綱 Authors introduce Introduction PAT Tree Searching algorithms on the PAT Tree PAT Array Summary

Authors Introduce Gaston H. Gonnet Professor, ETH Zürich, Switzerland, Informatik , Institute for Scientific Computation http://www.inf.ethz.ch/personal/gonnet/ Symbolic and algebraic computation, heuristic algorithms Computational Biochemistry algorithms. Development of the Darwin system . Text searching and sorting algorithms

Text Searching Methods Lexicographical indices Clustering techniques Indices based on hashing

Traditional Model Keywords Problems Basic structure is assumed. Keywords extraction # of keyword is variable. Queries are restricted to keywords

PAT tree How to build indices ??? Why use PAT tree Keywords ??? Full text !!! Why use PAT tree No restriction on structure No keyword are used

PAT-tree Structure PAT tree is a Patricia tree constructed over all the possible sistring of a text. Patricia tree sistring

Patricia tree Binary Digital tree Internal node skip number External node link to data Example : 011001 110010 100100 001000 010001 100010 000101 010111 001011

Sistring Treat text as a long string Each position in the text corresponds to a Semi-Infinite String Semi-Infinite String Example :

Sistring Example Ex: Text Today is Thursday,I want to.. sistring1 Today is Thursday,I want to.. sistring2 oday is Thursday,I want to.. sistring7 is Thursday,I want to.. sistring10 Thursday,I want to.. : :

PAT Tree PAT tree is a Patricia tree constructed over all the possible sistring of a text. PAT tree = Patricia tree + all Sistring of text Example : abbaabaaababa TEXT 123456789…… POSITION

Indexing Point Words Searching Phrase Searching Indexing point is application dependent

Searching Algorithms on the PAT tree Prefix Searching Range Searching Longest Repetition Searching Proximity Searching Most Significant or Most Frequent Searching Regular Expression Searching

Prefix Searching Every node in the same subtree has the same prefix. A subtree or A single node or Missed Keep the size of each subtree in the internal node.

Proximity Searching Build S1, S2 in PAT tree Find the tallest subtree which contained the S1 and S2. Sorted S1, S2 by position of the answer. Check the proximity condition

Most Significant or Most Frequent Searching Searching the biggest subtree Most common word

Regular Expression Searching Convert regular expression into a deterministic finite automation(DFA) Convert character DFA into binary DFA PAT tree

Improvement Efficiency is important. PAT tree drawback External node will use large physical space. # of internal node could be very large.

Solution Mapping the tree onto the disk using supernodes Allocate as much as possible of the tree in a disk page. Bucking of external nodes Every subtree with size less than b stores in a bucket.

But !!……… Disk page fullness in the actual experiments close to 80% (using greedy algorithm). Each tree page has 10 steps path.

PAT Array The size of the Bucket !!! Using suffix array in Bucket PAT array example :

New Discovery PAT array only missed the longest repetition. Prefix searching and Range searching can only use PAT array.

PAT Array Operation Build PAT array in memory Merge two PAT array Using paging, avoid memory thrashing Merge two PAT array O( n2*log(n1) ) + O( n2 ) Split first, then merge .

Delayed Reading Paradigm Sistring. Random disk access Reading sistring Store request in the pool, wait for time. Use request to generate more requests

Summary Signature file Inverted file PAT tree Storage is small but searching time is linear. Filtering is needed. Inverted file Performance is good but storage is huge. PAT tree …………………