Updating SF-Tree Speaker: Ho Wai Shing.

Slides:



Advertisements
Similar presentations
Chapter 4: Trees Part II - AVL Tree
Advertisements

CPSC 335 Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 11 – Hash-based Indexing.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
File Organization Techniques
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
Skip Lists 二○一七年四月二十五日
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
1 Ullman et al. : Database System Principles Notes 5: Hashing and More.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
Lecture 3: Uninformed Search
Spatial Data Management
HUFFMAN CODES.
Indexing Structures for Files and Physical Database Design
Indexing and hashing.
CS522 Advanced database Systems
Rule Induction for Classification Using
Tree-Structured Indexes
COP Introduction to Database Structures
Dynamic Hashing (Chapter 12)
A paper on Join Synopses for Approximate Query Answering
The Variable-Increment Counting Bloom Filter
B+-Trees.
Spatial Indexing I Point Access Methods.
Temporal Indexing MVBT.
COMP 430 Intro. to Database Systems
B+-Trees.
B+-Trees.
Hash-Based Indexes Chapter 11
CPSC-608 Database Systems
Subject Name: File Structures
AVL Trees "The voyage of discovery is not in seeking new landscapes but in having new eyes. " - Marcel Proust.
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Hash Table.
Vasiljevic Vladica, FP-Growth algorithm Vasiljevic Vladica,
Arrays and Linked Lists
Dynamic Programming.
Association Rule Mining
Introduction to Database Systems
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes Chapter 10
Indexing and Hashing Basic Concepts Ordered Indices
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
Index tuning Hash Index.
Advanced Implementation of Tables
Database Design and Programming
Introduction to Data Structure
Module 12a: Dynamic Hashing
Indexing 4/11/2019.
CPSC-608 Database Systems
More advanced aspects of search
Hash-Based Indexes Chapter 11
Hashing.
Chapter 11 Instructor: Xin Zhang
Tree-Structured Indexes
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning
Presentation transcript:

Updating SF-Tree Speaker: Ho Wai Shing

Contents Introduction, and summary of SF-Tree Possible SF-Tree update techniques and their limitations Future work

Introduction SF-Tree stands for Signature File Tree it is originally designed for storing XML SPE selectivity it can be generalized to store an "object-to-count" mapping

SF-Tree Basics Divide objects into groups of the same (or similar) count(s) Finding the count of an object is equivalent to finding the group containing the object Signature files are used to summarize groups SFs are organized in a tree form

Signature Files Are bit vectors Computed by hashing objects into bit positions and setting the bits Existence of an object can be checked by checking the hashed bit positions

Signature Files e.g., F has 10 bits, m = 3 "//name" is hashed to bits 2, 3, 8 "//buyer" is hashed to bits 2, 4, 9

SF-Tree

SF-Tree

Introduction Advantages: Time efficient: independent of database size Space efficient: independent of original object size Accurate: has statistical accuracy guarantee on the returned counts Flexible: can tune parameters to make trade-offs among space, accuracy, speed.

Introduction Disadvantages: Not 100% accurate, since original objects are discarded All information must be available before building a Shannon-Fano SF-Tree Updates may reduce its accuracy

Introduction Possible applications support count storage in stream data mining data cube storage Periodic reconstructions are not always feasible (esp. data streams), so we need good update strategy for SF-Tree

Updating SF-Tree basic idea: count of an object changed the object is changed from one group to another i.e., the object is deleted in a group, and inserted into another group

Updating SF-Tree Problems: No deletion algorithm for deleting objects in signature files No algorithm for creating a new group in a SF-Tree No way to compute new signatures for the existing objects

Updating SF-Tree Solutions to the problems For deletion: Counter-based signature files "No-deletion" scheme Look-ahead Retrieval For insertion: Dynamic expanding signature files Precomputed signature files Negative signature files

Counter-Based Signature Files Motivation: deletion of an object involves resetting some bits resetting bits may cause "false negatives" (can't retrieve some objects which are present in this group)

Counter-Based Signature Files e.g., "//name" is hashed to bits 2, 3, 8 "//buyer" is hashed to bits 2, 4, 9 when we remove "//buyer", "//name" can't be retrieved (since bit-2 = 0)

Counter-Based Signature Files Use counters instead of bits in the signature file e.g., "//name" is hashed to bits 2, 3, 8 "//buyer" is hashed to bits 2, 4, 9

Counter-Based Signature Files Advantages: deleting an objects which is in F won't cause "false negatives" Disadvantages: space requirement increases deletion due to false drop still causes troubles counters may overflow

"No-Deletion" Scheme Motivation: Deletion causes troubles (false negatives) No way to completely avoid it False negatives may cause big errors in counts

"No-Deletion" Scheme Won't reset the bits when deleting an object (i.e., no deletion) For retrieval, return the count of the group with largest count Advantages: completely remove all the troubles caused by deletion (no more false negatives)

"No-Deletion" Scheme Disadvantages: may reduce the accuracy of the signature (if the signature size is unchanged) may increase the space requirement significantly (if we maintain the accuracy) can be applied only to applications that counts are monotonic increasing (decreasing)

"No-Deletion" Scheme Still useful: for updates between periodic reconstructions since retrieval time is unchanged (for positive queries) more space efficient than counters if updates are not frequent (since we can use bit-based signature files)

Look-Ahead Retrieval Motivation: Reduces errors due to false negatives Retrieval stops only when two consecutive levels of signature files do not contain the query Re-insert the object if we think that it's a false negative

Look-Ahead Retrieval e.g.,

Look-Ahead Retrieval Advantages: Reduces the probability that SF-Tree is affected by false negatives The self-healing property removes some false negatives

Look-Ahead Retrieval Disadvantages: slower (more signatures to be examined) false drop rate increases more 1s created in the self-healing process one less level for false drop safe-guard

Insertion Involves 2 parts insertion within a signature file inserting new signature files in SF-Tree A signature file has a capacity under an error bound more objects  more 1s  higher false drop rate

Dynamically Expanding Signature Files To maintain the error (false drop prob.) bound, the size of SFs must be able to be increased However, objects are dropped  we can't recompute the signatures of previous objects Add new signatures to represent new objects

Dynamically Expanding Signature Files e.g., create a new signature if the first one is full doubling the size at each creation

Dynamically Expanding Signature Files Advantages: the signature files can now store arbitrary number of objects Disadvantages: less space efficient

Precomputed Signature Files The previous method solves insertion within a signature file The next two methods concentrates at inserting new count groups

Precomputed Signature Files Problem: adding a new count group (i.e., leaf node) may involve adding new internal nodes e.g.,

Precomputed Signature Files Solution: the signature file is precomputed, i.e.,

Negative Signature Files Alternative solution to the previous problem A "negative signature file" stores the deleted objects

Summary For deletion: For insertion: Counter-based signature files "No-deletion" scheme Look-ahead Retrieval For insertion: Dynamic expanding signature files Precomputed signature files Negative signature files

Future Work Implement and check the performance of different update strategies Identify the requirements of various applications (e.g., support counting, data cube storage) and choose a suitable SF-Tree strategy