Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)

Slides:



Advertisements
Similar presentations
File Organization & Indexing Reading: C&B, Ch 18 & 23.
Advertisements

Sampling Algorithms for Evolving Datasets Rainer Gemulla Defense of Ph.D. Thesis Faculty of Computer Science, Institute of System Architecture,
1 CSIS 7102 Spring 2004 Lecture 9: Recovery (approaches) Dr. King-Ip Lin.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
CSCI 3140 Module 8 – Database Recovery Theodore Chiasson Dalhousie University.
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.
Query Processing in Databases Dr. M. Gavrilova.  Introduction  I/O algorithms for large databases  Complex geometric operations in graphical querying.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Chapter 11: File System Implementation
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
File Systems Implementation. 2 Recap What we have covered: –User-level view of FS –Storing files: contiguous, linked list, memory table, FAT, I-nodes.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #12.
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
MySQL. Dept. of Computing Science, University of Aberdeen2 In this lecture you will learn The main subsystems in MySQL architecture The different storage.
CSCE Database Systems Chapter 15: Query Execution 1.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
The HDF Group Multi-threading in HDF5: Paths Forward Current implementation - Future directions May 30-31, 2012HDF5 Workshop at PSI 1.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
M.Kersten Dec 31, Cracking the database store The far side of the Moon Martin Kersten, Stefan Manegold Centre for Mathematics and Computer Science.
The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.
1 Shared Files Sharing files among team members A shared file appearing simultaneously in different directories Share file by link File system becomes.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Query Execution Section 15.1 Shweta Athalye CS257: Database Systems ID: 118 Section 1.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
CS4432: Database Systems II Query Processing- Part 2.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Presented By Anirban Maiti Chandrashekar Vijayarenu
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
1 Ullman et al. : Database System Principles Notes 4: Indexing.
Dense-Region Based Compact Data Cube
Remote Backup Systems.
Database Recovery Techniques
Jonathan Walpole Computer Science Portland State University
Module 11: File Structure
15.1 – Introduction to physical-Query-plan operators
Faculty of Computer Science, Institute System Architecture, Database Technology Group A Dip in the Reservoir: Maintaining Sample Synopses of Evolving.
CS161 – Design and Architecture of Computer
Updating SF-Tree Speaker: Ho Wai Shing.
Database Management System
A paper on Join Synopses for Approximate Query Answering
CPSC-608 Database Systems
Chapter 12: Query Processing
Spatial Online Sampling and Aggregation
Selected Topics: External Sorting, Join Algorithms, …
Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Chapter 12 Query Processing (1)
CPSC-608 Database Systems
Remote Backup Systems.
Presentation transcript:

Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden) Faculty of Computer Science, Institute System Architecture, Database Technology Group

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 2 Outline 1.Introduction 2.Logging Schemes 3.Refresh Algorithms 4.Performance 5.Summary & Outlook

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 3 Random Sampling Analytical databases –huge data sets –complex algorithms Requirements –Performance, performance, performance! Random sampling –approximate query answering –data mining –data stream processing –query optimization –data integration Turnover in Europe (TPCH) 1% 8.46 Mil.  0.15 Mil. 4s 10% 8.51 Mil.  0.05 Mil. 52s 100%8.54 Mil.200s

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 4 Offline Sampling Precomputed samples –pros avoid access to base data used multiple times arbitrary base data versatile –cons maintenance!!! Disk-based samples –many, large samples  stored on disk –crash safe –typically space-restricted –challenges sequential access is faster blocking of data

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 5 Basics: Reservoir Sampling Sampling with space-constraints –maintain a sample (reservoir) of M tuples add the first M tuples afterwards, throw a dice a)ignore the tuple (reject) b)replace a random tuple in the sample (accept) –accept probability controls sampling scheme –building block for many sophisticated sampling schemes Example –dataset with 50 tuples (M=5)

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 6 Evolution of the Sample  Random I/O!!!

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 7 Outline 1.Introduction 2.Logging Schemes 3.Refresh Algorithms 4.Performance 5.Summary & Outlook

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 8 Full Logging Full Log –track all changes –log is written sequentially –log contains more information than needed

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 9 Candidate Logging Candidate log –track only changes which affect the sample –log is written sequentially –smaller logs How to implement Candidate Refresh?

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 10 Outline 1.Introduction 2.Logging Schemes 3.Refresh Algorithms 4.Performance 5.Summary & Outlook

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 11 Naive Refresh Naive refresh –scan log file sequentially –write each element of the log to a random position in the sample –No improvement at all! random access to sample some elements are written more than once

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 12 Avoiding Multiple Writes Observation –each candidate can be overwritten by subsequent candidates only –last candidate is never overwritten Approach –scan log in reverse order –write only tuples which have not been written before

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 13 Avoiding Multiple Writes Probability of overwrites In general –k tuples written to sample (k=0…5) –probability of overwrite: p k = (M-k)/M –number of skipped tuples: P(X k =x)=(1-p k ) x p k (k>0) –X5=–X5= –here: X 1 =0, X 2 =1, X 3 =1, X 4 =6

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 14 Nomem Refresh Nomem Refresh (Phase 1) –dry run: generate X 4,…,X 1 in advance –reset pseudo-random number generator and generate same sequence again –start at: |C|-X  indexes of log file are generated

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 15 Nomem Refresh Naive update of sample –read generated indexes of the log –write it to a random (free) position in the sample –drawbacks free positions have to be maintained random access to the sample

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 16 Nomem Refresh Nomem Refresh (Phase 2) –general idea: order of the tuples in sample is unimportant –algorithm (re-)generate next position in the log (6, 8,10,11) generate next position in the sample (1, 2, 3, 5) read from log, write to sample 

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 17 Nomem Refresh Properties –log file is read sequentially –sample is written sequentially –no overwrites –no memory consumption –works on full logs as well (DBMS!)

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 18 Outline 1.Introduction 2.Logging Schemes 3.Refresh Algorithms 4.Performance 5.Summary & Outlook

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 19 Experiments Number of operations & execution time –sample size: 1 million tuples –refresh period: 1 million operations

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 20 Experiments Refresh period & execution time –sample size: 1 million tuples –number of operations: 100 million

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 21 Outline 1.Introduction 2.Logging Schemes 3.Refresh Algorithms 4.Performance 5.Summary & Outlook

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 22 Summary & Outlook Logging schemes –full logs: often found in database systems –candidate logs: reduce log file size Nomem Refresh –fast incremental refresh –sequential disk access only –no memory consumption –works with full and candidate logs Future work –more detailed discussion of updates & deletions

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 23 Thank you! Questions?

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 24 Extensions –nomem refresh for full logs (DBMS!) dry run: compute candidates, count their number reset random number generator add skips of Nomem Refresh and Reservoir Sampling –deletions and updates store deletions and updates separately process delete and update log first run Nomem Refresh on the insert log requires disjoint logs

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 25 Experiments Comparison with the Geometric File –sample size: 1 million tuples –number of operations: 100 million

Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 26 Experiments Computational overhead –sample size: 1 million tuples