Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)

Similar presentations


Presentation on theme: "Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)"— Presentation transcript:

1 Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden) Faculty of Computer Science, Institute System Architecture, Database Technology Group

2 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 2 Outline 1.Introduction 2.Logging Schemes 3.Refresh Algorithms 4.Performance 5.Summary & Outlook

3 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 3 Random Sampling Analytical databases –huge data sets –complex algorithms Requirements –Performance, performance, performance! Random sampling –approximate query answering –data mining –data stream processing –query optimization –data integration Turnover in Europe (TPCH) 1% 8.46 Mil.  0.15 Mil. 4s 10% 8.51 Mil.  0.05 Mil. 52s 100%8.54 Mil.200s

4 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 4 Offline Sampling Precomputed samples –pros avoid access to base data used multiple times arbitrary base data versatile –cons maintenance!!! Disk-based samples –many, large samples  stored on disk –crash safe –typically space-restricted –challenges sequential access is faster blocking of data

5 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 5 Basics: Reservoir Sampling Sampling with space-constraints –maintain a sample (reservoir) of M tuples add the first M tuples afterwards, throw a dice a)ignore the tuple (reject) b)replace a random tuple in the sample (accept) –accept probability controls sampling scheme –building block for many sophisticated sampling schemes Example –dataset with 50 tuples (M=5)

6 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 6 Evolution of the Sample  Random I/O!!!

7 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 7 Outline 1.Introduction 2.Logging Schemes 3.Refresh Algorithms 4.Performance 5.Summary & Outlook

8 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 8 Full Logging Full Log –track all changes –log is written sequentially –log contains more information than needed

9 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 9 Candidate Logging Candidate log –track only changes which affect the sample –log is written sequentially –smaller logs How to implement Candidate Refresh?

10 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 10 Outline 1.Introduction 2.Logging Schemes 3.Refresh Algorithms 4.Performance 5.Summary & Outlook

11 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 11 Naive Refresh Naive refresh –scan log file sequentially –write each element of the log to a random position in the sample –No improvement at all! random access to sample some elements are written more than once

12 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 12 Avoiding Multiple Writes Observation –each candidate can be overwritten by subsequent candidates only –last candidate is never overwritten Approach –scan log in reverse order –write only tuples which have not been written before

13 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 13 Avoiding Multiple Writes Probability of overwrites In general –k tuples written to sample (k=0…5) –probability of overwrite: p k = (M-k)/M –number of skipped tuples: P(X k =x)=(1-p k ) x p k (k>0) –X5=–X5= –here: X 1 =0, X 2 =1, X 3 =1, X 4 =6

14 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 14 Nomem Refresh Nomem Refresh (Phase 1) –dry run: generate X 4,…,X 1 in advance –reset pseudo-random number generator and generate same sequence again –start at: |C|-X  indexes of log file are generated

15 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 15 Nomem Refresh Naive update of sample –read generated indexes of the log –write it to a random (free) position in the sample –drawbacks free positions have to be maintained random access to the sample

16 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 16 Nomem Refresh Nomem Refresh (Phase 2) –general idea: order of the tuples in sample is unimportant –algorithm (re-)generate next position in the log (6, 8,10,11) generate next position in the sample (1, 2, 3, 5) read from log, write to sample 

17 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 17 Nomem Refresh Properties –log file is read sequentially –sample is written sequentially –no overwrites –no memory consumption –works on full logs as well (DBMS!)

18 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 18 Outline 1.Introduction 2.Logging Schemes 3.Refresh Algorithms 4.Performance 5.Summary & Outlook

19 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 19 Experiments Number of operations & execution time –sample size: 1 million tuples –refresh period: 1 million operations

20 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 20 Experiments Refresh period & execution time –sample size: 1 million tuples –number of operations: 100 million

21 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 21 Outline 1.Introduction 2.Logging Schemes 3.Refresh Algorithms 4.Performance 5.Summary & Outlook

22 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 22 Summary & Outlook Logging schemes –full logs: often found in database systems –candidate logs: reduce log file size Nomem Refresh –fast incremental refresh –sequential disk access only –no memory consumption –works with full and candidate logs Future work –more detailed discussion of updates & deletions

23 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 23 Thank you! Questions?

24 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 24 Extensions –nomem refresh for full logs (DBMS!) dry run: compute candidates, count their number reset random number generator add skips of Nomem Refresh and Reservoir Sampling –deletions and updates store deletions and updates separately process delete and update log first run Nomem Refresh on the insert log requires disjoint logs

25 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 25 Experiments Comparison with the Geometric File –sample size: 1 million tuples –number of operations: 100 million

26 Rainer Gemulla, Wolfgang LehnerDeferred Maintenance of Disk-Based Random SamplesSlide 26 Experiments Computational overhead –sample size: 1 million tuples


Download ppt "Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)"

Similar presentations


Ads by Google