Download presentation

Presentation is loading. Please wait.

Published byNia Bonwell Modified over 2 years ago

1
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI

2
Outlines Introduction Semantics of Sample Algorithms of Sampling Join Sampling Problem New Strategies for Join Sampling Extensions and Negative Results Experimental Evaluations Conclusions

3
Terms Used SAMPLE ( R, f ) is an SQL operation. f is a fraction of a relation R. Relation R is produced when a query Q is evaluated.

4
Introduction Sampling the output of query is inefficient. OLAP and Data Mining use sample of the result of the query posed. Sampling must be supported on the result of an arbitrary SQL query.

5
Continued… Supports Random Sampling as a primitive relational operation in relational databases. SAMPLE ( R, f ) operation. Partially evaluate Q to generate a sample of R. Sample operation appears arbitrarily in query tree T. Commute the sample operation down the tree using a single join operation.

6
Semantics of Sample 1. Sampling with Replacement ( WR ) 2. Sampling without Replacement ( WoR ) 3. Independent Coin Flips ( CF ) Sample with probability f independent of other tuples. f - Fraction of Tuples in R n - Number of Tuples in R

7
The size of relation being sampled. How it scans the relation ? Need any significant auxiliary memory ? Algorithms for Unweighted Sequential WR Sampling Black - Box U 1: Given relation R with n tuples, generate an UNWEIGHTED WR sample of size r. Black - Box U 2: Given relation R with n tuples, generate an UNWEIGHTED WR sample of size r.

8
The size of relation being sampled. How it scans the relation ? Need any significant auxiliary memory ? Algorithms for Weighted Sequential WR Sampling Black - Box U 1: Given relation R with n tuples, generate an WEIGHTED WR sample of size r. Black - Box U 2: Given relation R with n tuples, generate an WEIGHTED WR sample of size r.

9
The Difficulty of Join Sampling ?

10
Classification of the Problem Case A : No information is available for either R 1 or R 2 Case B : No information is available for R 1 but indexes and / or statistics are available for R 2. Case C : Indexes / statistics are available for R 1 and R 2

11
Previous Sampling Strategies Strategy Naive - Sample Strategy Olken - Sample :

12
New Strategies for Join Sampling Three new strategies of Sampling are : Strategy Stream Sample. Strategy Group Sample. Strategy Frequency - Partition - Sample.

13
Table showing the information about R 1 and R 2

14
Strategy Stream Sample Performs only a sequential sample from R 1 Does not generate excess tuples

15
Strategy Group Sample

16
Strategy Frequency - Partition - Sample Assumption that we have full statistics for R 2 Uses strategy Group Sample for high frequency values. Strategy Naive Sample for low frequency values. Join attribute values need not be of high frequency in both operand relations. Determine the distribution of the sample between high and low frequency sub domain. Advantage : It needs summary statistics in the form of histograms for R 2.

17
Continued…

18
Extensions and Negative Results The Inherent difficulty of Join Sampling : Even if we have large samples from R 1 and R 2 and the detailed statistics, it is not possible to generate any non - empty random sample of R 1 join R 2. Dealing with Join Trees : Pushing down the Sample operation to the operands.

19
Experimental Evaluations Naïve Sample : Add U 1 operator as the root of tree Olken Sample : Create uniform random sample T from key values of R 1 Stream Sample : Insert WR 1 operator as a child of the join operator Frequency - Partition - Sample : Implement a modified version of WR 1 operator for producing random sample from R 1

20
Experimental results

21
Continued…

23
Conclusions Study of issues involved in implementing sampling as primitive operation. Series of Sampling Strategies Provided new schemes for sequential random sampling for uniform and weighted sampling distributions Even more efficient strategies can be developed

24
QUESTIONS ?? Thank you

Similar presentations

OK

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Esi ms ppt online Download ppt on adolescence and puberty Ppt on internet services for class 10th Ppt on forward contract examples Ppt on osi model and tcp ip Download ppt on motion sensing technology Ppt on regular expression c# Ppt on computer graphics applications Ppt on bond lengths Ppt on taj lands end