Presentation on theme: "A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI."— Presentation transcript:
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI
Outlines Introduction Semantics of Sample Algorithms of Sampling Join Sampling Problem New Strategies for Join Sampling Extensions and Negative Results Experimental Evaluations Conclusions
Terms Used SAMPLE ( R, f ) is an SQL operation. f is a fraction of a relation R. Relation R is produced when a query Q is evaluated.
Introduction Sampling the output of query is inefficient. OLAP and Data Mining use sample of the result of the query posed. Sampling must be supported on the result of an arbitrary SQL query.
Continued… Supports Random Sampling as a primitive relational operation in relational databases. SAMPLE ( R, f ) operation. Partially evaluate Q to generate a sample of R. Sample operation appears arbitrarily in query tree T. Commute the sample operation down the tree using a single join operation.
Semantics of Sample 1. Sampling with Replacement ( WR ) 2. Sampling without Replacement ( WoR ) 3. Independent Coin Flips ( CF ) Sample with probability f independent of other tuples. f - Fraction of Tuples in R n - Number of Tuples in R
The size of relation being sampled. How it scans the relation ? Need any significant auxiliary memory ? Algorithms for Unweighted Sequential WR Sampling Black - Box U 1: Given relation R with n tuples, generate an UNWEIGHTED WR sample of size r. Black - Box U 2: Given relation R with n tuples, generate an UNWEIGHTED WR sample of size r.
The size of relation being sampled. How it scans the relation ? Need any significant auxiliary memory ? Algorithms for Weighted Sequential WR Sampling Black - Box U 1: Given relation R with n tuples, generate an WEIGHTED WR sample of size r. Black - Box U 2: Given relation R with n tuples, generate an WEIGHTED WR sample of size r.
Classification of the Problem Case A : No information is available for either R 1 or R 2 Case B : No information is available for R 1 but indexes and / or statistics are available for R 2. Case C : Indexes / statistics are available for R 1 and R 2
Strategy Frequency - Partition - Sample Assumption that we have full statistics for R 2 Uses strategy Group Sample for high frequency values. Strategy Naive Sample for low frequency values. Join attribute values need not be of high frequency in both operand relations. Determine the distribution of the sample between high and low frequency sub domain. Advantage : It needs summary statistics in the form of histograms for R 2.
Extensions and Negative Results The Inherent difficulty of Join Sampling : Even if we have large samples from R 1 and R 2 and the detailed statistics, it is not possible to generate any non - empty random sample of R 1 join R 2. Dealing with Join Trees : Pushing down the Sample operation to the operands.
Experimental Evaluations Naïve Sample : Add U 1 operator as the root of tree Olken Sample : Create uniform random sample T from key values of R 1 Stream Sample : Insert WR 1 operator as a child of the join operator Frequency - Partition - Sample : Implement a modified version of WR 1 operator for producing random sample from R 1
Conclusions Study of issues involved in implementing sampling as primitive operation. Series of Sampling Strategies Provided new schemes for sequential random sampling for uniform and weighted sampling distributions Even more efficient strategies can be developed