SURAJIT CHAUDHURI RAJEEV MOTWANI VIVEK NARASAYYA On random sampling over Joins Presented by : Srikantha Nema.

Slides:



Advertisements
Similar presentations
Simulation Examples in EXCEL Montana Going Green 2010.
Advertisements

Junzhou Huang, Shaoting Zhang, Dimitris Metaxas CBIM, Dept. Computer Science, Rutgers University Efficient MR Image Reconstruction for Compressed MR Imaging.
Significance Testing.  A statistical method that uses sample data to evaluate a hypothesis about a population  1. State a hypothesis  2. Use the hypothesis.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Copyright © 2004 Pearson Education, Inc.. Chapter 15 Algorithms for Query Processing and Optimization.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
5.1 and 5.2 Review AP Statistics.
Modeling and Analysis of Random Walk Search Algorithms in P2P Networks Nabhendra Bisnik, Alhussein Abouzeid ECSE, Rensselaer Polytechnic Institute.
Optimal Workload-Based Weighted Wavelet Synopsis
Outline SQL Server Optimizer  Enumeration architecture  Search space: flexibility/extensibility  Cost and statistics Automatic Physical Tuning  Database.
An Efficient Cost-Driven Selection Tool for Microsoft SQL Server Surajit ChaudhuriVivek Narasayya Indian Institute of Technology Bombay CS632 Course seminar.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.
On Random Sampling over Joins Surajit Chaudhuri Rajeeve Motwani Vivek Narasayya Microsoft Research Stanford University Microsoft Research.
Probability and Statistics Review
Efficient Join Processing over Uncertain Data - By Reynold Cheng, et all. Presented By Lydia & Usha.
Path Planning in Expansive C-Spaces D. HsuJ.-C. LatombeR. Motwani CS Dept., Stanford University, 1997.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
Database k-Nearest Neighbors in Uncertain Graphs Lin Yincheng VLDB10.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
Access Path Selection in a Relational Database Management System Selinger et al.
Rate-based Data Propagation in Sensor Networks Gurdip Singh and Sandeep Pujar Computing and Information Sciences Sanjoy Das Electrical and Computer Engineering.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Privacy Preservation of Aggregates in Hidden Databases: Why and How? Arjun Dasgupta, Nan Zhang, Gautam Das, Surajit Chaudhuri Presented by PENG Yu.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu, Cuitian Rong, Jinchuan Chen, Xiaoyong Du, Gabriel Fung, Xiaofang Zhou Renmin University.
1 A New Method for Composite System Annualized Reliability Indices Based on Genetic Algorithms Nader Samaan, Student,IEEE Dr. C. Singh, Fellow, IEEE Department.
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
To Tune or not to Tune? A Lightweight Physical Design Alerter Nico Bruno, Surajit Chaudhuri DMX Group, Microsoft Research VLDB’06.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
Presented By Anirban Maiti Chandrashekar Vijayarenu
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.
Is Top-k Sufficient for Ranking? Yanyan Lan, Shuzi Niu, Jiafeng Guo, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Check Yo Pockets Qituwrah King Kourtney Howard. Question  Does the weight of a coin affect the amount of times it will land on heads if it is flipped.
Using Classification Trees to Decide News Popularity
Huffman Coding The most for the least. Design Goals Encode messages parsimoniously No character code can be the prefix for another.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Query Sampling in DB2. Motivation Data volume is growing fast Many algorithms do not scale up with data volume For exploratory analysis users often want.
An Efficient, Cost-Driven Index Selection Tool for MS-SQL Server
A paper on Join Synopses for Approximate Query Answering
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Query Sampling in DB2.
Load Shedding Techniques for Data Stream Systems
StreamApprox Approximate Stream Analytics in Apache Spark
RDF Stores S. Sakr and G. A. Naymat.
PROBABILITY.
Query Sampling in DB2.
Outline Introduction Background Distributed DBMS Architecture
Chapter 10 Inferences on Two Samples
Introduction to Hypothesis Testing
A Framework for Testing Query Transformation Rules
Query Optimization.
一種兼顧影像壓縮與資訊隱藏之技術 張 真 誠 國立中正大學資訊工程學系 講座教授
Probabilistic Ranking of Database Query Results
Informatics Practices Study Guide for Class 11 on the Extramarks App
Presentation transcript:

SURAJIT CHAUDHURI RAJEEV MOTWANI VIVEK NARASAYYA On random sampling over Joins Presented by : Srikantha Nema

Outline Semantics of Sample Difficulty of join Sampling Algorithms for Sampling Sampling strategies New strategies for join Sampling Experimental evaluation Conclusions

Terminologies SAMPLE(R, f) is an SQL operation When a query Q is evaluated, we obtain relation R f is a fraction of a relation R

Semantics of Sample Sampling with Replacement (WR) Sampling without Replacement (WoR) Independent Coin Flips (CF)

Difficulty of Join Sampling

Classification of Join Sampling problem Case A  No information is available for either or Case B  No information is available for but indexes and /or statistics are available for Case C  Indexes/statistics are available for and

Algorithms for Sampling Unweighted Sequential WR Sampling  Black-Box U1  Black-Box U2 Weighted Sequential WR Sampling  Black-Box WR1  Black-Box WR2

Unweighted Sequential WR Sampling Black-Box U2 Black-Box U1

Weighted Sequential Sampling Black-Box WR1 Black-Box WR2

Sampling Strategies (old) Strategy Naïve-Sample Strategy Olken-Sample

New strategies for join Sampling Strategy Stream-Sample Strategy Group-Sample Strategy Frequency-Partition-Sample

Experimental Evaluation 1

Experimental Evaluation 2

Experimental Evaluation 3

Conclusions Difficulty of join sampling Classification of the problem into 3 cases Strategies for join sampling New schemes for sequential random sampling for uniform and weighted sampling More efficient strategies can be developed for the case of single join More work needed to understand the problem of sampling the result of join trees

Thank You