A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh.

Slides:



Advertisements
Similar presentations
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advertisements

Chapter 13: Query Processing
Chapter 6 Cost and Choice. Copyright © 2001 Addison Wesley LongmanSlide 6- 2 Figure 6.1 A Simplified Jam-Making Technology.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 38.
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
Chapter 1 Image Slides Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
Sketch-based Change Detection Balachander Krishnamurthy (AT&T) Subhabrata Sen (AT&T) Yin Zhang (AT&T) Yan Chen (UCB/AT&T) ACM Internet Measurement Conference.
Defending against large-scale crawls in online social networks Mainack Mondal Bimal Viswanath Allen Clement Peter Druschel Krishna Gummadi Alan Mislove.
Approximate Spatial Query Processing Using Raster Signatures Leonardo Guerreiro Azevedo, Rodrigo Salvador Monteiro, Geraldo Zimbrão & Jano Moreira de Souza.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Multiplying binomials You will have 20 seconds to answer each of the following multiplication problems. If you get hung up, go to the next problem when.
0 - 0.
ALGEBRAIC EXPRESSIONS
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
ZMQS ZMQS
On Sequential Experimental Design for Empirical Model-Building under Interval Error Sergei Zhilin, Altai State University, Barnaul, Russia.
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
BT Wholesale October Creating your own telephone network WHOLESALE CALLS LINE ASSOCIATED.
SE-292 High Performance Computing
Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
M AINTAINING L ARGE A ND F AST S TREAMING I NDEXES O N F LASH Aditya Akella, UW-Madison First GENI Measurement Workshop Joint work with Ashok Anand, Steven.
HyLog: A High Performance Approach to Managing Disk Layout Wenguang Wang Yanping Zhao Rick Bunt Department of Computer Science University of Saskatchewan.
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Sweet Storage SLOs with Frosting Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Ion Stoica, Randy Katz.
Université du Québec École de technologie supérieure Face Recognition in Video Using What- and-Where Fusion Neural Network Mamoudou Barry and Eric Granger.
Company Confidential © 2012 Eli Lilly and Company Beyond ICH Q1E Opening Remarks Rebecca Elliott Senior Research Scientist Eli Lilly and Company MBSW 2013.
(This presentation may be used for instructional purposes)
Chapter 4 Memory Management Basic memory management Swapping
ABC Technology Project
Mental Math Math Team Skills Test 20-Question Sample.
Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.
Memory Management.
A Quest for an Internet Video Quality-of-Experience Metric
Weighted moving average charts for detecting small shifts in process mean or trends The wonders of JMP 1.
1 Dynamics of Real-world Networks Jure Leskovec Machine Learning Department Carnegie Mellon University
Chapter 5 Microsoft Excel 2007 Window
Squares and Square Root WALK. Solve each problem REVIEW:
K-means Clustering Ke Chen.
1 Weiren Yu 1,2, Xuemin Lin 1, Wenjie Zhang 1 1 University of New South Wales 2 NICTA, Australia Towards Efficient SimRank Computation over Large Networks.
Weiren Yu 1, Xuemin Lin 1, Wenjie Zhang 1, Ying Zhang 1 Jiajin Le 2, SimFusion+: Extending SimFusion Towards Efficient Estimation on Large and Dynamic.
Processes Management.
Bi-intervals for backtracking on temporal constraint networks Jean-François Baget and Sébastien Laborie.
Chapter 5 Test Review Sections 5-1 through 5-4.
Bidirectional Photon Mapping Jiří Vorba Charles University in Prague Faculty of Mathematics and Physics 1.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
Addition 1’s to 20.
Warm Up Express the indicated degree of likelihood as a probability value: “There is a 40% chance of rain tomorrow.” A bag contains 6 red marbles, 3 blue.
25 seconds left…...
Performance Tuning for Informer PRESENTER: Jason Vorenkamp| | October 11, 2010.
Week 1.
We will resume in: 25 Minutes.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 Unit 1 Kinematics Chapter 1 Day
Rethinking Database Algorithms for Phase Change Memory
Fast Algorithms For Hierarchical Range Histogram Constructions
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
Dense-Region Based Compact Data Cube
Ripple Joins for Online Aggregation
Presentation transcript:

A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh + Microsoft Research PR-Join:

Online Aggregation Data warehouse and business intelligence –Fast growing multi-billion dollar market Interactive ad-hoc queries on big data –Important for detecting new trends –Fast response time hard to achieve One promising approach: Online Aggregation (OLA) –Provides early representative results for aggregate queries (sum, count, avg, etc.) –For example, “average is ± 5.6 with 95% confidence” Essential to OLA: non-blocking join algorithm PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath 2 [Hellerstein et al. 97]

Non-Blocking Join for OLA OLA assumption: relations are in random order 3 Relation A Relation B Main memory Temporary storage SpillRead back Estimates based on current results PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Design Goals of Non-Blocking Joins Fast, representative early results Good end-to-end performance 4 Wrong query: stop early Accurate enough: stop early Slow convergence: wait longer High variance, high selectivity, high group counts, data skews … Need the full, accurate result: finish query User may find Design Goals PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Two Metrics in Algorithm Analysis Good end-to-end performance: Fast early results: 5 Result Rate = Newly covered area x selectivity I/Os for covering the new area new records from B records from A Join: check all pairs of records from A and B Early : before completely reading A and B PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath Total I/Os

Design Space 6 HighLow High LowTotal I/O Cost Early Representative Result Rate Hash Ripple [Luo, et al’02] SMS [Jermaine, et al’05] GRACE [Kitsuregawa, et al ’83] Ripple PR-Join targets Ideal DBO [Jermaine, et al’07] [Haas & Hellerstein’99] PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Performance Result Preview 7 PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath Near-optimal total I/O cost Higher early result rate

Outline Introduction PR-Join (Partitioned expanding Ripple Join) Algorithm Evaluation Conclusion 8 PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Background: Ripple Join records from B records from A spillednew spilled new For each ripple: Read new records from A and B; check for matches Read spilled records; check for matches with new records Spill new to disk 9 [Haas & Hellerstein’99] PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Observations of Ripple Join 10 Total I/Os: O(N 2 ) –N = total # of input pages in A and B –I/Os of ripples form an arithmetic series Result rate of a ripple is higher if wider ripple –Increase ripple width But ripple width limited by the memory size PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath Result Rate = Newly covered area x selectivity I/Os for covering the new area Super linear growth Grows linearly

PR-Join Idea 1: Multiplicatively Expanding Ripples Total I/Os: O(N) linear –I/Os of ripples form a geometric series Higher result rate: –Wider ripple leads to higher result rate  But must overcome memory size limitation 11 PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

PR-Join Idea 2: Hash Partitioning Each partition < memory Every join invocation performs a ripple on a partition –Estimation is updated after every join invocation –Much faster user responses  Statistically sound 12 empty Partitioned on Join key PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Statistical Guarantees Idea: hash partitioning  disjoint sub-spaces –Stratified sampling in statistics Statistical estimate: 1)Ripple join formula for every partition 2)Stratified sampling formula to combine estimates from partitioned ripples 13 empty Partitioned on Join key PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Comparing Analytical Performance 14 Early Result Rate Symmetric Hash1 (when data fit in memory) Hash Ripple0.5 SMS0.6 Two-Way DBO1.2 Ripple1, 1.25, 1.40, 1.50, …,  2 PR-Join1, 1.7, 3.2, 6.2, 12.2, … … (Parameter setting details in paper) PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Outline Introduction PR-Join Algorithm Evaluation Conclusion 15 PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Non-Blocking Join for OLA 16 Relation A Relation B Main memory Temporary storage SpillRead back Estimates based on current results PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath Hard disk or SSD Hard disks

Disk as Temp Storage 17 10GB joins 10GB 500MB memory PR-Join achieves much better end-to-end performance than Ripple Join PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Marginal Result Rate 18 PR-Join achieves an order of magnitude higher result rate than Ripple Join PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath Disk as temp storage

SSD as Temp Storage 19 Using SSD, PR-Join achieves near optimal I/O costs 10GB joins 10GB 500MB memory Temp I/Os are almost completely overlapped with I/Os to read input PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

More Details in Paper Joining finite data streams: –PR-Join can be easily used for joining finite data streams –Compared with state-of-the-art algorithm (RPJ [Tao et al.’05]) –PR-Join achieves better performance Analysis of non-blocking join algorithms for OLA PR-Join parameter choices Handling skews More experimental results (see us at the plenary session) 20 PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Conclusions In this paper, we propose a new non-blocking join algorithm: PR-Join (Partitioned expanding Ripple Join) PR-Join for Online Aggregation: –Provides statistical guarantee –An order of magnitude higher result rate than prior approach –Near optimal total I/O cost PR-Join for finite data streams: –Better performance than state-of-the-art algorithm 21 PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath

Thank you! 22 PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate Shimin Chen, Phillip B. Gibbons, Suman Nath