Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams Paper By: Lukasz Golab M. Tamer Ozsu CS 561 Presentation WPI 11 th March,

Slides:

Advertisements

Similar presentations

Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:

Advertisements

MATH 224 – Discrete Mathematics

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.

Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.

Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.

Data Structures Using C++ 2E

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Mining Data Streams.

CMPT 225 Sorting Algorithms Algorithm Analysis: Big O Notation.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

CPSC 335 Computer Science University of Calgary Canada.

Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.

Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

Hashing General idea: Get a large array

1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.

Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.

Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.

Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.

1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.

Database Management 9. course. Execution of queries.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 19: Searching and Sorting Algorithms.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.

Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.

Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.

Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:

Eddies: Continuously Adaptive Query Processing Ross Rosemark.

16.7 Completing the Physical- Query-Plan By Aniket Mulye CS257 Prof: Dr. T. Y. Lin.

CS4432: Database Systems II Query Processing- Part 2.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.

Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.

Session 1 Module 1: Introduction to Data Integrity

Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.

Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, Univ. of Wisconsin-Madison.

Mining of Massive Datasets Ch4. Mining Data Streams

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)

Query Processing CS 405G Introduction to Database Systems.

1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.

CS 440 Database Management Systems Lecture 5: Query Processing 1.

File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.

Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

CS 540 Database Management Systems

Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)

© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Mining Data Streams (Part 1)

CS 540 Database Management Systems

The Stream Model Sliding Windows Counting 1’s

CS 440 Database Management Systems

Chapter 15 QUERY EXECUTION.

واشوقاه إلى رمضان مرحباً رمضان

Lecture 2- Query Processing (continued)

Implementation of Relational Operations

Heavy Hitters in Streams and Sliding Windows

PSoup: A System for streaming queries over streaming data

Presentation transcript:

Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams Paper By: Lukasz Golab M. Tamer Ozsu CS 561 Presentation WPI 11 th March, 2004 Students: Malav shah Professor: Elke Rundenstainer VLDB 2003, Berlin, Germany. This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered.

INDEX  Introduction  Problem Description  Sliding Window Join algorithms  Cost Analysis  Join Ordering Heuristics  Experimental Results  Related Work  Conclusion and Future Work

Introduction  What is Data Streams? A real-time, continuous, ordered (explicitly by timestamps or implicitly by arrival time) sequence of items.  How can you query such type of streams? running a query continually over a period of time and generating new results. continuous, standing, or persistent queries.

Applications  Sensor Data Processing  Internet Traffic analysis  Financial Ticker  Analysis of various transaction logs such as Web server logs and telephone records

Issues  Unbounded streams may not wholly stored in bounded memory.  New items are often more accurate or more relevant than older items.  Blocking operators may not be useful as they must consume entire input before any results produced.

Common Solution  Define Sliding-Window Restrict the range of continuous queries to a sliding- window that contains the last T items or those items that contains last t time units.  Count Based Window (Sequence Based)  Time Based Window (Timestamps Based)

Issues: using sliding window  Re-Execution Strategies  Eager re-execution strategy  Lazy re-execution strategy  Tuple Invalidation Procedures  Eager expiration  Lazy expiration

Example of Eager Re-execution and Expiration Insert tupleInserted tuple Invalidate tuple

Example of Eager Re-execution and Expiration Invalidated tuple

Example of Lazy Expiration Insert tupleInserted tuple

Example of Lazy Expiration Insert tuple

Example of Lazy Expiration Invalidate tuple Inserted tuple

Example of Lazy Expiration Invalidated tuple

INDEX Introduction  Problem Description  Sliding Window Join algorithms  Cost Analysis  Join Ordering Heuristics  Experimental Results  Related Work  Conclusion and Future Work

Problem Description  N Data Streams  N corresponding sliding window  Continuously evaluate exact join of all N window

Assumption  Each stream is consist of relational tuple with schema  All windows fit into main memory  All query plans use extreme right- deep join trees that do not materialize any intermediate results  Do not permit time-lagged windows

Explanation of symbols

Convention for Join Ordering TOP-DOWN Approach

INDEX Introduction Problem Description  Sliding Window Join algorithms  Cost Analysis  Join Ordering Heuristics  Experimental Results  Related Work  Conclusion and Future Work

Binary Incremental NLJ  Proposed by Kang  Strategy Let S1 and S2 be two sliding windows to be joined. For each newly arrived S1-tuple, we scan S2 and return all matching tuples. We then insert the new tuple into S1 and in-validate expired tuples. We follow the same procedure for each newly arrived S2-tuple.

Example AB Insert tuple Probe Output tuple

Example AB Inserted tupleInvalidate tuple

Example AB Invalidated tuple

Naïve Multi-way NLJ  Extension to Binary Incremental NLJ  Strategy For each newly arrived tuple k, we execute the join sequence in the order prescribed by the query plan, but we only include k in the join process (not the entire window that contain k).  In this algorithm we invalidate the expired tuples first  Extending Naïve Multi-way NLJ to support lazy re-evaluation is easy. Re-execute the join every T time units, first joining new s1-tuples with other sliding windows, then new s2-tuples and so on. (must ensure not to include expired tuples in result)

Example S2S3 Insert tuple S1 Invalidate tuple

Example AB Insert tuple C Output tuple A B Probe Output tuple Join Output tuple

Example AB Inserted tuple C A B Output tuple Invalidate tuple

Example AB C A B

Improved eager Multi-way NLJ  Problems with Naïve eager Multi- way Join  When a new tuple arrives at the stream which is not first in the order list then we compute the join for both second and third stream for all tuples in first ordered stream. This results in unnecessary work when a new tuple arrives at stream which is not first in the join tree.  Why not select only those tuples from s1 which joins with s3-tuple, and make scan of s2 for only those tuples.  In worst case when all tuples in s1 joins with newly arrived tuple, we’ve to scan s2 for every tuple in s1, otherwise it’ll be less.

Algorithm for eager Multi-way Join

Lazy Multi-Way Join  Straightforward adaptation to eager multi-way join  Process in the outer most for-loop all the new tuples which have been arrived since last re-execution  Algorithm

General Lazy Multi-Way Join  We can make the lazy multi-way join more general if newly arrived tuples are not restricted to the outer-most for loop.  Accepts arbitrary join order.  Algorithm

Multi-Way Hash Join  We scan only one hash bucket instead of the entire window at each for loop.  Algorithm  Notation: B(i,k) = hi(k.attr) for Ith window

Extension to Count-Based Windows  Eager expiration is straightforward:  Implement window(or hash bucket) as circular arrays  We can perform insertion and invalidation in one step by overwriting oldest tuple  Lazy expiration is interesting:  Implement circular counter and assign positions to each element in sliding window(call them cnt)  When probing for tuples to join with a new tuple k, instead of comparing timestamps, we ensure that each tuples counter cnt has not expired at time k.ts.  To do this, for each sliding window we find counter with the largest timestamps not exceeding k.ts and subtract window length from this counter (call it tmp) and ensure that we join only those tuples with counter greater than tmp.

Algorithm

INDEX Introduction Problem Description Sliding Window Join algorithms  Cost Analysis  Join Ordering Heuristics  Experimental Results  Related Work  Conclusion and Future Work

Cost Analysis  Insertion and Expiration cost  All NLJ based algorithms incur a constant insertion cost per tuple: a new tuple is simply appended to its window  In hash based algorithm requires more work: need to compute hash function and add tuple in hash table (insertion cost slightly higher)  Actual insertion and expiration costs are implementation- dependent  If invalidation is too frequent, some sliding window may not contain any state tuples, but we’ll still pay the cost to access it (same case with hash joins)  Very frequent expiration is too costly, especially in hash joins.

Join Processing Cost  Used per-unit-time cost model, developed by kang  When estimating join sizes, standard assumptions regarding containment of value sets and uniform distribution of attribute values are considered

Comparison between Naïve and Proposed Multi-way Joins. Given equivalent ordering All window has same window size, all streams has same arrival rate, each window has same distinct value. Proposed multi-way scales better than Naïve Multi-way Joins.

Comparison between different lazy multi-way join algorithms Lazy does not perform well some time

INDEX Introduction Problem Description Sliding Window Join algorithms Cost Analysis  Join Ordering Heuristics  Experimental Results  Related Work  Conclusion and Future Work

Effect of Join Ordering  Eager re-execution  If each window has same number of distinct values, then it is sensible to globally order the joins in ascending order of the window sizes (in tuples), or average hash bucket sizes  In general, it is sensible (but not optimal always) heuristic is to assemble the joins in descending order of the binary join selectivities, leaving as little work as possible for inner for-loops  We define a join predicate p1 to be more selective then p2, if p1 produces small result set then p1.

Example  For the example given, the results are as follows  For order s1,s2,s3,s4 processing time is  S2,s1,s3,s4 has cost of  Worst cost plan is

Example when two streams faster  For the example given, the results are as follows  For order s1,s2,s3,s4 processing time68200  S2,s1,s3,s4 has cost of  S3,s1,s4,s2 has cost of (optimal)  So it’s not the always case that moving all faster streams upward is optimal

Ordering heuristics for Lazy Re- evaluation  Recall that lazy multi-way join is as efficient as general multi-way join for small T.  If this is the case then we may use same ordering heuristics as algorithm Lazy Multi-way Join is a straightforward extension of its eager version.  General Multi-way Join is more efficient if a good join-ordering is chosen  General Multi-way join chooses join ordering arbitrarily depending on the origin of the new tuples that are being processed

Ordering heuristics for Multi- way Hash Join  If each hash table has same number of buckets, the ordering problem is same as NLJ. Why?  Hash join so configured operates in nested-loop fashion like NLJ, except in each loop only one hash bucket is scanned instead of entire window.

Join ordering in other scenarios  Hybrid Hash-NLJ: a simple heuristic is to place all the windows that contain hash indices in the inner for-loop.  Expensive Predicates: Those may be ordered near the top of the index tree.  Joins on different attributes: we cannot arbitrarily re-order the join tree. It may still be efficient to place the window from which new tuple arrived at the outer-most for-loop.  Fluctuating Stream arrival rates: If feasible, we re-execute the ordering heuristic whenever stream rates changes beyond some threshold, or we can place the streams which expected to change widely near the top.

INDEX Introduction Problem Description Sliding Window Join algorithms Cost Analysis Join Ordering Heuristics  Experimental Results  Related Work  Conclusion and Future Work

Experimental Setting  Build a simple prototype of algorithms using SUN Microsystems JDK  Windows PC with 1.2 AMD Processor and 256 MB RAM  Implemented sliding windows and hash buckets as singly linked list  All hash functions are simple modular division by the number of hash buckets  Tuple schema  Expiration does not delete the tuple, instead java garbage collector do that task  Tuple generation is simple continuous for-loop which generates tuples randomly from specified distinct values

Validation of cost model and Join ordering heuristics

Effect of query Re-Evaluation and Expiration Frequencies on Processing Time  Eager expirations incurs cost of updating linked list on every arrival of tuple, while lazy expiration performs fewer operations, but allows the window to grow between updates, causing long Join evaluation time.  For both NLJ and hash join short expiration intervals are preferred as cost of advancing pointer is lower than processing larger windows.  Very frequent expiration and re-evaluation are inefficient.

Varying Hash Table Sizes

INDEX Introduction Problem Description Sliding Window Join algorithms Cost Analysis Join Ordering Heuristics Experimental Results  Related Work  Conclusion and Future Work

Related Work  Couger: Distributed sensor processing inside the sensor network  Aurora: Allows user to create query plans by visually arranging query operators using boxes  TelegraphCQ: For adaptive query processing  STREAM: Addresses all aspects of data stream management, also proposed (CQL)  Detar: uses combination of window and stream summary  Babu and Widom uses stream constraints  Some related work towards Join processing: XJoin, Hash-Join, Ripple Join, Multi-way XJoin called MJoin.

INDEX Introduction Problem Description Sliding Window Join algorithms Cost Analysis Join Ordering Heuristics Experimental Results Related Work  Conclusion and Future Work

Conclusion  Presented and analyzed incremental, multi-way join algorithms for sliding window over data streams.  Using per-unit-time based model, developed a join heuristic that finds a good join order without iterating over entire search space  With experiments showed that hash-based joins performs better than NLJs and also discovered allocating more hash buckets to larger windows is a promising strategy

Future Work  Goal is to develop a sliding window query processor that is functional, efficient, and scalable  Functionality: intended to design efficient algorithms for other query operators as well.  Efficiency: low-overhead indices for indexing window contents and also exploit constraints to minimize state  Scalability: indexing query predicates, storing materialized views, and returning approximate answers if exact answers are too expensive to compute

INDEX Introduction Problem Description Sliding Window Join algorithms Cost Analysis Join Ordering Heuristics Experimental Results Related Work Conclusion and Future Work

Thank You Malav Shah