1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

Slides:



Advertisements
Similar presentations
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Advertisements

Scalable Data Partitioning Techniques for Parallel Sliding Window Processing over Data Streams DMSN 2011 Cagri Balkesen & Nesime Tatbul.
Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
CS240B Midterm Spring 2013 Your Name: and your ID: Problem Max scoreScore Problem 140% Problem 232% Problem 228% Total 100%
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
CS4432: Database Systems II
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Maintaining Sliding Widow Skylines on Data Streams.
Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.
Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.
Windows in Niagara Jin (Jenny) Li, David Maier, Vassilis Papadimos, Peter Tucker, Kristin Tufte.
ONE PASS ALGORITHM PRESENTED BY: PRADHYUMAN RAOL ID : 114 Instructor: Dr T.Y. LIN.
An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS
Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.
Query Processing (overview)
1 SINA: Scalable Incremental Processing of Continuous Queries in Spatio-temporal Databases Mohamed F. Mokbel, Xiaopeng Xiong, Walid G. Aref Presented by.
ONE PASS ALGORITHM PRESENTED BY: PRADHYUMAN RAOL ID : 114 Instructor: Dr T.Y. LIN.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Semantics and Evaluation Techniques for Window Aggregates in Data Stream Jin Li, David Maier, Kristin Tufte, Vassillis Papadimos, Peter Tucker. Presented.
Avoiding Idle Waiting in the execution of Continuous Queries Carlo Zaniolo CSD CS240B Notes April 2008.
An adaptive framework of multiple schemes for event and query distribution in wireless sensor networks Vincent Tam, Keng-Teck Ma, and King-Shan Lui IEEE.
STREAM The Stanford Data Stream Management System.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Data Streams: Lecture 101 Window Aggregates in NiagaraST Kristin Tufte, Jin Li Thanks to the NiagaraST PSU.
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.
Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.
JONATHAN LESSINGER A CRITIQUE OF CQL. PLAN 1.Background (How CQL, STREAM work) 2.Issues.
P2P Streaming Protocol (PPSP) Requirements draft-zong-ppsp-reqs-03.
Network Computing Laboratory A programming framework for Stream Synthesizing Service.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Eddies: Continuously Adaptive Query Processing Ross Rosemark.
A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Rate-Based Query Optimization for Streaming Information Sources Stratis D. Viglas Jeffrey F. Naughton.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Event Stream Processing with Out-of-Order Data Arrival Mo Liu Database System Research Group Worcester Polytechnic Institute.
Safety Guarantee of Continuous Join Queries over Punctuated Data Streams Hua-Gang Li *, Songting Chen, Junichi Tatemura Divykant Agrawal, K. Selcuk Candan.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Partial Query-Evaluation in Internet Query Engines Jayavel Shanmugasundaram Kristin Tufte David DeWitt David Maier Jeffrey Naughton University of Wisconsin.
The latte Stream-Archive Query Project - Exploring Stream+Archive Data in Intelligent Transportation Systems Jin Li (with Kristin Tufte, Vassilis Papadimos,
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
1 Out of Order Processing for Stream Query Evaluation Jin Li (Portland State Universtiy) Joint work with Theodore Johnson, Vladislav Shkapenyuk, David.
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
COMP3211 Advanced Databases
Data stream as an unbounded table
Chapter 12: Query Processing
Evaluation of Relational Operations
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
Database Applications (15-415) DBMS Internals- Part VI Lecture 15, Oct 23, 2016 Mohammad Hammoud.
Overview of Query Evaluation
Dop d d 1 2 reconst reconst sop P P 1 2.
UCLA, Fall CS240B Midterm Your Name: and your ID:
Idle Waiting for slides
Theppatorn rhujittawiwat
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work was supported by NSF grant IIS

2 Motivation Traffic: Traffic: t4 ( s5, 47, 01:10:10)t5 ( s6, 48, 01:10:30)t6 ( s6, 46, 01:11:02) ( sid, max) ( s5, 47 ) ( s6, 48 ) ( sid, speed, ts ) Q1: “For every minute, find the max speed of the past 5 minutes for each sensor.” t6 (s6, 46, 01:11:02) t4 (s5, 47, 01:10:10) t5 (s6, 48, 01:10:40) t4 (s5, 47, 01:10:10) t1 (s5, 40, 01:06:30) t2 (s6, 42, 01:07:45) t3 (s5, 45, 01:08:15) (sid, speed, ts (hh:mm:ss)) windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 window window: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 window windows: 01:06:xx – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 Traffic sensor

3 Limitations Window semantics definition and implementation Window semantics definition and implementation Assumptions on data arrival order Assumptions on data arrival order Data arrival affects query answer and result production Data arrival affects query answer and result production Query evaluation performance Query evaluation performance Space: Internal buffer space to hold a window Space: Internal buffer space to hold a window Time: Tuple access – each tuple is accessed multiple times Time: Tuple access – each tuple is accessed multiple times Latency: Window aggregate computation is tied with window completion Latency: Window aggregate computation is tied with window completion

4 Outline WID overview WID overview Window semantics definition and its implementation in WID Window semantics definition and its implementation in WID Disorder Disorder Sharing panes – an optimization technique using sub-windows (panes) Sharing panes – an optimization technique using sub-windows (panes) Conclusion Conclusion

5 WID Overview Q1: SELECT sid, max(speed) FROM Traffic [RANGE 5 minutes [RANGE 5 minutes SLIDE 1 minute SLIDE 1 minute WATTR ts] WATTR ts] GROUP-BY sid t4 ( s5, 47, 01:10:10) t5 ( s6, 48, 01:10:30) p1 ( s6, *, 01:11:00) t4 ( s5, 47, 01:10:10, )t5 ( s6, 48, 01:10:30, ) p1 ( s6, *, *, 70 ) t6 ( s6, 46, 01:11:02, ) (sid, window-id, max) ( s6, 70, 48 )  A punctuation is a message embedded in the data to indicate the end of a sub-stream tag window-id 70s5max: 47 … 74s5max: 47 70s6max: 48 … 74s6max: s6max: 48 … 74 s6 max: s6 max: 46 wid sid max … (sid, speed, ts ) (sid, speed, ts, window-id) t6 ( s6, 46, 01:11:02)t1 ( s5, 40, 01:06:30)t2 ( s6, 42, 01:07:45)t3 ( s5, 45, 01:08:15) t1 ( s5, 40, 01:06:30, ) t2 ( s6, 42, 01:07:45, ) t3 ( s5, 45, 01:08:15, ) 70s5max: 40 … 74s5max: 40 70s6max: 42 … 74s6max: 42 70s5max: 45 … 74s5max: 45 70s6max: 48 … 74s6max: 48

6 Window Semantics Framework T: the set of all tuples in the input stream T: the set of all tuples in the input stream S: a window specification S: a window specification W: a set of window-ids W: a set of window-ids windows: (T, S)  W windows: (T, S)  W Defines the set of window ids to be used Defines the set of window ids to be used extent: (T, S, w)  U  T, where w  W extent: (T, S, w)  U  T, where w  W Specifies which tuples belong to a given window Specifies which tuples belong to a given window wids: (T, S, t)  V  W, where t  T wids: (T, S, t)  V  W, where t  T Determines the set of window-ids to which a tuple belongs Determines the set of window-ids to which a tuple belongs Is the dual of extent Is the dual of extent

7 Defining Window Semantics - sliding window T: the set of all tuples in the input stream S: a window specification W: a set of window-ids windows (T, S [RANGE, SLIDE, WATTR]) = {0, 1, 2, …} extent (w, T, S[RANGE, SLIDE, WATTR]) = { t  T | ((w+1) * SLIDE)-RANGE ) ≤ t.WATTR < (w+1) * SLIDE } wids (t, T, S [RANGE, SLIDE, WATTR]) = {w  W | t.WATTR / SLIDE – 1 < w ≤ (t.WATTR + RANGE) / SLIDE) – 1 } Q1: SELECT sid, max(speed) FROM Traffic [RANGE 5 minutes [RANGE 5 minutes SLIDE 1 minute SLIDE 1 minute WATTR ts] WATTR ts] GROUP-BY sid windows (T, S [5, 1, ts]) = {0, 1, 2, …} extent (w, T, S[5, 1, ts]) = { t  T | ((w+1) * 1) − 5 ) ≤ t.ts < (w+1) * 1 } wids (t, T, S [5, 1, ts]) = {w  W | t.ts / 1 – 1 < w ≤ (t.ts + 5) / 1) – 1 } For t4 (s5, 47, 01:10:10), wids (t4, T, S [5, 1, ts]) = {w  W | t4.ts / 1 – 1 < w ≤ (t4.ts + 5) / 1) – 1 } = {w  W | < w ≤ } = {w  W | 70 ≤ w ≤ 74} where t4.ts is 01:10:10 ≈ minute

8 Window Semantics Implementation in WID – sliding window (sid, speed, ts ) ( s5, 40, 01:06:30 ) t1 (sid, speed, ts, window-id) ( s5, 40, 00:06:30, ) t1 streamscan bucket RANGE 5 minutes SLIDE 1 minute WATTR ts max (group on window-id, sid) (sid, window-id, max) ( s5, 70, 40 ) SELECT sid, max(speed) FROM Traffic [RANGE 5 minutes [RANGE 5 minutes SLIDE 1 minute SLIDE 1 minute WATTR ts] WATTR ts] GROUP BY sid ( s5, *, *, 70 ) p1 (sid, speed, ts ) ( s5, *, 01:11:00 ) p1 1.Bucket implements wids function; 2.Bucket for sliding windows is stateless

9 Defining Window Semantics - partitioned window Q2: SELECT sid, max(speed ) FROM Traffic [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid] windows (T, S [RANGE, SLIDE, row-num, PATTR]) = {(i, p) | i  { 0, 1, 2, …}, p  T.PATTR} extent ((i, p), T, S[RANGE, SLIDE, row-num, PATTR]) = { t  T | t.PATTR = p, ((i+1) * SLIDE)-RANGE ) ≤ rank(t.row-num, PATTR, T) < (i+1) * SLIDE } T: the set of all tuples in the input stream S: a window specification W: a set of window-ids

10 Defining Window Semantics - partitioned window (cont.) wids (t, T, S[RANGE, row-num, PATTR]) = {(i, p)  W | t.PATTR = p, r / SLIDE – 1  i  (r + RANGE) / SLIDE –1} where r = rank (t, row-num, PATTR, T) Q2: SELECT sid, max(speed ) FROM Traffic [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid] T: the set of all tuples in the input stream S: a window specification W: a set of window-ids

11 Window Semantics Implementation in WID – partitioned window (sid, speed, row-num ) ( s5, 47, 507 ) t1 (sid, window-id, speed, row-num) ( s5, 3-12, 47, 507 ) t1 streamscan bucket RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid Max (speed) (group on window-id, sid) (sid, window-id, max) ( s5, 3, 47 ) ( s5, 3, *, * ) p1 1.Bucket generates punctuations; 2.Bucket for partitioned windows maintains states (count for each partition) SELECT sid, max(speed) FROM Traffic [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid]

12 WID Advantages Window semantics definition Window semantics definition Separated from physical implementation and data arrival order Separated from physical implementation and data arrival order Flexible – covers varieties of windows, e.g., sliding, tumbling, landmark, time-based, tuple-based; allow user- specified windowing attribute Flexible – covers varieties of windows, e.g., sliding, tumbling, landmark, time-based, tuple-based; allow user- specified windowing attribute Implementation of query evaluation Implementation of query evaluation Window semantics localized in Bucket Window semantics localized in Bucket Insensitive to data arrival order Insensitive to data arrival order Punctuation can guarantee progress Punctuation can guarantee progress Gaps in tuple arrival need not affect result production Gaps in tuple arrival need not affect result production Performance gains in space, execution time and latency Performance gains in space, execution time and latency

13 WID vs. Buffering – execution time comparison (overview)

14 WID vs. Buffering – execution time comparison (zoom-in)

15 Outline WID overview WID overview Window semantics definition and its implementation in WID Window semantics definition and its implementation in WID Disorder Disorder Sharing panes – an optimization technique using sub-windows Sharing panes – an optimization technique using sub-windows Conclusion Conclusion

16 Sources of Disorder Sources of disorder Sources of disorder Merging different data sources Merging different data sources Various network transmission delay Various network transmission delay Data prioritization Data prioritization Query processing algorithms, e.g., shared window joins [Hammad, et al.] Query processing algorithms, e.g., shared window joins [Hammad, et al.] Multiple possible windowing attributes, e.g., two timestamps Multiple possible windowing attributes, e.g., two timestamps

17 Handling Disorder Generally dealt with by buffering Generally dealt with by buffering Slack – BSort in Aurora Slack – BSort in Aurora Output buffering in a shared-window join Output buffering in a shared-window join Punctuation + Window-id Punctuation + Window-id Heartbeat Heartbeat

18 Disorder Handling - WID Q1: SELECT sid, max(speed) FROM Traffic [RANGE 5 minutes [RANGE 5 minutes SLIDE 1 minute SLIDE 1 minute WATTR ts] WATTR ts] GROUP-BY sid p1 ( s6, *, 01:11:00)t7 ( s5, 52, 01:10:15) p1 ( s6, *, *, 70 )t6 ( s6, 46, 01:11:02, 71-75)t7 ( s5, 52, 01:10:15, 70-74) (sid, window-id, max) ( s6, 70, 48 ) bucket 70s5max: 47 … 74s5max: 47 70s6max: 48 … 74s6max: s6max: 48 … 74 s6 max: s6 max: 46 wid sid max … (sid, speed, ts ) (sid, speed, ts, window-id) t3 ( s6, 46, 01:11:02) 70s5max: 52 … 74s5max: 52 70s6max: 48 … 74s6max: 48

19 Outline WID overview WID overview Window semantics definition and its implementation in WID Window semantics definition and its implementation in WID Disorder Disorder Sharing panes – an optimization technique using sub-windows Sharing panes – an optimization technique using sub-windows Conclusion Conclusion

20 Sharing Panes Windows Panes … … P1P1 P5P5 P6P6 P7P7 P8P8 P2P2 P3P3 P4P4 W3W3 W1W1 W2W2 W4W4 W5W5 Q3: SELECT sid, count(*) FROM Traffic [RANGE 4 minutes [RANGE 4 minutes SLIDE 1 minute SLIDE 1 minute WATTR ts] GROUP BY sid WATTR ts] GROUP BY sid

21 Pane Implementation (sid, speed, ts ) ( s5, 47, 01:10:10 ) t1 ( s5, *, 01:11:00 ) p1 ( s6, 48, 01:10:30 ) t2 (sid, speed, ts, pane-id ) ( s5, 47, 01:10:10, ) t1 ( s5, *, *, 70 ) p1 ( s6, 48, 01:10:30, ) t2 streamscan count (*) (group on pane-id, sid) bucket B1 as pane-id RANGE 1 min SLIDE 1 min WATTR ts bucket B2 as window-id RANGE 4 SLIDE 1 WATTR pane-id sum(*) (group on window-id, sid) (sid, ts, pane-id, count) ( s5, 01:10:10, 70, 8 ) m0 (sid, ts, pane-id, count, window-id) ( s5, 01:10:10, 70, 8, ) m0 SELECT sid, count(*) FROM Traffic [RANGE 4 minutes SLIDE 1 minute WATTR ts] GROUP BY sid

22 When are panes better than windows? SELECT sid, max(*) FROM Traffic [RANGE X rows SLIDE Y rows WATTR row-num] GROUP BY sid 1. Panes are better when cost ratio is less than 1 2. The number of tuples per pane affects whether using panes is better

23 Conclusion and Future Work Conclusion Conclusion A framework for defining window semantics A framework for defining window semantics A one pass, non-buffering, disorder-tolerant query evaluation technique A one pass, non-buffering, disorder-tolerant query evaluation technique Initial investigation on disorder Initial investigation on disorder Sharing panes Sharing panes Future work Future work Disorder-tolerant window join Disorder-tolerant window join Sharing panes among multiple aggregate queries Sharing panes among multiple aggregate queries

24 Related Work Heartbeat, Sub-aggregation Heartbeat, Sub-aggregation Sliding window aggregates Sliding window aggregates Slack Slack Ordering Update Token, Sub-aggregation Ordering Update Token, Sub-aggregation