Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.

Similar presentations


Presentation on theme: "1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work."— Presentation transcript:

1 1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work was supported by NSF grant IIS 0086002

2 2 Motivation Traffic: Traffic: t4 ( s5, 47, 01:10:10)t5 ( s6, 48, 01:10:30)t6 ( s6, 46, 01:11:02) ( sid, max) ( s5, 47 ) ( s6, 48 ) ( sid, speed, ts ) Q1: “For every minute, find the max speed of the past 5 minutes for each sensor.” t6 (s6, 46, 01:11:02) t4 (s5, 47, 01:10:10) t5 (s6, 48, 01:10:40) t4 (s5, 47, 01:10:10) t1 (s5, 40, 01:06:30) t2 (s6, 42, 01:07:45) t3 (s5, 45, 01:08:15) (sid, speed, ts (hh:mm:ss)) windows: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 window window: 01:06:00 – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 window windows: 01:06:xx – 01:11:00 01:07:00 – 01:12:00 01:08:00 – 01:13:00 Traffic sensor

3 3 Limitations Window semantics definition and implementation Window semantics definition and implementation Assumptions on data arrival order Assumptions on data arrival order Data arrival affects query answer and result production Data arrival affects query answer and result production Query evaluation performance Query evaluation performance Space: Internal buffer space to hold a window Space: Internal buffer space to hold a window Time: Tuple access – each tuple is accessed multiple times Time: Tuple access – each tuple is accessed multiple times Latency: Window aggregate computation is tied with window completion Latency: Window aggregate computation is tied with window completion

4 4 Outline WID overview WID overview Window semantics definition and its implementation in WID Window semantics definition and its implementation in WID Disorder Disorder Sharing panes – an optimization technique using sub-windows (panes) Sharing panes – an optimization technique using sub-windows (panes) Conclusion Conclusion

5 5 WID Overview Q1: SELECT sid, max(speed) FROM Traffic [RANGE 5 minutes [RANGE 5 minutes SLIDE 1 minute SLIDE 1 minute WATTR ts] WATTR ts] GROUP-BY sid t4 ( s5, 47, 01:10:10) t5 ( s6, 48, 01:10:30) p1 ( s6, *, 01:11:00) t4 ( s5, 47, 01:10:10, 70-74 )t5 ( s6, 48, 01:10:30, 70-74 ) p1 ( s6, *, *, 70 ) t6 ( s6, 46, 01:11:02, 71-75 ) (sid, window-id, max) ( s6, 70, 48 )  A punctuation is a message embedded in the data to indicate the end of a sub-stream tag window-id 70s5max: 47 … 74s5max: 47 70s6max: 48 … 74s6max: 48 71 s6max: 48 … 74 s6 max: 48 75 s6 max: 46 wid sid max … (sid, speed, ts ) (sid, speed, ts, window-id) t6 ( s6, 46, 01:11:02)t1 ( s5, 40, 01:06:30)t2 ( s6, 42, 01:07:45)t3 ( s5, 45, 01:08:15) t1 ( s5, 40, 01:06:30, 70-74 ) t2 ( s6, 42, 01:07:45, 70-74 ) t3 ( s5, 45, 01:08:15, 70-74 ) 70s5max: 40 … 74s5max: 40 70s6max: 42 … 74s6max: 42 70s5max: 45 … 74s5max: 45 70s6max: 48 … 74s6max: 48

6 6 Window Semantics Framework T: the set of all tuples in the input stream T: the set of all tuples in the input stream S: a window specification S: a window specification W: a set of window-ids W: a set of window-ids windows: (T, S)  W windows: (T, S)  W Defines the set of window ids to be used Defines the set of window ids to be used extent: (T, S, w)  U  T, where w  W extent: (T, S, w)  U  T, where w  W Specifies which tuples belong to a given window Specifies which tuples belong to a given window wids: (T, S, t)  V  W, where t  T wids: (T, S, t)  V  W, where t  T Determines the set of window-ids to which a tuple belongs Determines the set of window-ids to which a tuple belongs Is the dual of extent Is the dual of extent

7 7 Defining Window Semantics - sliding window T: the set of all tuples in the input stream S: a window specification W: a set of window-ids windows (T, S [RANGE, SLIDE, WATTR]) = {0, 1, 2, …} extent (w, T, S[RANGE, SLIDE, WATTR]) = { t  T | ((w+1) * SLIDE)-RANGE ) ≤ t.WATTR < (w+1) * SLIDE } wids (t, T, S [RANGE, SLIDE, WATTR]) = {w  W | t.WATTR / SLIDE – 1 < w ≤ (t.WATTR + RANGE) / SLIDE) – 1 } Q1: SELECT sid, max(speed) FROM Traffic [RANGE 5 minutes [RANGE 5 minutes SLIDE 1 minute SLIDE 1 minute WATTR ts] WATTR ts] GROUP-BY sid windows (T, S [5, 1, ts]) = {0, 1, 2, …} extent (w, T, S[5, 1, ts]) = { t  T | ((w+1) * 1) − 5 ) ≤ t.ts < (w+1) * 1 } wids (t, T, S [5, 1, ts]) = {w  W | t.ts / 1 – 1 < w ≤ (t.ts + 5) / 1) – 1 } For t4 (s5, 47, 01:10:10), wids (t4, T, S [5, 1, ts]) = {w  W | t4.ts / 1 – 1 < w ≤ (t4.ts + 5) / 1) – 1 } = {w  W | 69.17 < w ≤ 74.17 } = {w  W | 70 ≤ w ≤ 74} where t4.ts is 01:10:10 ≈ 70.17 minute

8 8 Window Semantics Implementation in WID – sliding window (sid, speed, ts ) ( s5, 40, 01:06:30 ) t1 (sid, speed, ts, window-id) ( s5, 40, 00:06:30, 70-74 ) t1 streamscan bucket RANGE 5 minutes SLIDE 1 minute WATTR ts max (group on window-id, sid) (sid, window-id, max) ( s5, 70, 40 ) SELECT sid, max(speed) FROM Traffic [RANGE 5 minutes [RANGE 5 minutes SLIDE 1 minute SLIDE 1 minute WATTR ts] WATTR ts] GROUP BY sid ( s5, *, *, 70 ) p1 (sid, speed, ts ) ( s5, *, 01:11:00 ) p1 1.Bucket implements wids function; 2.Bucket for sliding windows is stateless

9 9 Defining Window Semantics - partitioned window Q2: SELECT sid, max(speed ) FROM Traffic [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid] windows (T, S [RANGE, SLIDE, row-num, PATTR]) = {(i, p) | i  { 0, 1, 2, …}, p  T.PATTR} extent ((i, p), T, S[RANGE, SLIDE, row-num, PATTR]) = { t  T | t.PATTR = p, ((i+1) * SLIDE)-RANGE ) ≤ rank(t.row-num, PATTR, T) < (i+1) * SLIDE } T: the set of all tuples in the input stream S: a window specification W: a set of window-ids

10 10 Defining Window Semantics - partitioned window (cont.) wids (t, T, S[RANGE, row-num, PATTR]) = {(i, p)  W | t.PATTR = p, r / SLIDE – 1  i  (r + RANGE) / SLIDE –1} where r = rank (t, row-num, PATTR, T) Q2: SELECT sid, max(speed ) FROM Traffic [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid] T: the set of all tuples in the input stream S: a window specification W: a set of window-ids

11 11 Window Semantics Implementation in WID – partitioned window (sid, speed, row-num ) ( s5, 47, 507 ) t1 (sid, window-id, speed, row-num) ( s5, 3-12, 47, 507 ) t1 streamscan bucket RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid Max (speed) (group on window-id, sid) (sid, window-id, max) ( s5, 3, 47 ) ( s5, 3, *, * ) p1 1.Bucket generates punctuations; 2.Bucket for partitioned windows maintains states (count for each partition) SELECT sid, max(speed) FROM Traffic [RANGE 1000 rows SLIDE 100 rows WATTR row-num PATTR sid]

12 12 WID Advantages Window semantics definition Window semantics definition Separated from physical implementation and data arrival order Separated from physical implementation and data arrival order Flexible – covers varieties of windows, e.g., sliding, tumbling, landmark, time-based, tuple-based; allow user- specified windowing attribute Flexible – covers varieties of windows, e.g., sliding, tumbling, landmark, time-based, tuple-based; allow user- specified windowing attribute Implementation of query evaluation Implementation of query evaluation Window semantics localized in Bucket Window semantics localized in Bucket Insensitive to data arrival order Insensitive to data arrival order Punctuation can guarantee progress Punctuation can guarantee progress Gaps in tuple arrival need not affect result production Gaps in tuple arrival need not affect result production Performance gains in space, execution time and latency Performance gains in space, execution time and latency

13 13 WID vs. Buffering – execution time comparison (overview)

14 14 WID vs. Buffering – execution time comparison (zoom-in)

15 15 Outline WID overview WID overview Window semantics definition and its implementation in WID Window semantics definition and its implementation in WID Disorder Disorder Sharing panes – an optimization technique using sub-windows Sharing panes – an optimization technique using sub-windows Conclusion Conclusion

16 16 Sources of Disorder Sources of disorder Sources of disorder Merging different data sources Merging different data sources Various network transmission delay Various network transmission delay Data prioritization Data prioritization Query processing algorithms, e.g., shared window joins [Hammad, et al.] Query processing algorithms, e.g., shared window joins [Hammad, et al.] Multiple possible windowing attributes, e.g., two timestamps Multiple possible windowing attributes, e.g., two timestamps

17 17 Handling Disorder Generally dealt with by buffering Generally dealt with by buffering Slack – BSort in Aurora Slack – BSort in Aurora Output buffering in a shared-window join Output buffering in a shared-window join Punctuation + Window-id Punctuation + Window-id Heartbeat Heartbeat

18 18 Disorder Handling - WID Q1: SELECT sid, max(speed) FROM Traffic [RANGE 5 minutes [RANGE 5 minutes SLIDE 1 minute SLIDE 1 minute WATTR ts] WATTR ts] GROUP-BY sid p1 ( s6, *, 01:11:00)t7 ( s5, 52, 01:10:15) p1 ( s6, *, *, 70 )t6 ( s6, 46, 01:11:02, 71-75)t7 ( s5, 52, 01:10:15, 70-74) (sid, window-id, max) ( s6, 70, 48 ) bucket 70s5max: 47 … 74s5max: 47 70s6max: 48 … 74s6max: 48 71 s6max: 48 … 74 s6 max: 48 75 s6 max: 46 wid sid max … (sid, speed, ts ) (sid, speed, ts, window-id) t3 ( s6, 46, 01:11:02) 70s5max: 52 … 74s5max: 52 70s6max: 48 … 74s6max: 48

19 19 Outline WID overview WID overview Window semantics definition and its implementation in WID Window semantics definition and its implementation in WID Disorder Disorder Sharing panes – an optimization technique using sub-windows Sharing panes – an optimization technique using sub-windows Conclusion Conclusion

20 20 Sharing Panes Windows Panes … … P1P1 P5P5 P6P6 P7P7 P8P8 P2P2 P3P3 P4P4 W3W3 W1W1 W2W2 W4W4 W5W5 Q3: SELECT sid, count(*) FROM Traffic [RANGE 4 minutes [RANGE 4 minutes SLIDE 1 minute SLIDE 1 minute WATTR ts] GROUP BY sid WATTR ts] GROUP BY sid

21 21 Pane Implementation (sid, speed, ts ) ( s5, 47, 01:10:10 ) t1 ( s5, *, 01:11:00 ) p1 ( s6, 48, 01:10:30 ) t2 (sid, speed, ts, pane-id ) ( s5, 47, 01:10:10, 70-70 ) t1 ( s5, *, *, 70 ) p1 ( s6, 48, 01:10:30, 70-70 ) t2 streamscan count (*) (group on pane-id, sid) bucket B1 as pane-id RANGE 1 min SLIDE 1 min WATTR ts bucket B2 as window-id RANGE 4 SLIDE 1 WATTR pane-id sum(*) (group on window-id, sid) (sid, ts, pane-id, count) ( s5, 01:10:10, 70, 8 ) m0 (sid, ts, pane-id, count, window-id) ( s5, 01:10:10, 70, 8, 70-74 ) m0 SELECT sid, count(*) FROM Traffic [RANGE 4 minutes SLIDE 1 minute WATTR ts] GROUP BY sid

22 22 When are panes better than windows? SELECT sid, max(*) FROM Traffic [RANGE X rows SLIDE Y rows WATTR row-num] GROUP BY sid 1. Panes are better when cost ratio is less than 1 2. The number of tuples per pane affects whether using panes is better

23 23 Conclusion and Future Work Conclusion Conclusion A framework for defining window semantics A framework for defining window semantics A one pass, non-buffering, disorder-tolerant query evaluation technique A one pass, non-buffering, disorder-tolerant query evaluation technique Initial investigation on disorder Initial investigation on disorder Sharing panes Sharing panes Future work Future work Disorder-tolerant window join Disorder-tolerant window join Sharing panes among multiple aggregate queries Sharing panes among multiple aggregate queries

24 24 Related Work STREAM@Stanford Heartbeat, Sub-aggregation STREAM@Stanford Heartbeat, Sub-aggregation TelegraphCQ@Berkeley Sliding window aggregates TelegraphCQ@Berkeley Sliding window aggregates Aurora&Borealis@Brown&MIT&Brandeis Slack Aurora&Borealis@Brown&MIT&Brandeis Slack Gigascope@AT&T Ordering Update Token, Sub-aggregation Gigascope@AT&T Ordering Update Token, Sub-aggregation


Download ppt "1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work."

Similar presentations


Ads by Google