Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Similar presentations


Presentation on theme: "Feifei Li, Ching Chang, George Kollios, Azer Bestavros"— Presentation transcript:

1 Characterizing and Exploiting Reference Locality in Data Stream Applications
Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer Science Department Boston University

2 Data Stream Management System
Application Result Query (e.g. Joins over two streams) Unselected tuples Query Processor Select tuples that maximize the query metrics Memory Data Stream Management System (DSMS)

3 Observations Storage / Computation limitation
Full contents of tuples of interest cannot be stored in memory. Cast as “caching” problems Query processing with memory constraint.

4 “Caching” Problem in DSMS
sliding window joins window size sum of is the memory size What tuples to store to max the size of join results? Locality of reference properties (Denning & Schwatz)

5 Locality-Aware Algorithms
Our Locality-aware algorithms Previous algorithms

6 Our Contributions Cast query processing with memory constraint in DSMS as “caching” problem and analyze the two causes of reference locality Provide a mathematical model and simple method to infer it to characterize the reference locality in data streams Show how to improve performance of data stream applications with locality-aware algorithms

7 Reference Locality - Definition
In a data stream recently appearing tuples have a high probability of appearing in the near future.

8 Inter Arrival Distance (IAD)
A random variable that corresponds to the number of tuples separating consecutive appearances of the same tuple. 2 4 10 7 3 1 IAD

9 Calculate distribution of IAD
i a b e c a i xn xn+k distance is k Where pi is the frequency of value i in this data stream

10 Sources of Reference Locality
Long-term popularity vs. Short-term correlation (web traces, Bestavros and Crovella) MS IBM GG Reference locality due to long-term popularity For example: Stock Traces A MS GG IBM Reference locality due to short-term correlation George’s Company A listed today!

11 Independent Reference Model
With the independent, identically-distributed (IID) assumption: Problem: only captures reference locality due to skewed popularity profile.

12 Metrics of Reference Locality
How to distinguish the two causes of reference locality? A MS A A MS GG GG MS IBM Original Data Stream S MS A GG IBM Random Permutation of S Compare IAD distribution of the two!

13 Stock Transaction Traces
Daily stock transaction data from INET ATS, Inc. Zipf-like Popularity Profile (log-log scale)

14 Stock Transaction Traces
Still has strong reference locality, due to skewed popularity distribution CDF of IAD for Original and Randomly Permuted Traces

15 Network OD Flow Traces Network traces of Origin-Destination (OD) flows in two major networks: US Abilene and Sprint-Europe Zipf-like Popularity Profile (log-log scale)

16 Network OD Flow Traces CDF of IAD for Original and
Randomly Permuted Traces

17 Outline Motivation Reference Locality: source and metrics
A Locality-Aware Data Stream Model Application of Locality-Aware Model Max-subset Join Approximate count estimation Data summarization Performance Study Conclusion

18 Locality-Aware Stream Model
Recent h tuples 2 2 4 10 5 10 7 7 5 xn Index xn-h P(xn=xn-4)=a4 xn-1 Popularity Distribution of S P Recent h tuples of S stream S

19 Locality-Aware Stream Model
Recent h tuples 2 2 4 10 5 10 7 7 2 xn P(xn=2 from popularity profile)=b*p(2) Index xn-h xn-1 Popularity Distribution of S P Recent h tuples of S stream S

20 Locality-Aware Stream Model
Xn= Xn-i with probability ai Y with probability b where 1  i  h, and Y is a IID random variable w.r.t P, and where (xk,c)=1 if xk=c, and 0 otherwise. Similar model appears for caching of web-traces, example Konstantinos Psounis, et. al

21 Make N observations, infer ai and b (h+1) parameters
Infer the Model Expected value for xn: Least square method: minimize over a1, … , ah, b: Make N observations, infer ai and b (h+1) parameters

22 Model on Real Traces- Stock
b: degree of reference locality due to long-term popularity 1-b: … due to short-term correlation

23 Model on Real Traces- OD Flow

24 Utilizing Model for Prediction
S xn-h xn-1 xn xn+1 xn+2 xn+T T The expected number of occurrence for tuple with value e in a future period of T, ET(e). Using only T+1 constants calculated based on the locality model of S

25 Outline Motivation Reference Locality: source and metrics
A Locality-Aware Data Stream Model Application of Locality-Aware Model Max-subset Join Approximate count estimation Data summarization Performance Study Conclusion

26 Approximate Sliding Window Join
sliding window joins window size sum of is the memory size What tuples to store to max the size of join results?

27 Existing Approach Metrics: Max-subset Previous approach:
Random load shedding: poor performance (J. Kang et. al, A. Das et. al) Frequency model: IID assumption (A. Das et. al) Age-based model: too strict assumption (U. Srivastava et. al) Stochastic model: not universal (J. Xie et. al)

28 Marginal Utility 6 5 10 8 12 Stream S Stream R n n-1 T=5

29 Calculate Marginal Utility
S 10 x 13 x 8 x x 8 9 n Tuple Index: x ? P1 x ? P2 R 9 7 n Based on locality model, we can show that: where F depends the characteristic equation of Pi which is a linear recursive sequence!

30 ELBA Exact Locality-Based Algorithm (ELBA)
Based on the previous analysis, calculate the marginal utility of tuples in the buffer, evict the victim with the smallest value Expensive

31 LBA Locality-Based Algorithm (LBA)
Assume T is fixed, approximate marginal utility based on the prediction power of locality model. Depends on only T+1 constants that could be pre-computed.

32 Space Complexity A histogram stores both P over a domain size D and T+1 constants histogram space usage is poly logarithm: O(poly[logN]) space usage for N values (A. Gilbert, et. al)

33 Sliding window join: varying buffer size – OD Flow

34 Sliding window join: varying buffer size - Stock

35 Sliding window join: varying window size - stock

36 Conclusion Reference locality property is important for query processing with memory constraint in data stream applications. Most real data streams have strong temporal locality, i.e. short term correlations. How about spatial locality, i.e. correlation among different attributes of the tuple?

37 Thanks!

38 Approximate Count Estimation
Derive much tighter space bound for Lossy-counting algorithm (G. Manku et. al) using locality-aware techniques. Tight space bound is important, as it tells us how much memory space to allocate.

39 Data Summarization Define Entropy over a window in data stream using locality-aware techniques, instead of the normal way of entropy definition. Important for data summarization, change detection, etc. For example: 1 2 3

40 Data Stream Entropy Data Streams Locality-Aware Entropy Uniform IID
6.19 Permuted Stock Stream 5.48 Original Stock Stream 3.32 Higher degree of reference locality infers less entropy


Download ppt "Feifei Li, Ching Chang, George Kollios, Azer Bestavros"

Similar presentations


Ads by Google