Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMU SCS Data Mining Meets Systems: Tools and Case Studies Christos Faloutsos SCS CMU.

Similar presentations


Presentation on theme: "CMU SCS Data Mining Meets Systems: Tools and Case Studies Christos Faloutsos SCS CMU."— Presentation transcript:

1 CMU SCS Data Mining Meets Systems: Tools and Case Studies Christos Faloutsos SCS CMU

2 CMU SCS PDL 2008C. Faloutsos#2 Thanks Spiros Papadimitriou (CMU->IBM) Mengzhi Wang (CMU->Google) Jimeng Sun (CMU -> IBM)

3 CMU SCS PDL 2008C. Faloutsos#3 Outline Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining (Problem 4: sensor mining) (Problem 5: Large graphs & hadoop) fractals SVD wavelets tensors PageRank

4 CMU SCS PDL 2008C. Faloutsos#4 Problem #1: Goal: given a signal (eg., #bytes over time) Find: patterns, periodicities, and/or compress time #bytes Bytes per 30’ (packets per day; earthquakes per year)

5 CMU SCS PDL 2008C. Faloutsos#5 Problem #1 model bursty traffic generate realistic traces (Poisson does not work) time # bytes Poisson

6 CMU SCS PDL 2008C. Faloutsos#6 Motivation predict queue length distributions (e.g., to give probabilistic guarantees) “learn” traffic, for buffering, prefetching, ‘active disks’, web servers

7 CMU SCS PDL 2008C. Faloutsos#7 Q: any ‘pattern’? time # bytes Not Poisson spike; silence; more spikes; more silence… any rules?

8 CMU SCS PDL 2008C. Faloutsos#8 solution: self-similarity # bytes time # bytes

9 CMU SCS PDL 2008C. Faloutsos#9 But: Q1: How to generate realistic traces; extrapolate? Q2: How to estimate the model parameters?

10 CMU SCS PDL 2008C. Faloutsos#10 Approach Q1: How to generate a sequence, that is –bursty –self-similar –and has similar queue length distributions

11 CMU SCS PDL 2008C. Faloutsos#11 Approach A: ‘binomial multifractal’ [Wang+02] ~ 80-20 ‘law’: –80% of bytes/queries etc on first half –repeat recursively b: bias factor (eg., 80%)

12 CMU SCS PDL 2008C. Faloutsos#12 binary multifractals 20 80

13 CMU SCS PDL 2008C. Faloutsos#13 binary multifractals 20 80

14 CMU SCS PDL 2008C. Faloutsos#14 Parameter estimation Q2: How to estimate the bias factor b?

15 CMU SCS PDL 2008C. Faloutsos#15 Parameter estimation Q2: How to estimate the bias factor b? A: MANY ways [Crovella+96] –Hurst exponent –variance plot –even DFT amplitude spectrum! (‘periodogram’) –More robust: ‘entropy plot’ [Wang+02] Mengzhi Wang, Tara Madhyastha, Ngai Hang Chang, Spiros Papadimitriou and Christos Faloutsos, Data Mining Meets Performance Evaluation: Fast Algorithms for Modeling Bursty TrafficFast Algorithms for Modeling Bursty Traffic, ICDE 2002

16 CMU SCS PDL 2008C. Faloutsos#16 Entropy plot Rationale: – burstiness: inverse of uniformity –entropy measures uniformity of a distribution –find entropy at several granularities, to see whether/how our distribution is close to uniform.

17 CMU SCS PDL 2008C. Faloutsos#17 Entropy plot Entropy E(n) after n levels of splits n=1: E(1)= - p1 log 2 (p1)- p2 log 2 (p2) p1p2 % of bytes here

18 CMU SCS PDL 2008C. Faloutsos#18 Entropy plot Entropy E(n) after n levels of splits n=1: E(1)= - p1 log(p1)- p2 log(p2) n=2: E(2) = -   p 2,i * log 2 (p 2,i ) p 2,1 p 2,2 p 2,3 p 2,4

19 CMU SCS PDL 2008C. Faloutsos#19 Real traffic Has linear entropy plot (-> self-similar) # of levels (n) Entropy E(n) 0.73

20 CMU SCS PDL 2008C. Faloutsos#20 Observation - intuition: intuition: slope = intrinsic dimensionality =~ ‘degrees of freedom’ or info-bits per coordinate-bit –unif. Dataset: slope =1 –multi-point: slope = 0 # of levels (n) Entropy E(n) 0.73

21 CMU SCS PDL 2008C. Faloutsos#21 Entropy plot - Intuition Slope ~ intrinsic dimensionality (in fact, ‘Information fractal dimension’) = info bit per coordinate bit - eg Dim = 1 Pick a point; reveal its coordinate bit-by-bit - how much info is each bit worth to me? Skip

22 CMU SCS PDL 2008C. Faloutsos#22 Entropy plot Slope ~ intrinsic dimensionality (in fact, ‘Information fractal dimension’) = info bit per coordinate bit - eg Dim = 1 Is MSB 0? ‘info’ value = E(1): 1 bit Skip

23 CMU SCS PDL 2008C. Faloutsos#23 Entropy plot Slope ~ intrinsic dimensionality (in fact, ‘Information fractal dimension’) = info bit per coordinate bit - eg Dim = 1 Is MSB 0? Is next MSB =0? Skip

24 CMU SCS PDL 2008C. Faloutsos#24 Entropy plot Slope ~ intrinsic dimensionality (in fact, ‘Information fractal dimension’) = info bit per coordinate bit - eg Dim = 1 Is MSB 0? Is next MSB =0? Info value =1 bit = E(2) - E(1) = slope! Skip

25 CMU SCS PDL 2008C. Faloutsos#25 Entropy plot Repeat, for all points at same position: Dim=0 Skip

26 CMU SCS PDL 2008C. Faloutsos#26 Entropy plot Repeat, for all points at same position: we need 0 bits of info, to determine position -> slope = 0 = intrinsic dimensionality Dim=0 Skip

27 CMU SCS PDL 2008C. Faloutsos#27 Entropy plot Real (and 80-20) datasets can be in- between: bursts, gaps, smaller bursts, smaller gaps, at every scale Dim = 1 Dim=0 0<Dim<1 Skip

28 CMU SCS PDL 2008C. Faloutsos#28 (Fractals, again) What set of points could have behavior between point and line?

29 CMU SCS PDL 2008C. Faloutsos#29 Cantor dust Eliminate the middle third Recursively!

30 CMU SCS PDL 2008C. Faloutsos#30 Cantor dust

31 CMU SCS PDL 2008C. Faloutsos#31 Cantor dust

32 CMU SCS PDL 2008C. Faloutsos#32 Cantor dust

33 CMU SCS PDL 2008C. Faloutsos#33 Cantor dust

34 CMU SCS PDL 2008C. Faloutsos#34 Dimensionality? (no length; infinite # points!) Answer: log2 / log3 = 0.6 Cantor dust

35 CMU SCS PDL 2008C. Faloutsos#35 Some more entropy plots: Poisson vs real Poisson: slope = ~1 -> uniformly distributed 1 0.73

36 CMU SCS PDL 2008C. Faloutsos#36 B-model b-model traffic gives perfectly linear plot Lemma: its slope is slope = -b log 2 b - (1-b) log 2 (1-b) Fitting: do entropy plot; get slope; solve for b E(n) n

37 CMU SCS PDL 2008C. Faloutsos#37 Experimental setup Disk traces (from HP [Wilkes 93]) web traces from LBL http://repository.cs.vt.edu/ lbl-conn-7.tar.Z

38 CMU SCS PDL 2008C. Faloutsos#38 Model validation Linear entropy plots Bias factors b: 0.6-0.8 smallest b / smoothest: nntp traffic

39 CMU SCS PDL 2008C. Faloutsos#39 Web traffic - results LBL, NCDF of queue lengths (log-log scales) (queue length l) Prob( >l)

40 CMU SCS PDL 2008C. Faloutsos#40 Conclusions Multifractals (80/20, ‘b-model’, Multiplicative Wavelet Model (MWM)) for analysis and synthesis of bursty traffic

41 CMU SCS PDL 2008C. Faloutsos#41 Books Fractals: Manfred Schroeder: Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991 (Probably the BEST book on fractals!)

42 CMU SCS PDL 2008C. Faloutsos#42 Outline Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining (Problem 4: sensor mining) (Problem 5: Large graphs & hadoop)

43 CMU SCS PDL 2008C. Faloutsos#43 Clusters/data center monitoring Monitor correlations of multiple measurements Automatically flag anomalous behavior Intemon: intelligent monitoring system –warsteiner.db.cs.cmu.edu/demo/intemon.jsp

44 CMU SCS PDL 2008C. Faloutsos#44 Publication Evan Hoke, Jimeng Sun, John D. Strunk, Gregory R. Ganger, Christos Faloutsos. InteMon: Continuous Mining of Sensor Data in Large-scale Self-* Infrastructures. ACM SIGOPS Operating Systems Review, 40(3):38-44. ACM Press, July 2006

45 CMU SCS PDL 2008C. Faloutsos#45 Under the hood: SVD Singular Value Decomposition Done incrementally Spiros Papadimitriou, Jimeng Sun and Christos Faloutsos Streaming Pattern Discovery in Multiple Time-Series VLDB 2005, Trondheim, Norway.

46 CMU SCS PDL 2008C. Faloutsos#46 Singular Value Decomposition (SVD) SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) LSI: S. Dumais; M. Berry KL: eg, Duda+Hart PCA: eg., Jolliffe Details: [Press+] u of CPU1 u of CPU2 t=1 t=2

47 CMU SCS PDL 2008C. Faloutsos#47 Singular Value Decomposition (SVD) SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU1 u of CPU2 t=1 t=2

48 CMU SCS PDL 2008C. Faloutsos#48 Singular Value Decomposition (SVD) SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU1 u of CPU2 t=1 t=2

49 CMU SCS PDL 2008C. Faloutsos#49 Singular Value Decomposition (SVD) SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU1 u of CPU2 t=1 t=2

50 CMU SCS PDL 2008C. Faloutsos#50 Outline Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining (Problem 4: sensor mining) (Problem 5: Large graphs & hadoop)

51 CMU SCS PDL 2008C. Faloutsos#51 BGP updates With Aditya Prakash (CMU) Michalis Faloutsos (UC Riverside) Nicholas Valler (UC Riverside) Dave Andersen (CMU)

52 CMU SCS PDL 2008C. Faloutsos#52 Time Series: #Updates per 600s, Washington Router 09/2004- 09/2006 Tool #0: Time plot

53 CMU SCS PDL 2008C. Faloutsos#53 Tool #0: Time plot Observation #1: Missing values Observation #2: Bursty

54 CMU SCS PDL 2008C. Faloutsos#54 Tool #1: Wavelets

55 CMU SCS PDL 2008C. Faloutsos#55 Wavelets - DWT Short window Fourier transform (SWFT) But: how short should be the window? time freq time value

56 CMU SCS PDL 2008C. Faloutsos#56 Wavelets - DWT Answer: multiple window sizes! -> DWT time freq Time domain DFT SWFT DWT

57 CMU SCS PDL 2008C. Faloutsos#57 Haar Wavelets subtract sum of left half from right half repeat recursively for quarters, eight-ths,...

58 CMU SCS PDL 2008C. Faloutsos#58 ‘Tornado Plot’ for Washington Router: Dark areas correspond to high energy Low freq. High freq. time

59 CMU SCS PDL 2008C. Faloutsos#59 Tornado Plot: Wavelet Transform for Washington Router 09/2004-09/2006, All coefficients and Detail levels 1-12 Observations: 1.Obvious Spikes (E1): tornados that “touch down” 2. Prolonged Spikes (E2 and E3): when coarser scales have high values but finer scales do not 3.Intermittent Waves (E4 and E5): High-energy entries at nearby scales correspond to local periodic motion

60 CMU SCS PDL 2008C. Faloutsos#60 E2: Prolonged Spike Sustained Period of relatively high Activity Magnification of updates on 28 th Aug. 2005 time # updates

61 CMU SCS PDL 2008C. Faloutsos#61 Tool #2: logarithms

62 CMU SCS PDL 2008C. Faloutsos#62 Tool #2: logarithms Prominent `clothesline’ at ~ 50 updates per 600 secs. Culprit IP addresses: 192.211.42.0/24 216.109.38.0/24 207.157.115.0/24 All from Alabama (Supercomputing Center)!

63 CMU SCS PDL 2008C. Faloutsos#63 Outline Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining (Problem 4: sensor mining) (Problem 5: Large graphs & hadoop) fractals SVD wavelets tensors PageRank

64 CMU SCS PDL 2008C. Faloutsos#64 Main point Two-way street: <- DM can use such infrastructures to find patterns -> DM can help such systems/networks etc to become self-healing, self-adjusting, ‘self-*’ Hot topic in Data Mining: finding patterns in Tera- and Peta-bytes

65 CMU SCS PDL 2008C. Faloutsos#65 Additional resources Machine learning classes at SCS/MLD Tom Mitchell’s book on Machine Learning –Classification –Clustering/Anomaly detection –Support vector machines –Graphical models –Bayesian networks –

66 CMU SCS PDL 2008C. Faloutsos#66 www.cs.cmu.edu/~christos For code, papers etc WeH 7107 christos cs


Download ppt "CMU SCS Data Mining Meets Systems: Tools and Case Studies Christos Faloutsos SCS CMU."

Similar presentations


Ads by Google