CMU SCS Data Mining Meets Systems: Tools and Case Studies Christos Faloutsos SCS CMU.

Slides:



Advertisements
Similar presentations
Internet Measurement Conference 2003 Source-Level IP Packet Bursts: Causes and Effects Hao Jiang Constantinos Dovrolis (hjiang,
Advertisements

Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
Time Series II.
CMU SCS : Multimedia Databases and Data Mining Lecture #23: DSP tools – Fourier and Wavelets C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals - case studies - I C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #22: DSP tools – Fourier and Wavelets C. Faloutsos.
Deepayan ChakrabartiCIKM F4: Large Scale Automated Forecasting Using Fractals -Deepayan Chakrabarti -Christos Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #25: Time series mining and forecasting Christos Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
15-826: Multimedia Databases and Data Mining
Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.
Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,
1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.
BGP-lens: Patterns and Anomalies in Internet Routing Updates B. Aditya Prakash 1, Nicholas Valler 2, David Andersen 1, Michalis Faloutsos 2, Christos Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals - case studies Part III (regions, quadtrees, knn queries) C. Faloutsos.
CMU SCS Mining Billion-node Graphs Christos Faloutsos CMU.
SCS CMU Joint Work by Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos Speaker: Hanghang Tong Aug , 2008, Las Vegas.
On the Self-Similar Nature of Ethernet Traffic - Leland, et. Al Presented by Sumitra Ganesh.
Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM.
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
Analysis of the Internet Topology Michalis Faloutsos, U.C. Riverside (PI) Christos Faloutsos, CMU (sub- contract, co-PI) DARPA NMS, no
CMU SCS Bio-informatics, Graph and Stream mining Christos Faloutsos CMU.
CMU SCS Graph and stream mining Christos Faloutsos CMU.
1 Using A Multiscale Approach to Characterize Workload Dynamics Characterize Workload Dynamics Tao Li June 4, 2005 Dept. of Electrical.
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
Traffic Analysis: Tools for Mining Time Series
Indexing Time Series.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #27: Time series mining and forecasting Christos Faloutsos.
CMU SCS Data Mining in Streams and Graphs Christos Faloutsos CMU.
CMU SCS : Multimedia Databases and Data Mining Lecture #8: Fractals - introduction C. Faloutsos.
Self-Similarity of Network Traffic Presented by Wei Lu Supervised by Niclas Meier 05/
1 Chapters 9 Self-SimilarTraffic. Chapter 9 – Self-Similar Traffic 2 Introduction- Motivation Validity of the queuing models we have studied depends on.
CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
Transforms. 5*sin (2  4t) Amplitude = 5 Frequency = 4 Hz seconds A sine wave.
InteMon: Intelligent monitoring system for large clusters Evan Hoke, Jimeng Sun and Christos Faloutsos.
BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.
AutoPlait: Automatic Mining of Co-evolving Time Sequences Yasuko Matsubara (Kumamoto University) Yasushi Sakurai (Kumamoto University) Christos Faloutsos.
Mining and Querying Multimedia Data Fan Guo Sep 19, 2011 Committee Members: Christos Faloutsos, Chair Eric P. Xing William W. Cohen Ambuj K. Singh, University.
CMU SCS : Multimedia Databases and Data Mining Lecture #12: Fractals - case studies Part III (quadtrees, knn queries) C. Faloutsos.
1 Self Similar Traffic. 2 Self Similarity The idea is that something looks the same when viewed from different degrees of “magnification” or different.
Mingyang Zhu, Huaijiang Sun, Zhigang Deng Quaternion Space Sparse Decomposition for Motion Compression and Retrieval SCA 2012.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
Multiplicative Wavelet Traffic Model and pathChirp: Efficient Available Bandwidth Estimation Vinay Ribeiro.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P5-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 5: Graphs over time & tensors Faloutsos,
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and.
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.
CMU SCS : Multimedia Databases and Data Mining Lecture #18: SVD - part I (definitions) C. Faloutsos.
MODELING THE SELF-SIMILAR BEHAVIOR OF PACKETIZED MPEG-4 VIDEO USING WAVELET-BASED METHODS Dogu Arifler and Brian L. Evans The University of Texas at Austin.
Facets: Fast Comprehensive Mining of Coevolving High-order Time Series Hanghang TongPing JiYongjie CaiWei FanQing He Joint Work by Presenter:Wei Fan.
Arizona State University1 Fast Mining of a Network of Coevolving Time Series Wei FanHanghang TongPing JiYongjie Cai.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P9-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
Carnegie Mellon Data Mining – Research Directions C. Faloutsos CMU
CLASSIFICATION OF ECG SIGNAL USING WAVELET ANALYSIS
SCS-CMU Data Mining Tools A crash course C. Faloutsos.
Enabling Real Time Alerting through streaming pattern discovery Chengyang Zhang Computer Science Department University of North Texas 11/21/2016 CRI Group.
Forecasting with Cyber-physical Interactions in Data Centers (part 3)
Large Graph Mining: Power Tools and a Practitioner’s guide
BGP-lens: Patterns and Anomalies in Internet Routing Updates
Supervised Time Series Pattern Discovery through Local Importance
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Jimeng Sun · Charalampos (Babis) E
15-826: Multimedia Databases and Data Mining
Presentation transcript:

CMU SCS Data Mining Meets Systems: Tools and Case Studies Christos Faloutsos SCS CMU

CMU SCS PDL 2008C. Faloutsos#2 Thanks Spiros Papadimitriou (CMU->IBM) Mengzhi Wang (CMU->Google) Jimeng Sun (CMU -> IBM)

CMU SCS PDL 2008C. Faloutsos#3 Outline Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining (Problem 4: sensor mining) (Problem 5: Large graphs & hadoop) fractals SVD wavelets tensors PageRank

CMU SCS PDL 2008C. Faloutsos#4 Problem #1: Goal: given a signal (eg., #bytes over time) Find: patterns, periodicities, and/or compress time #bytes Bytes per 30’ (packets per day; earthquakes per year)

CMU SCS PDL 2008C. Faloutsos#5 Problem #1 model bursty traffic generate realistic traces (Poisson does not work) time # bytes Poisson

CMU SCS PDL 2008C. Faloutsos#6 Motivation predict queue length distributions (e.g., to give probabilistic guarantees) “learn” traffic, for buffering, prefetching, ‘active disks’, web servers

CMU SCS PDL 2008C. Faloutsos#7 Q: any ‘pattern’? time # bytes Not Poisson spike; silence; more spikes; more silence… any rules?

CMU SCS PDL 2008C. Faloutsos#8 solution: self-similarity # bytes time # bytes

CMU SCS PDL 2008C. Faloutsos#9 But: Q1: How to generate realistic traces; extrapolate? Q2: How to estimate the model parameters?

CMU SCS PDL 2008C. Faloutsos#10 Approach Q1: How to generate a sequence, that is –bursty –self-similar –and has similar queue length distributions

CMU SCS PDL 2008C. Faloutsos#11 Approach A: ‘binomial multifractal’ [Wang+02] ~ ‘law’: –80% of bytes/queries etc on first half –repeat recursively b: bias factor (eg., 80%)

CMU SCS PDL 2008C. Faloutsos#12 binary multifractals 20 80

CMU SCS PDL 2008C. Faloutsos#13 binary multifractals 20 80

CMU SCS PDL 2008C. Faloutsos#14 Parameter estimation Q2: How to estimate the bias factor b?

CMU SCS PDL 2008C. Faloutsos#15 Parameter estimation Q2: How to estimate the bias factor b? A: MANY ways [Crovella+96] –Hurst exponent –variance plot –even DFT amplitude spectrum! (‘periodogram’) –More robust: ‘entropy plot’ [Wang+02] Mengzhi Wang, Tara Madhyastha, Ngai Hang Chang, Spiros Papadimitriou and Christos Faloutsos, Data Mining Meets Performance Evaluation: Fast Algorithms for Modeling Bursty TrafficFast Algorithms for Modeling Bursty Traffic, ICDE 2002

CMU SCS PDL 2008C. Faloutsos#16 Entropy plot Rationale: – burstiness: inverse of uniformity –entropy measures uniformity of a distribution –find entropy at several granularities, to see whether/how our distribution is close to uniform.

CMU SCS PDL 2008C. Faloutsos#17 Entropy plot Entropy E(n) after n levels of splits n=1: E(1)= - p1 log 2 (p1)- p2 log 2 (p2) p1p2 % of bytes here

CMU SCS PDL 2008C. Faloutsos#18 Entropy plot Entropy E(n) after n levels of splits n=1: E(1)= - p1 log(p1)- p2 log(p2) n=2: E(2) = -   p 2,i * log 2 (p 2,i ) p 2,1 p 2,2 p 2,3 p 2,4

CMU SCS PDL 2008C. Faloutsos#19 Real traffic Has linear entropy plot (-> self-similar) # of levels (n) Entropy E(n) 0.73

CMU SCS PDL 2008C. Faloutsos#20 Observation - intuition: intuition: slope = intrinsic dimensionality =~ ‘degrees of freedom’ or info-bits per coordinate-bit –unif. Dataset: slope =1 –multi-point: slope = 0 # of levels (n) Entropy E(n) 0.73

CMU SCS PDL 2008C. Faloutsos#21 Entropy plot - Intuition Slope ~ intrinsic dimensionality (in fact, ‘Information fractal dimension’) = info bit per coordinate bit - eg Dim = 1 Pick a point; reveal its coordinate bit-by-bit - how much info is each bit worth to me? Skip

CMU SCS PDL 2008C. Faloutsos#22 Entropy plot Slope ~ intrinsic dimensionality (in fact, ‘Information fractal dimension’) = info bit per coordinate bit - eg Dim = 1 Is MSB 0? ‘info’ value = E(1): 1 bit Skip

CMU SCS PDL 2008C. Faloutsos#23 Entropy plot Slope ~ intrinsic dimensionality (in fact, ‘Information fractal dimension’) = info bit per coordinate bit - eg Dim = 1 Is MSB 0? Is next MSB =0? Skip

CMU SCS PDL 2008C. Faloutsos#24 Entropy plot Slope ~ intrinsic dimensionality (in fact, ‘Information fractal dimension’) = info bit per coordinate bit - eg Dim = 1 Is MSB 0? Is next MSB =0? Info value =1 bit = E(2) - E(1) = slope! Skip

CMU SCS PDL 2008C. Faloutsos#25 Entropy plot Repeat, for all points at same position: Dim=0 Skip

CMU SCS PDL 2008C. Faloutsos#26 Entropy plot Repeat, for all points at same position: we need 0 bits of info, to determine position -> slope = 0 = intrinsic dimensionality Dim=0 Skip

CMU SCS PDL 2008C. Faloutsos#27 Entropy plot Real (and 80-20) datasets can be in- between: bursts, gaps, smaller bursts, smaller gaps, at every scale Dim = 1 Dim=0 0<Dim<1 Skip

CMU SCS PDL 2008C. Faloutsos#28 (Fractals, again) What set of points could have behavior between point and line?

CMU SCS PDL 2008C. Faloutsos#29 Cantor dust Eliminate the middle third Recursively!

CMU SCS PDL 2008C. Faloutsos#30 Cantor dust

CMU SCS PDL 2008C. Faloutsos#31 Cantor dust

CMU SCS PDL 2008C. Faloutsos#32 Cantor dust

CMU SCS PDL 2008C. Faloutsos#33 Cantor dust

CMU SCS PDL 2008C. Faloutsos#34 Dimensionality? (no length; infinite # points!) Answer: log2 / log3 = 0.6 Cantor dust

CMU SCS PDL 2008C. Faloutsos#35 Some more entropy plots: Poisson vs real Poisson: slope = ~1 -> uniformly distributed

CMU SCS PDL 2008C. Faloutsos#36 B-model b-model traffic gives perfectly linear plot Lemma: its slope is slope = -b log 2 b - (1-b) log 2 (1-b) Fitting: do entropy plot; get slope; solve for b E(n) n

CMU SCS PDL 2008C. Faloutsos#37 Experimental setup Disk traces (from HP [Wilkes 93]) web traces from LBL lbl-conn-7.tar.Z

CMU SCS PDL 2008C. Faloutsos#38 Model validation Linear entropy plots Bias factors b: smallest b / smoothest: nntp traffic

CMU SCS PDL 2008C. Faloutsos#39 Web traffic - results LBL, NCDF of queue lengths (log-log scales) (queue length l) Prob( >l)

CMU SCS PDL 2008C. Faloutsos#40 Conclusions Multifractals (80/20, ‘b-model’, Multiplicative Wavelet Model (MWM)) for analysis and synthesis of bursty traffic

CMU SCS PDL 2008C. Faloutsos#41 Books Fractals: Manfred Schroeder: Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991 (Probably the BEST book on fractals!)

CMU SCS PDL 2008C. Faloutsos#42 Outline Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining (Problem 4: sensor mining) (Problem 5: Large graphs & hadoop)

CMU SCS PDL 2008C. Faloutsos#43 Clusters/data center monitoring Monitor correlations of multiple measurements Automatically flag anomalous behavior Intemon: intelligent monitoring system –warsteiner.db.cs.cmu.edu/demo/intemon.jsp

CMU SCS PDL 2008C. Faloutsos#44 Publication Evan Hoke, Jimeng Sun, John D. Strunk, Gregory R. Ganger, Christos Faloutsos. InteMon: Continuous Mining of Sensor Data in Large-scale Self-* Infrastructures. ACM SIGOPS Operating Systems Review, 40(3): ACM Press, July 2006

CMU SCS PDL 2008C. Faloutsos#45 Under the hood: SVD Singular Value Decomposition Done incrementally Spiros Papadimitriou, Jimeng Sun and Christos Faloutsos Streaming Pattern Discovery in Multiple Time-Series VLDB 2005, Trondheim, Norway.

CMU SCS PDL 2008C. Faloutsos#46 Singular Value Decomposition (SVD) SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) LSI: S. Dumais; M. Berry KL: eg, Duda+Hart PCA: eg., Jolliffe Details: [Press+] u of CPU1 u of CPU2 t=1 t=2

CMU SCS PDL 2008C. Faloutsos#47 Singular Value Decomposition (SVD) SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU1 u of CPU2 t=1 t=2

CMU SCS PDL 2008C. Faloutsos#48 Singular Value Decomposition (SVD) SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU1 u of CPU2 t=1 t=2

CMU SCS PDL 2008C. Faloutsos#49 Singular Value Decomposition (SVD) SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU1 u of CPU2 t=1 t=2

CMU SCS PDL 2008C. Faloutsos#50 Outline Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining (Problem 4: sensor mining) (Problem 5: Large graphs & hadoop)

CMU SCS PDL 2008C. Faloutsos#51 BGP updates With Aditya Prakash (CMU) Michalis Faloutsos (UC Riverside) Nicholas Valler (UC Riverside) Dave Andersen (CMU)

CMU SCS PDL 2008C. Faloutsos#52 Time Series: #Updates per 600s, Washington Router 09/ /2006 Tool #0: Time plot

CMU SCS PDL 2008C. Faloutsos#53 Tool #0: Time plot Observation #1: Missing values Observation #2: Bursty

CMU SCS PDL 2008C. Faloutsos#54 Tool #1: Wavelets

CMU SCS PDL 2008C. Faloutsos#55 Wavelets - DWT Short window Fourier transform (SWFT) But: how short should be the window? time freq time value

CMU SCS PDL 2008C. Faloutsos#56 Wavelets - DWT Answer: multiple window sizes! -> DWT time freq Time domain DFT SWFT DWT

CMU SCS PDL 2008C. Faloutsos#57 Haar Wavelets subtract sum of left half from right half repeat recursively for quarters, eight-ths,...

CMU SCS PDL 2008C. Faloutsos#58 ‘Tornado Plot’ for Washington Router: Dark areas correspond to high energy Low freq. High freq. time

CMU SCS PDL 2008C. Faloutsos#59 Tornado Plot: Wavelet Transform for Washington Router 09/ /2006, All coefficients and Detail levels 1-12 Observations: 1.Obvious Spikes (E1): tornados that “touch down” 2. Prolonged Spikes (E2 and E3): when coarser scales have high values but finer scales do not 3.Intermittent Waves (E4 and E5): High-energy entries at nearby scales correspond to local periodic motion

CMU SCS PDL 2008C. Faloutsos#60 E2: Prolonged Spike Sustained Period of relatively high Activity Magnification of updates on 28 th Aug time # updates

CMU SCS PDL 2008C. Faloutsos#61 Tool #2: logarithms

CMU SCS PDL 2008C. Faloutsos#62 Tool #2: logarithms Prominent `clothesline’ at ~ 50 updates per 600 secs. Culprit IP addresses: / / /24 All from Alabama (Supercomputing Center)!

CMU SCS PDL 2008C. Faloutsos#63 Outline Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining (Problem 4: sensor mining) (Problem 5: Large graphs & hadoop) fractals SVD wavelets tensors PageRank

CMU SCS PDL 2008C. Faloutsos#64 Main point Two-way street: <- DM can use such infrastructures to find patterns -> DM can help such systems/networks etc to become self-healing, self-adjusting, ‘self-*’ Hot topic in Data Mining: finding patterns in Tera- and Peta-bytes

CMU SCS PDL 2008C. Faloutsos#65 Additional resources Machine learning classes at SCS/MLD Tom Mitchell’s book on Machine Learning –Classification –Clustering/Anomaly detection –Support vector machines –Graphical models –Bayesian networks –

CMU SCS PDL 2008C. Faloutsos#66 For code, papers etc WeH 7107 christos cs