Abdullah Mueen UC Riverside Suman Nath Microsoft Research Jie Liu Microsoft Research.

Slides:

Advertisements

Similar presentations

DataGarage: Warehousing Massive Performance Data on Commodity Servers

Advertisements

GAMPS COMPRESSING MULTI SENSOR DATA BY GROUPING & AMPLITUDE SCALING

High Performance Discovery from Time Series Streams

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.

Yasuhiro Fujiwara (NTT Cyber Space Labs)

F AST A PPROXIMATE C ORRELATION FOR M ASSIVE T IME - SERIES D ATA SIGMOD’10 Abdullah Mueen, Suman Nath, Jie Liu 1.

1 StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time Pankaj Kumar Madhukar Rakesh Kumar Singh Puspendra Kumar Project Instructor:

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Similarity Search for Adaptive Ellipsoid Queries Using Spatial Transformation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara.

Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.

Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.

Probabilistic Similarity Search for Uncertain Time Series Presented by CAO Chen 21 st Feb, 2011.

Finding Time Series Motifs on Disk-Resident Data

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

Proteus: Power Proportional Memory Cache Cluster in Data Centers Shen Li, Shiguang Wang, Fan Yang, Shaohan Hu, Fatemeh Saremi, Tarek Abdelzaher.

AlgoDEEP 16/04/101 An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Fabio Vandin DEI - Università di Padova CS.

CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.

Randomized Algorithms Morteza ZadiMoghaddam Amin Sayedi.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Active Learning for Networked Data Based on Non-progressive Diffusion Model Zhilin Yang, Jie Tang, Bin Xu, Chunxiao Xing Dept. of Computer Science and.

Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Path Computation in External Memory In this work we focus on undirected, unweighted graphs with small diameter. This fits well for real world graph data.

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.

A User-Lever Concurrency Manager Hongsheng Lu & Kai Xiao.

Abdullah Mueen Eamonn Keogh University of California, Riverside.

Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Exact indexing of Dynamic Time Warping

Quadrisection-Based Task Mapping on Many-Core Processors for Energy-Efficient On-Chip Communication Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang.

A User Experience-based Cloud Service Redeployment Mechanism KANG Yu Yu Kang, Yangfan Zhou, Zibin Zheng, and Michael R. Lyu {ykang,yfzhou,

Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.

Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.

Kijung Shin Jinhong Jung Lee Sael U Kang

Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”

Fast Parallel Algorithms for Edge-Switching to Achieve a Target Visit Rate in Heterogeneous Graphs Maleq Khan September 9, 2014 Joint work with: Hasanuzzaman.

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Privacy Preserving Outlier Detection using Locality Sensitive Hashing

Clustering Data Streams A presentation by George Toderici.

Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.

Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)

Cohesive Subgraph Computation over Large Graphs

Machine Learning for the Quantified Self

Data Driven Resource Allocation for Distributed Learning

Optimizing Parallel Algorithms for All Pairs Similarity Search

Measurement-based Design

Distributed Network Traffic Feature Extraction for a Real-time IDS

Computing and Compressive Sensing in Wireless Sensor Networks

Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University

Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University

Clustering Uncertain Taxi data

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Department of Computer Science University of California, Santa Barbara

Clustering Categorical Data Using Summaries

Approximating the Community Structure of the Long Tail

Fast and Exact K-Means Clustering

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

An Efficient Partition Based Method for Exact Set Similarity Joins

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

Presentation transcript:

Abdullah Mueen UC Riverside Suman Nath Microsoft Research Jie Liu Microsoft Research

Context Data center monitoring apps collect large number of performance counters (eg. CPU utilization) – MS DataGarage: counters/machine/15 sec – 100K machines  millions of signals, 1TB/day – Signals are archived on disk, over years General goal: Mine the data to discover trends and anomalies, over many months Challenge: large number of signals

Our Focus: Correlation Queries All pair Correlation/Correlation matrix – Is CPU utilization correlated with # of connected users? – Do two machines in the same rack have similar power usage? – Correlation matrix can reveal clusters of servers. – Successive matrices can reveal sudden changes in behavior. n j i Given n signals of length m, Find the n X n correlation matrix (Pearson correlation coefficient)

Challenges Naïve Approach: – Cache signals in batches that fit available memory. – Correlate intra-batch signals. – Correlate inter-batch signals. Problems: – Quadratic I/O cost 40 Minutes (for n=10K signals and n/32 cache size) – Quadratic CPU Cost 93 Minutes (for the above setup) – Not easily parallelizable Memory Disk Batch Swap Slot

Our Approaches: approximations Threshold Correlation Matrix – Discard correlations lower than threshold. Approximate Threshold Correlation Matrix – Output within ϵ of the true correlation. Boolean Threshold Correlation Matrix – Output 0/1 values showing correlated/uncorrelated pairs. Exact Matrix Approximate Threshold Correlation Matrix (ϵ =0.04) Boolean Threshold Correlation Matrix (T=0.5) Threshold Correlation Matrix (T=0.5) Equally useful in practice. Up to 18x faster.

Outline Reducing I/O Cost – Threshold Correlation Matrix Reducing CPU Cost by – ϵ - Approximate Correlation Matrix – Boolean Correlation Matrix Extensions Evaluation

Reducing I/O Cost Challenge: The order in which signals are brought to memory is important. – Correlating signals across batches is expensive. Observation: Uncorrelated signals can often be determined (pruned) cheaply using bounds. Disk Memory Batch Inter Batch Intra Batch

Our Approach

Step 1: Pruning using DFT Correlation is related to Euclidian distance. – If and then correlation DFT preserves Euclidian distance. Any prefix of DFT can give a lower bound. – Use a small prefix to identify if a signal pair is likely correlated or definitely uncorrelated – Smooth signals need very few DFT coefficients Pruning Matrix (Agrawal et al., 1993; Zhu et al.,2002)

Step 2: Min-cut Partitioning Merging smaller batches Likely Correlated Graph Cut Pruning matrix Recursive Bi-partitioning The node capacitated graph partitioning problem is NP Complete. (Ferreira et al.,1998) (Fiduccia and Mattheyses, 1982; Wang et al., 2000)

Example I/O Pruning by DFT I/O Min-cut Partitioning I/O Cached Batch Comparisons in this order 8 Signals 5 signals can fit in the memory Maximum Batch size is 4 1 swap slot Batch Loads

Outline Reducing I/O Cost – Threshold Correlation Matrix Reducing CPU Cost by – ϵ - Approximate Correlation Matrix (1) – Boolean Correlation Matrix (2) Extensions Evaluation

ϵ - Approximate Correlation(1) Result: given ϵ, compute k Smooth Signal  k is small for small ϵ …… …… DFT Compute correlation in frequency domain Length m Length k Reduce scanning cost from O(m) to O(k) Exact correlation c i,j Approx corr. c i,j ± ϵ

Example: ϵ=0.1 k = max(k x,k y ) = 18 For most signals k << m Signal Signal Prefix length for time domain DFT DFT k x = 14k y = 18 Prefix Length

Boolean Correlation(2) Corr. Threshold T  distance threshold θ UB(x,y) ≤ θ then corr(x,y)≥T LB(x,y) > θ then corr(x,y)<T Otherwise, compute correlation Result: Compute bounds using triangular inequality and use dynamic programming. Idea: Use distance bounds instead of exact distance O(1) point arithmeticO(m) vector arithmetic

Outline Reducing I/O Cost – Threshold Correlation Matrix Reducing CPU Cost by – ϵ - Approximate Correlation Matrix – Boolean Correlation Matrix Extensions Evaluation

Extensions Negative Correlation – For every pair (x,y) also consider (x,-y). – No need to compute DFT for -y. DFT(-y) = -DFT(y) Lagged Correlation – Maximum correlation for a set of possible lags [Sakurai’05]. – Result: DFT of prefix/suffix of x from prefix of DFT(x). Prefix(DFT( )) Signal DFT(Prefix( )) Lag = 4

I/O Speedup We used 4 datasets – DC1: #TCP connections established in a server. – DC2: Processor Utilizations of a server. – Chlorine: The Chlorine level in water distribution network. – RW: Synthetic random walks modeling the behavior of a stock price.

CPU Speedup Speedup (A T /A ϵ ) Speedup (A/A T ) Speedup (A T /A ϵ ) Pruning in frequency, correlation in time domain [Zhu, VLDB02] Approximate correlation in frequency domain.

Related Work Exact correlations – Statstream (Zhu et. al 2002) uses DFT to prune uncorrelated pairs – Exact correlation in time domain Approximate correlations – HierarchyScan (Li et. al 1996) – False negatives without error bound Above works do not consider I/O optimization

Conclusion Bounded approximation techniques for fast correlation queries Threshold Correlation Matrix, up to 3.5X faster ϵ-Approximate Correlation Matrix, up to 18X faster Boolean Correlation Matrix, up to 11X faster Based on computational shortcuts in DFT Can be extended to negative and lagged correlation with very low error

ϵ - Approximate Correlation(1) Result: given ϵ, compute k Smooth Signal  k is small for small ϵ …… …… DFT Compute correlation in frequency domain Length m Length k Reduce scanning cost from O(m) to O(k) Total energy, = 1.0 Exact correlation c i,j Selected energy, = (1-ϵ/2) Approx corr. c i,j ± ϵ