Download presentation

Presentation is loading. Please wait.

Published byConnor Mahoney Modified over 3 years ago

1
GAMPS COMPRESSING MULTI SENSOR DATA BY GROUPING & AMPLITUDE SCALING Sorabh Gandhi, UC Santa Barbara Suman Nath, Microsoft Research Subhash Suri, UC Santa Barbara Jie Liu, Microsoft Research

2
Fine Grained Sensing & Data Glut Advances in sensing technology fine grained ubiquitous sensing of environment Many applications, but the issue is data glut Automated Data Center Cooling: [MSFT DCGenome project] physical parameters ex. humidity, temperature etc 1000s of sensors, 10 bytes/sensors/sec 10s of GBs/day Server Performance Monitoring: [MSFT server farm monitoring] performance counters ex. cpu utilization, memory usage etc 100s of counters, 1000s of servers, few bytes/counter/sec TBs/day

3
Focus and Objectives Data archival + (reliable and fast) query processing Centralized setting Point query: report value for sensor x, time t Similarity query: report sensors similar to sensor x in time range Obvious solution: compression, data is set of time series Initial idea: approximate every time series individually Many approximation techniques known ex. DFT, DCT, piecewise linear Focus: L 1 error [guarantee on point queries] ex techniques wavelets, piecewise constant/linear approximations Compression not enough!! Gives upto an order of magnitude improvement, we want more

4
Signals are Correlated! Server dataset: 40 signals, 1 day, sampling once every 30 seconds, counter: # of connected users # Connected Users Time Similar signals in a group Shifted/Scaled groups Dynamic groups

5
We propose GAMPS, which exploits linear correlations among multiple signals while compressing them together, and gives L 1 guarantees Compression both along time and across signals We propose an index structure for compressed data which can give fast responses to a lot of relevant queries Through simulations on real data, we show that on large datasets, GAMPS can achieve upto an order of magnitude improvement over state of the art compression techniques Contributions

6
State of the art: Single Signal Optimal L 1 approximations Problem: Given a time series S and input parameter ² approximate S with piecewise constant segments such that the L 1 error is <= ² Greedy algorithm (PCGreedy(S, ² ))

7
Problem: Given a time series S and input parameter ², approximate S with piecewise constant segments such that the L 1 error is <= ² Greedy algorithm (PCGreedy(S, ² )) 2²2² Original Time Series Approximation ICDE03 Lazardis et al. State of the art: Single Signal Optimal L 1 approximations

8
GAMPS Overview GAMPS take as input, the set of time series and approximation parameter ² Compression Partition phase: partitions the data into contiguous time intervals Group phase: divides a given partition into groups of similar signals Amplitude scaling phase: compression happens with sharing of representations Amplitude Scaling Phase Partition Phase Data Grouping Phase Index Structure COMPRESSION INDEXING Data Compressed

9
Compression by Amplitude Scaling Given a group of k similar signals Let the signals be denoted by set X = {X 1, X 2, …, X k } Key idea: express all signals X i as scaled function of some signal X j : X i = A i X j A i is the ratio/amplitude signal and X j is the base signal If signal X i is a perfectly scaled version of X j then A i = constant To reconstruct X i, we only need to store the constant and X j In reality, no perfect correlation However, we found that if there are enough linearly correlated signals smartly approximating A i s and X j can give very good compression factors!

10
Illustration: Amplitude Scaling on Real Dataset DataCenter dataset 6 signals shown for ~3 days each, parameter: relative humidity Input: X = {X 1, X 2, …, X 6 }, ² = 1% Need to choose base signal and divide ² among base signal ( ² b ) and ratio signal approximations ( ² r ) Oracle: X 4 is base signal, also provides values ² b and ² r Run PCGreedy(X 4, ² b ) and PCGreedy(A i, ² r ) for signals other than the base signal DataCenter Dataset

11
Illustration: Amplitude Scaling on Real Dataset Leftmost figure, all signals use PCGreedy() with ² = 1.0% Middle figure, higher fidelity base signal, ² b =0.4% Rightmost figure: Ratio signals Very sparse (small number of segments to represent) Base signal approx Y-axis: Relative Humidity Ratio signal approx Y-axis: Ratio Individual approx Y-axis: Relative Humidity

12
Compression factor = M 1 /M 2 M 1 = number of segments in individual signal approximations M 2 = number of segments in (base signal + ratio signal) approximations For this illustrative dataset, compression factor (1% error) is 1.9 Quantitative Comparison for Amplitude Scaling Comparison with optimal individual approximations

13
Facility location problem Problem is modeled as a graph G(V, E) Opening a facility at node j costs c(j) Serving a demand point j using facility i costs w(i,j) Objective is to choose F µ V Minimize j 2 F c(i) + i 2 V w(i,j) Grouping & amplitude scaling is modeled as facility location Complete graph, every signal is represented by a node Cost opening a facility: # segments needed to represent base signal Cost of serving a demand point: # segments needed to represent the ratio signal Grouping and Amplitude Scaling by Facility Location Graph

14
Implementation Setup We set ² b = 0.4 ² [error allocation for base signal] Facility location : NP hard We show results with exact solution (integer linear program) Approximation solutions are with 90% of the results shown Time taken to solve the linear program is <= few seconds We use three different datasets Server dataset: 240 signals, 1 day data [CPU utilization counter] DataCenter dataset: 24 signals, 3 days of data [humidity sensors] IBT dataset: 45 signals, 1 day of data [temperature sensors in a building in Berkeley]

15
Quantitative Evaluation: GAMPS Figure on the left shows compression factor over raw data For 1.5% error, 300 for server data, 50 for the other two Figure on the right: compression factor over individual approximations For 1.5% error, between factor 2-10 Compression factor high for Server dataset Average group size is highest (60 as compared to 4.5 & 6)

16
Scaling versus Group size We extracted 60 signals in the same group for the Server dataset Compression factor (versus individual approximations) increases as group size increases

17
Advantage of Grouping Demonstrate the advantage of having multiple groups Datasets IBT and Server Hybrid: algorithm which allows only 1 group Every signal is either in the group or approximated individually For both datasets, for all errors, grouping gives great advantage Compression Factor: 1.5 (IBT) - 9 (Server) [Error 1.5%]

18
Grouping: Geographical Locality IBT dataset, 1 day, error = 1.5% GAMPS runs the grouping on entire days data Picture on left shows sensor layout in the Intel Berkeley lab Hexagons are sensor positions, crosses are sensors without data for the one day, rectangles are outliers (individual approximations) Simple region boundaries conform our intuition Sensor LayoutGroup Layout

19
Indexing Compressed Data 1 2 3 4 5 Skip-list of groups Ptr. to base signal Skip-list of approx. lines for ratio signal Propose Skip list based index structure Point query: log(n) Range query : log(n) + range Similarity query : log(n) + #groups in range

20
Future Work How to distribute error among base and ratio signals ? How about generic linear transformations ? We use only ratio signal (scaling) : X i = A i X j Maybe we can get much better compression by using X i = A i X j + B i How about piecewise linear signals ? Underlying algorithm is not so trivial (convex hulls) Can we apply this technique to 2D signals ? Consider a video, every pixel value in time time series Every pixel-time-series, correlated with neighboring pixel-time- series

21
Thanks for your attention

22
Example Query: Similarity Query Based on grouping we can define similarity coefficient for a given time range (t 1, t 2 ) = 1, if signals S i and S j are in the same group at time t Part of IBT dataset Similarity Query

23
Compression by Interval Sharing Key Idea: If two sensors have near overlapping time series they can share a part of the approximation Let number of signals be k and desired error be ² ( ®, ¯ ) approximation algorithm For given error ² say optimal algorithm taken OPT ( ®, ¯ ) algorithm has error no more than ® ² and uses no more than ¯ OPT segments We propose polynomial time (5, log k + log OPT) approximation algorithm for approximation with PC segments using interval sharing Signal 1 Signal 2 Representation can be shared

24
Multiple Correlated Signals: Example 1 Instant messaging service – Server dataset 240 servers, 2 weeks, >= 100 performance counters 40 signals shown (normalized) for one day, counter: #connected users, sampling rate once in 30 seconds Signals are correlated (almost overlapping) with each other, can we exploit this in compression ? Server Dataset

25
Multiple Correlated Signals: Example 2 Data center monitoring 24 sensors, 2 years, 2 parameters: humidity, temperature 6 signals shown for ~3 days each, parameter: relative humidity, sampling rate once every 30 seconds Signals not overlapping, but still correlated Shifting or scaling may help Question: Can we exploit this correlation ? We propose a technique to compress multiple signals along both time and across signals DataCenter Dataset

26
Partition Determination Use double-half-same size heuristic Start with some initial batch size (say 100 data points) For next batch run group and compress with 200, 100 & 50 data points For 200, compare with two batches of size 100, whichever one takes less memory is chosen Similarly for 50, compare two batch sizes of 50 with one batch size 100 Memory taken = # segments + Cluster delta Cluster delta: Every time clusters change, we need to update the base signals and base-ratio signal relationships

27
Base signals Ratio signals 2 4 1 3 5 GAMPS Illustration 1 2 3 4 5 1 2 3 4 5 Partition 1 2 3 4 5 (Similar signals together) Grouping Select Base and Ratio Signals

28
GAMPS Compression Illustration 1 2 3 4 5 1 2 3 4 5 Partition 1 2 3 4 5 (Similar signals together) Grouping Compress by Amplitude Scaling (To overcome varying correlations)

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google