Download presentation
Presentation is loading. Please wait.
Published byKjeld Clausen Modified over 6 years ago
1
Deterministic Error Guarantees for Queries on Compressed Time Series
Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Korhan Demirkaya, Joshua Lapacik, Yannis Papakonstantinou
2
Compute the correlation of the foreign-exchange CAD/JPY and AUD/JPY
Motivation Fast analytic query processing over historical time series is necessary Future prediction Abnormally detection Similarity matching CAD/JPY AUD/JPY Compute the correlation of the foreign-exchange CAD/JPY and AUD/JPY Motivation: data is big. Online query processing is required. Two directions: distribute computation and approximate computation. We follow the second one public health analyst
3
Challenge Historical time series is big
1 billion data points for each forex*1 8 TB operational data per day for each oil drilling rig*2 Motivation: data is big. Online query processing is required. Two directions: distribute computation and approximate computation. We follow the second one *1 *2
4
Solutions Distributed query processing in many machines
Approximate query processing in a single machine sampling methods Probabilistic error guarantees E.g., the actual answer is within with 95% confidence Motivation: data is big. Online query processing is required. Two directions: distribute computation and approximate computation. We follow the second one our goal Deterministic error guarantees E.g., the actual answer is within with 95% confidence
5
Data Time Series: a sequence of (timestamp, value) pairs
Assume queries involve time series with the same resolution Omit timestamps 1, 10000, [ 115.80, 115.90, 116.25, 116.30, 116.11, 116.15, 116.16, 116.06, 115.72, ...... ] [ ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ] How to normalize data into same resolutions is not a topic of this work. There are several prior work Apple stock price
6
Query Time subseries operators Arithmetic operators (+,−×,÷,√ )
Time series operators. Arithmetic operators (+,−×,÷,√ ) E.g., , , 100*20, 100/20…
7
Query Statistic queries Covariance, Correlation, Cross-correlation, ……
Time series operators.
8
Query Statistic queries Covariance, Correlation, Cross-correlation, ……
base time series time series produced by time series operators Time series operators.
9
Offline precomputation phase – building indexes
FL and SW
10
Segment list index Index: a list of compressed time series segments
f(x) = a x + b Forex CAD/JPY (the Canadian Dollar and the Japanese Yen) For each segment, we store: Estimation function (minimize Euclidean distance) Error measures Note that, we are not really store (a,b) but the (a,b) in its orthogonal basis (a , b) L2-norm of errors: Reconstruction error: L2-norm of estimated values:
11
no limitation on estimation functions
Segment list index Estimation function families no limitation on estimation functions logarithmic function family polynomial function family exponential function family logistic function family gaussian function family sin/cos function family Functions. Minimize Eucliean distance. And show the function families.
12
depends on the data values
Segment list index Error guarantees L2-norm of errors: Reconstruction error: L2-norm of estimated values: depends on the data values 3.0 4.8 5.4 f(x) = 1.2x + 2.0 L2-norm of estimated values is usually big
13
Segment list index Existing index building algorithms
Fix-length segmentation (FL) : control segment size Sliding-window segmentation (SW): control reconstruction error …… CAD/JPY AUD/JPY CAD/JPY AUD/JPY Functions. Minimize Eucliean distance. And show the function families. E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online algorithm for segmenting time series. In ICDM, pages 289–296, 2001.
14
Offline precomputation phase – building indexes
We build a segment list index for each time series We store an estimation function and error measures for each segment FL and SW
15
Online query processing – providing deterministic error guarantees
FL and SW
16
Error guarantees Actual error: the absolute difference between the true answer R and the estimated answer 𝑅 , i.e., Error guarantee: the upper bound of the actual error, i.e., Error measures.
17
Error guarantees Providing the error guarantee for each Sum(T) is the key base time series time series produced by time series operators Error measures. If we can provide an error guarantee for each Sum(Ti), then we are able to give the error guarantee for general queries
18
Query over single segment
Error guarantees for time series operators T1 T2 Error measures.
19
Query over single segment
Error guarantee of Sum(T1 x T2) Show how to get error guarantees for VS 𝐻 𝑜 𝑙𝑑𝑒𝑟 𝑖𝑛𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦
20
Query over single segment
Error guarantee of Sum(T1 x T2) = 0 if the estimation function family forms a vector space (VS) Error measures. Vector space: A set that is closed under finite vector addition and scalar multiplication Polynomial function family is a vector space
21
Orthogonal projection property in VS
Show how to get error guarantees for ANY
22
Orthogonal projection property in VS
Show how to get error guarantees for ANY
23
=0 Query over single segment Orthogonal projection property in VS
Show how to get error guarantees for VS 𝐻 𝑜 𝑙𝑑𝑒𝑟 𝑖𝑛𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦
24
Query over single segment
Error guarantee of Sum(T1 x T2) Estimation function family is not VS Estimation function family is VS Show how to get error guarantees for ANY
25
Query over aligned segments
All the segments are perfectly aligned CAD/JPY AUD/JPY Error guarantees Sum of the error guarantees of each segment pair Show how to get error guarantees for LSF
26
Query over aligned segments
CAD/JPY AUD/JPY Show how to get error guarantees for LSF
27
Query over misaligned segments
One segment overlaps with more than one segment CAD/JPY AUD/JPY Segment combination selection algorithms.
28
Query over misaligned segments
Sum(T1 x T2) Segment combination selection becomes an optimization problem Minimize CAD/JPY AUD/JPY Segment combination selection algorithms.
29
Query over misaligned segments
Segment combination selection CAD/JPY AUD/JPY Intersection Strategy (IS) Maximal number of segments Optimal Strategy (OS) Minimal error combination Segment combination selection algorithms.
30
Query over misaligned segments
Orthogonal projection property CAD/JPY AUD/JPY Cannot be applied, not aligned Estimation function for a subsegment may not be in the family Segment combination selection algorithms. Linear scalable family (LSF): the restriction of any function in LSF to a smaller domain is still a function in LSF ANY VS LSF PF LSF is a superset of the polynomial function family (PF)
31
Query over misaligned segments
Sum(T1 x T2) If estimation functions are in LSF CAD/JPY AUD/JPY Segment combination selection algorithms.
32
Error guarantee properties
Tightness With the same error measures, no other error guarantee is smaller than it for queries on all the data Amplitude-independence (AI) Not using the amplitudes in the error guarantees E.g., Changing from Celsius to Kelvin will not change the error guarantees AI and tight.
33
Error guarantee properties
Function family ANY\VS VS\LSF LSF ANY Queries on aligned segments Queries on misaligned segments AI Tight Sum(T1 x T2) Sum(T1 + T2) Sum(T1 - T2) Show dichotomies.
34
Error guarantee properties
Dichotomies of function families VS LSF ANY\VS ANY\LSF Show dichotomies. AI AI non-AI non-AI Queries on aligned segments Queries on misaligned segments
35
Avg # of data points in each time series
Experiments Dataset Avg # of data points in each time series # of time series Resolution Historical Forex Data (HF) 126,059,817 15 millisecond Historical IoT Data (HI) 2,676,311 14 second Historical Bitcoin Exchanges Data (HB) 1,669,835 16 minute Historical Air Quality Data (HA) 1,587,258 11
36
Experiments Estimation functions [1] [2] [3]
E. Keogh. Fast similarity search in the presence of longitudinal scaling in time series databases. In ICTAI, pages 578–584, 1997. M. Tobita. Combined logarithmic and exponential function model for fitting postseismic gnss time series after 2011 tohoku-oki earthquake. Earth, Planets and Space, 68(1):41, 2016. Z. Pan, Y. Hu, and B. Cao. Construction of smooth daily remote sensing time series data: a higher spatiotemporal resolution perspective. Open Geospatial Data, Software and Standards, 2(1):25, 2017.
37
Experiments Segment list building algorithms
Fix-length segmentation (FL) Sliding-window segmentation (SW) Queries Correlation query cross-correlation query tree
38
Experiments Error guarantees for queries on aligned time series
20 correlation queries FL segment lists building Power of orthogonal property tree VS uses less space than ANY VS uses 0.035% while ANY uses 0.06%
39
Experiments Error guarantees for queries on misaligned time series
20 correlation queries SW segment lists building 1 2 LSF uses 0.02% while ANY uses 0.023%. ANY has less segments, but more parameters. 1 Effect of LSF (~100x) Effect of optimal segment combination selection (~10x) 2
40
Experiments Aligned vs. misaligned
Fix space, compare error guarantees for aligned and misaligned K segments in misaligned case, N data points involved in the query, then set segment size in FL to be N/K 1 Misaligned produces smaller true errors 2 Misaligned produces smaller error guarantees ~ 3x for ANY 2 tree 1
41
Experiments Aligned vs. misaligned
Fix space, compare error guarantees for aligned and misaligned K segments in misaligned case, N data points involved in the query, then set segment size in FL to be N/K 1 Misaligned produces smaller error guarantees ~ 8.2 x for LSF 1 tree
42
Experiments Index building time Query processing time tree
43
Experiments Compare with sampling method
uniform random sampling scheme with a global seed Sampling size to provide same error guarantees with those of VS Sampling size to provide same error guarantees with those of ANY tree confidence
44
Study the properties – AI and tight – of the proposed error guarantees
Conclusion Provide deterministic error guarantees for statistic queries over aligned segments and misaligned segments. Provide optimizations to reduce the error guarantees in both scenarios. Study the properties – AI and tight – of the proposed error guarantees Conduct experiments to evaluate the error guarantees Time series operators.
45
Future work Deterministic error guarantees for interactive analytic queries over compressed time series Time series operators.
46
Architecture Build segment tree index for each time series (offline)
A node refers to a compressed segment Each segment, we store estimation function and error measures Tree may not be a balanced tree Navigate trees to access minimal number of nodes to get answers with error guarantees less than given threshold value (online) In the offline processing, we partition each time series into segments. And for each time series segment, we precompute some error parameters. The segments are organized as a hierarchy structure. In the online processing, a user gives a query and an error budget. Instead of accessing the original data, ApproPlato access the hierarchy structure and returns an approximate answer. The approximate answer has an error guarantee, which means the error is no greater than the error budget. In the following slides, I will introduce the error parameters, as shown in 1. Then describe the hierarchy structure as shown in 2. And finally introduce the hierarchy structure as shown in 3. I will also keep this architecture graph in the following slides in order to indicate where we are now. But make it smaller.
47
Segment tree index One tree structure for each time series
A node refers to a compressed segment Each segment, we store estimation function and error measures Tree may not be a balanced tree Segment tree building algorithms: Top-down algorithm Bottom-up method Sliding-window approach *Fu, Tak-chung. "A review on time series data mining." Engineering Applications of Artificial Intelligence 24.1 (2011):
48
Query processing algorithm
Given query and error budget, access minimal number of nodes to get approximate answers with error guarantees less than the error budgets Consider query = (Agg(Times(T1, T2)), 10% Time series T1 Time series T2
49
Query processing algorithm
Performance-wise optimization An incremental update segmentation algorithm that gives ratio compared with the optimal one. 1+ 2 Space-wise optimization Avoid storing the estimation functions for the right nodes. Only red nodes store estimation functions Estimation function can be deduced from the parent node and the left sibling node via an invert basis matrix
50
Thank you Q&A
51
Backup slides
52
Application New compression consideration
Not only consider the reconstruction error but also the query error guarantees Reconstruction error Query error Show dichotomies.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.