# Time Series II.

## Presentation on theme: "Time Series II."— Presentation transcript:

Time Series II

Syllabus Nov 4 Introduction to data mining Nov 5 Association Rules
Clustering and Data Representation Nov 17 Exercise session 1 (Homework 1 due) Nov 19 Classification Nov 24, 26 Similarity Matching and Model Evaluation Dec 1 Exercise session 2 (Homework 2 due) Dec 3 Combining Models Dec 8, 10 Time Series Analysis Dec 15 Exercise session 3 (Homework 3 due) Dec 17 Ranking Jan 13 Review Jan 14 EXAM Feb 23 Re-EXAM

Last time… What is time series? How do we compare time series data?

Today… What is the structure of time series data?
Can we represent this structure compactly and accurately? How can we search streaming time series?

Time series summarization
20 40 60 80 100 120 Keogh, Chakrabarti, Pazzani & Mehrotra KAIS 2000 Yi & Faloutsos VLDB 2000 Keogh, Chakrabarti, Pazzani & Mehrotra SIGMOD 2001 Chan & Fu. ICDE 1999 Agrawal, Faloutsos, &. Swami. FODO 1993 Faloutsos, Ranganathan, & Manolopoulos. SIGMOD 1994 Morinaka, Yoshikawa, Amagasa, & Uemura, PAKDD 2001 DFT DWT APCA PAA PLA aabbbccb a b c SAX

Why Summarization? We can reduce the length of time series
We should not lose any information We can process it faster

Transform the data from the time domain to the frequency domain
Discrete Fourier Transform (DFT) Basic Idea: Represent the time series as a linear combination of sines and cosines Transform the data from the time domain to the frequency domain Highlight the periodicities but keep only the first n/2 coefficients X X' 20 40 60 80 100 120 140 Jean Fourier 1 2 3 4 Why n/2 coefficients? Because they are symmetric 5 6 7 Excellent free Fourier Primer Hagit Shatkay, The Fourier Transform - a Primer'', Technical Report CS , Department of Computer Science, Brown University, 1995. 8 9

Why DFT? A: several real sequences are periodic Q: Such as? A:
sales patterns follow seasons economy follows 50-year cycle (or 10?) temperature follows daily and yearly cycles Many real signals follow (multiple) cycles

How does it work? value x ={x0, x1, ... xn-1} s ={s0, s1, ... sn-1}
Decomposes signal to a sum of sine and cosine waves How to assess ‘similarity’ of x with a (discrete) wave? value x ={x0, x1, ... xn-1} s ={s0, s1, ... sn-1} time 1 n-1

How does it work? Freq=1/period value value freq. f=1 sin(t * 2 p/n)
Consider the waves with frequency 0, 1, … Use the inner-product (~cosine similarity) Freq=1/period 1 n-1 time value freq. f= sin(t * 2 p/n) 1 n-1 time value freq. f=0

How does it work? 1 n-1 time value freq. f=2
Consider the waves with frequency 0, 1, … Use the inner-product (~cosine similarity) 1 n-1 time value freq. f=2

How does it work? 1 n-1 cosine, f=1 sine, freq =1 1 n-1 1 n-1
1 n-1 ‘basis’ functions 1 n-1 cosine, f=1 sine, freq =1 1 n-1 cosine, f=2 sine, freq = 2 1 n-1 1 n-1

How does it work? Basis functions are actually n-dim vectors, orthogonal to each other ‘similarity’ of x with each of them: inner product DFT: ~ all the similarities of x with the basis functions

How does it work? Since: ejf = cos(f) + j sin(f), with j=sqrt(-1)
we finally have: inverse DFT

How does it work? Each Xf is an imaginary number: Xf = a + b j
α is the real part β is the imaginary part Examples: 10 + 5j 4.5 – 4j

How does it work? SYMMETRY property of imaginary numbers:
Xf = (Xn-f )* ( “*”: complex conjugate: (a + b j)* = a - b j ) Thus: we use only the first n/2 numbers

DFT: Amplitude spectrum
Intuition: strength of frequency ‘f’ count Af freq: 12 time freq. f

Reconstruction using 1coefficients
Example 50 100 150 200 250 -5 5 Reconstruction using 1coefficients

Reconstruction using 2coefficients
Example 50 100 150 200 250 -5 5 Reconstruction using 2coefficients

Reconstruction using 7coefficients
Example 50 100 150 200 250 -5 5 Reconstruction using 7coefficients

Reconstruction using 20coefficients
Example 50 100 150 200 250 -5 5 Reconstruction using 20coefficients

DFT: Amplitude spectrum
Can achieve excellent approximations, with only very few frequencies! SO what?

DFT: Amplitude spectrum
Can achieve excellent approximations, with only very few frequencies! We can reduce the dimensionality of each time series by representing it with the k most dominant frequencies Each frequency needs two numbers (real part and imaginary part) Hence, a time series of length n can be represented using 2*k real numbers, where k << n

n = 128 Raw Data The graphic shows a time series with 128 points.
0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 Raw Data The graphic shows a time series with 128 points. The raw data used to produce the graphic is also reproduced as a column of numbers (just the first 30 or so points are shown). C 20 40 60 80 100 120 140 n = 128

1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... Fourier Coefficients 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 Raw Data We can decompose the data into 64 pure sine waves using the Discrete Fourier Transform (just the first few sine waves are shown). The Fourier Coefficients are reproduced as a column of numbers (just the first 30 or so coefficients are shown). C 20 40 60 80 100 120 140

We have discarded of the data. Truncated Fourier Raw Fourier
Coefficients 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 Raw Data Fourier Coefficients 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... C 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 n = 128 N = 8 Cratio = 1/16 C’ 20 40 60 80 100 120 140 We have discarded of the data.

Sorted Truncated Fourier Coefficients 1.5698 1.0485 0.7160 0.8406 0.3709 0.1670 0.4667 0.1928 0.1635 0.1302 0.0992 0.1282 0.2438 0.2316 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... Fourier Coefficients 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 Raw Data C 1.5698 1.0485 0.7160 0.8406 0.2667 0.1928 0.1438 0.1416 C’ 20 40 60 80 100 120 140 Instead of taking the first few coefficients, we could take the best coefficients

Pros and Cons of DFT as a time series representation
Discrete Fourier Transform…recap Pros and Cons of DFT as a time series representation Pros: Good ability to compress most natural signals Fast, off the shelf DFT algorithms exist O(nlog(n)) Cons: Difficult to deal with sequences of different lengths X X' 20 40 60 80 100 120 140 1 2 3 4 5 6 7 8 9

X: time series of length n
Piecewise Aggregate Approximation (PAA) Basic Idea: Represent the time series as a sequence of box basis functions, each box being of the same length X Computation: X' X: time series of length n Can be represented in the N-dimensional space as: 20 40 60 80 100 120 140 x1 x2 x3 x4 x5 x6 x7 x8 Keogh, Chakrabarti, Pazzani & Mehrotra, KAIS (2000) Byoung-Kee Yi, Christos Faloutsos, VLDB (2000)

Piecewise Aggregate Approximation (PAA)
Example X Let X = [ ] X' X can be mapped from its original dimension n = 9 to a lower dimension, e.g., N = 3, as follows: 20 40 60 80 100 120 140 x1 x2 x3 x4 x5 x6 x7 x8 [ ] [ ]

Extremely fast to calculate
Piecewise Aggregate Approximation (PAA) Pros and Cons of PAA as a time series representation. X Pros: Extremely fast to calculate As efficient as other approaches (empirically) Support queries of arbitrary lengths Can support any Minkowski metric Supports non Euclidean measures Simple! Intuitive! Cons: If visualized directly, looks ascetically unpleasing X' 20 40 60 80 100 120 140 x1 x2 x3 x4 x5 x6 x7 x8

Symbolic ApproXimation (SAX)
similar in principle to PAA uses segments to represent data series represents segments with symbols (rather than real numbers) small memory footprint

Creating SAX baabccbc Input Output A time series (blue curve)
PAA Input A time series (blue curve) Output SAX representation of the input time series (red string) Input Series SAX baabccbc

The Process (STEP 1) Represent time series T of length n with w segments using Piecewise Aggregate Approximation (PAA) PAA(T,w) = where -3 -2 -1 1 2 3 4 8 12 16 A time series T 4 8 12 16 PAA(T,4) -3 -2 -1 1 2 3

The Process (STEP 2) Discretize into a vector of symbols
Use breakpoints to map to a small alphabet α of symbols 4 8 12 16 PAA(T,4) -3 -2 -1 1 2 3 -3 -2 -1 1 2 3 4 8 12 16 00 01 10 11 iSAX(T,4,4)

Symbol Mapping Each average value from the PAA vector is replaced by a symbol from an alphabet An alphabet size, a of 5 to 8 is recommended a,b,c,d,e a,b,c,d,e,f a,b,c,d,e,f,g a,b,c,d,e,f,g,h Given an average value we need a symbol

Symbol Mapping This is achieved by using the normal distribution from statistics: Assuming our input series is normalized we can use normal distribution as the data model We divide the area under the normal distribution into ‘a’ equal sized areas where a is the alphabet size Each such area is bounded by breakpoints

SAX Computation – in pictures
20 40 60 80 100 120 This slide taken from Eamonn’s Tutorial on SAX - 20 40 60 80 100 120 b a c baabccbc

Finding the BreakPoints
-0.43 -0.67 -0.84 b2 0.43 -0.25 b3 0.67 0.25 b4 0.84 Breakpoints for different alphabet sizes can be structured as a lookup table When a=3 Average values below are replaced by ‘A’ Average values between and 0.43 are replaced by ‘B’ Average values above 0.43 are replaced by ‘C’

The GEMINI Framework Raw data: original full-dimensional space
Summarization: reduced dimensionality space Searching in original space costly Searching in reduced space faster: Less data, indexing techniques available, lower bounding Lower bounding enables us to prune search space: through away data series based on reduced dimensionality representation guarantee correctness of answer no false negatives false positives: filtered out based on raw data

GEMINI Solution: Quick filter-and-refine:
extract m features (numbers, e.g., average) map into a point into m-dimensional feature space organize points retrieve the answer using a NN query discard false alarms

Generic Search using Lower Bounding
Answer Superset Original DB Simplified DB Final Answer set Verify against original DB No false negatives!! Remove false positives!! simplified query query

GEMINI: contractiveness
GEMINI works when: Dfeature(F(x), F(y)) <= D(x, y) Note that, the closer the feature distance to the actual one, the better

Streaming Algorithms Similarity search is the bottleneck for most time series data mining algorithms, including streaming algorithms Scaling such algorithms can be tedious when the target time series length becomes very large! This will allow us to solve higher-level time series data mining problems: e.g., similarity search in data streams, motif discovery, at scales that would otherwise be untenable

Fast Serial Scan A streaming algorithm for fast and exact search in very large data streams: query data stream

Z-normalization Needed when interested in detecting trends and not absolute values For streaming data: each subsequence of interest should be z-normalized before being compared to the z-normalized query otherwise the trends lost Z-normalization guarantees: offset invariance scale/amplitude invariance A B C

Pre-Processing z-Normalization
data series encode trends usually interested in identifying similar trends but absolute values may mask this similarity

Pre-Processing z-Normalization
two data series with similar trends but large distance… v2 v1

Pre-Processing z-Normalization
zero mean compute the mean of the sequence subtract the mean from every value of the sequence v2 v1

Pre-Processing z-Normalization
zero mean compute the mean of the sequence subtract the mean from every value of the sequence

Pre-Processing z-Normalization
zero mean compute the mean of the sequence subtract the mean from every value of the sequence

Pre-Processing z-Normalization
zero mean compute the mean of the sequence subtract the mean from every value of the sequence

Pre-Processing z-Normalization
zero mean standard deviation one compute the standard deviation of the sequence divide every value of the sequence by the stddev

Pre-Processing z-Normalization
zero mean standard deviation one compute the standard deviation of the sequence divide every value of the sequence by the stddev

Pre-Processing z-Normalization
zero mean standard deviation one compute the standard deviation of the sequence divide every value of the sequence by the stddev

Pre-Processing z-Normalization
zero mean standard deviation one

Pre-Processing z-Normalization
when to z-normalize interested in trends when not to z-normalize interested in absolute values

Proposed Method: UCR Suite
An algorithm for similarity search in large data streams Supports both ED and DTW search Works for both z-normalized and un-normalized data series Combination of various optimizations

Squared Distance + LB Using the Squared Distance Lower Bounding LB_Yi
LB_Kim LB_Keogh 2 C U L Q LB_Keogh

Lower Bounds Lower Bounding LB_Yi LB_Kim LB_Keogh max(Q) min(Q) A B C

Early Abandoning Early Abandoning of ED Early Abandoning of LB_Keogh
U, L is an envelope of Q

Early Abandoning Early Abandoning of DTW
Earlier Early Abandoning of DTW using LB Keogh C Q R (Warping Windows) Stop if dtw_dist ≥ bsf dtw_dist

Early Abandoning Early Abandoning of DTW
Earlier Early Abandoning of DTW using LB_Keogh Stop if dtw_dist +lb_keogh ≥ bsf C Q R (Warping Windows) (partial) dtw_dist lb_keogh

Z-normalization Early Abandoning Z-Normalization
Do normalization only when needed (just in time) Every subsequence needs to be normalized before it is compared to the query Online mean and std calculation is needed Keep a buffer of size m and compute a running mean and standard deviation

The Pseudocode

Reordering Reordering Early Abandoning
We don’t have to compute ED or LB from left to right Order points by expected contribution Idea We conjecture that the universal optimal ordering is to sort the indices based on the absolute values of the Z-normalized Q. The intuition behind this idea is that the value at Qi will be compared to many Ci’s during a search. However, for subsequence search, with Z-normalized candidates, the distribution of many Ci’s will be Gaussian, with a mean of zero. Thus, the sections of the query that are farthest from the mean, zero, will on average have the largest contributions to the distance measure. Order by the absolute height of the query point This step is performed only once for the query and can save about 30%-50% of calculations

Reordering Reordering Early Abandoning
We don’t have to compute ED or LB from left to right Order points by expected contribution Idea We conjecture that the universal optimal ordering is to sort the indices based on the absolute values of the Z-normalized Q. The intuition behind this idea is that the value at Qi will be compared to many Ci’s during a search. However, for subsequence search, with Z-normalized candidates, the distribution of many Ci’s will be Gaussian, with a mean of zero. Thus, the sections of the query that are farthest from the mean, zero, will on average have the largest contributions to the distance measure. Intuition The query will be compared to many data stream points during a search Candidates are z-normalized: the distribution of many candidates will be Gaussian, with a zero mean of zero the sections of the query that are farthest from the mean (zero) will on average have the largest contributions to the distance measure

Different Envelopes Reversing the Query/Data Role in LB_Keogh
Make LB_Keogh tighter Much cheaper than DTW Online envelope calculation Envelop on Q Envelop on C

At least 18 lower bounds of DTW was proposed. Use some lower bounds only on the Skyline. Tightness of LB (LB/DTW)

Experimental Result: Random Walk
Random Walk: Varying size of the data Million (Seconds) Billion (Minutes) Trillion (Hours) UCR-ED 0.034 0.22 3.16 SOTA-ED 0.243 2.40 39.80 UCR-DTW 0.159 1.83 34.09 SOTA-DTW 2.447 38.14 472.80 Code and data is available at:

Experimental Result: ECG
Data: One year of Electrocardiograms 8.5 billion data points. Query: Idealized Premature Ventricular Contraction (PVC) of length 421 (R=21=5%). PVC (aka. skipped beat) UCR-ED SOTA-ED UCR-DTW SOTA-DTW ECG 4.1 minutes 66.6 minutes 18.0 minutes 49.2 hours ~30,000X faster than real time!

Up next… Nov 4 Introduction to data mining Nov 5 Association Rules
Clustering and Data Representation Nov 17 Exercise session 1 (Homework 1 due) Nov 19 Classification Nov 24, 26 Similarity Matching and Model Evaluation Dec 1 Exercise session 2 (Homework 2 due) Dec 3 Combining Models Dec 8, 10 Time Series Analysis Dec 15 Exercise session 3 (Homework 3 due) Dec 17 Ranking Jan 13 No Lecture Jan 14 EXAM Feb 23 Re-EXAM