Presentation is loading. Please wait.

Presentation is loading. Please wait.

Time Series II 1. Syllabus Nov 4Introduction to data mining Nov 5Association Rules Nov 10, 14Clustering and Data Representation Nov 17Exercise session.

Similar presentations


Presentation on theme: "Time Series II 1. Syllabus Nov 4Introduction to data mining Nov 5Association Rules Nov 10, 14Clustering and Data Representation Nov 17Exercise session."— Presentation transcript:

1 Time Series II 1

2 Syllabus Nov 4Introduction to data mining Nov 5Association Rules Nov 10, 14Clustering and Data Representation Nov 17Exercise session 1 (Homework 1 due) Nov 19Classification Nov 24, 26Similarity Matching and Model Evaluation Dec 1Exercise session 2 (Homework 2 due) Dec 3Combining Models Dec 8, 10Time Series Analysis Dec 15Exercise session 3 (Homework 3 due) Dec 17Ranking Jan 13Review Jan 14EXAM Feb 23Re-EXAM 2

3 Last time… What is time series? How do we compare time series data? 3

4 Today… What is the structure of time series data? Can we represent this structure compactly and accurately? How can we search streaming time series? 4

5 Time series summarization 5

6 We can reduce the length of time series We should not lose any information We can process it faster Why Summarization? 6

7 Jean Fourier X X' Discrete Fourier Transform (DFT) Excellent free Fourier Primer Hagit Shatkay, The Fourier Transform - a Primer'', Technical Report CS , Department of Computer Science, Brown University, Basic Idea: Represent the time series as a linear combination of sines and cosines Transform the data from the time domain to the frequency domain Highlight the periodicities but keep only the first n/2 coefficients Why n/2 coefficients? Because they are symmetric 7

8 A: several real sequences are periodic Q: Such as? A: sales patterns follow seasons economy follows 50-year cycle (or 10?) temperature follows daily and yearly cycles Many real signals follow (multiple) cycles Why DFT? 8

9 How does it work? Decomposes signal to a sum of sine and cosine waves How to assess ‘similarity’ of x with a (discrete) wave? 0 1 n-1 time value x ={x 0, x 1,... x n-1 } s ={s 0, s 1,... s n-1 } 9

10 Consider the waves with frequency 0, 1, … Use the inner-product (~cosine similarity) 0 1 n-1 time value freq. f=0 0 1 n-1 time value freq. f=1 sin(t * 2  n) Freq=1/period How does it work? 10

11 0 1n-1 time value freq. f=2 Consider the waves with frequency 0, 1, … Use the inner-product (~cosine similarity) How does it work? 11

12 ‘basis’ functions 0 1 n sine, freq =1 sine, freq = n cosine, f=1 cosine, f=2 How does it work? 12

13 Basis functions are actually n-dim vectors, orthogonal to each other ‘similarity’ of x with each of them: inner product DFT: ~ all the similarities of x with the basis functions How does it work? 13

14 Since: e jf = cos(f) + j sin(f), with j=sqrt(-1) we finally have: inverse DFT How does it work? 14

15 Each X f is an imaginary number: X f = a + b j α is the real part β is the imaginary part Examples: – j – 4.5 – 4j How does it work? 15

16 SYMMETRY property of imaginary numbers: X f = (X n-f )* ( “*”: complex conjugate: (a + b j)* = a - b j ) Thus: we use only the first n/2 numbers How does it work? 16

17 DFT: Amplitude spectrum Amplitude Intuition: strength of frequency ‘ f ’ time count freq. f AfAf freq: 12 17

18 Reconstruction using 1coefficients Example 18

19 Reconstruction using 2coefficients Example 19

20 Reconstruction using 7coefficients Example 20

21 Reconstruction using 20coefficients Example 21

22 DFT: Amplitude spectrum Can achieve excellent approximations, with only very few frequencies! SO what? 22

23 DFT: Amplitude spectrum Can achieve excellent approximations, with only very few frequencies! We can reduce the dimensionality of each time series by representing it with the k most dominant frequencies Each frequency needs two numbers (real part and imaginary part) Hence, a time series of length n can be represented using 2*k real numbers, where k << n 23

24 C … Raw Data The graphic shows a time series with 128 points. The raw data used to produce the graphic is also reproduced as a column of numbers (just the first 30 or so points are shown). n = 128

25 C Fourier Coefficients … Raw Data We can decompose the data into 64 pure sine waves using the Discrete Fourier Transform (just the first few sine waves are shown). The Fourier Coefficients are reproduced as a column of numbers (just the first 30 or so coefficients are shown).

26 C Truncated Fourier Coefficients C’C’ We have discarded of the data Fourier Coefficients … Raw Data n = 128 N = 8 C ratio = 1/16

27 C Sorted Truncated Fourier Coefficients C’C’ Fourier Coefficients … Raw Data Instead of taking the first few coefficients, we could take the best coefficients

28 Discrete Fourier Transform…recap Pros and Cons of DFT as a time series representation Pros: Good ability to compress most natural signals Fast, off the shelf DFT algorithms exist O(nlog(n)) Cons: Difficult to deal with sequences of different lengths X X'

29 Piecewise Aggregate Approximation (PAA) X X' x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x8x8 Basic Idea: Represent the time series as a sequence of box basis functions, each box being of the same length Keogh, Chakrabarti, Pazzani & Mehrotra, KAIS (2000) Byoung-Kee Yi, Christos Faloutsos, VLDB (2000) Computation: X: time series of length n Can be represented in the N-dimensional space as: 29

30 Piecewise Aggregate Approximation (PAA) X X' x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x8x8 Example Let X = [ ] X can be mapped from its original dimension n = 9 to a lower dimension, e.g., N = 3, as follows: [ ] [ ] 30

31 X X' Pros: Extremely fast to calculate As efficient as other approaches (empirically) Support queries of arbitrary lengths Can support any Minkowski metric Supports non Euclidean measures Simple! Intuitive! Cons: If visualized directly, looks ascetically unpleasing Pros and Cons of PAA as a time series representation Pros and Cons of PAA as a time series representation. Piecewise Aggregate Approximation (PAA) x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x8x8 31

32 Symbolic ApproXimation (SAX) similar in principle to PAA – uses segments to represent data series represents segments with symbols (rather than real numbers) – small memory footprint 32

33 Creating SAX Input – A time series (blue curve) Output – SAX representation of the input time series (red string) baabccbc Input Series PAA SAX 33

34 A time series T PAA(T,4) The Process (STEP 1) Represent time series T of length n with w segments using Piecewise Aggregate Approximation (PAA) PAA(T,w) = where 34

35 PAA(T,4) The Process (STEP 2) Discretize into a vector of symbols Use breakpoints to map to a small alphabet α of symbols iSAX(T,4,4) 35

36 Symbol Mapping Each average value from the PAA vector is replaced by a symbol from an alphabet An alphabet size, a of 5 to 8 is recommended – a,b,c,d,e – a,b,c,d,e,f – a,b,c,d,e,f,g – a,b,c,d,e,f,g,h Given an average value we need a symbol 36

37 Symbol Mapping This is achieved by using the normal distribution from statistics: – Assuming our input series is normalized we can use normal distribution as the data model – We divide the area under the normal distribution into ‘a’ equal sized areas where a is the alphabet size – Each such area is bounded by breakpoints 37

38 SAX Computation – in pictures C C baabccbc This slide taken from Eamonn ’ s Tutorial on SAX 38

39 Finding the BreakPoints Breakpoints for different alphabet sizes can be structured as a lookup table When a=3 – Average values below are replaced by ‘A’ – Average values between and 0.43 are replaced by ‘B’ – Average values above 0.43 are replaced by ‘C’ a=3a=4a=5 b b b b

40 The GEMINI Framework Raw data: original full-dimensional space Summarization: reduced dimensionality space Searching in original space costly Searching in reduced space faster: – Less data, indexing techniques available, lower bounding Lower bounding enables us to – prune search space: through away data series based on reduced dimensionality representation – guarantee correctness of answer no false negatives false positives: filtered out based on raw data 40

41 GEMINI Solution: Quick filter-and-refine: extract m features (numbers, e.g., average) map into a point into m-dimensional feature space organize points retrieve the answer using a NN query discard false alarms 41

42 Generic Search using Lower Bounding query simplified query Simplified DB Original DB Answer Superset Verify against original DB Final Answer set No false negatives!! Remove false positives!! 42

43 GEMINI: contractiveness GEMINI works when: D feature (F(x), F(y)) <= D(x, y) Note that, the closer the feature distance to the actual one, the better 43

44 Streaming Algorithms Similarity search is the bottleneck for most time series data mining algorithms, including streaming algorithms Scaling such algorithms can be tedious when the target time series length becomes very large! This will allow us to solve higher-level time series data mining problems: e.g., similarity search in data streams, motif discovery, at scales that would otherwise be untenable 44

45 Fast Serial Scan A streaming algorithm for fast and exact search in very large data streams: 45 query data stream

46 Z-normalization Needed when interested in detecting trends and not absolute values For streaming data: – each subsequence of interest should be z-normalized before being compared to the z-normalized query – otherwise the trends lost Z-normalization guarantees: – offset invariance – scale/amplitude invariance 46 A B C

47 Pre-Processing z-Normalization data series encode trends usually interested in identifying similar trends but absolute values may mask this similarity 47

48 Pre-Processing z-Normalization two data series with similar trends but large distance… 48 v1v1 v2v2

49 Pre-Processing z-Normalization zero mean – compute the mean of the sequence – subtract the mean from every value of the sequence 49 v1v1 v2v2

50 Pre-Processing z-Normalization zero mean – compute the mean of the sequence – subtract the mean from every value of the sequence 50

51 Pre-Processing z-Normalization zero mean – compute the mean of the sequence – subtract the mean from every value of the sequence 51

52 Pre-Processing z-Normalization zero mean – compute the mean of the sequence – subtract the mean from every value of the sequence 52

53 Pre-Processing z-Normalization zero mean standard deviation one – compute the standard deviation of the sequence – divide every value of the sequence by the stddev 53

54 Pre-Processing z-Normalization 54 zero mean standard deviation one – compute the standard deviation of the sequence – divide every value of the sequence by the stddev

55 Pre-Processing z-Normalization 55 zero mean standard deviation one – compute the standard deviation of the sequence – divide every value of the sequence by the stddev

56 Pre-Processing z-Normalization zero mean standard deviation one 56

57 Pre-Processing z-Normalization when to z-normalize – interested in trends when not to z-normalize – interested in absolute values 57

58 Proposed Method: UCR Suite 58 An algorithm for similarity search in large data streams Supports both ED and DTW search Works for both z-normalized and un-normalized data series Combination of various optimizations

59 Squared Distance + LB Using the Squared Distance Lower Bounding – LB_Yi – LB_Kim – LB_Keogh C U L Q LB_Keogh 2 59

60 Lower Bounding – LB_Yi – LB_Kim – LB_Keogh 60 A B C D max(Q) min(Q) C U L Q Lower Bounds

61 Early Abandoning Early Abandoning of ED Early Abandoning of LB_Keogh 61 U, L is an envelope of Q

62 Early Abandoning Early Abandoning of DTW Earlier Early Abandoning of DTW using LB Keogh 62 C Q R (Warping Windows) Stop if dtw_dist ≥ bsf dtw_dist

63 Early Abandoning Early Abandoning of DTW Earlier Early Abandoning of DTW using LB_Keogh 63 C Q R (Warping Windows) (partial) dtw_dist (partial) lb_keogh Stop if dtw_dist +lb_keogh ≥ bsf

64 Z-normalization Early Abandoning Z-Normalization – Do normalization only when needed (just in time) – Every subsequence needs to be normalized before it is compared to the query – Online mean and std calculation is needed – Keep a buffer of size m and compute a running mean and standard deviation 64

65 The Pseudocode 65

66 Reordering Reordering Early Abandoning – We don’t have to compute ED or LB from left to right – Order points by expected contribution 66 -Order by the absolute height of the query point -This step is performed only once for the query and can save about 30%-50% of calculations Idea

67 Reordering 67 Intuition -The query will be compared to many data stream points during a search -Candidates are z-normalized: -the distribution of many candidates will be Gaussian, with a zero mean of zero -the sections of the query that are farthest from the mean (zero) will on average have the largest contributions to the distance measure Idea Reordering Early Abandoning – We don’t have to compute ED or LB from left to right – Order points by expected contribution

68 Different Envelopes Reversing the Query/Data Role in LB_Keogh – Make LB_Keogh tighter – Much cheaper than DTW – Online envelope calculation 68 Envelop on QEnvelop on C

69 Lower bounds Cascading Lower Bounds – At least 18 lower bounds of DTW was proposed. – Use some lower bounds only on the Skyline. 69 Tightness of LB (LB/DTW)

70 Experimental Result: Random Walk Million (Seconds) Billion (Minutes) Trillion (Hours) UCR-ED SOTA-ED UCR-DTW SOTA-DTW Random Walk: Varying size of the data Code and data is available at:

71 Data: One year of Electrocardiograms 8.5 billion data points. Query: Idealized Premature Ventricular Contraction (PVC) of length 421 (R=21=5%). UCR-EDSOTA-EDUCR-DTWSOTA-DTW ECG4.1 minutes66.6 minutes18.0 minutes49.2 hours Experimental Result: ECG 71 PVC (aka. skipped beat) ~30,000X faster than real time!

72 Up next… Nov 4Introduction to data mining Nov 5Association Rules Nov 10, 14Clustering and Data Representation Nov 17Exercise session 1 (Homework 1 due) Nov 19Classification Nov 24, 26Similarity Matching and Model Evaluation Dec 1Exercise session 2 (Homework 2 due) Dec 3Combining Models Dec 8, 10Time Series Analysis Dec 15Exercise session 3 (Homework 3 due) Dec 17Ranking Jan 13No Lecture Jan 14EXAM Feb 23Re-EXAM 72


Download ppt "Time Series II 1. Syllabus Nov 4Introduction to data mining Nov 5Association Rules Nov 10, 14Clustering and Data Representation Nov 17Exercise session."

Similar presentations


Ads by Google