SAX: a Novel Symbolic Representation of Time Series

Slides:



Advertisements
Similar presentations
Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.
Advertisements

Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data Thanawin Rakthanmanon Eamonn Keogh Stefano Lonardi Scott Evans.
ECG Signal processing (2)
Spatial Database Systems. Spatial Database Applications GIS applications (maps): Urban planning, route optimization, fire or pollution monitoring, utility.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Relevance Feedback Retrieval of Time Series Data Eamonn J. Keogh & Michael J. Pazzani Prepared By/ Fahad Al-jutaily Supervisor/ Dr. Mourad Ykhlef IS531.
Machine learning continued Image source:
Jessica Lin, Eamonn Keogh, Stefano Lonardi, Bill Chiu
Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
Mining Time Series.
Geodatabase basic. The geodatabase The geodatabase is a collection of geographic datasets of various types used in ArcGIS and managed in either a file.
08/25/2004KDD ‘041 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use these slides.
Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.
Efficient Query Filtering for Streaming Time Series
Jessica Lin, Eamonn Keogh, Stefano Loardi
In the beginning God created the heaven and the earth. And the earth was without ….
Time Series Bitmap Experiments This file contains full color, large scale versions of the experiments shown in the paper, and additional experiments which.
Distance Functions for Sequence Data and Time Series
Visually Mining and Monitoring Massive Time Series Amy Karlson V. Shiv Naga Prasad 15 February 2004 CMSC 838S Images courtesy of Jessica Lin and Eamonn.
Detecting Time Series Motifs Under
Using Relevance Feedback in Multimedia Databases
CLUSTERING Eitan Lifshits Big Data Processing Seminar Prof. Amir Averbuch Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeffery.
A Multiresolution Symbolic Representation of Time Series
1 Dot Plots For Time Series Analysis Dragomir Yankov, Eamonn Keogh, Stefano Lonardi Dept. of Computer Science & Eng. University of California Riverside.
Time Series Anomaly Detection Experiments This file contains full color, large scale versions of the experiments shown in the paper, and additional experiments.
Time Series Data Analysis - II
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin
Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
10/23/2015© Mohamed Medhat Gaber1 Adaptive Mobile ECG Analysis Dr Mohamed Medhat Gaber School of Computing University of Portsmouth
Mining Time Series.
Shape-based Similarity Query for Trajectory of Mobile Object NTT Communication Science Laboratories, NTT Corporation, JAPAN. Yutaka Yanagisawa Jun-ichi.
Semi-Supervised Time Series Classification & DTW-D REPORTED BY WANG YAWEN.
Exact indexing of Dynamic Time Warping
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li Department of Computer and Information Science.
Ariadna Quattoni Xavier Carreras An Efficient Projection for l 1,∞ Regularization Michael Collins Trevor Darrell MIT CSAIL.
NSF Career Award IIS University of California Riverside Eamonn Keogh Efficient Discovery of Previously Unknown Patterns and Relationships.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
VizTree Huyen Dao and Chris Ackermann. Introducing example
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Out of sample extension of PCA, Kernel PCA, and MDS WILSON A. FLORERO-SALINAS DAN LI MATH 285, FALL
ITree: Exploring Time-Varying Data using Indexable Tree Yi Gu and Chaoli Wang Michigan Technological University Presented at IEEE Pacific Visualization.
Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.
Naifan Zhuang, Jun Ye, Kien A. Hua
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Fast Subsequence Matching in Time-Series Databases.
What Is Cluster Analysis?
Open Problems in Streaming
Supervised Time Series Pattern Discovery through Local Importance
Visually Mining and Monitoring Massive Time Series
A Time Series Representation Framework Based on Learned Patterns
Time Series Filtering Time Series
Distance Functions for Sequence Data and Time Series
Searching Similar Segments over Textual Event Sequences
A Fast Algorithm for Subspace Clustering by Pattern Similarity
Feature space tansformation methods
Time Series Filtering Time Series
Using Manifold Structure for Partially Labeled Classification
Semi-Supervised Time Series Classification
Jessica Lin Eamonn Keogh Stefano Lonardi
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate materials kindly provided by Prof. Eamonn Keogh

Time Series  A time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. [Wiki] Example: Economic, Sales, Stock market forecasting EEG, ECG, BCI analysis 2000 4000 6000 8000 10 20 30

Problems Join: Given two data collections, link items occurring in each Annotation: obtain additional information from given data Query by content: Given a large data collection, find the k most similar objects to an object of interest. Clustering: Given a unlabeled dataset, arrange them into groups by their mutual similarity

Problems (Cont.) Classification: Given a labeled training set, classify future unlabeled examples Anomaly Detection: Given a large collection of objects, find the one that is most different to all the rest. Motif Finding: Given a large collection of objects, find the pair that is most similar.

Data Mining Constraints For example, suppose you have one gig of main memory and want to do K-means clustering… Clustering ¼ gig of data, 100 sec Clustering ½ gig of data, 200 sec Clustering 1 gig of data, 400 sec Clustering 1.1 gigs of data, few hours Bradley, M. Fayyad, & Reina: Scaling Clustering Algorithms to Large Databases. KDD 1998: 9-15

Generic Data Mining Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest Approximately solve the problem at hand in main memory Make (hopefully very few) accesses to the original data on disk to confirm the solution

Some Common Approximation

Why Symbolic Representation? Reduce dimension Numerosity reduction Hashing Suffix Trees Markov Models Stealing ideas from text processing/ bioinformatics community

Symbolic Aggregate ApproXimation (SAX) Lower bounding of Euclidean distance Lower bounding of the DTW distance Dimensionality Reduction Numerosity Reduction baabccbc

SAX Allows a time series of arbitrary length n to be reduced to a string of arbitrary length w (w<<n) Notations C A time series C = c1, ….., cn Ć A Piecewise Aggregate Approximation of a time series Ć = ć1,…ćw Ĉ A symbolic representation of a time series Ĉ = ĉ1, …, ĉw w Number PAA segments representing C a Alphabet size

How to obtain SAX? Step 1: Reduce dimension by PAA Time series C of length n can be represented in a w-dimensional space by a vector Ć = ć1,…ćw The ith element is calculated by Reduce dimension from 20 to 5. The 2nd element will be

How to obtain SAX? Data is divided into w equal sized frames. Mean value of the data falling within a frame is calculated Vector of these values becomes the PAA C C 20 40 60 80 100 120

How to obtain SAX? baabccbc Step 2: Discretization c b a Normalize Ć to have a Gaussian distribution Determine breakpoints that will produce a equal-sized areas under Gaussian curve. - 20 40 60 80 100 120 b a c Words: 8 Alphabet: 3 baabccbc

Distance Measure Given 2 time series Q and C Euclidean distance Distance after transforming the subsequence to PAA

Distance Measure Define MINDIST after transforming to symbolic representation MINDIST lower bounds the true distance between the original time series

Numerosity Reduction Subsequences are extracted by a sliding window Sequences are mostly repetitive subsequence Sliding window finds aabbcc If the next sequence is also aabbcc, just store the position This optimization depends on the data, but typically yields a reduction factor of 2 or 3 Space shuttle telemetry with subsequence length 32

Experimental Validation Clustering Hierarchical Partitional Classification Nearest neighbor Decision tree Motif discovery

Hierarchical Clustering Sample dataset consists 3 decreasing trend, 3 upward shift and 3 normal classes

Partitional Clustering (k-means) Assign each point to one of k clusters whose center is nearest Each iteration tries to minimize the sum of squared intra-clustered error

Nearest Neighbor Classification SAX beats Euclidean distance due to the smoothing effect of dimensional reduction

Decision Tree Classification Since decision trees are expensive to use with high dimensional dataset, Regression Tree [Geurts.2001] is a better approach for data mining on time series

Motif Discovery Implemented the random projection algorithm of Tompa and Buhler [ICMB2001] Hashing subsequenced into buckets using a random subset of their features as a key

New Version: iSAX Use binary numbers for labeling the words Different alphabet size(cardinality)within a word Comparison of words with different cardinalities

Thank you Questions?