Learning Time-Series Shapelets Josif Grabocka, Nicolas Schilling, Martin Wistuba, Lars Schmidt-Thieme Information Systems and Machine Learning Lab University.

Slides:



Advertisements
Similar presentations
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
K-NEAREST NEIGHBORS AND DECISION TREE Nonparametric Supervised Learning.
Support Vector Machines and Margins
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Particle swarm optimization for parameter determination and feature selection of support vector machines Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen,
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
CES 514 – Data Mining Lecture 8 classification (contd…)
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Discriminative Naïve Bayesian Classifiers Kaizhu Huang Supervisors: Prof. Irwin King, Prof. Michael R. Lyu Markers: Prof. Lai Wan Chan, Prof. Kin Hong.
Distributed Representations of Sentences and Documents
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
CSE 185 Introduction to Computer Vision Pattern Recognition.
Active Learning for Class Imbalance Problem
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Data mining and machine learning A brief introduction.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Support Vector Machine & Image Classification Applications
Presented by Tienwei Tsai July, 2005
Analysis of Constrained Time-Series Similarity Measures
Non Negative Matrix Factorization
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.
Fast Shapelets: All Figures in Higher Resolution.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Data Mining and Decision Support
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
NTU & MSRA Ming-Feng Tsai
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Iterative K-Means Algorithm Based on Fisher Discriminant UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND Mantao Xu to be presented.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.
Introduction to Machine Learning, its potential usage in network area,
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CS 9633 Machine Learning Support Vector Machines
Semi-Supervised Clustering
k-Nearest neighbors and decision tree
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Boosting and Additive Trees (2)
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Supervised Time Series Pattern Discovery through Local Importance
Basic machine learning background with Python scikit-learn
A Time Series Representation Framework Based on Learned Patterns
Data Mining Lecture 11.
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Text Categorization Berlin Chen 2003 Reference:
CAMCOS Report Day December 9th, 2015 San Jose State University
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Learning Time-Series Shapelets Josif Grabocka, Nicolas Schilling, Martin Wistuba, Lars Schmidt-Thieme Information Systems and Machine Learning Lab University of Hildesheim 14’ SIGKDD

Outline  Introduction  Related Work  Proposed Method  Analysis of The Proposed Method  Learning General Shapelets  Experimental Results  Conclusion and Comments

Outline  Introduction  Related Work  Proposed Method  Analysis of The Proposed Method  Learning General Shapelets  Experimental Results  Conclusion and Comments

Shapelet Figure 1: left) Skulls of horned lizards and turtles. right) the time series representing the images. The 2D shapes are converted to time series using the technique in [14]

Shapelet Figure 2: left) The shapelet that best distinguishes between skulls of horned lizards and turtles, shown as the purple/bold subsequence. right) The shapelet projected back to the original 2D shape space

Shapelet Orderline 0 ∞ split candidate Figure 3: The orderline shows the distance between the candidate subsequence and all time series as positions on the x-axis. The three objects on the left hand side of the line correspond to horned lizards and the three objects on the right correspond to turtles

SOTA Shapelet Mining Method  State-of-the-art methods discover shapelets by trying a pool of candidate sub-sequences from all possible series segments and then sorting the top performing segments according to their target prediction qualities.  A method called Shapelet-transformation has recently shown improvements with respect to prediction accuracy.

The Proposed Method  This work proposes a mathematical formulation of the shapelet learning task as an optimization of a classification objective function.  Furthermore, this work proposes a method that learns (not searches for) the shapelets which optimize the objective function.  Concretely, the proposed method learns shapelets whose distances to series can linearly separate the time series instances by their targets.  In comparison to existing approaches, this method can learn near-to- optimal shapelets and true top-K shapelet interactions.

The Proposed Method

Outline  Introduction  Related Work  Proposed Method  Analysis of The Proposed Method  Learning General Shapelets  Experimental Results  Conclusion and Comments

Original Concept, Quality Metrics and Shapelet Transformation  Shapelets were first proposed as time-series segments that maximally predict the target variable. All possible segments were considered as potential candidates, while the minimum distances of a candidate to all training series were used as a predictor feature for ranking the information gain accuracy of that candidate on the target variable.  Other quality measures include F-Stats, Kruskal-Wallis and Mood’s median.  Standard classifiers have achieved high accuracy over the shapelet- transformed representation.

Speed-up Techniques  Early abandoning of distance computations.  Entropy pruning of the information gain metric.  Reuse of computations.  Pruning of the search space.  Exploiting projections on the SAX representation.  Elaborating the usage of infrequent shapelet candidates.  Hardware-based optimization assisted shapelet discovery using GPUS.

Real-life Applications  Clustering time series using unsupervised shapelets.  Identifying human through their gait data.  Gesture recognition.  Early classification of medical and health informatics related time series.

Outline  Introduction  Related Work  Proposed Method  Analysis of The Proposed Method  Learning General Shapelets  Experimental Results  Conclusion and Comments

Key Techniques  Shapelet Transformation  Logistic Regression  Stochastic Gradient Descent  K-Means Clustering

Definitions and Notations  Time Series Dataset: A time-series dataset composed of I training instances, with each series contains Q-many ordered values, is denoted as T I× Q, while the series target is a nominal variable Y ∈ {1,..., C} I having C categories. *The proposed method can operate on variable series lengths.  Sliding Window Segment: A sliding window segment of length L is an ordered sub-sequence of a series. Concretely, the segment starting at time j inside the i-th series is defined as (T i,j,..., T i,j+L−1 ). There are totally J := Q − L + 1 segments in a time series provided the starting index of the sliding window is incremented by one.

Definitions and Notations

Go to Differentiable Soft-Minimum Function.

Definitions and Notations Go to Objective Function: Learning Model.

Definitions and Notations

Learning Model Go to Differentiable Soft-Minimum Function.

Loss Function

Regularized Objective Function Go to Per-Instance Objective.

Differentiable Soft-Minimum Function

Go to Differentiable Soft-Minimum Function.

Differentiable Soft-Minimum Function

Per-Instance Objective

Gradients for Shapelets

Gradients Shapelets

Gradients for Shapelets

Gradients for Weights

Optimization Algorithm

Convergence  The convergence of the optimization algorithm depends on two parameters, the learning rate η and the maximum number of iterations.  To determine the optimal values of these two parameters, this work implements cross-validation.

Convergence

Model Initialization

 If the initialization starts the learning around a region where the global optimum is located, then the gradient can update the parameters to the exact location of the optimum.  In order to robustify the initialization guesses, this work uses the K- Means centroids of all segments as initial values for the shapelets. Since centroids represent typical patterns of the data, they offer a good variety of shapes for initializing shapelets and help our method achieve high prediction accuracy.  The hyper-plane W is also initialized randomly around 0.

Illustrating The Mechanism

Outline  Introduction  Related Work  Proposed Method  Analysis of The Proposed Method  Learning General Shapelets  Experimental Results  Conclusion and Comments

Algorithmic Complexity

VS. SOTA: Learning Near-to-Optimal Shapelets  This: The gradient descent approach can find a near-to-optimal minimum given an appropriate initialization.  Baselines: No such guarantee for two primary reasons. First of all, the baselines are bound to shapelet candidates from the pool of series segments and cannot explore candidates which do not appear literally as segments. Secondly, minimizing the classification objective through candidate guesses has no guarantee of optimality.

VS. SOTA: Capturing Interactions Among Shapelets  The baselines find the score of each shapelet independently, ignoring the interactions among patterns.  In reality, two shapelets can be individually sub-optimal, but when combined together they can improve the results.  This problem is well known in data mining as variable subset selection.

VS. SOTA: Capturing Interactions Among Shapelets

 The baselines can address this problem by conducting an exhaustive search over all combinations of candidate shapelets, yet it is very costly and not feasible in practice.  The proposed method, however, can find the interactions at a simple linear scale K, due to the property of jointly learning the shapelets and their interactions.

VS. SOTA: One Weaker Aspect  The proposed method relies on more hyper-parameters than the baselines such as the learning rate η, the number of iterations, the regularization parameter λ W and the soft-min precision α.  Nonetheless, the very high accuracy out-weights the model’s learning efforts.

Outline  Introduction  Related Work  Proposed Method  Analysis of The Proposed Method  Learning General Shapelets  Experimental Results  Conclusion and Comments

Extending to Multi-class Cases

Extending to Non-fixed Shapelet Length Cases

Outline  Introduction  Related Work  Proposed Method  Analysis of The Proposed Method  Learning General Shapelets  Experimental Results  Conclusion and Comments

Dataset & Hyper-parameter Search

Baselines  Shapelet Tree Methods, conducted from shapelets whose qualities are measured using:  Information gain quality criterion (IG)  Kruskal-Wallis quality criterion (KW)  F-Stats quality criterion (FST)  The Mood’s Median Criterion (MM)

Baselines  Basic Classifiers, learned over shapelet-transformed data, such as:  Nearest Neighbors (1NN)  Naïve Bayes (NB)  C4.5 tree (C4.5)

Baselines  More Complex Classifiers, learned over shapelet transformed data, such as:  Bayesian Networks (BN)  Random Forest (RAF)  Rotation Forest (ROF)  Support Vector Machines (SVM)

Baselines  Other Related Methods:  Fast Shapelets (FSH)  Dynamic Time Warping (DTW)

Outline  Introduction  Related Work  Proposed Method  Analysis of The Proposed Method  Learning General Shapelets  Experimental Results  Conclusion and Comments

Conclusion and Comments  Learning, not searching for, shapelets.  Classic machine learning techniques.  Pros:  Very high accuracy.  Competitive running time.  Cons:  Painstaking Hyper-parameter Tuning.  Inadequate Interpretability.