Automatic Segmentation of Data Sequences

Slides:

Advertisements

Similar presentations

Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.

Advertisements

An Interactive-Voting Based Map Matching Algorithm

A Unified Framework for Context Assisted Face Clustering

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Aggregating local image descriptors into compact codes

FUNNEL: Automatic Mining of Spatially Coevolving Epidemics Yasuko Matsubara, Yasushi Sakurai (Kumamoto University) Willem G. van Panhuis (University of.

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.

Native-Conflict-Aware Wire Perturbation for Double Patterning Technology Szu-Yu Chen, Yao-Wen Chang ICCAD 2010.

Fast Algorithms For Hierarchical Range Histogram Constructions

DAVA: Distributing Vaccines over Networks under Prior Information

Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.

Patch to the Future: Unsupervised Visual Prediction

Texture Segmentation Based on Voting of Blocks, Bayesian Flooding and Region Merging C. Panagiotakis (1), I. Grinias (2) and G. Tziritas (3)

Constructing Popular Routes from Uncertain Trajectories Ling-Yin Wei 1, Yu Zheng 2, Wen-Chih Peng 1 1 National Chiao Tung University, Taiwan 2 Microsoft.

Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,

Unsupervised Feature Selection for Multi-Cluster Data Deng Cai et al, KDD 2010 Presenter: Yunchao Gong Dept. Computer Science, UNC Chapel Hill.

Lecture 6 Image Segmentation

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Segmentation Divide the image into segments. Each segment:

Incremental Learning of Temporally-Coherent Gaussian Mixture Models Ognjen Arandjelović, Roberto Cipolla Engineering Department, University of Cambridge.

MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.

Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.

Video summarization by graph optimization Lu Shi Oct. 7, 2003.

CS218 – Final Project A “Small-Scale” Application- Level Multicast Tree Protocol Jason Lee, Lih Chen & Prabash Nanayakkara Tutor: Li Lao.

Time Series Data Analysis - II

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Graphite 2004 Statistical Synthesis of Facial Expressions for the Portrayal of Emotion Lisa Gralewski Bristol University United Kingdom

Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.

How to reform a terrain into a pyramid Takeshi Tokuyama (Tohoku U) Joint work with Jinhee Chun (Tohoku U) Naoki Katoh (Kyoto U) Danny Chen (U. Notre Dame)

A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.

AutoPlait: Automatic Mining of Co-evolving Time Sequences Yasuko Matsubara (Kumamoto University) Yasushi Sakurai (Kumamoto University) Christos Faloutsos.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Lei Li Computer Science Department Carnegie Mellon University Pre Proposal Time Series Learning completed work 11/27/2015.

Boundary Detection in Tokenizing Network Application Payload for Anomaly Detection Rachna Vargiya and Philip Chan Department of Computer Sciences Florida.

Unsupervised Auxiliary Visual Words Discovery for Large-Scale Image Object Retrieval Yin-Hsi Kuo1,2, Hsuan-Tien Lin 1, Wen-Huang Cheng 2, Yi-Hsuan Yang.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Support Feature Machine for DNA microarray data

MEIKE: Influence-based Communities in Networks

Discrete ABC Based on Similarity for GCP

Sofus A. Macskassy Fetch Technologies

Computing and Compressive Sensing in Wireless Sensor Networks

Dynamic Graph Partitioning Algorithm

DM-Group Meeting Liangzhe Chen, Nov

Supervised Time Series Pattern Discovery through Local Importance

Dynamical Statistical Shape Priors for Level Set Based Tracking

Compact Query Term Selection Using Topically Related Text

Distributed Representations of Subgraphs

Jianping Fan Dept of CS UNC-Charlotte

StreamApprox Approximate Stream Analytics in Apache Spark

StreamApprox Approximate Computing for Stream Analytics

A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence Yue Ming NJIT#:

Globally Optimal Generalized Maximum Multi Clique Problem (GMMCP) using Python code for Pedestrian Object Tracking By Beni Mulyana.

Effective Social Network Quarantine with Minimal Isolation Costs

Enumerating Distances Using Spanners of Bounded Degree

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.

Department of Computer Science University of York

Introduction Wireless Ad-Hoc Network

Synthesis of Motion from Simple Animations

Uncapacitated Minimum Cost Problem in a Distribution Network

Pei Lee, ICDE 2014, Chicago, IL, USA

On the Graph Decomposition

Actively Learning Ontology Matching via User Interaction

The Greedy Approach Young CS 530 Adv. Algo. Greedy.

Topological Signatures For Fast Mobility Analysis

Yingze Wang and Shi-Kuo Chang University of Pittsburgh

Presentation transcript:

Automatic Segmentation of Data Sequences Liangzhe Chen, Sorour E. Amiri, B. Aditya Prakash Department of Computer Science Virginia Tech

Outline Motivation and Introduction Our Framework and Solution Goal 1: Summarizing Time Segments Goal 2: Constructing the Segment-Graph Goal 3: Finding the Best Segmentation Experiment Results Conclusions

Motivation Find pattern changes in multi-dimensional value sequences Epidemiology: How to find the pattern changes in disease propagation? Motion detection: How to detect different motions in motion sequences

Multi-Dimensional Data Sequences We study sequences with real/categorical multi-dimensional values arbitrary time stamps

Informal Problem Definition Given: a data sequence {(x1, t1), (x2, t2), …, (xN, tN)}, where (xi,ti) is an observation of d-dimensional vector xi at time ti. Find: a time segmentation s.t consecutive time segments are not similarly informative. Notations: x: a data value X: union set of all x’s yi,j: time segment [ti, tj) Y: union set of all y’s

Limitations of Existing Work Time series algorithms Data values uniformly distributed over time. (number of data values proportional to the length of the time period) Event sequence analysis Each event is a 1-dimensional categorical value Too restrictive for real world data sequences Our Sequence

Our idea: Holistic segmenting of multi-dimensional value sequences using all possible time segments.

Outline Motivation and Introduction Our Framework and Solution Goal 1: Summarizing Time Segments Goal 2: Constructing the Segment-Graph Goal 3: Finding the Best Segmentation Experiment Results Conclusions

Main Idea: Segment Graph Nodes: All time segments + source(‘s’) + target(‘t’) Source(‘s’) = start of time, target(‘t’) = end of time Edges: There is a directed edge between adjacent time segments Edge weight measures the difference between two time segments w

Convert to Path Optimization Observation: For each segmentation, there is a path from ‘s’ to ‘t’ For each path from ‘s’ to ‘t’, there is a segmentation Therefore Best segmentation problem = Path optimization problem

Proposed Method: DASSA [submitted to PLoS’17] Goal 1: Cluster co-occuring data values to find a summary for each time segment

Proposed Method: DASSA [submitted to PLoS’17] Goal 2: Construct a segment graph to efficiently represent all possible time segments.

Proposed Method: DASSA [submitted to PLoS’17] Goal 3: Find the best segmentation

Outline Motivation and Introduction Our Framework and Solution Goal 1: Summarizing Time Segments Goal 2: Constructing the Segment-Graph Goal 3: Finding the Best Segmentation Experiment Results Conclusions

Goal 1: Summarize Time Segments Input data sequence Over-segmenting the sequence

Goal 1: Summarize Time Segments Our Idea: data values that co-occur together in the sequence should be regarded as the same for segmentation problem More general: compare temporal patterns rather than values Red and yellow occur closely in time

Data Clustering Our idea: cluster data values with similar time segment distributions p(y|x). Combine IB (Information Bottleneck) and MDL (Minimum Description Length) to cluster data with similar p(y|x). p(y|x) are the probabilities of data value x in time segment y

Information Bottleneck Find a compact representation of X s.t the information about Y is maximally kept. It optimizes the following function Can be solve by iteratively merging data values that minimizes the loss of temporal information I() is the mutual information function A question remains: how many clusters to keep?

Minimum Description Length The best model is the one that can express the data losslessly with the smallest code length. Data: the input sequences Model: roughly parameters from the IB algorithm Cost to describe the model Cost to describe the data using the model

(can be further parallelized) Combine IB & MDL Continue merging data values until the total MDL cost increases Use priority queue to reduce the time complexity to O((|X|-l)|X|log|X|) (can be further parallelized)

Outline Motivation and Introduction Our Framework and Solution Goal 1: Summarizing Time Segments Goal 2: Constructing the Segment-Graph Goal 3: Finding the Best Segmentation Experiment Results Conclusions

Goal 2: Construct the Segment Graph w ? Calculate the distance between the cluster distributions in the segments

Goal 2: Edge Weights Euclidean distance between the cluster distributions in the segments Euclidean distance between the cluster distribution in the segments Penalize segments with small number of data values Satisfying three important axioms! See details in paper

Outline Motivation and Introduction Our Framework and Solution Goal 1: Summarizing Time Segments Goal 2: Constructing the Segment-Graph Goal 3: Finding the Best Segmentation Experiment Results Conclusions

Goal 3: Finding the Best Segmentation Recall: Best segmentation problem = Path optimization problem

Current state-of-the-art algorithm for ALP is O(Vs2Es) [Waggoner’13]. Path Optimization Our idea: Average longest path Find the average longest path from ‘s’ to ‘t’ Advantages: Parameter free Naturally balances weight of the path with the number of segments. Current state-of-the-art algorithm for ALP is O(Vs2Es) [Waggoner’13]. Not Scalable!

Time complexity: O(Es) DAG-ALP We propose an efficient DAG-ALP algorithm Time complexity: O(Es) See details in paper

The Complete Algorithm DASSA Step 1: Cluster data values based on temporal occurrence Step 2: Construct the segment graph Step 3: Find the average longest path Time complexity: O((|X|-l)|X|log|X| + |Es|) (IB, DAG-ALP can be further parallelized)

Outline Motivation and Introduction Our Framework and Solution Goal 1: Summarizing Time Segments Goal 2: Constructing the Segment-Graph Goal 3: Finding the Best Segmentation Experiment Results Conclusions

Datasets Use data from different domains such as Portland: disease propagation in a contact network ChickenDance: a chicken dance motion sequence Twitter: flu-related key word trends Ebola: real Ebola disease reports PUC-Rio: sequences of human motions

Baselines Adopt a time series algorithm Variations of DASSA: Dynammo: averaging data points in a sliding window to construct multi-dimensional time series; and use Dynammo to find the segmentation (requiring the number of cut points as an input) Variations of DASSA: EMP: Use empirical cluster distribution to calculate segment distance TopicM: Cluster values using topic modeling LP: Find the longest path instead of ALP

Performance Comparison For datasets with ground truth segmentation, we calculate the F1 scores for the detected cut points. DASSA outperforms all baselines, and achieves the best performance

Detecting Meaningful Patterns Patterns detected for the Portland and Chicken Dance datasets Precisely detect the time point when a different motion in the chicken dance take place Find the disease propagation pattern The disease first infects elder people with higher income and higher number of workers in family; then in the second segment, the disease spread to younger people with lower income

Case Studies Detect interesting patterns for datasets without ground truth segmentations Find different stages of a flu infection cycle: the words used in different segments show a transition from flu symptoms, to flu infection, and to final recovery. Captures the time when the caution of the disease increases (leading to the decreased number of newly confirmed cases)

Outline Motivation and Introduction Our Framework and Solution Goal 1: Summarizing Time Segments Goal 2: Constructing the Segment-Graph Goal 3: Finding the Best Segmentation Experiment Results Conclusions

Conclusions DASSA automatically detects the appropriate number of segments and the best segmentation for the data sequences. DASSA reveals meaningful patterns across different types of data sequences, such as epidemiology sequences, motion sequences, etc.

Future Work Parallelized or online version of DASSA Apply DASSA for more complex data sequences, such as image sequences

Code and papers are available at: Thank you! Any Questions? Code and papers are available at: http://people.cs.vt.edu/~liangzhe/