Automatic Segmentation of Data Sequences

Automatic Segmentation of Data Sequences
Liangzhe Chen, Sorour E. Amiri, B. Aditya Prakash Department of Computer Science Virginia Tech

Outline Motivation and Introduction Our Framework and Solution
Goal 1: Summarizing Time Segments Goal 2: Constructing the Segment-Graph Goal 3: Finding the Best Segmentation Experiment Results Conclusions

Motivation Find pattern changes in multi-dimensional value sequences
Epidemiology: How to find the pattern changes in disease propagation? Motion detection: How to detect different motions in motion sequences

Multi-Dimensional Data Sequences
We study sequences with real/categorical multi-dimensional values arbitrary time stamps

Informal Problem Definition
Given: a data sequence {(x1, t1), (x2, t2), …, (xN, tN)}, where (xi,ti) is an observation of d-dimensional vector xi at time ti. Find: a time segmentation s.t consecutive time segments are not similarly informative. Notations: x: a data value X: union set of all x’s yi,j: time segment [ti, tj) Y: union set of all y’s

Limitations of Existing Work
Time series algorithms Data values uniformly distributed over time. (number of data values proportional to the length of the time period) Event sequence analysis Each event is a 1-dimensional categorical value Too restrictive for real world data sequences Our Sequence

Our idea: Holistic segmenting of multi-dimensional value sequences using all possible time segments.

Main Idea: Segment Graph
Nodes: All time segments + source(‘s’) + target(‘t’) Source(‘s’) = start of time, target(‘t’) = end of time Edges: There is a directed edge between adjacent time segments Edge weight measures the difference between two time segments w

Convert to Path Optimization
Observation: For each segmentation, there is a path from ‘s’ to ‘t’ For each path from ‘s’ to ‘t’, there is a segmentation Therefore Best segmentation problem = Path optimization problem

Proposed Method: DASSA [submitted to PLoS’17]
Goal 1: Cluster co-occuring data values to find a summary for each time segment

Goal 2: Construct a segment graph to efficiently represent all possible time segments.

Goal 3: Find the best segmentation

Goal 1: Summarize Time Segments
Input data sequence Over-segmenting the sequence

Goal 1: Summarize Time Segments
Our Idea: data values that co-occur together in the sequence should be regarded as the same for segmentation problem More general: compare temporal patterns rather than values Red and yellow occur closely in time

Data Clustering Our idea: cluster data values with similar time segment distributions p(y|x). Combine IB (Information Bottleneck) and MDL (Minimum Description Length) to cluster data with similar p(y|x). p(y|x) are the probabilities of data value x in time segment y

Information Bottleneck
Find a compact representation of X s.t the information about Y is maximally kept. It optimizes the following function Can be solve by iteratively merging data values that minimizes the loss of temporal information I() is the mutual information function A question remains: how many clusters to keep?

Minimum Description Length
The best model is the one that can express the data losslessly with the smallest code length. Data: the input sequences Model: roughly parameters from the IB algorithm Cost to describe the model Cost to describe the data using the model

(can be further parallelized)
Combine IB & MDL Continue merging data values until the total MDL cost increases Use priority queue to reduce the time complexity to O((|X|-l)|X|log|X|) (can be further parallelized)

Goal 2: Construct the Segment Graph
w ? Calculate the distance between the cluster distributions in the segments

Goal 2: Edge Weights Euclidean distance between the cluster distributions in the segments Euclidean distance between the cluster distribution in the segments Penalize segments with small number of data values Satisfying three important axioms! See details in paper

Goal 3: Finding the Best Segmentation
Recall: Best segmentation problem = Path optimization problem

Current state-of-the-art algorithm for ALP is O(Vs2Es) [Waggoner’13].
Path Optimization Our idea: Average longest path Find the average longest path from ‘s’ to ‘t’ Advantages: Parameter free Naturally balances weight of the path with the number of segments. Current state-of-the-art algorithm for ALP is O(Vs2Es) [Waggoner’13]. Not Scalable!

Time complexity: O(Es)
DAG-ALP We propose an efficient DAG-ALP algorithm Time complexity: O(Es) See details in paper

The Complete Algorithm
DASSA Step 1: Cluster data values based on temporal occurrence Step 2: Construct the segment graph Step 3: Find the average longest path Time complexity: O((|X|-l)|X|log|X| + |Es|) (IB, DAG-ALP can be further parallelized)

Datasets Use data from different domains such as
Portland: disease propagation in a contact network ChickenDance: a chicken dance motion sequence Twitter: flu-related key word trends Ebola: real Ebola disease reports PUC-Rio: sequences of human motions

Baselines Adopt a time series algorithm Variations of DASSA:
Dynammo: averaging data points in a sliding window to construct multi-dimensional time series; and use Dynammo to find the segmentation (requiring the number of cut points as an input) Variations of DASSA: EMP: Use empirical cluster distribution to calculate segment distance TopicM: Cluster values using topic modeling LP: Find the longest path instead of ALP

Performance Comparison
For datasets with ground truth segmentation, we calculate the F1 scores for the detected cut points. DASSA outperforms all baselines, and achieves the best performance

Detecting Meaningful Patterns
Patterns detected for the Portland and Chicken Dance datasets Precisely detect the time point when a different motion in the chicken dance take place Find the disease propagation pattern The disease first infects elder people with higher income and higher number of workers in family; then in the second segment, the disease spread to younger people with lower income

Case Studies Detect interesting patterns for datasets without ground truth segmentations Find different stages of a flu infection cycle: the words used in different segments show a transition from flu symptoms, to flu infection, and to final recovery. Captures the time when the caution of the disease increases (leading to the decreased number of newly confirmed cases)

Conclusions DASSA automatically detects the appropriate number of segments and the best segmentation for the data sequences. DASSA reveals meaningful patterns across different types of data sequences, such as epidemiology sequences, motion sequences, etc.

Future Work Parallelized or online version of DASSA
Apply DASSA for more complex data sequences, such as image sequences

Code and papers are available at:
Thank you! Any Questions? Code and papers are available at:

Automatic Segmentation of Data Sequences

Similar presentations

Presentation on theme: "Automatic Segmentation of Data Sequences"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Segmentation of Data Sequences

Similar presentations

Presentation on theme: "Automatic Segmentation of Data Sequences"— Presentation transcript:

Similar presentations

About project

Feedback