Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Segmentation of Data Sequences

Similar presentations


Presentation on theme: "Automatic Segmentation of Data Sequences"— Presentation transcript:

1 Automatic Segmentation of Data Sequences
Liangzhe Chen, Sorour E. Amiri, B. Aditya Prakash Department of Computer Science Virginia Tech

2 Outline Motivation and Introduction Our Framework and Solution
Goal 1: Summarizing Time Segments Goal 2: Constructing the Segment-Graph Goal 3: Finding the Best Segmentation Experiment Results Conclusions

3 Motivation Find pattern changes in multi-dimensional value sequences
Epidemiology: How to find the pattern changes in disease propagation? Motion detection: How to detect different motions in motion sequences

4 Multi-Dimensional Data Sequences
We study sequences with real/categorical multi-dimensional values arbitrary time stamps

5 Informal Problem Definition
Given: a data sequence {(x1, t1), (x2, t2), …, (xN, tN)}, where (xi,ti) is an observation of d-dimensional vector xi at time ti. Find: a time segmentation s.t consecutive time segments are not similarly informative. Notations: x: a data value X: union set of all x’s yi,j: time segment [ti, tj) Y: union set of all y’s

6 Limitations of Existing Work
Time series algorithms Data values uniformly distributed over time. (number of data values proportional to the length of the time period) Event sequence analysis Each event is a 1-dimensional categorical value Too restrictive for real world data sequences Our Sequence

7 Our idea: Holistic segmenting of multi-dimensional value sequences using all possible time segments.

8 Outline Motivation and Introduction Our Framework and Solution
Goal 1: Summarizing Time Segments Goal 2: Constructing the Segment-Graph Goal 3: Finding the Best Segmentation Experiment Results Conclusions

9 Main Idea: Segment Graph
Nodes: All time segments + source(‘s’) + target(‘t’) Source(‘s’) = start of time, target(‘t’) = end of time Edges: There is a directed edge between adjacent time segments Edge weight measures the difference between two time segments w

10 Convert to Path Optimization
Observation: For each segmentation, there is a path from ‘s’ to ‘t’ For each path from ‘s’ to ‘t’, there is a segmentation Therefore Best segmentation problem = Path optimization problem

11 Proposed Method: DASSA [submitted to PLoS’17]
Goal 1: Cluster co-occuring data values to find a summary for each time segment

12 Proposed Method: DASSA [submitted to PLoS’17]
Goal 2: Construct a segment graph to efficiently represent all possible time segments.

13 Proposed Method: DASSA [submitted to PLoS’17]
Goal 3: Find the best segmentation

14 Outline Motivation and Introduction Our Framework and Solution
Goal 1: Summarizing Time Segments Goal 2: Constructing the Segment-Graph Goal 3: Finding the Best Segmentation Experiment Results Conclusions

15 Goal 1: Summarize Time Segments
Input data sequence Over-segmenting the sequence

16 Goal 1: Summarize Time Segments
Our Idea: data values that co-occur together in the sequence should be regarded as the same for segmentation problem More general: compare temporal patterns rather than values Red and yellow occur closely in time

17 Data Clustering Our idea: cluster data values with similar time segment distributions p(y|x). Combine IB (Information Bottleneck) and MDL (Minimum Description Length) to cluster data with similar p(y|x). p(y|x) are the probabilities of data value x in time segment y

18 Information Bottleneck
Find a compact representation of X s.t the information about Y is maximally kept. It optimizes the following function Can be solve by iteratively merging data values that minimizes the loss of temporal information I() is the mutual information function A question remains: how many clusters to keep?

19 Minimum Description Length
The best model is the one that can express the data losslessly with the smallest code length. Data: the input sequences Model: roughly parameters from the IB algorithm Cost to describe the model Cost to describe the data using the model

20 (can be further parallelized)
Combine IB & MDL Continue merging data values until the total MDL cost increases Use priority queue to reduce the time complexity to O((|X|-l)|X|log|X|) (can be further parallelized)

21 Outline Motivation and Introduction Our Framework and Solution
Goal 1: Summarizing Time Segments Goal 2: Constructing the Segment-Graph Goal 3: Finding the Best Segmentation Experiment Results Conclusions

22 Goal 2: Construct the Segment Graph
w ? Calculate the distance between the cluster distributions in the segments

23 Goal 2: Edge Weights Euclidean distance between the cluster distributions in the segments Euclidean distance between the cluster distribution in the segments Penalize segments with small number of data values Satisfying three important axioms! See details in paper

24 Outline Motivation and Introduction Our Framework and Solution
Goal 1: Summarizing Time Segments Goal 2: Constructing the Segment-Graph Goal 3: Finding the Best Segmentation Experiment Results Conclusions

25 Goal 3: Finding the Best Segmentation
Recall: Best segmentation problem = Path optimization problem

26 Current state-of-the-art algorithm for ALP is O(Vs2Es) [Waggoner’13].
Path Optimization Our idea: Average longest path Find the average longest path from ‘s’ to ‘t’ Advantages: Parameter free Naturally balances weight of the path with the number of segments. Current state-of-the-art algorithm for ALP is O(Vs2Es) [Waggoner’13]. Not Scalable!

27 Time complexity: O(Es)
DAG-ALP We propose an efficient DAG-ALP algorithm Time complexity: O(Es) See details in paper

28 The Complete Algorithm
DASSA Step 1: Cluster data values based on temporal occurrence Step 2: Construct the segment graph Step 3: Find the average longest path Time complexity: O((|X|-l)|X|log|X| + |Es|) (IB, DAG-ALP can be further parallelized)

29 Outline Motivation and Introduction Our Framework and Solution
Goal 1: Summarizing Time Segments Goal 2: Constructing the Segment-Graph Goal 3: Finding the Best Segmentation Experiment Results Conclusions

30 Datasets Use data from different domains such as
Portland: disease propagation in a contact network ChickenDance: a chicken dance motion sequence Twitter: flu-related key word trends Ebola: real Ebola disease reports PUC-Rio: sequences of human motions

31 Baselines Adopt a time series algorithm Variations of DASSA:
Dynammo: averaging data points in a sliding window to construct multi-dimensional time series; and use Dynammo to find the segmentation (requiring the number of cut points as an input) Variations of DASSA: EMP: Use empirical cluster distribution to calculate segment distance TopicM: Cluster values using topic modeling LP: Find the longest path instead of ALP

32 Performance Comparison
For datasets with ground truth segmentation, we calculate the F1 scores for the detected cut points. DASSA outperforms all baselines, and achieves the best performance

33 Detecting Meaningful Patterns
Patterns detected for the Portland and Chicken Dance datasets Precisely detect the time point when a different motion in the chicken dance take place Find the disease propagation pattern The disease first infects elder people with higher income and higher number of workers in family; then in the second segment, the disease spread to younger people with lower income

34 Case Studies Detect interesting patterns for datasets without ground truth segmentations Find different stages of a flu infection cycle: the words used in different segments show a transition from flu symptoms, to flu infection, and to final recovery. Captures the time when the caution of the disease increases (leading to the decreased number of newly confirmed cases)

35 Outline Motivation and Introduction Our Framework and Solution
Goal 1: Summarizing Time Segments Goal 2: Constructing the Segment-Graph Goal 3: Finding the Best Segmentation Experiment Results Conclusions

36 Conclusions DASSA automatically detects the appropriate number of segments and the best segmentation for the data sequences. DASSA reveals meaningful patterns across different types of data sequences, such as epidemiology sequences, motion sequences, etc.

37 Future Work Parallelized or online version of DASSA
Apply DASSA for more complex data sequences, such as image sequences

38 Code and papers are available at:
Thank you! Any Questions? Code and papers are available at:


Download ppt "Automatic Segmentation of Data Sequences"

Similar presentations


Ads by Google