# Mining Event Periodicity from Incomplete Observations

## Presentation on theme: "Mining Event Periodicity from Incomplete Observations"— Presentation transcript:

Mining Event Periodicity from Incomplete Observations
Zhenhui (Jessie) Li*, Jingjing Wang, Jiawei Han University of Illinois at Urbana-Champaign *Now at Penn State University Key points: (1) data is widely available (2) lots of applications (3) data complexity (4) data scalability Methodogies: (1) real data (2) real problem (3) collaboration with biogists (4) fundamental patterns that can be applied to many areas Before presentation: motivate myself; important applications; biologists SOME pronunciations: Periodicity (perio’dicity) consecutive [kuhn-sek-yuh-tiv] longitude [lon-ji-tood, -tyood] parameter [puh-ram-i-ter] hypothesis [hahy-poth-uh-sis, hi-] hurricane [hur-i-keyn] These kinds of methods This kind of methods Comment: Application, Challenge, Solutions. What do you expect your audience to take away? The major difference with statistical methods, signal processing, stream processing, dynamic system. Not contradictory. KDD 2012 Beijing, China Zhenhui Jessie Li

Prologue: Detect Periodicity in Movements [Li et al., KDD’10]
Problem: What is the periodicity of the movement? Bee example: 8 hours in hive 16 hours fly nearby Mention “reference spot” Zhenhui Jessie Li

Prologue: Detect Periodicity in Movements [Li et al., KDD’10]
Observe the in-and-out movements from the reference spot (i.e., hive). Easy to see the periodicity. in hive outside hive Mention “reference spot” time Two-Dimensional Movement One-Dimensional Binary Sequence Zhenhui Jessie Li

Challenge: Periodicity Detection for Incomplete Observations
in hive outside hive :03 in :30 out :12 in :03 in :14 out :15 in Complete Observations Incomplete Observations Two factors result in incomplete observations: inconsistent + low sampling rate Movement data collection in real scenarios: Human movements data collected from cellphones: only report locations when making calls Animal movement data: 2~3 locations in 3~5 days See figure. Please note that this work, we assume the observation spots are already detected and we are only interested in detecting periods from the in-and-out sequence. Zhenhui Jessie Li

A Challenging Case of Detecting Periodicity for Incomplete Observations
Sparse Raw Data :03 in :30 out :12 in :03 in :14 out :15 in in out in It is hard… but we do believe we can find the period if the observations are generated from some periodic pattern. So now let’s introduce the idea of our approach. Any periodicity in the above sequence? Zhenhui Jessie Li

Mining Periodicity in Incomplete Data
Event has a period of 20 Occurrences of the event happen between 20k+5 to 20k+10 even though the observations are sparse, it has little affect on the overall distribution when we segment and overlay the data using the correct period. So our high-level idea is as a generate-and-test framework. We will try all potential periods and see which results in a skew distribution of observations. [pause] There are many ways to measure the skewness. In our work, we propose to use the discrepancy measure between the ratios of in and out observations. Comment: The high-level idea is enough; probability is not the major contribution. Put the observation graph first, pause a few seconds, let people guess the periods. (Zhai) Mixture probabilistic model; entropy? (Zhai) Baseline is too weak. Why not other probabilistic methods? (Zhai) Zhenhui Jessie Li

A Probabilistic Model for Periodic Event
Example: Human daily periodicity visiting office Period as 24 Visiting office at 10-11am, 14-16pm Example Periodic distribution vectors; human; period 24; visiting office at some timestamps; low probability visiting the office at other times; for each timestamp, we model it as bernoulli distribution Zhenhui Jessie Li

A Probabilistic Model for Periodic Event with Random Observation
x(t) = 1, 0, -1; Generative model; overlay idea; formally use this generative model to understand and justify our overlay idea generate x(5)=1 x(62)=0 Zhenhui Jessie Li

Periodicity Detection by Overlaying Observations
True period Wrong period Say clearly the definition. Suppose we have already segment and overlay the original sequence using length T. Then, for any set of timestamps from 1 to T, we can compute the number of positive samples that fall into these timestamps. And we can define to ratio of positive observations as this number divided by the total number of positive samples. Even distribution Skewed distribution Zhenhui Jessie Li

Relationship between Observation Ratio and Probabilistic Model
Pos/Neg Ratio Periodic Distribution Vector Generative model Zhenhui Jessie Li

Discrepancy Score to Measure Periodicity
If T (=24) is the correct period, the discrepancy score should be large for certain set of timestamps If T (=23) is the wrong period, the discrepancy scores are likely to be zero for any set of timestamps Zhenhui Jessie Li

Periodicity Measure The discrepancy score will be large for certain timestamps using the correct period T. However, when potential period T is wrong, for ANY set of timestamps, the discrepancy scores are likely to be 0. So we propose periodicity score… Though such measures are simple, the nice thing is that we can formally prove that … Motivate discrepancy scores. And explain the differences between periodicity measures for T0 and T. Discrepancy score of a set of timestamp I and a potential period T is defined as the difference on ratios of positive and negative observations. As we discussed, if T is the true period, then it is very likely that for some timestamps, this discrepancy score will be very high. But if T is not the true period, then for any set of timestamps, we expect to score to be approximately zero, Therefore, we define the periodicity measure as the maximal discrepancy among all set of timestamps. Our main contribution is that, we can formally prove, if the observations have a true period T0, then the periodicity score of T0 will be no less than any other periodicity score. So we can treat the period with highest periodicity score as the discovered period. Comment: (1) one of the main results is that we theoretically prove that.... (QQ) Zhenhui Jessie Li

Performance Comparisons
What is the data, what are the methods No need to explain each existing period detection method in time series T = 24, SEG = [9 : 10, 14 : 16]. TN = 1000, Gamma = 0.1, alpha = 0.5, and beta = 0.2. Sampling rate (Ratio of observed points in the complete sequence) Zhenhui Jessie Li

Experiment on Real Human Data
One person’s visits to a specific location Sampling rate: 20min Sampling rate: 1hour To evaluate our method on real human movements, we use the data from Nokia Mobile Data Challenge The data contains movements of 80 persons across 200 to 500 days. The raw location data (based on GPS and WLAN) is first transformed into a set of symbolic places. Each place corresponds to a circle with radius of 100 meters. In this section, we select one person who has tracking record for 492 days for a case study. Zhenhui Jessie Li

Problems with Using Fourier Transform to Detect Periodicity
Zhenhui Jessie Li

Summary: Mining Event Periodicity from Incomplete Observations
Motivation Challenge of the real data: incomplete observations (inconsistent + low sampling rate) Method Overlay the segments and measure the “skewness” of the distribution Theoretically prove the correctness of the method Application Location prediction 2nd place in Nokia Mobile Data Challenge 2012 Periodicity-based feature + SVM Thanks! Questions? Real scenario Zhenhui Jessie Li