Presentation is loading. Please wait.

Presentation is loading. Please wait.

Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong.

Similar presentations


Presentation on theme: "Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong."— Presentation transcript:

1 Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Hongjun Lu, Philip S Yu VLDB 2005

2 Systems Engineering and Engineering Management The Chinese University of Hong Kong Outline Introduction Introduction –Bursty events? Text streams? Etc. A Possible Method A Possible Method –Document pivot clustering Proposed Work Proposed Work –Feature pivot clustering Results Highlight Results Highlight Related Works Related Works Summary & Future Work Summary & Future Work

3 Systems Engineering and Engineering Management The Chinese University of Hong Kong Outline Introduction Introduction –Bursty events? Text streams? Etc. A Possible Method A Possible Method –Document pivot clustering Proposed Work Proposed Work –Feature pivot clustering Results Highlight Results Highlight Related Works Related Works Summary & Future Work Summary & Future Work

4 Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Parameter Free Bursty Events Detection in Text Streams Introduction (1 or 5)

5 Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Parameter Free Bursty Events Detection in Text Streams –A sequence of documents organized temporally »E.g. News stories and e-mails –Two kinds of stream: Online vs. Offline »Online Stream: Open-ended. »Offline Stream: Have boundaries. Introduction (2 or 5) ………

6 Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Parameter Free Bursty Events Detection in Text Streams –An event consists a set of features that are useful to identify (understand) the event. –A Bursty Event is an event that is hot in a specific period of time –We call the features that are used to identify the Bursty Event as Bursty Features –E.g. The event “SARS” consists of the features “Outbreak, Atypic, Respire, …” Introduction (3 or 5) Time No. of News Stories An event, e.g. SARS

7 Systems Engineering and Engineering Management The Chinese University of Hong Kong Introduction (4 or 5) Parameter Free Bursty Events Detection in Text Stream Parameter Free Bursty Events Detection in Text Stream –Given a text stream, try to figure out all of the bursty events »In other words, try to figure out all of the bursty features (features that are “hot” in a specific period) and group the bursty features together logically, such that the bursty features grouped together are useful for identifying an event. ………

8 Systems Engineering and Engineering Management The Chinese University of Hong Kong Introduction (5 or 5) Parameter Free Bursty Events Detection in Text Streams Parameter Free Bursty Events Detection in Text Streams –Parameter Free – You do not need to turn the parameters by yourself »The framework is applicable on any corpus »No fine tuning is necessary »No parameter needs to be estimated –Why parameter free is useful? »Without any prior knowledge about the information in a database, it is rather difficult to make any initially estimation »In our problem, we are trying to identify the bursty events in a text stream. In this problem, we do not know have any prior knowledge about the information in the database. We do not know what it contains. We even do not know whether there is any burst. We do not know…

9 Systems Engineering and Engineering Management The Chinese University of Hong Kong Problem Setting Data archived Data archived –Source: Local news stories (South China Morning Post) –Period: 2003-01-01 to 2004-12-31 Some major settings Some major settings –Offline detection –New stories that are release on the same day (i.e. new stories that appear in the same piece of the newspaper) are grouped together as a batch

10 Systems Engineering and Engineering Management The Chinese University of Hong Kong Outline Introduction Introduction –Bursty events? Text streams? Etc. A Possible Method A Possible Method –Document pivot clustering Proposed Work Proposed Work –Feature pivot clustering Results Highlight Results Highlight Related Works Related Works Summary & Future Work Summary & Future Work

11 Systems Engineering and Engineering Management The Chinese University of Hong Kong A possible method (Not our approach) A possible method (Not our approach) –Step 1: »Objective: Group similar events together »Method: Use clustering to group similar documents together (e.g. K- Means) –Step 2 »Objective: Extract the keywords of each event »Method: Use feature selection (e.g. Information gain) Document Pivot Clustering Approach (1 of 3) All News Stories Via Clustering... Group 1 Group 2 Step 1 Step 2 Extract the Key Features feature... feature...

12 Systems Engineering and Engineering Management The Chinese University of Hong Kong Document Pivot Clustering Approach (2 of 3) Some difficulties Some difficulties 1.Most similar documents may not report the same event –From our experiments, we found that two documents that are the most similar in terms of the features, may not necessary report the same event 2.Clustering requires feature weightings (e.g. tf-idf) –Feature weighting is originated from IR. Its idea is: feature appear in fewer documents in the domain are more useful (obtain higher weights). –For clustering: feature appear in many documents in a certain period should obtain a higher weights.

13 Systems Engineering and Engineering Management The Chinese University of Hong Kong Some difficulties (cont’d) Some difficulties (cont’d) 3.A long running events may be broken down into several small pieces –This phenomenon appears in many reported studies (esp. in TDT) 4.Difficult to figure out the bursty features –Assume clustering can determine bursty events. However, there can be many clusters that are not “hot” (important). Determine which of the cluster is “hot” is difficult (may require a ranking function, but difficult to derive.) Document Pivot Clustering Approach (3 of 3)

14 Systems Engineering and Engineering Management The Chinese University of Hong Kong Outline Introduction Introduction –Bursty events? Text streams? Etc. A Possible Method A Possible Method –Document pivot clustering Proposed Work Proposed Work –Feature pivot clustering Results Highlight Results Highlight Related Works Related Works Summary & Future Work Summary & Future Work

15 Systems Engineering and Engineering Management The Chinese University of Hong Kong Feature Pivot Clustering Approach Overview of the framework Overview of the framework –Step 1 »Identify the bursty features –Step 2 »Group the bursty features into bursty events –Step 3 »Determine the hot periods of the bursty events All News Stories Extract All feature... Identify Event 1... Bursty feature... Cluster Event 2...... Determine the hot period Determine the hot period Step 1 Step 2 Step 3

16 Systems Engineering and Engineering Management The Chinese University of Hong Kong Cluster Feature Pivot Clustering Approach Overview of the framework Overview of the framework –Step 1 »Identify the bursty features Step 2 Group the bursty features into bursty events Step 3 Determine the hot periods of the bursty events All News Stories Extract All feature... Identify Event 1... Bursty feature... Event 2...... Determine the hot period Determine the hot period Step 1 Step 2 Step 3

17 Systems Engineering and Engineering Management The Chinese University of Hong Kong Identify the Bursty Features (1 of 7) General Idea General Idea –Given a single feature, f, try to figure out whether it contains any bursty period. –If so, then it is a bursty feature (in some specific periods) Time No. of docs contains the feature, f Bursty Period The distribution of a feature, f, among documents

18 Systems Engineering and Engineering Management The Chinese University of Hong Kong Identify the Bursty Features (2 of 7) Some more examples Some more examples Time No. of docs contains the feature, f Time No. of docs contains the feature, f Time No. of docs contains the feature, f Time No. of docs contains the feature, f No burst Not a burst (stopword) Burst without fading away Two burst

19 Systems Engineering and Engineering Management The Chinese University of Hong Kong Identify the Bursty Features (3 of 7) An obvious approach to discover whether a feature is a bursty feature is to use a “threshold cut” An obvious approach to discover whether a feature is a bursty feature is to use a “threshold cut” Time No. of docs contains the feature, f Bursty Period The distribution of a feature, f, among documents threshold

20 Systems Engineering and Engineering Management The Chinese University of Hong Kong Identify the Bursty Features (4 of 7) Challenges Challenges –Setting one single threshold for all features is impossible Another attempt – set a “percentage cut” Another attempt – set a “percentage cut” –Figure out the relative differences between the max and min of the “No. of docs contains the feature” Time No. of docs contains the feature, f Time No. of docs contains the feature, f For a stop-word: For a normal non-bursty feature: threshold

21 Systems Engineering and Engineering Management The Chinese University of Hong Kong Identify the Bursty Features (5 of 7) Challenges Challenges –Setting a percentage cut is also impossible »Different features has different distribution: Time No. of docs contains the feature, f Time No. of docs contains the feature, f 500 300

22 Systems Engineering and Engineering Management The Chinese University of Hong Kong Identify the Bursty Features (6 of 7) Our solution Our solution –Treating each feature in the text stream as a probabilistic distribution –In each day, we compute the probability that the number of documents contains a particular feature, f j »What we got are: N’ – no. of news stories in the stream n’ – no. of news stories in a time window (one day) K’– no. of news stories contains the specific feature n’ – K’ – no. of news stories does not contain the specific feature N’ – no. of news stories in the stream n’ – no. of news stories in a time window (one day) K’– no. of news stories contains the specific feature n’ – K’ – no. of news stories does not contain the specific feature »We can model the distribution of a feature in a time window (i.e. in a day) by binomial distribution (the above four elements are enough for computing binomial distribution) (Continue next page)

23 Systems Engineering and Engineering Management The Chinese University of Hong Kong Identify the Bursty Features (7 of 7) –If in any time window (day), the value of the binomial distribution (probability that the number of documents contain the feature) change significantly, than it implies that the feature exhibit “abnormal” behavior »The reason is that if the features are generated from an unknown probability distribution, than the value of the binomial distribution at each time window (in each day) should be more or less constant –Two reasons that it drop significantly: »Suddenly very few documents contains the specific features We are not interested in this kind of observation, as it only tells us that the specific feature is NOT a bursty feature in the corresponding time window (day). It gives no insight about whether it is a bursty feature NOW. We are not interested in this kind of observation, as it only tells us that the specific feature is NOT a bursty feature in the corresponding time window (day). It gives no insight about whether it is a bursty feature NOW. »Suddenly many documents contains the specific features We are interested in this kind of features We are interested in this kind of features

24 Systems Engineering and Engineering Management The Chinese University of Hong Kong Cluster Feature Pivot Clustering Approach Overview of the framework Overview of the framework –Step 1 »Identify the bursty features Step 2 Group the bursty features into bursty events Step 3 Determine the hot periods of the bursty events All News Stories Extract All feature... Identify Event 1... Bursty feature... Event 2...... Determine the hot period Determine the hot period Step 1 Step 2 Step 3

25 Systems Engineering and Engineering Management The Chinese University of Hong Kong Feature Pivot Clustering Approach Overview of the framework Overview of the framework –Step 1 »Identify the bursty features –Step 2 »Group the bursty features into bursty events –Step 3 Determine the hot periods of the bursty events All News Stories Extract All feature... Identify Event 1... Bursty feature... Cluster Event 2...... Determine the hot period Determine the hot period Step 1 Step 2 Step 3

26 Systems Engineering and Engineering Management The Chinese University of Hong Kong Group the Bursty Features (1 of 2) General idea General idea –Group the features such that they always appear together »If the features always appear together, they should be discussing the same event –Cluster the features Challenge Challenge –Should we group these two features together? »Situation: If feature A appears, Feature B always appears also. Feature A appears in 1,000 stories. Feature B appears in 200 stories. »We claim that they should not be grouped together, as Feature B is only a subset of Feature A. We want to group the feature at the “same level” We want to group the feature at the “same level”

27 Systems Engineering and Engineering Management The Chinese University of Hong Kong Group the Bursty Features (2 of 2) Our solution Our solution –We try to figure out what is the probability of the features grouped together given the observation of the document distribution of the text stream »Find a maximum probability that the features would be grouped together (Expectation-Maximization, EM) –Mathematically,

28 Systems Engineering and Engineering Management The Chinese University of Hong Kong Feature Pivot Clustering Approach Overview of the framework Overview of the framework –Step 1 »Identify the bursty features –Step 2 »Group the bursty features into bursty events –Step 3 Determine the hot periods of the bursty events All News Stories Extract All feature... Identify Event 1... Bursty feature... Cluster Event 2...... Determine the hot period Determine the hot period Step 1 Step 2 Step 3

29 Systems Engineering and Engineering Management The Chinese University of Hong Kong Feature Pivot Clustering Approach Overview of the framework Overview of the framework –Step 1 »Identify the bursty features –Step 2 »Group the bursty features into bursty events –Step 3 »Determine the hot periods of the bursty events All News Stories Extract All feature... Identify Event 1... Bursty feature... Cluster Event 2...... Determine the hot period Determine the hot period Step 1 Step 2 Step 3

30 Systems Engineering and Engineering Management The Chinese University of Hong Kong Determine the Hot Periods General idea General idea –The highest average probability that the bursty features will be appeared together Graphically Graphically Time Document Distribution

31 Systems Engineering and Engineering Management The Chinese University of Hong Kong Outline Introduction Introduction –Bursty events? Text streams? Etc. A Possible Method A Possible Method –Document pivot clustering Proposed Work Proposed Work –Feature pivot clustering Results Highlight Results Highlight Related Works Related Works Summary & Future Work Summary & Future Work

32 Systems Engineering and Engineering Management The Chinese University of Hong Kong Problem Setting Data archived Data archived –Source: Local news stories (South China Morning Post) –Period: 2003-01-01 to 2004-12-31 Major Settings Major Settings –Offline detection –New stories that are release on the same day (i.e. new stories that appear in the same piece of the newspaper) are grouped together as a batch

33 Systems Engineering and Engineering Management The Chinese University of Hong Kong Results Highlight Some events Some events Bursty Events Bursty Features SARS Sars, Outbreak, Atypic, Respire, … Legislation Article, Yip, Law, Rally, … Bird Fu Bird, Flu Taiwan Issue Taiwan, Chen, Shu, Bian Iraq War Iraq, War, Saddam, … Gas Victim, Might, Accident, Gas

34 Systems Engineering and Engineering Management The Chinese University of Hong Kong Outline Introduction Introduction –Bursty events? Text streams? Etc. A Possible Method A Possible Method –Document pivot clustering Proposed Work Proposed Work –Feature pivot clustering Results Highlight Results Highlight Related Works Related Works Conclusion Conclusion

35 Systems Engineering and Engineering Management The Chinese University of Hong Kong Related Works (1 of 2) TDT – Automatically techniques for locating topically related materials in streams data (Wayne 2000 pp. 1487) TDT – Automatically techniques for locating topically related materials in streams data (Wayne 2000 pp. 1487) –Five major tasks: segmentation, tracking, detection, first story detection, linking –Work well with the “document-pivot clustering” approach »Try to group similar documents to form an event (The event is not named, i.e. no need to extract or identify the main features in the event) No need to figure out the “bursty features” No need to figure out the “bursty features” –Other interesting issue »Our approach naturally combine the detection task and linking task together

36 Systems Engineering and Engineering Management The Chinese University of Hong Kong Related Works (2 of 2) Many other related works Many other related works –Vlachos et la SIGMOD’04 »Burst for online query –Smith SIGIR’02 »Events Detection –Kleinbery KDD’02 »Burst and hierarchical structure –Swan & Allan SIGIR’00 »Time varying features –…

37 Systems Engineering and Engineering Management The Chinese University of Hong Kong Outline Introduction Introduction –Bursty events? Text streams? Etc. A Possible Method A Possible Method –Document pivot clustering Proposed Work Proposed Work –Feature pivot clustering Results Highlight Results Highlight Related Works Related Works Summary & Future Work Summary & Future Work

38 Systems Engineering and Engineering Management The Chinese University of Hong Kong Summary & Future Work Document Pivot Clustering vs. Feature Pivot Clustering Document Pivot Clustering vs. Feature Pivot Clustering –Document Pivot Clustering – Clustering is based on the content of the documents –Feature Pivot Clustering – Clustering is based on distribution of features Future Works Future Works –Try to apply the framework in TDT dataset »However, TDT contain selected news stories from multiple sources. The distribution of features may be affected. »Moreover, the time period of TDT is relatively short. We do not know whether the change in the distribution of features is significant enough for us to do analysis –Try to assign the same features to multiple events (more realistic) »However, this may lead to many new issues, such as a “cycle” appear, or the some parameters needed to introduce

39 Systems Engineering and Engineering Management The Chinese University of Hong Kong Thank you very much – The End –


Download ppt "Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong."

Similar presentations


Ads by Google