Presentation on theme: "Event Detection and Summarization in Weblogs with Temporal Collocations Chun-Yuan Teng and Hsin-Hsi Chen Department of Computer Science and Information."— Presentation transcript:
Event Detection and Summarization in Weblogs with Temporal Collocations Chun-Yuan Teng and Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan firstname.lastname@example.org
2 Outlines Motivation Temporal collocation Event detection and summarization using temporal collocations Experiments –Datasets –Evaluation of event detection –Evaluation of event summarization Conclusion
3 Motivation Weblogs –containing abundant life experiences and public opinions toward different topics –highly sensitive to the events occurring in the real world –associated with the personal information of bloggers Problem –How to know what bloggers write and discuss over time? –Event detection is fundamental
4 Google Trend –Plot the frequency of word and frequency of news over time –E.g., Select the news with highest frequency of “president” Ambiguous peak –We don’t know the peak of “president” is caused by which president.
5 Collocations Combination of words give the specific meaning. Collocations such as mean and variance, hypothesis test, mutual information, etc. are used to model the relationship between terms. Can we model collocations over time?
6 Temporal Collocation Mutual Information Temporal Mutual Information –P(x,y|t) denotes the probability of co-occurrence of terms x and y in timestamp t. –P(x|t) and P(y|t) denote the probability of x and y in timestamp t.
7 Temporal Collocation Change of Temporal Mutual Information –C(x,y,t1,t2) is the change of temporal mutual information of terms x and y in time interval [t1, t2] –I(x,y| t1) and I(x,y| t2) are the temporal mutual information in time stamps t1 and t2, respectively
8 Event Detection Identify the collocations resulting in events Retrieve the descriptions of events
9 System Architecture Pre-processing phase –parse the weblogs –retrieve the collocations Event detection phase –detect the unusual peak of the change of temporal mutual information –identify the set of collocations resulting in an event in a specific time duration Event summarization phase –extract the collocations related to the seed collocations found in a specific time duration
10 Pre-processing Phase Retrieve the collocations from the sentences in blog posts –Propose the candidates within a window size –Remove those candidates containing stop-words or with low change of temporal mutual information
11 Event Detection Phase Remove the regular pattern by seasonal index Measure the unusual peak of temporal mutual information to detect the plausible events –change of temporal mutual information (MI2-MI1) favor the events with high frequency –relative change of temporal mutual information (MI2-MI1)/MI1 favor the events with low mutual information MI1 and MI2: temporal mutual information at timestamps t1 and t2
12 Event Summarization Phase Select the collocations with the highest mutual information with the word w in a seed collocation –Place the seed collocation into a collocation network –Add the collocation having the highest mutual information –Compute the mutual information of the multiword collocations when a new collocation is added –Stop and return the words in the collocation network if the multiword mutual information is lower than a threshold
14 Data Sets ICWSM weblog data set –collected from May 1, 2006 through May 20, 2006 –about 20 GB –the English weblog of 2,734,518 articles for analysis Gold standard –http://en.wikipedia.org/wiki/May_2006 –The events posted in wikipedia are not always complete, thus we adopt recall rate –The events specified in wikipedia are not always discussed in weblogs, thus we remove the events listed in wikipedia, but not referenced in the weblogs
15 Evaluation of Event Detection Phase recall rate: 75%
17 Discussion CollocationsRelative change casinos online618.36 zacarias moussaoui154.68 Tsunami warning107.93 Conspirator zacarias71.62 Artist formerly57.04 Federal jury41.78 Wed 339.20 Pramod mahajan35.41 BBC version35.21 Geena davis33.64 Diet sodas32.50 Ving rhames31.63 Stock picks29.09 Happy hump28.45 Wong kan28.34 Sixapartcom movabletype 28.13 Aaron echolls27.48 Phnom Penh25.78 Livejournal sixapartcom 23.83 George yeo20.34 CollocationsChange of MI May 039276.08 Illegal immigrants5833.17 Feel left5411.57 Saturday night4155.29 Past weekend2405.32 White house2208.89 Red sox2208.43 Album tool2120.30 Sunday morning2006.78 Sunday night1992.37 Current music1842.67 Hate studying1722.32 Stephen Colbert1709.59 Thursday night1678.78 Can’t believe1533.33 Feel asleep1428.18 Ice cream1373.23 Oh god1369.52 Illegal immigration1368.12 Pretty cool1316.56 Illegal aliens1217.89 Change of MI (left) favors regular events and events with high frequency Time: May 03 Feeling: fell left Relative change (right) favors person or special event Terrorists killed in May 3: zacarias moussaoui, parad mahajan best actress award in golden globe award in May 3: Geena Davis
18 Evaluation of Event Summarization Method 1: Employ the highest temporal mutual information Method 2: Utilize the highest product of temporal mutual information and change of temporal mutual information
19 An Example of Event Retrieval typhoon Chanchu –The typhoon Chanchu appears in the pacific ocean near 5/10, and the typhoon passes through Philippine and China and result in disasters in these areas.
20 Event Summarization for Typhoon Chanchu Using Method 1
21 Event Summarization for Typhoon Chanchu Using Method 2
22 Some Observations The appearance of the typhoon Chanchu cannot be found from the events listed in wikipedia on May 10. We can identify the appearance of typhoon Chanchu from the description of the typhoon appearance such as “typhoon named” and “Typhoon eye.” The typhoon Chanchu’s path can also be inferred from the retrieved collocations such as “Philippine China” and “near China”. The responses of bloggers such as “unexpected typhoon” and “8 typhoons” are also extracted.
23 Method 1 vs. Method 2 Method 1 shows more noise than Method 2. The term “typhoon earthquake” is extracted using the Method 1. The term “typhoon earthquake” is not retrieved using Method 2 because we also consider the change of temporal mutual information.
24 Concluding Remarks The works we have done –Introduce temporal mutual information to capture term-term association over time in weblogs –Select the extracted collocation with unusual peak in terms of relative change of temporal mutual information to represent an event –Collect those collocations with the highest product of mutual information and change of temporal mutual information to summarize the specific event Future works –Model the collocations over time and location –Model the relationship between the user-preferred usage of collocations and the profile of users