Presentation is loading. Please wait.

Presentation is loading. Please wait.

A probabilistic model for retrospective news event detection

Similar presentations


Presentation on theme: "A probabilistic model for retrospective news event detection"— Presentation transcript:

1 A probabilistic model for retrospective news event detection
Zhiwei Li, Bin Wang, Mingjing Li and Wei-Ying Ma. A probabilistic model for retrospective news event detection. In the 28th Annual International ACM SIGIR Conference (SIGIR'2005), 2005. Presenter: Suhan Yu

2 Introduction RED News event definition
Retrospective news event detection (RED) is defined as the discovery of previously unidentified event in historical news corpus. News event definition a specific thing happens at a specific place and time. Consecutively reported by many news articles in a period.

3 Introduction Observation: This paper contribution include:
A news article contains two kinds of information: Contents (most previous research work focus) Timestamps (often ignored) This paper contribution include: Proposing a multi-modal RED algorithm (use content and time info) Proposing an approach to determine the approximate number of events from the articles count-time distribution.

4 Characteristics of news articles and events
Halloween topics contains many events Each year’s Halloween is an event. The figure indicates the two most important characteristics Events are peaks, but in some situations, several events could be overlapped on time. The start and end time of reports to events on different website are very similar. event

5 Multi-modal retrospective news event detection method
Representation of news articles and news events News articles represented by four kinds of information: Who (person) Where (location) What (keywords) When (time) --define as the period between the first article and the last article (Time consists two values) Define news article and event as: The four kinds of information of a news article are independent:

6 The generative model of news articles
Contents Unigram models to model contents Model persons, locations and keywords by three models. Timestamps Gaussian Mixture Model (GMM) is chosen to model timestamps. A peak is usually modeled by a Gaussian function, where the mean is the position of the peak and the variance is the duration of event.

7 The generative model of news articles
N=term space size

8 Learning model parameters
The model parameter can be estimated by Maximum Likelihood method. X represents the corpus of news articles. M and k are number of news articles and number of events. Given an event j, the four kinds of information of the i-th article are conditional independent: EM algorithm is generally applied to maximize log-likelihood.

9 Maximize log-likelihood
E-step M-step (update parameters) Word n. Like person=Mary Vocabulary size

10 Maximize log-likelihood
M-step Parameters of the GMM mean variances

11 How many events? We assume only the salient peaks are corresponding to events. Initial estimate of events number can be set as the number of peaks Use hill-climbing approach to detect all peaks Compute salient score for each of them The top 20% peaks are defined as salient peaks. Spitting/merging initial peaks to detect salient peaks, we define salient scores for peaks as:

12 Splitting/merging initial salient peaks
MDL (Minimum Description Length) penalty Np=person vocabulary size

13 Event summarization Maximum a Posterior (MAP)
is the label of news article

14 Algorithm summary

15 Multi-modal RED algorithm application
HISCOVERY system HISCOVERY (HIStory disCOVERY) Two useful function Photo Story Chronicle News article come from 12 news sites (such as CNN, MSNBC, BBC…)

16 HISCOVERY system

17 Experimental methods Data Each year’s reports can be
TDT Benchmarks for event detection. TDT4 Run experiments Contain 80 events annotated from news articles. These articles collected from the period of 2000/10~2001/1 Each year’s reports can be regarded as an events. Extracting named entities. Extracted by BBN NLP tool, which can extract seven types of named entities.

18 Experimental design To compare the approach with other algorithm:
Group Average Clustering (GAC) It is the best algorithm in TDT evaluations. A hierarchical clustering method Baseline kNN algorithm

19 Results Probabilistic model gains the best results, but the
improvement are not significant.

20 Results Named entities

21 result

22 result

23 result 39 events

24 result 46 events

25 Conclusion Study 2 characteristics of news articles and events.
Proposed a multi-modal RED algorithm Future work: Use fitful dynamic models to model news events. HMM ICA (Independent components analysis)


Download ppt "A probabilistic model for retrospective news event detection"

Similar presentations


Ads by Google