Presentation is loading. Please wait.

Presentation is loading. Please wait.

2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.

Similar presentations


Presentation on theme: "2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology."— Presentation transcript:

1 2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology of China Microsoft Research Asia SIGIR2005

2 2005/09/13 Abstract Retrospective news event detection (RED) The discovery of previously unidentified events in historical news corpus Both the contents and time information of news article are helpful to RED, but most researches focus on the utilization of the contents of news article. Few research works have been carried out on finding better usages of time information. Propose A probabilistic model to incorporate both content and time information in unified framework. Build an interactive RED system, HISCOVERY, which provides additional functions to present events, Photo Story and Chronicle.

3 2005/09/13 Introduction News Event A specific thing happens at a specific time and place RED The discovery of previously unidentified events in historical news corpus Applications: detect earthquakes happened in the last ten years from historical news articles Exploration The better representations of news articles and events, which should effectively model both the contents and time information. Model events in probabilistic manners

4 2005/09/13 Introduction (cont.) Main contributions Proposing a multi-model RED algorithm, in which both the contents and time information of news articles are modeled explicitly and effectively. Proposing an approach to determine the approximate number of events from the articles count-time distribution.

5 2005/09/13 Related Work RED First proposed and defined by Yang et al. (SIGIR1998), and an agglomerative clustering algorithm (Group Average Clustering, GAC) was proposed. There are few right-on-the-target research work reported. New Event Detection (NED) Similar topic, it has been extensively studied. The most prevailing approach of NED was proposed by Allan et al. (SIGIR1998) and Yang et al. (SIGIR1998) Modifications: better representation of contents and utilizing of time information

6 2005/09/13 Related Work (cont.) From the aspect of utilizing the contents TF-IDF and cosine similarity New distance metrics, such as the Hellinger distance metric (SIGIR2003) Better representation of documents, i.e. feature selection, Yang et al. (SIGKDD2002) The usage of named entities have been studied, such as in Allan et al. (1999), Yang et al. (2002) and Lam et al. (2001) Re-weighting of terms, firstly proposed by Allan et al.(1999) Kumaran et al. (SIGIR2004) exploited to use both text classification and named entities to improve the performance of NED.

7 2005/09/13 Related Work (cont.) From the aspect of utilizing time informaiton Two kinds of usages Some approaches only use the chronological order of documents The others use decaying functions to modify the similarity metrics of the contents. (Brants et al. SIGIR2003)

8 2005/09/13 Characteristics of News Articles and Events “Halloween” is a topic, while it includes lots of events.

9 2005/09/13 Characteristics of News Articles and Events (cont.) Two most important characteristics of news articles and events News articles are always aroused by news events, the articles counts of an event are changed with time. Events are peaks. However, in some situations, the observed peaks and events are not exactly corresponding. Both the contents and time of the articles reporting the same event are similar on different news sites. The start and end time of reports to events on different websites are very similar. Method The first characteristic leads RED algorithm to be modeled by a latent variable model, where events are latent variables and articles are observations. The second characteristic can gather lots of news stories on the same event by mixing articles coming from different sources.

10 2005/09/13 Multi-model Retrospective News Event Detection Method Multi-model approach Since contents and timestamps have different characteristics, it proposes multi-model to incorporate them in a unified probabilistic framework. Representations According to the knowledge about news, news articles can be represented by four kinds of information: who (persons), when (time), where (location),and what (keywords)

11 2005/09/13 The Generative Model of News Articles Generative Model Contents Use mixture of unigram models to model contents Since persons and locations are important, we model persons, locations and keywords by three mixtures of unigram models. Timestamps The articles count-time distribution is a mixture of many distributions of event. A peak is usually modeled by a Gaussian functions. Thus, Gaussian Mixture Model (GMM) is chosen to model timestamps. The whole model combines the four mixture models Three mixture of unigram model and one GMM

12 2005/09/13 The Generative Model of News Articles (cont.) The two-step generating process of a news articles:

13 2005/09/13 The Generative Model of News Articles (cont.) A graphical representation of this model N: the terms space sizes of the three kinds of entities (N p, N l and N n )

14 2005/09/13 Learning Model Parameters Model Parameter Can be estimated by Maximum Likelihood method Given an event j, the four kinds of information of the i-th article are conditional independent:

15 2005/09/13 Learning Model Parameters (cont.) Expectation Maximization algorithm EM is generally applied to maximize log-likelihood. By using the independent assumptions, parameters of the four mixture models can be estimated independently. In E-step, compute the posteriors probability

16 2005/09/13 Learning Model Parameters (cont.) Expectation Maximization algorithm In M-step, update the parameters of the four model For the three mixture of unigram models, parameters are updated by:

17 2005/09/13 Learning Model Parameters (cont.) Expectation Maximization algorithm In M-step, the parameters of the GMM are updated by: Since the mean and variance of the GMM are changed consistently with the whole model, the Gaussian functions work like sliding windows on time line.

18 2005/09/13 Learning Model Parameters (cont.) Expectation Maximization algorithm In M-step the mixture proportions are updated by: The EM algorithm increases the log-likelihood consistently, while it will stop at a local maximum.

19 2005/09/13 How Many Event? Basic The initial estimate of events number can be set as the number of peaks But noises damage the distribution Salient peak Define salient scores for peaks as:

20 2005/09/13 How Many Event? (cont.) Salient peak Use hill-climbing to detect all peaks, and calculate their salient score. the number of top 20% peaks is the initial estimation of k. Alternative way of k User can specify the initial value of k, and use split/merge Model selection Apply the Minimum Description Length (MDL) principle to select among values of k:

21 2005/09/13 Event Summarization Two ways to summarize news events Choose some features with the maximum probabilities to represent event For event j, the ‘protagonist’ is the person with the maximum p(person p |e j ) The read abilities are so bad Choose one news article as the representative for each news event The article with the maximum p(x i |e j ) The first article of each event is also a good representative

22 2005/09/13 Algorithm Summary 1.Multi-model RED Algorithm: a. Using hill-climbing algorithm to find all peaks b. Using salient scores to determine the TOP 20% peaks, and initialize events correspondingly. 2. Learning model parameters a. E-step: computing posteriors b. M-step: updating parameters 3. Increasing/decreasing the initial number of events until the minimum/maximum events numbers is reached a. Using Splitting/merging current big/small peaks, and re-initialize events correspondingly b. Goto step 2 4. Performing model selections by MDL 5. Summarizing

23 2005/09/13 Application: HISCOVERY System HISCOVERY (HIStory disCOVERY) Photo Story and Chronicle News article come from 12 news sites Photo Story

24 2005/09/13 Application: HISCOVERY System (cont.) HISCOVERY Chronicle User enters a topic HISCOVERY search the news corpus to gather related articles Apply the proposed RED approach to detect events belonging to this topic, and then sort summaries of events in chronological order.

25 2005/09/13 Experimental Methods Data Preparation The first is TDT4 dataset Choose three representative topics form TDT4 dataset, and download articles from some news websites

26 2005/09/13 Experimental Methods (cont.) Experimental Design In the first two experiments, set the cluster numbers as the number of events, but in practice, the event number must be determined automatically To compare, Yang et al.’s augmented Group Average Clustering (GAC) and kNN algorithm are chosen as baselines Evaluation Measures Once got contingency tables and corresponding measures (precision, recall, and F1) are calculated

27 2005/09/13 Results Overall Performance on Dataset 1 The better performance of the full Probabilistic Model indicates the benefits of modeling named entities by separate models. Name entities are very important for news articles.

28 2005/09/13 Results (cont.) Overall Performance on Dataset 2

29 2005/09/13 Results (cont.) How many events? Salient peak Use Mutual information to measure the fitness of a partition with the ground truth

30 2005/09/13 Conclusions and Future Work Contribution Use a multi-model RED algorithm to model two characteristics of news articles and events Future Work Find better representation of the contents of news articles Study how to use dynamic models to model news events, such as Hidden Markov Model (HMM) and Independent Components Analysis (ICA)


Download ppt "2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology."

Similar presentations


Ads by Google