Presentation is loading. Please wait.

Presentation is loading. Please wait.

NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.

Similar presentations


Presentation on theme: "NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting."— Presentation transcript:

1 NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting

2 NTNU Speech Lab 2 References Sanda Harabagiu and Finley Lacatusu, “Topic Themes for Multi-Document Summarization”, SIGIR 2005. S. Harabagiu. Incremental Topic Representations. In Proceedings of the 20th COLING Conference, Geneva, Switzerland, 2004.

3 NTNU Speech Lab 3 Outline Introduction Topic representation Theme representation Using Topic and Theme representations for MDS Evaluation MDS Conclusions

4 NTNU Speech Lab 4 Introduction One of the problems of data overload that we are facing today is that there are many document that cover the same topic. Multi-document summaries need to be both informative and coherent. Much work in summarization dealt with these problems separately. An approach represents topics as a structure of themes. To dictate both (a) the information content to be included in an MDS as well as (b) the order of the themes that are selected.

5 NTNU Speech Lab 5 Topic representation(1/12) Five different topic representation (TRs) –(TR 1 ) representing topics via topic signature (TS 1 ) –(TR 2 ) representing topic via enhanced topic signature (TS 2 ) –(TR 3 ) representing topic via thematic signature (TS 3 ) –(TR 4 ) representing topic by modeling the content structure of documents –(TR 5 ) representing topic as templates implemented as a frame with slots and fillers.

6 NTNU Speech Lab 6 Topic representation(2/12) TR 1. Topic Representation 1: –The topic signature is represented as TS 1 = {topic, } where the terms t i are highly correlated to the topic with association weight w i. –Term selection and weight association are determined by the use of likelihood ratio. –With the likelihood ratio method, the confidence level for a specific value is found by (a) looking up the distribution table, (b) using the value c to select an appropriate cutoff associated weight, and (c) determining the terms selected in the topic signature based on the value c.

7 NTNU Speech Lab 7 Topic representation(3/12) TR 1. Topic Representation 1: –A set of documents is preclassified into (a) topic relevant texts, and (b) topic nonrelevant texts –Two hypotheses: Hypothesis 1 (H1) : Hypothesis 2 (H2) :

8 NTNU Speech Lab 8 Topic representation(4/12) TR 1. Topic Representation 1:

9 NTNU Speech Lab 9 Topic representation(5/12) TR 2. Topic Representation 2: –Topics can be represented by identifying the relevant relations that exist between topic signature terms: TS 2 = {topic, }, where r i is a binary relation between two topic concepts. –Two forms of topic relations are considered: (1) syntax- based relations between the VP and it’s Subject, Object, or Prepositional Attachments; and (2) C-relations between events and entities that cannot be identified by syntactic constraints, but belong to the same context. –The topic relations are discovered by starting with the topic terms uncovered in TS 1 and selecting a seed syntactic relation between the topic terms. –Only nouns and verbs are considered from TS 1.

10 NTNU Speech Lab 10 Topic representation(6/12) TR 2. Topic Representation 2: –The iterative process of discovering topic relations has four steps: –Step1-generate candidate relations –Step2-the candidate topic relations are ranked based on its Relevance-Rate and it’s Frequency. Relevance-Rate= Frequency/Count –Step3-select a new topic relation based on the ranking in step 2. –Step4-restart the discovery by using the latest discovered relation for classifying relevant documents.

11 NTNU Speech Lab 11 Topic representation(7/12) TR 2. Topic Representation 2:

12 NTNU Speech Lab 12 Topic representation(8/12) TR 3. Topic Representation 3: –A third topic representation that is based on the concept of themes. TS 3 = {topic, },where Th i is one the themes associated with the topic and r i is its rank. –The discovery of themes is based on (1) a segmentation of documents produced by the TextTiling algorithm (2) a method of (i) assigning labels to themes, and (ii) ranking them. –Four cases for theme labeling: Case 1: A single topic-relevant relation is identified in the segment. Case2: several topic relation are recognized in the segment. Case3: multiple topic Case4: the theme contains topic-relevant terms, but no topic relation.

13 NTNU Speech Lab 13 Topic representation(9/12) TR 4. Topic Representation 4: (Topics Represented as Content Models) –The content model is a Hidden Markov Model (HMM) wherein states correspond to topic themes and state transitions capture either (1) orderings within that domain, or (2) the probability of changing from one given topic theme to another. –Step1 initial topic induction: complete-link clustering –Step2 the model states and the emission/transition probabilities are determined. –Step3 Viterbi re-estimation –The cluster represents the topic representation TR 4

14 NTNU Speech Lab 14 Topic representation(10/12) TR 4. Topic Representation 4:

15 NTNU Speech Lab 15 Topic representation(11/12) TR 5. Topic Representation 5: (Topics Represented as Extraction Templates) –Topics can be represented as a set of inter-related concepts, implemented as a frame having slots and filler.

16 NTNU Speech Lab 16 Topic representation(12/12) TR 5. Topic Representation 5: –It is important to be able to generate scripts automatically from corpora. –Using the IS-A and Gloss lexical relations found in the WordNet lexical database to mine topic relations for topic relevant terms. –Combining the Is-A and GLOSS relations for generating the topical relations –An ad-hoc template generation algorithm (five step)

17 NTNU Speech Lab 17 Theme representation(1/4) In order to produce exhaustive summaries, MDS systems must be able to identify information that is (1) common to multiple documents in the collection (2) unique to a single document in the collection and (3) contradictory to information presented in other document in the collection. Extracting all similar sentences would produce a verbose and repetitive summary. By observing to fine the core of method of representing themes. Current semantic parsers are able to recognize all verbal predicates and their arguments. The predicates that were recognized are underlined.

18 NTNU Speech Lab 18 Theme representation(2/4)

19 NTNU Speech Lab 19 Theme representation(3/4) To generate the theme representation through the following six steps: –For every sentence in each document from the collection, the predicate-argument structures are identified. (involves the recognition of paraphrases as synonyms or idioms). –All sentences having at least one common predicate with a common argument are clustered together. The semantic consistency of the other arguments is also checked. –Conceptual representations for each cluster are generated. –Selection of the candidate themes is made by considering the mapping of the clusters into (1) the topic representation TR 3 and (2) the topic representation TR 4. –There are meaningful relations between the themes. Cohesion relations 、 Discourse relations. (recognized by the naïve Bayes classifiers) –The themes are structured into a graph.

20 NTNU Speech Lab 20 Theme representation(4/4)

21 NTNU Speech Lab 21 Using Topic and Theme representations for MDS Multi-document summarization is performed by (1) extracting sentences that contain the most salient information; (2) compressing the sentences for retaining the most important pieces of information and (3) ordering the extracted sentences into the final summary. To implement four extraction methods, two ordering methods, and a separate MDS method. –EM 1 (TR 1 ) 、 EM 2 (TR 2 ) 、 EM 3 (TR 3 ) 、 EM 4 (TR 5 ) –OM 1 、 OM 2

22 NTNU Speech Lab 22 Evaluating MDS(1/2)

23 NTNU Speech Lab 23 Evaluating MDS(2/2)

24 NTNU Speech Lab 24 Conclusions In this paper, they investigated to five topic representation that were used before in MDS and proposed a new representation based on topic themes. Additionally, to represent themes in a graph-like structure that improve the quality of ordering information for MDS.


Download ppt "NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting."

Similar presentations


Ads by Google