Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

Similar presentations


Presentation on theme: "1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,"— Presentation transcript:

1 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850, USA {ves,cardie}@cs.cornell.edu Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2006/10/23 ACL2006 Workshop on Sentiment and Subjectivity in Text

2 2 Agenda Introduction Toward opinion summarization –Source coreference resolution Data set The method Transformation –Standard noun phrase coreference resolution Coreference resolution –By Ng and Cardie (2002) Evaluation Conclusion

3 3 Introduction 1/4 Problem of opinion summarization Addressing the dearth of approaches for summarizing opinion information –Source coreference resolution »Deciding which source mentions (opinion holders) are associated with opinions that belong to the same real- world entity –Example (see next page) Coreference resolution –Deciding what noun phrases in the text refer to the same real-world entities – 阿扁 or 陳總統 or 中華民國陳總統 = 陳水扁

4 4 Introduction 2/4 Example (corpus of manually annotated opinions) “ [ Target Delaying of Bulgaria’s accession to the EU] would be a serious mistake” [ Source Bulgarian Prime Minister Sergey Stanishev] said in an interview for the German daily Suddeutsche Zeitung. “ [ Target Our country] serves as a model and encourages countries from the region to follow despite the difficulties”, [ Source he] added. [ Target Bulgaria] is criticized by [ Source the EU] because of slow reforms in the judiciary branch, the newspaper notes. Stanishev was elected prime minister in 2005. Since then, [ Source he] has been a prominent supporter of [ Target his country’s accession to the EU].

5 5 Introduction 3/4

6 6 Introduction 4/4 Example (source coreference resolution) “ [ Target Delaying of Bulgaria’s accession to the EU] would be a serious mistake” [ Source Bulgarian Prime Minister Sergey Stanishev] said in an interview for the German daily Suddeutsche Zeitung. “ [ Target Our country] serves as a model and encourages countries from the region to follow despite the difficulties”, [ Source he] added. [ Target Bulgaria] is criticized by [ Source the EU] because of slow reforms in the judiciary branch, the newspaper notes. Stanishev was elected prime minister in 2005. Since then, [ Source he] has been a prominent supporter of [ Target his country’s accession to the EU].

7 7 Data set 1/2 MPQA corpus (Wilson and Wiebe, 2003) Multi-Perspective Question Answering Developing annotation using GATE –General Architecture for Text Engineering –Example (see next page) 535 manually annotated documents with phrase-level opinion information Over 11-month period, between June 2001 and May 2002 Suitable for the political, government and commercial domain Can find source coreference chain Contains no coreference information for general NPs (which are not sources)

8 8 Data set 2/2 Example of annotations in GATE

9 9 The method 1/10 To solve source coreference resolution Transformation –How source coreference resolution (SCR) can be transformed into standard noun phrase coreference resolution (NPCR) ? Difference between SCR and NPCR : 1.The sources of opinions do not exactly correspond to the automatic extractors’ notion of noun phrases (NPs) 2.The time-consuming nature of coreference annotation

10 10 The method 2/10 The general approach to SCR 1.Preprocessing –To obtain an augmented set of NPs in the text –Done by Ng and Cardie (2002) »Running a tokenizer, sentence splitter, POS tagger, parser, a base NP finder, and a named entity finder 2.Source to noun phrase mapping –Three problems –Using a set of heuristics 3.Coreference resolution –Applying a state-of-the-art coreference resolution approach to the transformed data »“ Improving Machine learning approaches to coreference resolution ” [ by Ng and Cardie (2002) ]

11 11 The method 3/10 Three problems –Inexact span match »“Venezuelan people” vs. “the Venezuelan people” »“Muslims rulers” was not recognized, while “Muslims” and “rulers” were recognized by the NP extractor –Multiple NP match »“the country’s new president, Eduardo Duhalde” »“Latin American leaders at a summit meeting in Costa Rica” »“Britain, Canada and Australia” –No matching NP »“Carmona named new ministers, including two military officers who rebelled against Chavez” »“many”, “which”, and “domestically” »“lash” and “taskforce”

12 12 The method 4/10 Using a set of heuristics –Rule 1 »If a source matches any NP exactly in span, match that source to the NP; do this even if multiple NPs overlap the source »Example_1 ·[determiner] “ the Venezuelan people ” ·[NP extractor] “ the Venezuelan people ” »Example_2 ·[determiner] “ the country’s new president, Eduardo Duhalde ” ·[NP extractor] “ the country’s new president”, “Eduardo Duhalde ” ·

13 13 The method 5/10 Rule 2 If no NP matches exactly in span then : –If a single NP overlaps the source, »Then map the source to that NP –If multiple NP overlaps the source, »Then prefer three cases : »The outermost NP ·Because longer NPs contain more information »The last NP ·Because it is likely to be the head NP of a phrase »NP’s before preposition ·Because a preposition signals an explanatory prepositional phrase

14 14 The method 6/10 Example 1.The outermost NP –[determiner] »“Prime Minister Sergey Stanishev” –[NP extractor] »“Bulgarian Prime Minister”, “Sergey Stanishev” »“Bulgarian Prime Minister Sergey Stanishev” 2.The last NP –[determiner] »“new president, Eduardo Duhalde” –[NP extractor] »“the country’s new president”, “Eduardo Duhalde” 3.NP’s before preposition –[determiner] »“Latin American leaders at a summit meeting in Costa Rica” –[NP extractor] »“Latin American leaders”, ”summit meeting”, “Costa Rica”

15 15 The method 7/10 Rule 3 If no NP overlaps the source, select the last NP before the source. –Stanishev was elected prime minister in 2005. Since then, [ source he] has been a prominent supporter. »[determiner] => “he“ »[NP extractor] ·“Stanishev“,“prime minister”,“prominent supporter” In half of the cases we are dealing with the word who, which typically refers to the last preceding NP. –“Carmona named new ministers, including two military officers who rebelled against Chavez”

16 16 The method 8/10 Coreference resolution Using the standard combination of classification and single-link clustering –Soon et al. (2001) and Ng and Cardie (2002) Machine learning approach –Computing a vector of 57 features for every pair of source noun phrases from the preprocessed corpus »(source, NP) –Training positive »To predict whether a source NP pair should be classified as positive (the NPs refer to the same entity) or negative –Testing »To predict whether a source NP pair is positive »and single-link clustering to group together sources that belong to the same entity

17 17 The method 9/10 Example (Single-link clustering) Training (positive instance) –(source, NP) + feature set –( 李登輝, 李前總統 ) + 57 features –( 李登輝, 登輝先生 ) + 57 features –( 阿輝伯, 登輝先生 ) + 57 features Testing –( 李前總統, 登輝先生 ) => positive –( 阿輝伯, 李前總統 ) => positive » 阿輝伯 -- 李前總統 -- 登輝先生

18 18 The method 10/10 Machine learning techniques To try the reportedly best techniques for pairwise classification –RIPPER (Cohen, 1995) »Repeated Incremental Pruning to Produce Error Reduction »Using 24 different settings –SVM light »Support Vector Machines »Using 56 different settings Feature set 57 = 12 + 41 + ?? –12 by Soon et al. (2001) –41 by Ng and Cardie (ACL2002)

19 19 Feature set (12 features) ( NP i, NP j )

20 20 Feature set (41 features)

21 21 Feature set (41 features) cont.

22 22 Evaluation MPQA corpus (535 documents) 400 for training set (random) 135 for test set (remaining) The purpose of the evaluation To create a strong baseline –Using the best setting for the NP coreference resolution

23 23 Evaluation Instance selection Adopt the method of Soon et al.(2001) –selects for each NP the pairs with the n preceding coreferent instances and all intervening non-coreferent pairs Soon 1 (n=1) [ Ng and Cardie (2002) ] Soon 2 (n=2) [ Ng and Cardie (2002) ] None

24 24 Evaluation Using performance measures for coreference resolution B-CUBED (Bagga and Baldwin, 1998) MUC score (Vilain et al.,1995) Positive identification –Precision, recall and F1 »Using these metrics on the identification of the positive class »By using the pairwise decisions as the classifiers outputs them »Example (see next page) Actual Positive identification –Precision, recall and F1 »Using these metrics on the identification of the positive class »By performing the clustering of the source NPs and then considering a pairwise decision to be positive if the two source NPs belong to the same cluster »Example (see next page)

25 25 Sample of answer set The classifiers output (positive) Positive identification ( 陳水扁, 陳水扁總統 ) ( 陳水扁, 陳總統 )** ( 陳水扁, 阿扁總統 )** ( 馬英九, 市長馬英九 ) ( 陳總統, 陳水扁總統 )** ( 陳總統, 陳總統 ) ( 陳總統, 阿扁總統 ) ( 阿扁, 陳水扁總統 )** ( 阿扁, 陳總統 ) ( 阿扁, 阿扁總統 ) ( 陳水扁, 陳水扁總統 ) ( 馬英九, 市長馬英九 ) ( 陳總統, 陳總統 ) ( 阿扁, 阿扁總統 ) Actual Positive identification ( 陳水扁, 陳水扁總統 ) ( 馬英九, 市長馬英九 ) ( 陳總統, 陳總統 ) ( 陳總統, 阿扁總統 ) ( 阿扁, 陳總統 ) ( 阿扁, 阿扁總統 ) ( 陳水扁, 陳水扁總統 ) ( 馬英九, 市長馬英九 ) ( 陳總統, 陳總統 ) ( 阿扁, 阿扁總統 ) (source) 陳水扁 (NP) 陳水扁總統 (source) 馬英九 (NP) 市長馬英九 (source) 陳總統, 阿扁 (NP) 陳總統, 阿扁總統

26 26 Evaluation

27 27 Evaluation

28 28 Evaluation

29 29 Evaluation

30 30 Evaluation

31 31 Conclusion As a first step toward opinion summarization To target the problem of source coreference resolution To show that this problem can be tackled effectively as noun coreference resolution To create a baseline Next step To develop a method that utilizes the unlabeled NPs in the corpus using a structured rule learner


Download ppt "1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,"

Similar presentations


Ads by Google