TAC Summarisation System WING Meeting 8 Jul 2011 Ziheng Lin, Praveen Bysani, Jun-Ping Ng.

TAC Summarisation System WING Meeting 8 Jul 2011 Ziheng Lin, Praveen Bysani, Jun-Ping Ng

Outline Introduction Methodology Experimental Results Conclusion

INTRODUCTION

TAC 2011 Guided Summarization Summarization: guided by “importance” of facts – Highly subjective and content-dependent Problems with generic summarization 1.Sentence scoring based on term freq Hindered by synonyms and paraphrases Redundancy 2.Extractive: Low readability and coherence

TAC 2011 Guided Summarization Guided summarization: – Topics: template-like categories, highly predictable elements – A specific, unified information model – Encourage abstractive summaries Task: – Set A: input – 10 news articles and a topic, output – 100 word summary – Set B: input – subsequent 10 news articles for the topic, output – 100 word update summary

TAC 2011 Guided Summarization Before TAC 2010, a topic used to be: – Title: Southern Poverty Law Center – Narrative: Describe the activities of Moris Dees and the Southern Poverty Law Center. New topic format: category + aspect 5 topic categories: – Accidents and Natural Disasters – Attacks – Health and Safety – Endangered Resources – Investigations and Trials

TAC 2011 Guided Summarization Pre-defined aspects for each category: – Health and Safety: WHAT: what is the issue WHO_AFFECTED: who is affected by the health/safety issue HOW: how they are affected WHY: why the health/safety issue occurs COUNTERMEASURES: countermeasures, prevention efforts

TAC 2011 Guided Summarization Aim – Achieve high ROUGE scores Direction – Utilize the category and aspect info

METHODOLOGY

Design Principles Need for a testbed to develop and verify ideas and techniques – Simple to maintain – Easy to use – Quick-footed and flexible

Architecture Pipeline of modules – Independent Ruby modules Can concentrate on specific parts – Linked up with Linux pipes Simple and stable Intermediate results improves robustness – Information exchange via JSON Easy to program Human readable (to a certain extent)

Overall Flow Train Support Vector Regression Test Generate Summaries Verify ROUGE evaluation

Summary Generation Pipeline Input SentenceSplitter GetStanfordNERParse Features SentencePosition SentenceLength Bigrams-DocFrequency KL-Divergence Category Relevance score Category Differential measure SentenceSelection SupportVectorRegression MMRWithRouge PostProcessing SentenceReduction

FEATURES

Generic Word Importance Document Frequency - successful feature in past summarization tasks – word level feature – all relevant documents in a cluster – DF (w) = d/D Extended version – from unigrams to bigrams – smoothed with unigrams for better recall during sentence scoring dfs = α ( dfs_uni) + 1- α (dfs_bi)

KL-Divergence Step 1: – Get statistics of words over reference corpus Step 2: – Collapse words with similar distribution into same equivalence class – Similarity measured with KL-Divergence Step 3: – Repeat (1) and (2) for target document set Step 4: – Naïve bayes formulation to compute likelihood of word appearing in document set

Category relevance score DF extended to category level frequency in terms of both topics and documents in categories weighted linear combination of both – crs = α ( top_freq) + 1- α (doc_freq)

KL Divergence – Compute difference between probability distributions To identify discriminative words for a category – C-KLD of a word across current category ( c ) and rest of the categories ( c^) – More the divergence, more discriminative the word for the category – Calculating Importance Word Lists with highest divergence Average word divergence per sentence Category Differential Measures

Category Differential Measures (cont.) Relevance Frequency (RF) – Term weighting scheme for text categorization – Lan, Tan et.al – different from idf and others, that are set in IR context – Discriminative power of a word RF = log (2 + a/c) – ‘c’ frequency in C^ – ‘a’ frequency in C

EXPLORATION

Named Entities In many categories, “who” and “where” are important aspects of the summary Use of named-entity recognition can identify people names and places How do we use this to improve our summaries?

RESULTS

Baseline Experiments Set AROUGE 2ROUGE SU4 Top TAC System0.095740.13014 DFSB+SP+SL0.109830.13908 DFSB+SP+SL+KLD0.103190.13472 Set BROUGE 2ROUGE SU4 Top TAC System0.080240.12006 DFSB+SP+SL0.080630.11767 DFSB+SP+SL+KLD0.078780.11570 Trained on 2009 and tested on 2010

Guided Experiments Set AROUGE 2ROUGE SU4 DFSB+SP+SL (B)0.102770.13318 B+ CFS0.104100.13443 B+CKLD0.104330.13449 B+RF0.102390.13121 Test data (TAC 2010) split into two parts to test the efficiency of new features Features suffer from less category information in the training set

Sample Summary Category – Health issues Topic – Pet food recall An unknown number of cats and dogs suffered kidney failure and about 10 died after eating the affected pet food. Menu Foods, the Ontario-based company that produced the pet food, said Saturday it was recalling dog food sold under 48 brands and cat food sold under 40 brands including Iams, Nutro and Eukanuba. The food was distributed throughout the United States, Canada and Mexico by major retailers such as Wal-Mart, Kroger and Safeway. However, the recalled products were made using wheat gluten purchased from a new supplier, since dropped for another source. The company said it manufacturers for 17 of the top 20 North American retailers. What Who affected How Why Countermeasures

CONCLUSION

It’s Just The Beginning Preprocessing Scoring – Features – Beyond SVR and MMR Postprocessing – Sentence re-ordering – Language generation

REFERENCES

Baker and McCallum, Distributional Clustering of Words for Text Classification, SIGIR 1998

TAC Summarisation System WING Meeting 8 Jul 2011 Ziheng Lin, Praveen Bysani, Jun-Ping Ng.

Similar presentations

Presentation on theme: "TAC Summarisation System WING Meeting 8 Jul 2011 Ziheng Lin, Praveen Bysani, Jun-Ping Ng."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TAC Summarisation System WING Meeting 8 Jul 2011 Ziheng Lin, Praveen Bysani, Jun-Ping Ng.

Similar presentations

Presentation on theme: "TAC Summarisation System WING Meeting 8 Jul 2011 Ziheng Lin, Praveen Bysani, Jun-Ping Ng."— Presentation transcript:

Similar presentations

About project

Feedback