TAC Summarisation System WING Meeting 8 Jul 2011 Ziheng Lin, Praveen Bysani, Jun-Ping Ng.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Text Categorization.

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

EXTENDED ESSAY, CONTINUED Assessment Criteria and Subject Areas.

Developing and Evaluating a Query Recommendation Feature to Assist Users with Online Information Seeking & Retrieval With graduate students: Karl Gyllstrom,

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

SemQuest: University of Houston’s Semantics-based Question Answering System Rakesh Verma University of Houston Team: Txsumm Joint work with Araly Barrera.

Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.

Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.

Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.

Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.

Rutgers’ HARD Track Experiences at TREC 2004 N.J. Belkin, I. Chaleva, M. Cole, Y.-L. Li, L. Liu, Y.-H. Liu, G. Muresan, C. L. Smith, Y. Sun, X.-J. Yuan,

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Cover Coefficient based Multidocument Summarization CS 533 Information Retrieval Systems Özlem İSTEK Gönenç ERCAN Nagehan PALA.

TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab Computational Intelligence Laboratory Toyota.

Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.

Chapter 5: Information Retrieval and Web Search

Indexing Overview Approaches to indexing Automatic indexing Information extraction.

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.

Query session guided multidocument summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect.

Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,

Presented by Tienwei Tsai July, 2005

Text Classification, Active/Interactive learning.

1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Search and Information Extraction Lab IIIT Hyderabad.

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.

Chapter 6: Information Retrieval and Web Search

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.

Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

A Novel Relational Learning-to- Rank Approach for Topic-focused Multi-Document Summarization Yadong Zhu, Yanyan Lan, Jiafeng Guo, Pan Du, Xueqi Cheng Institute.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.

The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,

Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Towards an Extractive Summarization System Using Sentence Vectors and Clustering John Cadigan, David Ellison, Ethan Roday.

A Simple Approach for Author Profiling in MapReduce

Queensland University of Technology

A Straightforward Author Profiling Approach in MapReduce

Semantic Processing with Context Analysis

Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.

John Lafferty, Chengxiang Zhai School of Computer Science

CS224N: Query Focused Multi-Document Summarization

Publication Output on the Topical Area of "Energy" and Real Estate (Education) Bob Martens.

Presentation transcript:

TAC Summarisation System WING Meeting 8 Jul 2011 Ziheng Lin, Praveen Bysani, Jun-Ping Ng

Outline Introduction Methodology Experimental Results Conclusion

INTRODUCTION

TAC 2011 Guided Summarization Summarization: guided by “importance” of facts – Highly subjective and content-dependent Problems with generic summarization 1.Sentence scoring based on term freq Hindered by synonyms and paraphrases Redundancy 2.Extractive: Low readability and coherence

TAC 2011 Guided Summarization Guided summarization: – Topics: template-like categories, highly predictable elements – A specific, unified information model – Encourage abstractive summaries Task: – Set A: input – 10 news articles and a topic, output – 100 word summary – Set B: input – subsequent 10 news articles for the topic, output – 100 word update summary

TAC 2011 Guided Summarization Before TAC 2010, a topic used to be: – Title: Southern Poverty Law Center – Narrative: Describe the activities of Moris Dees and the Southern Poverty Law Center. New topic format: category + aspect 5 topic categories: – Accidents and Natural Disasters – Attacks – Health and Safety – Endangered Resources – Investigations and Trials

TAC 2011 Guided Summarization Pre-defined aspects for each category: – Health and Safety: WHAT: what is the issue WHO_AFFECTED: who is affected by the health/safety issue HOW: how they are affected WHY: why the health/safety issue occurs COUNTERMEASURES: countermeasures, prevention efforts

TAC 2011 Guided Summarization Aim – Achieve high ROUGE scores Direction – Utilize the category and aspect info

METHODOLOGY

Design Principles Need for a testbed to develop and verify ideas and techniques – Simple to maintain – Easy to use – Quick-footed and flexible

Architecture Pipeline of modules – Independent Ruby modules Can concentrate on specific parts – Linked up with Linux pipes Simple and stable Intermediate results improves robustness – Information exchange via JSON Easy to program Human readable (to a certain extent)

Overall Flow Train Support Vector Regression Test Generate Summaries Verify ROUGE evaluation

Summary Generation Pipeline Input SentenceSplitter GetStanfordNERParse Features SentencePosition SentenceLength Bigrams-DocFrequency KL-Divergence Category Relevance score Category Differential measure SentenceSelection SupportVectorRegression MMRWithRouge PostProcessing SentenceReduction

FEATURES

Generic Word Importance Document Frequency - successful feature in past summarization tasks – word level feature – all relevant documents in a cluster – DF (w) = d/D Extended version – from unigrams to bigrams – smoothed with unigrams for better recall during sentence scoring dfs = α ( dfs_uni) + 1- α (dfs_bi)

KL-Divergence Step 1: – Get statistics of words over reference corpus Step 2: – Collapse words with similar distribution into same equivalence class – Similarity measured with KL-Divergence Step 3: – Repeat (1) and (2) for target document set Step 4: – Naïve bayes formulation to compute likelihood of word appearing in document set

Category relevance score DF extended to category level frequency in terms of both topics and documents in categories weighted linear combination of both – crs = α ( top_freq) + 1- α (doc_freq)

KL Divergence – Compute difference between probability distributions To identify discriminative words for a category – C-KLD of a word across current category ( c ) and rest of the categories ( c^) – More the divergence, more discriminative the word for the category – Calculating Importance Word Lists with highest divergence Average word divergence per sentence Category Differential Measures

Category Differential Measures (cont.) Relevance Frequency (RF) – Term weighting scheme for text categorization – Lan, Tan et.al – different from idf and others, that are set in IR context – Discriminative power of a word RF = log (2 + a/c) – ‘c’ frequency in C^ – ‘a’ frequency in C

EXPLORATION

Named Entities In many categories, “who” and “where” are important aspects of the summary Use of named-entity recognition can identify people names and places How do we use this to improve our summaries?

RESULTS

Baseline Experiments Set AROUGE 2ROUGE SU4 Top TAC System DFSB+SP+SL DFSB+SP+SL+KLD Set BROUGE 2ROUGE SU4 Top TAC System DFSB+SP+SL DFSB+SP+SL+KLD Trained on 2009 and tested on 2010

Guided Experiments Set AROUGE 2ROUGE SU4 DFSB+SP+SL (B) B+ CFS B+CKLD B+RF Test data (TAC 2010) split into two parts to test the efficiency of new features Features suffer from less category information in the training set

Sample Summary Category – Health issues Topic – Pet food recall An unknown number of cats and dogs suffered kidney failure and about 10 died after eating the affected pet food. Menu Foods, the Ontario-based company that produced the pet food, said Saturday it was recalling dog food sold under 48 brands and cat food sold under 40 brands including Iams, Nutro and Eukanuba. The food was distributed throughout the United States, Canada and Mexico by major retailers such as Wal-Mart, Kroger and Safeway. However, the recalled products were made using wheat gluten purchased from a new supplier, since dropped for another source. The company said it manufacturers for 17 of the top 20 North American retailers. What Who affected How Why Countermeasures

CONCLUSION

It’s Just The Beginning Preprocessing Scoring – Features – Beyond SVR and MMR Postprocessing – Sentence re-ordering – Language generation

REFERENCES

Baker and McCallum, Distributional Clustering of Words for Text Classification, SIGIR 1998