Text Summarization Jagadish M(07305050) ‏ Annervaz K M (07305063) ‏ Joshi Prasad(07305047) ‏ Ajesh Kumar S(07305065) ‏ Shalini Gupta(07305R02) ‏

Slides:



Advertisements
Similar presentations
Product Review Summarization Ly Duy Khang. Outline 1.Motivation 2.Problem statement 3.Related works 4.Baseline 5.Discussion.
Advertisements

Automatic Text Summarization
Improved TF-IDF Ranker
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Introduction to RST Rhetorical Structure Theory Maite Taboada and Manfred Stede Simon Fraser University / Universität Potsdam Contact:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Aki Hecht Seminar in Databases (236826) January 2009
1 I256: Applied Natural Language Processing Marti Hearst Oct 2, 2006.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
Automatic Classification of Semantic Relations between Facts and Opinions Koji Murakami, Eric Nichols, Junta Mizuno, Yotaro Watanabe, Hayato Goto, Megumi.
Link Analysis, PageRank and Search Engines on the Web
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Overview of Search Engines
CS6800 Advanced Theory of Computation Fall 2012 Vinay B Gavirangaswamy
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Masquerade Detection Mark Stamp 1Masquerade Detection.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,
Chris Luszczek Biol2050 week 3 Lecture September 23, 2013.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
LexRank: Graph-based Centrality as Salience in Text Summarization
LexRank: Graph-based Centrality as Salience in Text Summarization
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
How to read a scientific paper
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Chapter 6: Information Retrieval and Web Search
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
SINGULAR VALUE DECOMPOSITION (SVD)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
Single Document Key phrase Extraction Using Neighborhood Knowledge.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Text Summarization using Lexical Chains. Summarization using Lexical Chains Summarization? What is Summarization? Advantages… Challenges…
An Adaptive User Profile for Filtering News Based on a User Interest Hierarchy Sarabdeep Singh, Michael Shepherd, Jack Duffy and Carolyn Watters Web Information.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
A Survey on Automatic Text Summarization Dipanjan Das André F. T. Martins Tolga Çekiç
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Introduction to RST (Rhetorical Structure Theory)
GRAPH BASED MULTI-DOCUMENT SUMMARIZATION Canan BATUR
Natural Language Processing Vasile Rus
Junghoo “John” Cho UCLA
Topic: Semantic Text Mining
Presented by Nick Janus
Presentation transcript:

Text Summarization Jagadish M( ) ‏ Annervaz K M ( ) ‏ Joshi Prasad( ) ‏ Ajesh Kumar S( ) ‏ Shalini Gupta(07305R02) ‏

Introduction  Summary: Brief but accurate representation of the contents of a document  Goal: Take an information source, extract the most important content from it and present it to the user in a condensed form and in a manner sensitive to the user’s needs.  Compression: Amount of text to present or the length of the summary to the length of the source.

MSWord AutoSummarize

Presentation Outline  Motivation  Different Genres  Simple Statistical Techniques  Degree Centrality  Lex Rank  Lexical/Co-reference Chains  Rhetorical Structure Theory  WordNet Based Methods  DUC/TAC

Motivation  Abstracts for Scientific and other articles  News summarization (mostly Multiple document summarization) ‏  Classification of articles and other written data  Web pages for search engines  Web access from PDAs, Cell phones  Question answering and data gathering

Genres  Indicative vs. informative used for quick categorization vs. content processing.  Extract vs. abstract lists fragments of text vs. re-phrases content coherently.  Generic vs. query-oriented provides author’s view vs. reflects user’s interest.  Background vs. just-the-news assumes reader’s prior knowledge is poor vs. up- to-date.  Single-document vs. multi-document source based on one text vs. fuses together many texts.

Statistical scoring  Scoring techniques Word frequencies throughout the text(Luhn58) ‏ Position in the text(Edmundson69) ‏ Title Method(Edmundson69) ‏ Cue phrases in sentences (Edmundson69) ‏

Luhn58  Important words occur fairly frequently  Earliest work in field

Statistical Approaches(contd..) ‏  Degree Centrality  LexRank  Continuous LexRank

Degree Centrality  Problem Formulation Represent each sentence by a vector Denote each sentence as the node of a graph Cosine similarity determines the edges between nodes

Degree Centrality  Since we are interested in significant similarities, we can eliminate some low values in this matrix by defining a threshold.

Degree Centrality  Compute the degree of each sentence  Pick the nodes (sentences) with high degrees

Degree Centrality  Disadvantage in Degree Centrality approach

LexRank  Centrality vector p which will give a lexrank of each sentence (similar to page rank) defined by :

What Should B Satisfy?  Stochastic Matrix and Markov Chain property.  Irreducible.  Aperiodic

Perron-Frobenius Theorem  An irreducible and aperiodic Markov chain is guaranteed to converge to a stationary distribution

Reducibility

Aperiodicity

LexRank  B is a stochastic matrix  Is it an irreducible and aperiodic matrix?  Dampness (Page et al. 1998) ‏

Matrix Form of p for Dampening  Solve for p using Power method

Continuous LexRank

Linguistic/Semantic Methods  Co-reference /Lexical Chain  Rhetorical Analysis

Co-reference/Lexical Chains  Assumption/Observation :- Important parts in a text will be more related in a semantic interpretation  Co-reference / Lexical Chains (Object- Action, Part-of relation, Semantically related) ‏  Important sentences will be traversed by more number of such chains

Co-reference/Lexical Chains  Mr. Kenny is the person that invented the anesthetic machine which uses micro-computers to control the rate at which an anesthetic is pumped into the blood. Such machines are nothing new. But his device uses two micro-computers to achieve much closer monitoring of the pump feeding the anesthetic into the patient

Rhetorical Structure Theory  Mann & Thompson 88  Rhetoric Relation Between two non-overlapping text snippets Nucleus - Core Idea, Writers Purpose Satellite - Referred in context to nucleus for Justifying, Evidencing, Contradicting etc

Rhetorical Structure Theory  Nucleus of a rhetorical relation is comprehensible independent of the satellite, but not vice versa  All rhetoric relations are not nucleus-satellite relations, Contrast is a multinuclear relationship  Example: evidence [The truth is that the pressure to smoke in 'junior high' is greater than it will be any other time of one’s life:][ we know that 3,000 teens start smoking each day.]

Rhetorical Structure Theory  Rhetoric Parsing Breaks into elementary units Uses cue phrases(discourse markers) and notion of semantic similarity in order to hypothesize rhetorical relations  Rhetorical relations can be assembled into rhetorical structure trees (RS- trees) by recursively applying individual relations across the whole text

2 Elaboration 8 Example 2 Background Justification 3 Elaboration 8 Concession 10 Antithesis Mars experiences frigid weather conditions (2)‏ Surface temperature s typically average about -60 degrees Celsius (-76 degrees Fahrenheit) at the equator and can dip to degrees C near the poles (3)‏ 4 5 Contrast Although the atmosphere holds a small amount of water, and water-ice clouds sometimes develop, (7)‏ Most Martian weather involves blowing dust and carbon monoxide. (8)‏ Each winter, for example, a blizzard of frozen carbon dioxide rages over one pole, and a few meters of this dry-ice snow accumulate as previously frozen carbon dioxide evaporates from the opposite polar cap. (9)‏ Yet even on the summer pole, where the sun remains in the sky all day long, temperature s never warm enough to melt frozen water. (10)‏ With its distant orbit (50 percent farther from the sun than Earth) and slim atmospheric blanket, (1)‏ Only the midday sun at tropical latitudes is warm enough to thaw ice on occasion, (4)‏ 5 Evidence Cause but any liquid water formed in this way would evaporate almost instantly (5)‏ because of the low atmospheric pressure (6)‏

RST Based Summarization  Multiple RS-trees  A built RS-tree captures relations in the text and can be used for high quality summarization  Picking up the ‘K’ nodes nearest to the root  Disadvantages

WordNet based Approach for Summarization  Preprocessing of text  Constructing sub-graph from WordNet  Synset Ranking  Sentence Selection  Principal Component Analysis

Preprocessing  Break text into sentences  Apply POS tagging  Identify collocations in the text  Remove the stop words  Sequence is important

Constructing sub-graph from WordNet  Mark all the words and collocations in the WordNet graph which are present in the text  Traverse the generalization edges up to a fixed depth, and mark the synsets you visit  Construct a graph, containing only the marked synsets

Synset Ranking  Rank synsets based on their relevance to text  Construct a Rank vector, corresponding to each node of the graph, initialized to 1/ √ (no_of_nodes, n in graph) ‏  Create an authority matrix, A(i,j) = 1/(num_of_predecessors(j)), if j is a child of i.

Synset Ranking  Update the R vector iteratively as,  Higher value implies better rank and higher relevance

Sentence Selection  Construct a matrix, M with m rows and n columns  m is number of sentences and n is number of nodes  For each sentence S i Traverse graph G, starting with words present in S i and following generalization edges Find set of reachable synsets, SY i For each sy ij ∈ SY i  set M[S i ][sy ij ] to rank of sy ij calculated in previous step

Principal Component Analysis  Apply PCA on matrix M and get set of principal components or eigen vectors  Eigen value of each eigen vector is measure of relevance of eigen vector to the meaning  Sort Eigen vectors according to Eigen values  For each Eigen vector, find its projection on each sentence

Principal Component Analysis  Select top n numselect sentences for each eigen vector  n numselect is proportional to the eigen values of the eigen vectors  n numselect = i / ∑ j ( j )) where i is the eigen value corresponding to the eigen vector, i

Document Understanding Conference(DUC)  Text Analysis Conference(TAC)  Interest and activity aimed at building powerful multi-purpose information systems  Evaluation results of various summarization techniques  www-nlpir.nist.gov/projects/duc/data.html

Human Summary of Our Presentation :) ‏  What is Text Summarization?  Why Text Summarization?  Methods to Summarization LexRank Lexical Chains Rhetorical Structure Theory Wordnet Based

Challenges ahead..  Ensuring text coherency  Sentences may have dangling anaphors  Summarizing non-textual data  Handling multiple sources effectively  High reduction rates are needed  Achieving human quality summarization!!

References  Erkan, Radev, LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. Vol: 22, 457 – 479, Journal of Artificial Intelligence Research  Barzilay, R. and M. Elhadad Using Lexical Chains for Text Summarization. In Proceedings of the Workshop on Intelligent Scalable Text Summarization at the ACL/EACL Conference, 10–17. Madrid, Spain.  Mann, W.C. and S.A. Thompson Rhetorical Structure Theory: Toward a Functional Theory of Text Organization. Text 8(3), 243–281. Also available as USC/Information Sciences Institute Research Report RR

References  Baldwin, B. and T. Morton Coreference- Based Summarization. In T. Firmin Hand and B. Sundheim (eds). TIPSTER-SUMMAC Summarization Evaluation. Proceedings of the TIPSTER Text Phase III Workshop. Washington.  Marcu, D Improving Summarization Through Rhetorical Parsing Tuning. Proceedings of the Workshop on Very Large Corpora. Montreal, Canada.  Ramakrishnan and Bhattacharya, Text representation with wordnet synsets. Eighth International Conference on Applications of Natural Language to Information Systems (NLDB2003) ‏

References  Bellare,Anish S., Atish S., Loiwal, Bhattacharya, Mehta, Ramakrishnan, Generic Text Summarization using WordNet  Inderjeet Mani and Mark T. Maybury (eds). Advances in Automatic Text. Summarization. MIT Press, ISBN 

Thank You