Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Summarization Jagadish M(07305050) ‏ Annervaz K M (07305063) ‏ Joshi Prasad(07305047) ‏ Ajesh Kumar S(07305065) ‏ Shalini Gupta(07305R02) ‏

Similar presentations


Presentation on theme: "Text Summarization Jagadish M(07305050) ‏ Annervaz K M (07305063) ‏ Joshi Prasad(07305047) ‏ Ajesh Kumar S(07305065) ‏ Shalini Gupta(07305R02) ‏"— Presentation transcript:

1 Text Summarization Jagadish M(07305050) ‏ Annervaz K M (07305063) ‏ Joshi Prasad(07305047) ‏ Ajesh Kumar S(07305065) ‏ Shalini Gupta(07305R02) ‏

2 Introduction  Summary: Brief but accurate representation of the contents of a document  Goal: Take an information source, extract the most important content from it and present it to the user in a condensed form and in a manner sensitive to the user’s needs.  Compression: Amount of text to present or the length of the summary to the length of the source.

3 MSWord AutoSummarize

4

5 Presentation Outline  Motivation  Different Genres  Simple Statistical Techniques  Degree Centrality  Lex Rank  Lexical/Co-reference Chains  Rhetorical Structure Theory  WordNet Based Methods  DUC/TAC

6 Motivation  Abstracts for Scientific and other articles  News summarization (mostly Multiple document summarization) ‏  Classification of articles and other written data  Web pages for search engines  Web access from PDAs, Cell phones  Question answering and data gathering

7 Genres  Indicative vs. informative used for quick categorization vs. content processing.  Extract vs. abstract lists fragments of text vs. re-phrases content coherently.  Generic vs. query-oriented provides author’s view vs. reflects user’s interest.  Background vs. just-the-news assumes reader’s prior knowledge is poor vs. up- to-date.  Single-document vs. multi-document source based on one text vs. fuses together many texts.

8 Statistical scoring  Scoring techniques Word frequencies throughout the text(Luhn58) ‏ Position in the text(Edmundson69) ‏ Title Method(Edmundson69) ‏ Cue phrases in sentences (Edmundson69) ‏

9 Luhn58  Important words occur fairly frequently  Earliest work in field

10 Statistical Approaches(contd..) ‏  Degree Centrality  LexRank  Continuous LexRank

11 Degree Centrality  Problem Formulation Represent each sentence by a vector Denote each sentence as the node of a graph Cosine similarity determines the edges between nodes

12 Degree Centrality  Since we are interested in significant similarities, we can eliminate some low values in this matrix by defining a threshold.

13 Degree Centrality  Compute the degree of each sentence  Pick the nodes (sentences) with high degrees

14 Degree Centrality  Disadvantage in Degree Centrality approach

15 LexRank  Centrality vector p which will give a lexrank of each sentence (similar to page rank) defined by :

16 What Should B Satisfy?  Stochastic Matrix and Markov Chain property.  Irreducible.  Aperiodic

17 Perron-Frobenius Theorem  An irreducible and aperiodic Markov chain is guaranteed to converge to a stationary distribution

18 Reducibility

19 Aperiodicity

20 LexRank  B is a stochastic matrix  Is it an irreducible and aperiodic matrix?  Dampness (Page et al. 1998) ‏

21 Matrix Form of p for Dampening  Solve for p using Power method

22 Continuous LexRank

23 Linguistic/Semantic Methods  Co-reference /Lexical Chain  Rhetorical Analysis

24 Co-reference/Lexical Chains  Assumption/Observation :- Important parts in a text will be more related in a semantic interpretation  Co-reference / Lexical Chains (Object- Action, Part-of relation, Semantically related) ‏  Important sentences will be traversed by more number of such chains

25 Co-reference/Lexical Chains  Mr. Kenny is the person that invented the anesthetic machine which uses micro-computers to control the rate at which an anesthetic is pumped into the blood. Such machines are nothing new. But his device uses two micro-computers to achieve much closer monitoring of the pump feeding the anesthetic into the patient

26 Rhetorical Structure Theory  Mann & Thompson 88  Rhetoric Relation Between two non-overlapping text snippets Nucleus - Core Idea, Writers Purpose Satellite - Referred in context to nucleus for Justifying, Evidencing, Contradicting etc

27 Rhetorical Structure Theory  Nucleus of a rhetorical relation is comprehensible independent of the satellite, but not vice versa  All rhetoric relations are not nucleus-satellite relations, Contrast is a multinuclear relationship  Example: evidence [The truth is that the pressure to smoke in 'junior high' is greater than it will be any other time of one’s life:][ we know that 3,000 teens start smoking each day.]

28 Rhetorical Structure Theory  Rhetoric Parsing Breaks into elementary units Uses cue phrases(discourse markers) and notion of semantic similarity in order to hypothesize rhetorical relations  Rhetorical relations can be assembled into rhetorical structure trees (RS- trees) by recursively applying individual relations across the whole text

29 2 Elaboration 8 Example 2 Background Justification 3 Elaboration 8 Concession 10 Antithesis Mars experiences frigid weather conditions (2)‏ Surface temperature s typically average about -60 degrees Celsius (-76 degrees Fahrenheit) at the equator and can dip to - 123 degrees C near the poles (3)‏ 4 5 Contrast Although the atmosphere holds a small amount of water, and water-ice clouds sometimes develop, (7)‏ Most Martian weather involves blowing dust and carbon monoxide. (8)‏ Each winter, for example, a blizzard of frozen carbon dioxide rages over one pole, and a few meters of this dry-ice snow accumulate as previously frozen carbon dioxide evaporates from the opposite polar cap. (9)‏ Yet even on the summer pole, where the sun remains in the sky all day long, temperature s never warm enough to melt frozen water. (10)‏ With its distant orbit (50 percent farther from the sun than Earth) and slim atmospheric blanket, (1)‏ Only the midday sun at tropical latitudes is warm enough to thaw ice on occasion, (4)‏ 5 Evidence Cause but any liquid water formed in this way would evaporate almost instantly (5)‏ because of the low atmospheric pressure (6)‏

30 RST Based Summarization  Multiple RS-trees  A built RS-tree captures relations in the text and can be used for high quality summarization  Picking up the ‘K’ nodes nearest to the root  Disadvantages

31 WordNet based Approach for Summarization  Preprocessing of text  Constructing sub-graph from WordNet  Synset Ranking  Sentence Selection  Principal Component Analysis

32 Preprocessing  Break text into sentences  Apply POS tagging  Identify collocations in the text  Remove the stop words  Sequence is important

33 Constructing sub-graph from WordNet  Mark all the words and collocations in the WordNet graph which are present in the text  Traverse the generalization edges up to a fixed depth, and mark the synsets you visit  Construct a graph, containing only the marked synsets

34 Synset Ranking  Rank synsets based on their relevance to text  Construct a Rank vector, corresponding to each node of the graph, initialized to 1/ √ (no_of_nodes, n in graph) ‏  Create an authority matrix, A(i,j) = 1/(num_of_predecessors(j)), if j is a child of i.

35 Synset Ranking  Update the R vector iteratively as,  Higher value implies better rank and higher relevance

36 Sentence Selection  Construct a matrix, M with m rows and n columns  m is number of sentences and n is number of nodes  For each sentence S i Traverse graph G, starting with words present in S i and following generalization edges Find set of reachable synsets, SY i For each sy ij ∈ SY i  set M[S i ][sy ij ] to rank of sy ij calculated in previous step

37 Principal Component Analysis  Apply PCA on matrix M and get set of principal components or eigen vectors  Eigen value of each eigen vector is measure of relevance of eigen vector to the meaning  Sort Eigen vectors according to Eigen values  For each Eigen vector, find its projection on each sentence

38 Principal Component Analysis  Select top n numselect sentences for each eigen vector  n numselect is proportional to the eigen values of the eigen vectors  n numselect = i / ∑ j ( j )) where i is the eigen value corresponding to the eigen vector, i

39 Document Understanding Conference(DUC)  Text Analysis Conference(TAC)  Interest and activity aimed at building powerful multi-purpose information systems  Evaluation results of various summarization techniques  www-nlpir.nist.gov/projects/duc/data.html

40 Human Summary of Our Presentation :) ‏  What is Text Summarization?  Why Text Summarization?  Methods to Summarization LexRank Lexical Chains Rhetorical Structure Theory Wordnet Based

41 Challenges ahead..  Ensuring text coherency  Sentences may have dangling anaphors  Summarizing non-textual data  Handling multiple sources effectively  High reduction rates are needed  Achieving human quality summarization!!

42 References  Erkan, Radev, 2004. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. Vol: 22, 457 – 479, Journal of Artificial Intelligence Research  Barzilay, R. and M. Elhadad. 1997. Using Lexical Chains for Text Summarization. In Proceedings of the Workshop on Intelligent Scalable Text Summarization at the ACL/EACL Conference, 10–17. Madrid, Spain.  Mann, W.C. and S.A. Thompson. 1988. Rhetorical Structure Theory: Toward a Functional Theory of Text Organization. Text 8(3), 243–281. Also available as USC/Information Sciences Institute Research Report RR-87-190.

43 References  Baldwin, B. and T. Morton. 1998. Coreference- Based Summarization. In T. Firmin Hand and B. Sundheim (eds). TIPSTER-SUMMAC Summarization Evaluation. Proceedings of the TIPSTER Text Phase III Workshop. Washington.  Marcu, D. 1998. Improving Summarization Through Rhetorical Parsing Tuning. Proceedings of the Workshop on Very Large Corpora. Montreal, Canada.  Ramakrishnan and Bhattacharya, 2003. Text representation with wordnet synsets. Eighth International Conference on Applications of Natural Language to Information Systems (NLDB2003) ‏

44 References  Bellare,Anish S., Atish S., Loiwal, Bhattacharya, Mehta, Ramakrishnan, 2004. Generic Text Summarization using WordNet  Inderjeet Mani and Mark T. Maybury (eds). Advances in Automatic Text. Summarization. MIT Press, 1999. ISBN 0-262-13359-8.  www.wikipedia.com

45 Thank You


Download ppt "Text Summarization Jagadish M(07305050) ‏ Annervaz K M (07305063) ‏ Joshi Prasad(07305047) ‏ Ajesh Kumar S(07305065) ‏ Shalini Gupta(07305R02) ‏"

Similar presentations


Ads by Google