Presentation is loading. Please wait.

Presentation is loading. Please wait.

Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya.

Similar presentations


Presentation on theme: "Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya."— Presentation transcript:

1 Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

2 Introduction Single Document Summarization Multiple Document Summarization Application Evaluation Conclusion 2

3 What is Summary? Text produced from one or more texts Conveys important information in the original texts, and that is no longer than half of the original texts. 3 important aspects of summary are: Summaries should be short Summaries should preserve important information Summaries may be produced from single/multiple documents 3

4 Extraction Procedure of identifying important sections of text and producing verbatim Abstraction Aim to produce material in a new way Fusion Combining extracted parts coherently Compression Aims at throwing out unimportant sections of text 4

5 Early Works Machine Learning Methods Naïve-Bayes Methods Rich Features and Decision Trees Deep Natural Language Analysis Methods Lexical Chaining Rhetorical Structure Theory (RST) 5

6 Luhn, 1958 Summarization based on measuring significance of words depending on its frequency Deriving significance factor of sentence, based on number of significance words in that sentence Edmundson, 1969 Word frequency and positional importance were incorporated Presence of cue words, and skeleton of the document were also incorporated 6

7 7

8 8

9 Weighing sentences based on their position Arises from the idea that texts generally follow a predictable discourse structure Sentence position yield was calculated against the topic keywords later Sentence position were then ranked by average yield to produce Optimal Position Policy for topic positions for the genre Later, sentence extraction problem was modeled using decision trees assumption that features are independent broke away 9

10 Techniques aimed at modeling the texts discourse structure Use of heuristics to create document extracts Lexical Chaining independent of the grammatical structure of the text list of words that captures a portion of the cohesive structure of the text sequence of related words in the text, spanning short or long distances technique used to identify the central theme of a document 10

11 Ellipsis Words are omitted when the phrase needs to be repeated Example: A: Where are you going? B: To town. Substitution Word is not omitted but replaced by another Example: A: Which ice-cream would you like? B: I would like the pink one. 11

12 Conjunction Relationship between two clauses Few of them are: and, then, however etc. Repetition Mentioning of the same word again Reference Anaphoric reference Refers to someone/something that has been previously identified Cataphoric reference Forward referencing. Example: Here he comes….Its Brad Pitt 12

13 Example:- John had mud pie for dessert. Mud pie is made of chocolate. John really enjoyed it. Steps involved in lexical chaining: a) Selecting a set of candidate words. b) For each candidate word, finding an appropriate chain relying on a relatedness criterion among members of the chain c) If it is found, inserting the word in the chain and updating it accordingly 13

14 relatedness measure-Wordnet Distance. Weights assigned to chains based on their length and homogeneity Determining the strength of a lexical chain by taking in consideration the distribution of elements in the chain throughout the text Corresponds to the significance of the textual context it embodies. Provides a basis identifying the topical units in a document which are of great importance in document summarization. 14

15 two non-overlapping pieces of text spans: the nucleus and the satellite Nuclei expresses what is more essential to the writer's purpose than the satellite Example: claim followed by evidence for the claim. RST posits an "Evidence" relation between the two spans. claim is more essential to the text than the particular evidence claim span a nucleus and the evidence span a satellite Nucleus is independent of the satellite but not vice versa 15

16 16

17 Need and Encouragement Extraction of single summary from multiple documents started in mid 1990s Most of the application in news article Google news (news.google.com) Columbia news blaster (newsblaster.cs.columbia.edu) News in Essence (NewsInEssence.com) Multiple source of information which are :- supplementary to each other overlapping in content even contradictory at time 17

18 Extended template driven message understanding system Abstractive System, rely heavily on internal NLP tools Earlier considered as knowledge of Language Interpretation Generation Extractive Techniques have been applied - Similarity measures between sentences identify common theme through clustering - select one sentence to represent each cluster generate composite sentence from each cluster Summarization differs on what the final goal is MEAD : works based on extraction techniques on general domains SUMMONS : build a briefing highlighting difference and updates on news report 18

19 SUMMONS is the first example of multi-document summarization Considers event about a narrow domain news articles about terrorism It produces a briefing merging relevant information about event and their evolution over time It reads a database built by template based message understanding system Concatenation of two systems : Content Planner and Linguistic Generator 19

20 Content Planner : selects information to include in summary through combination of input templates It uses summary operators - set of heuristics that perform operations like : change of perspective, contradiction, refinement Linguistic Generator :selects the right words to express the information in grammatical and coherent text. Uses connective phrases to synthesize summary, adapting language generation tools like FUF/SURGE 20

21 Themes - set of similar text units (Paragraphs) - Clustering Problem Text is mapped to vector of features including single words weighted by their TF-IDF scores, noun, pronoun, semantic classes of verbs For each pair of paragraphs a vector is computed which represents matches on different features. Decision rules learnt from data classify each pair as similar or dissimilar. An algorithm then places the most related paragraphs in same theme Information Fusion - which sentences of the theme should be included in the final summary. 21

22 Algorithm - compares and intersects predicate argument structures of the phrases within each theme to find which are repeated often enough to be included in summary Sentenced are parsed using Collins' statistical parser converted into dependency tree – captures predicate- argument structure, identify functional roles. Comparison algorithm traverses the tree recursively, adding identical nodes to output tree. Once full phrase are found, they are marked to be included in summary. Once summary content is decided, a grammatical text is generated using FUF/SURGE language generating system. 22

23 23 McVeigh, 27,was charged with the bombing

24 MMR - Maximal Marginal Relevance introduced by Carbonell and Goldstein Rewards relevant sentences and penalizes redundant ones by considering a linear combination of two similarity measures. Q - query or user profile, R - Ranked list of documents, S - already selected documents. Select a document one at a time and add them to S. For each document in D i in R\S, MR(D i ) = a * Sim 1 (D i,Q) - (1-a) * max Di in S Sim 2 (D i,D j ), where a lies in [0,1] Document getting maximum MR(D i ) is selected until maximum number is reached or threshold is reached, a controls the relative importance between relevance and redundancy. Sim1 and Sim2 are similarity measures ( cosine similarity measure ) 24

25 Content is denoted as entities and relations as nodes and edges of a graph. Rather than extracting sentences, they detect salient regions of the graph. Topic Driven : topic is denoted by entry nodes in graph. Graph : Each node is single occurrence of word. Different kind of links – Adjacency links, Same links, Alpha Links and Phrase links, Name and Coref Links 25

26 Topic nodes are identified through stem comparison and marked as entry node. Spreading activation: search for semantically related text is propagated from these to other nodes of the graph. Weight of neighboring node depends on node links traveled and is exponentially decaying function of the distance. Pair of document graph: identify common nodes and difference nodes. Highlight sentences having higher common and different scores. User is able to specify the maximal number to control the output. 26

27 27

28 Second Stage - Identify sentences that are central to topic of the entire cluster. Two metrics similar to MMR(but not query dependent) are defined by Radev et al., 2000 Cluster-based relative utility (CBRU) - how relevant a particular sentence to general topic of cluster Cross-sentence Informational subsumption (CSIS) - measure of redundancy among sentences Given a cluster segmented into n sentences, and compression rate R, we select nR sentences in order of appearance in chronologically arranged documents Addition of the three scores minus redundancy penalty(Rs) for sentence that overlaps highly ranked sentence is the final score for each sentence Centroid Value (Ci) sum of centroid values of all the words in sentence Positional Value(Pi) makes leading sentences more important First sentence Overlap (Fi) - inner product of word occurrence vector of sentence I and that of 1st sentence of document 28

29 Google News: news aggregator, selecting most up-to-date(within the past 30 days) information from thousands of publications by an automatic aggregation algorithm Different versions available for more than 60 regions in 28 languages Ultimate research Assistant: performs text mining on Internet search results make it easier for the user to perform online research by organizing the output. Type name of a topic and it will search the web for highly relevant resources, and organize the search results 29

30 Shablast Universal search engine Produces multi-document summaries from the top 50 results returned by Microsoft's Bing search engine for a set of keywords. iResearch Reporter – Commercial Text Extraction and Text Summarization system Produces categorized, easily-readable natural language summary reports covering multiple documents retrieved by entering user query in google search engine 30

31 31

32 A difficult task Absence of a standard human or automatic evaluation metric makes difficult to compare different systems and establish a baseline Manual evaluation not feasible Need for an evaluation metric having high correlation with human scores human and automatic evaluation: Comparison of automatic generated summaries with manually written "ideal" summaries decomposition of text into sentences Rating between 1-4 to system unit(SU) which shares content with Model unit(MU) corresponding to ideal summaries 32

33 ROUGE based only on content overlap can determine if the same general concepts are discussed between an automatic summary and a reference summary cannot determine if the result is coherent or the sentences flow together in a sensible manner Better in case of single document summarization Information-theoretic Evaluation of Summaries Central idea is to use a divergence measure between a pair of probability distributions First distribution is derived from automatic summary Second from a set of reference summaries Suits both the single document and multi document summarization scenarios 33

34 Need to develop efficient and accurate summarization systems due to enormous rate of information growth Still a lot of research going on this field especially in evaluation techniques Multi document summarization is more in use as compared to single-document summarization Extractive techniques are employed usually rather than abstractive techniques as they are easy to employ and have produced satisfactory results 34

35 A survey on Automatic Summarization – Dipanjan Das and Andre F.T. Martins (http://www.cs.cmu.edu/~afm/Home_files/Das_Marti ns_survey_summarization.pdf) Wikipedia Relevance of cluster size in MMR Based summarizer (http://www.cs.cmu.edu/~madhavi/publications/Gan apathiraju_11-742Report.pdf) 35


Download ppt "Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya."

Similar presentations


Ads by Google