Document Summarization

Name: Document Summarization
Uploaded: 2017-08-21T16:15:51+00:00
Duration: PTM18S53
Channel: Ismael Gafford
Description: Document Summarization

Document Summarization
Vinayak Gagrani Neeraj Toshniwal Abhishek Kabra Guide Pushpak Bhattacharya

Outline Introduction Single Document Summarization
Multiple Document Summarization Application Evaluation Conclusion

Introduction What is Summary? Text produced from one or more texts
Conveys important information in the original texts, and that is no longer than half of the original texts. 3 important aspects of summary are: Summaries should be short Summaries should preserve important information Summaries may be produced from single/multiple documents

Common terms in summarization dialect
Extraction Procedure of identifying important sections of text and producing verbatim Abstraction Aim to produce material in a new way Fusion Combining extracted parts coherently Compression Aims at throwing out unimportant sections of text

Single Document Summarization
Early Works Machine Learning Methods Naïve-Bayes Methods Rich Features and Decision Trees Deep Natural Language Analysis Methods Lexical Chaining Rhetorical Structure Theory (RST)

Early Works Luhn, 1958 Edmundson, 1969
Summarization based on measuring significance of words depending on its frequency Deriving significance factor of sentence, based on number of significance words in that sentence Edmundson, 1969 Word frequency and positional importance were incorporated Presence of cue words, and skeleton of the document were also incorporated

Naïve Bayes Method Classifier based on applying Bayes theorem with strong independence assumption s-particular sentence S-set of sentences that make up the summary F1…, Fk -the features Assuming independence of features: P(s ε S | F1,F2….Fk)= 𝑖=1 𝑘 P(Fi |s ∈ S). P(s ∈ S) 𝑖=1 𝑘 P(Fi ) Evaluation is done by analyzing its match with the human extracted document summary

Naïve Bayes Method Term frequency-inverse document frequency
Increases proportionally to the number of times a word appears in the document offset by the frequency of the word in the corpus Takes into account that certain words are more common than others. For e.g.. “the”, “is” etc. Idf(t,D)= log |𝐷| | 𝑑∈𝐷:𝑡∈𝑑 | |D|: total number of documents in the corpus | 𝑑∈𝐷:𝑡∈𝑑 |: number of documents where the term t appears i.e. tf(t,d) ≠ 0

Rich Features and Decision Trees
Weighing sentences based on their position Arises from the idea that texts generally follow a predictable discourse structure Sentence position yield was calculated against the topic keywords later Sentence position were then ranked by average yield to produce Optimal Position Policy for topic positions for the genre Later, sentence extraction problem was modeled using decision trees assumption that features are independent broke away

Deep Natural Language Analysis Methods
Techniques aimed at modeling the text’s discourse structure Use of heuristics to create document extracts Lexical Chaining independent of the grammatical structure of the text list of words that captures a portion of the cohesive structure of the text sequence of related words in the text, spanning short or long distances technique used to identify the central theme of a document

Forms of Cohesion Ellipsis Substitution
Words are omitted when the phrase needs to be repeated Example: A: Where are you going? B: To town. Substitution Word is not omitted but replaced by another A: Which ice-cream would you like? B: I would like the pink one.

Forms of Cohesion Conjunction Repetition Reference
Relationship between two clauses Few of them are: “and”, “then”, “however” etc. Repetition Mentioning of the same word again Reference Anaphoric reference Refers to someone/something that has been previously identified Cataphoric reference Forward referencing . Example: Here he comes….It’s Brad Pitt

Lexical chaining Example:- John had mud pie for dessert. Mud pie is made of chocolate. John really enjoyed it. Steps involved in lexical chaining: a) Selecting a set of candidate words. b) For each candidate word, finding an appropriate chain relying on a relatedness criterion among members of the chain c) If it is found, inserting the word in the chain and updating it accordingly

Lexical Chaining relatedness measure-Wordnet Distance.
Weights assigned to chains based on their length and homogeneity Determining the strength of a lexical chain by taking in consideration the distribution of elements in the chain throughout the text Corresponds to the significance of the textual context it embodies. Provides a basis identifying the topical units in a document which are of great importance in document summarization.

Rhetorical Structure Theory(RST)
two non-overlapping pieces of text spans: the nucleus and the satellite Nuclei expresses what is more essential to the writer's purpose than the satellite Example: claim followed by evidence for the claim. RST posits an "Evidence" relation between the two spans. claim is more essential to the text than the particular evidence claim span a nucleus and the evidence span a satellite Nucleus is independent of the satellite but not vice versa

Rhetorical Structure Theory(RST)

Multiple Document Summarization
Need and Encouragement Extraction of single summary from multiple documents started in mid 1990s Most of the application in news article Google news (news.google.com) Columbia news blaster (newsblaster.cs.columbia.edu) News in Essence (NewsInEssence.com) Multiple source of information which are :- supplementary to each other overlapping in content even contradictory at time

Early Work Extended template driven message understanding system
Abstractive System, rely heavily on internal NLP tools Earlier considered as knowledge of Language Interpretation Generation Extractive Techniques have been applied - Similarity measures between sentences identify common theme through clustering - select one sentence to represent each cluster generate composite sentence from each cluster Summarization differs on what the final goal is MEAD : works based on extraction techniques on general domains SUMMONS : build a briefing highlighting difference and updates on news report

Abstractions and Information Fusion
SUMMONS is the first example of multi-document summarization Considers event about a narrow domain news articles about terrorism It produces a briefing merging relevant information about event and their evolution over time It reads a database built by template based message understanding system Concatenation of two systems : Content Planner and Linguistic Generator

SUMMONS - processing the text (Content Planner)
Content Planner : selects information to include in summary through combination of input templates It uses summary operators - set of heuristics that perform operations like : change of perspective, contradiction, refinement Linguistic Generator :selects the right words to express the information in grammatical and coherent text. Uses connective phrases to synthesize summary, adapting language generation tools like FUF/SURGE

Theme based approach - McKeown et al., Barzilay et al.
Themes - set of similar text units (Paragraphs) - Clustering Problem Text is mapped to vector of features including single words weighted by their TF-IDF scores, noun, pronoun, semantic classes of verbs For each pair of paragraphs a vector is computed which represents matches on different features. Decision rules learnt from data classify each pair as similar or dissimilar. An algorithm then places the most related paragraphs in same theme Information Fusion - which sentences of the theme should be included in the final summary.

Information Fusion Algorithm - compares and intersects predicate argument structures of the phrases within each theme to find which are repeated often enough to be included in summary Sentenced are parsed using Collins' statistical parser converted into dependency tree – captures predicate- argument structure, identify functional roles. Comparison algorithm traverses the tree recursively, adding identical nodes to output tree. Once full phrase are found, they are marked to be included in summary. Once summary content is decided, a grammatical text is generated using FUF/SURGE language generating system.

“McVeigh, 27,was charged with the bombing”
Decision Tree “McVeigh, 27,was charged with the bombing”

Topic-Driven Summarization
MMR - Maximal Marginal Relevance introduced by Carbonell and Goldstein Rewards relevant sentences and penalizes redundant ones by considering a linear combination of two similarity measures. Q - query or user profile, R - Ranked list of documents, S - already selected documents . Select a document one at a time and add them to S. For each document in Di in R\S, MR(Di) = a * Sim1(Di,Q) - (1-a) * max Di in S Sim2(Di,Dj), where a lies in [0,1] Document getting maximum MR(Di) is selected until maximum number is reached or threshold is reached, a controls the relative importance between relevance and redundancy. Sim1 and Sim2 are similarity measures ( cosine similarity measure )

Graph Spreading Activation
Content is denoted as entities and relations as nodes and edges of a graph. Rather than extracting sentences, they detect salient regions of the graph. Topic Driven : topic is denoted by entry nodes in graph. Graph : Each node is single occurrence of word. Different kind of links – Adjacency links, Same links, Alpha Links and Phrase links, Name and Coref Links

Graph Spreading Activation
Topic nodes are identified through stem comparison and marked as entry node. Spreading activation: search for semantically related text is propagated from these to other nodes of the graph. Weight of neighboring node depends on node links traveled and is exponentially decaying function of the distance. Pair of document graph: identify common nodes and difference nodes. Highlight sentences having higher common and different scores. User is able to specify the maximal number to control the output.

Centroid-based Summarization
It does not use any language generation module. Easily scalable and domain-independent Topic Detection - Group together news articles that describe the same event. An agglomerative clustering algorithm is used, operates on TF- IDF vector representations, successively adding documents to clusters and re computing the centroids according to 𝑐𝑗= 𝑑 ∈ 𝐶𝑗 𝑑~ 𝐶𝑗 cj is the centroid of the j-th cluster, Cj the set of documents that belong to that cluster Centroids can thus be considered as pseudo-documents that include those words whose TF-IDF scores are above a threshold in their cluster.

Centroid-based Summarization
Second Stage - Identify sentences that are central to topic of the entire cluster. Two metrics similar to MMR(but not query dependent) are defined by Radev et al., 2000 Cluster-based relative utility (CBRU) - how relevant a particular sentence to general topic of cluster Cross-sentence Informational subsumption (CSIS) - measure of redundancy among sentences Given a cluster segmented into n sentences, and compression rate R, we select nR sentences in order of appearance in chronologically arranged documents Addition of the three scores minus redundancy penalty(Rs) for sentence that overlaps highly ranked sentence is the final score for each sentence Centroid Value (Ci) sum of centroid values of all the words in sentence Positional Value(Pi) makes leading sentences more important First sentence Overlap (Fi) - inner product of word occurrence vector of sentence I and that of 1st sentence of document

Application Google News: Ultimate research Assistant:
news aggregator, selecting most up-to-date(within the past 30 days) information from thousands of publications by an automatic aggregation algorithm Different versions available for more than 60 regions in 28 languages Ultimate research Assistant: performs text mining on Internet search results make it easier for the user to perform online research by organizing the output. Type name of a topic and it will search the web for highly relevant resources, and organize the search results

Application Shablast iResearch Reporter – Universal search engine
Produces multi-document summaries from the top 50 results returned by Microsoft's Bing search engine for a set of keywords. iResearch Reporter – Commercial Text Extraction and Text Summarization system Produces categorized, easily-readable natural language summary reports covering multiple documents retrieved by entering user query in google search engine

Application

Evaluation A difficult task
Absence of a standard human or automatic evaluation metric makes difficult to compare different systems and establish a baseline Manual evaluation not feasible Need for an evaluation metric having high correlation with human scores human and automatic evaluation: Comparison of automatic generated summaries with manually written "ideal" summaries decomposition of text into sentences Rating between 1-4 to system unit(SU) which shares content with Model unit(MU) corresponding to ideal summaries

Evaluation ROUGE Information-theoretic Evaluation of Summaries
based only on content overlap can determine if the same general concepts are discussed between an automatic summary and a reference summary cannot determine if the result is coherent or the sentences flow together in a sensible manner Better in case of single document summarization Information-theoretic Evaluation of Summaries Central idea is to use a divergence measure between a pair of probability distributions First distribution is derived from automatic summary Second from a set of reference summaries Suits both the single document and multi document summarization scenarios

Conclusion Need to develop efficient and accurate summarization systems due to enormous rate of information growth Still a lot of research going on this field especially in evaluation techniques Multi document summarization is more in use as compared to single-document summarization Extractive techniques are employed usually rather than abstractive techniques as they are easy to employ and have produced satisfactory results

References A survey on Automatic Summarization – Dipanjan Das and Andre F.T. Martins ( ns_survey_summarization.pdf) Wikipedia Relevance of cluster size in MMR Based summarizer ( apathiraju_11-742Report.pdf)

Document Summarization

Similar presentations

Presentation on theme: "Document Summarization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Document Summarization

Similar presentations

Presentation on theme: "Document Summarization"— Presentation transcript:

Similar presentations

About project

Feedback