Text Summarization 黄连恩

Text Summarization http://net.pku.edu.cn/~wbia 黄连恩 hle@net.pku.edu.cn
北京大学信息工程学院 12/17/2013

Overview 2

What is summarization?

Columbia Newsblaster The academic version

What is the input? News, or clusters of news Email and email thread
a single article or several articles on a related topic and thread Scientific articles Health information: patients and doctors Meeting summarization Video

What is the output Keywords Highlight information in the input
Chunks or speech directly from the input or paraphrase and aggregate the input in novel ways Modality: text, speech, video, graphics

Ideal stages of summarization
Analysis Input representation and understanding Transformation Selecting important content Realization Generating novel text corresponding to the gist of the input

Most current systems Use shallow analysis methods
Rather than full understanding Work by sentence selection Identify important sentences and piece them together to form a summary

Types of summaries Extracts Abstracts
Sentences from the original document are displayed together to form a summary Abstracts Materials is transformed: paraphrased, restructured, shortened

Extractive summarization
Each sentence is assigned a score that reflects how important and contenful they are Data-driven approaches Word statistics Cue phrases Section headers Sentence position Knowledge-based systems Discourse information Resolve anaphora, text structure Use external lexical resources Wordnet, adjective polarity lists, opinion Using machine learning

What are summaries useful for?
Relevance judgments Does this document contain information I am interested in? Is this document worth reading? Save time Reduce the need to consult the full document

Recent development 2013.3, Yahoo bought news reading app Summly for $30 million! 2013.4, Google purchased Wavii for more than $30 million!

Multi-document summarization
Very useful for presenting and organizing search results Many results are very similar, and grouping closely related documents helps cover more event facets Summarizing similarities and differences between documents

How to deal with redundancy?
Author JK Rowling has won her legal battle in a New York court to get an unofficial Harry Potter encyclopaedia banned from publication. A U.S. federal judge in Manhattan has sided with author J.K. Rowling and ruled against the publication of a Harry Potter encyclopedia created by a fan of the book series. Shallow techniques not likely to work well

Global optimization for content selection
What is the best summary? vs What is the best sentence? Form all summaries and choose the best What is the problem with this approach?

Information ordering In what order to present the selected sentences?
An article with permuted sentences will not be easy to understand Very important for multi-document summarization Sentences coming from different documents

Automatic summary edits
Some expressions might not be appropriate in the new context References: he Putin Russian Prime Minister Vladimir Putin Discourse connectives However, moreover, subsequently Requires more sophisticated NLP techniques

Before Pinochet was placed under arrest in London Friday by
British police acting on a warrant issued by a Spanish judge. Pinochet has immunity from prosecution in Chile as a senator-for-life under a new constitution that his government crafted. Pinochet was detained in the London clinic while recovering from back surgery.

After Gen. Augusto Pinochet, the former Chilean dictator, was placed under arrest in London Friday by British police acting on a warrant issued by a Spanish judge. Pinochet has immunity from prosecution in Chile as a senator-for-life under a new constitution that his government crafted. Pinochet was detained in the London clinic while recovering from back surgery.

Before Turkey has been trying to form a new government since a coalition government led by Yilmaz collapsed last month over allegations that he rigged the sale of a bank. Ecevit refused even to consult with the leader of the Virtue Party during his efforts to form a government. Ecevit must now try to build a government. Demirel consulted Turkey's party leaders immediately after Ecevit gave up.

After Turkey has been trying to form a new government since a coalition government led by Prime Minister Mesut Yilmaz collapsed last month over allegations that he rigged the sale of a bank. Premier-designate Bulent Ecevit refused even to consult with the leader of the Virtue Party during his efforts to form a government. Ecevit must now try to build a government. President Suleyman Demirel consulted Turkey's party leaders immediately after Ecevit gave up.

Traditional Approaches
24

1) word frequency based method
Hans Peter Luhn (“father of Information Retrieval”): The Automatic Creation of Literature Abstracts Image: Courtesy IBM

Luhn’s method: basic idea
Target documents: technical literature The method is based on the following assumptions: Frequency of word occurrence in an article is a useful measurement of word significance Relative position of these significant words within a sentence is also a useful measurement of word significance Based on limited capabilities of machines (IBM 704)  no semantic information

Important words are repeated throughout the text
Why word frequency? Important words are repeated throughout the text examples are given in favor of a certain principle arguments are given for a certain principle Technical literature  one word: one notion Simple and straightforward algorithm  cheap to implement (processing time is costly) Note that different forms of the same word are counted as the same word

When significant? Too low frequent words are not significant
Too high frequent words are also not significant (e.g. “the”, “and”) Removing low frequent words is easy set a minimum frequency-threshold Removing common (high frequent) words: Setting a maximum frequency threshold (statistically obtained) Comparing to a common-word list Figure 1 from [Luhn, 1958]

Using relative position
Where greatest number of high-frequent words are found closest together  probability very high that representative information is given Based on the characteristic that an explanation of a certain idea is represented by words closely together (e.g. sentences – paragraphs - chapters)

The significance factor
The “significance factor” of a sentence reflects the number of occurrences of significant words within a sentence and the linear distance between them due to non-significant words in between Only consider portion of sentence bracketed by significant words with maximum of 5 non-significant words in between, e.g. “ (*) [ * - * * - - * - - * ] - - (*) “ Significance factor formula: (Σ[*])2 / |[.]| (2.5 in the above example)

Generating the abstract
For every sentence the significance factor is calculated The sentences with a significance factor higher than a certain cut-off value are returned (alternatively the N highest-valued sentences can be returned) For large texts, it can also be applied to subdivisions of the text No evaluation of the results present in the journal paper!

2) Position based method
H.P. Edmundson: New methods in Automatic Extracting IBM Courtesy IBM

Lead method Claim: Important sentences occur at the beginning (and/or end) of texts. Lead method: just take first sentence(s)! Experiments: In 85% of 200 individual paragraphs the topic sentences occurred in initial position and in 7% in final position (Baxendale, 58). Only 13% of the paragraphs of contemporary writers start with topic sentences (Donlan, 80).

Cue-Phrase method Claim 1: Important sentences contain ‘bonus phrases’, such as significantly, In this paper we show, and In conclusion, while non-important sentences contain ‘stigma phrases’ such as hardly and impossible. Claim 2: These phrases can be detected automatically (Kupiec et al. 95; Teufel and Moens 97). Method: Add to sentence score if it contains a bonus phrase, penalize if it contains a stigma phrase.

Four methods for weighting
Weighting methods: Cue Method Key Method Title Method Location Method The weight of a sentence is a linear combination of the weights obtained with the above four methods The highest weighing sentences are included in the abstract Target documents: technical literature

Three types of Cue words:
Cue Method Based on the hypothesis that the probable relevance of a sentence is affected by presence of pragmatic words (e.g. “Significant”, “Greatest”, Impossible”, “Hardly”) Three types of Cue words: Bonus words: positively affecting the relevance of a sentence (e.g. “Significant”, “Greatest”) Stigma words: negatively affecting the relevance of a sentence (e.g. “Impossible”, “Hardly”) Null words: irrelevant

The lists were obtained by statistical analyses of 100 documents:
Obtaining Cue words The lists were obtained by statistical analyses of 100 documents: Dispersion (λ): number of documents in which the word occurred Selection ratio (η): ratio of number of occurrences in extractor-selected sentences to number of occurrences in all sentences Bonus words: η > thighη Stigma words: η < tlowη Null words: λ > tλ and tlowη< η < thighη

Stigma list (73): anaphoric expressions, belittling expressions, etc.
Resulting Cue lists Bonus list (783): comparatives, superlatives, adverbs of conclusion, value terms, etc. Stigma list (73): anaphoric expressions, belittling expressions, etc. Null list (139): ordinals, cardinals, the verb “to be”, prepositions, pronouns, etc.

Cue weight of sentence: Σ (Cue weight of each word in sentence)
Tag all Bonus words with weight b > 0, all Stigma words with weight s < 0, all Null words with weight n = 0 Cue weight of sentence: Σ (Cue weight of each word in sentence)

Key Method Principle based on [Luhn], counting the frequency of words.
Algorithm differs: Create key glossary of all non-Cue words in the document which have a frequency larger than a certain threshold Weight of each key word in the key glossary is set to the frequency it occurs in the document Assign key weight to each word which can be found in the key glossary If word is not in key glossary, key weight: 0 No relative position is used ([Luhn]) Key weight of sentence: Σ (Key weight of each word in sentence)

Title Method Based on the hypothesis that an author conceives title as circumscribing the subject matter of the document (similarly for headings vs. paragraphs) Create title glossary consisting of all non-Null words in the title, subtitle and headings of the document Words are given a positive title weight if they appear in this glossary Title words are given a larger weight than heading words Title weight of sentence: Σ (Title weight of each word in sentence)

Location Method Based on the hypothesis that: Global idea:
Sentences occurring under certain headings are positively relevant Topic sentences tend to occur very early or very late in a document and its paragraphs Global idea: Give each sentence below his heading the same weight as the heading itself (note that this is independent from the Title Method) – Heading weight Give each sentence a certain weight based on its position - Ordinal weight Location weight of sentence: Ordinal weight of sentence + Heading weight of sentence

Location Method: Heading weight
Compare each word in a heading with the pre-stored Heading dictionary If the word occurs in this dictionary, assign it a weight equal to the weight it has in the dictionary Heading weight of a heading: Σ (heading weight of each word in heading) Heading weight of a sentence = Heading weight of its heading

Creating the Heading dictionary
The Heading dictionary was created by listing all words in the headings of 120 documents and calculating the selection ratio for each word: Selection ratio (η): ratio of number of occurrences in extractor-selected sentences to number of occurrences in all headings Deletions from this list were made on the basis of low frequency and unrelatedness to the desired information types (subject, purpose, conclusion, etc.) Weights were given to the words in the Heading dictionary proportional to the selection ratio The resulting Heading dictionary contained 90 words

Location Method: Ordinal weight
Sentences of the first paragraph are tagged with weight O1 Sentences of the last paragraph are tagged with weight O2 The first sentence of a paragraph is tagged with weight O3 The last sentence of a paragraph is tagged with weight O4 Ordinal weight of sentence: O1 + O2 + O3 + O4

Generating the abstract
Calculate the weight of a sentence: aC + bK + cT + dL, with a,b,c,d constant positive integers, C: Cue Weight, K: Key weight, T: Title weight, L: Location weight The values of a, b, c and d were obtained by manually comparing the generated automatic abstracts with the desired (human made) abstract Return the highest N sentences under their proper headings as the abstract (including title) N is calculated by taking a percentage of the size of the original documents, in this journal paper 25% is used

Which combination is best?
All combinations of C, K, T and L were tried to see which result had (on average) the most overlap with the handmade extract As can be seen in the figure below (only the interesting results are shown), the Key method was omitted and only C, T and L are used to create the best abstract Surprising result! (Luhn used only keywords to create the abstract) Figure 4 from [Edmundson, 1969]

Evaluation Evaluation was done on unseen data (40 technical documents), comparison with handmade abstracts Result: 44% of the sentences co-selected, 66% similarity between abstracts (human judge) Random ‘abstract’: 25% of the sentences co-selected, 34% similarity between abstracts Another evaluation criterion: ‘extract-worthiness’ Result: 84% of the sentences selected is extract-worthy Therefore: for one document many possible abstracts (differing in length and content)

3) Machine-learning method
Ask people to select sentences Use these as training examples for machine learning Each sentence is represented as a number of features Based on the features distinguish sentences that are appropriate for a summary and sentences that are not Run on new inputs

Scoring sentences For each sentence s the probability P is calculated that it will be included in the summary S given the k features (Bayes’ rule): Assuming statistical independence of the features: is constant, and and can be estimated directly from the training set by counting occurrences This function assigns for each s a score which can be used to select sentences for inclusion in the abstract

Sentences from the abstract were matched to the original document:
The training material 188 documents with professionally created abstracts from the scientific/technical domain, the average length of the abstracts is 3 sentences (3.5% of the total size of the document) Sentences from the abstract were matched to the original document: 79% direct sentence matches 3% direct joins (2 sentences combined) 18% no direct match or join possible Therefore the maximum performance of the automatic system is 82%

Evaluation Too little material  Cross-validation used to evaluate
Two evaluation measures Fraction of manually selected sentences which were reproduced correctly: average result: 35% Fraction of the matchable selected sentences which were reproduced correctly: average result: 42% Performance of features (2nd measure): Feature Individual % sentences correct Cumulative % sentences correct Paragraph 33 Fixed Phrases 29 42 Length Cut-off 24 44 Thematic Word 20 Uppercase Word

4) Discourse-based method
Claim: The multi-sentence coherence structure of a text can be constructed, and the ‘centrality’ of the textual units in this structure reflects their importance. Tree-like representation of texts in the style of Rhetorical Structure Theory (Mann and Thompson,88). Use the discourse representation in order to determine the most important textual units. Attempts: (Ono et al., 1994) for Japanese. (Marcu, 1997,2000) for English.

Rhetorical parsing (Marcu,97)
[With its distant orbit {– 50 percent farther from the sun than Earth –} and slim atmospheric blanket,1] [Mars experiences frigid weather conditions.2] [Surface temperatures typically average about –60 degrees Celsius (–76 degrees Fahrenheit) at the equator and can dip to –123 degrees C near the poles.3] [Only the midday sun at tropical latitudes is warm enough to thaw ice on occasion,4] [but any liquid water formed that way would evaporate almost instantly5] [because of the low atmospheric pressure.6] [Although the atmosphere holds a small amount of water, and water-ice clouds sometimes develop,7] [most Martian weather involves blowing dust or carbon dioxide.8] [Each winter, for example, a blizzard of frozen carbon dioxide rages over one pole, and a few meters of this dry-ice snow accumulate as previously frozen carbon dioxide evaporates from the opposite polar cap.9] [Yet even on the summer pole, {where the sun remains in the sky all day long,} temperatures never warm enough to melt frozen water.10]

Rhetorical parsing (2) Use discourse markers to hypothesize rhetorical relations rhet_rel(CONTRAST, 4, 5)  rhet_rel(CONTRAT, 4, 6) rhet_rel(EXAMPLE, 9, [7,8])  rhet_rel(EXAMPLE, 10, [7,8]) Use semantic similarity to hypothesize rhetorical relations if similar(u1,u2) then rhet_rel(ELABORATION, u2, u1)  rhet_rel(BACKGROUND, u1,u2) else rhet_rel(JOIN, u1, u2) rhet_rel(JOIN, 3, [1,2])  rhet_rel(ELABORATION, [4,6], [1,2]) Use the hypotheses in order to derive a valid discourse representation of the original text.

Summarization = selection of the
Rhetorical parsing (3) 2 Elaboration 2 Elaboration 8 Example 2 Background Justification 3 Elaboration 8 Concession 10 Antithesis 1 2 3 4 5 Contrast 7 8 9 10 Summarization = selection of the most important units 2 > 8 > 3, 10 > 1, 4, 5, 7, 9 > 6 4 5 Evidence Cause 5 6

Discourse method: Evaluation
(using a combination of heuristics for rhetorical parsing disambiguation) TREC Corpus (fourfold cross-validation) Scientific American Corpus

5) VS based method Based on word probability Based on word tf.idf
S is sentence with length n Pi is the probability of the i-th word in the sentence Based on word tf.idf

Centrality measures How representative is a sentence of the overall content of a document The more similar are sentence is to the document, the more representative it is

Evaluation 60

Comparing Text Against Text
Which human summary makes a good gold standard? Many summaries are good At what granularity is the comparison made? When can we say that two pieces of text match?

Variation impacts evaluation
Comparing content is hard All kinds of judgment calls Paraphrases VP vs. NP Ministers have been exchanged Reciprocal ministerial visits Length and constituent type Robotics assists doctors in the medical operating theater Surgeons started using robotic assistants

Nightmare: only one gold standard
System may have chosen an equally good sentence but not in the one gold standard Pinochet arrested in London on Oct 16 at a Spanish judge’s request for atrocities against Spaniards in Chile. Former Chilean dictator Augusto Pinochet has been arrested in London at the request of the Spanish government In DUC 2001 (one gold standard), human model had significant impact on scores (McKeown et al) Five human summaries needed to avoid changes in rank (Nenkova and Passonneau) DUC2003 data 3 topic sets, 1 highest scoring and 2 lowest scoring 10 model summaries

Scoring Two main approaches used in DUC Problems: ROUGE (Lin and Hovy)
Pyramids (Nenkova and Passonneau) Problems: Are the results stable? How difficult is it to do the scoring?

DUC – Document Understanding Conference
Established and funded by DARPA TIDES Run by independent evaluator NIST Open to summarization community Annual evaluations on common datasets 2001-present Tasks Single document summarization Headline summarization Multi-document summarization Multi-lingual summarization Focused summarization

DUC Evaluation Gold Standard Multiple metrics Granularity
Human summaries written by NIST From 2 to 9 summaries per input set Multiple metrics Manual Coverage (early years) Pyramids (later years) Responsiveness (later years) Quality questions Automatic Rouge (-1, -2, -skipbigrams, LCS, BE) Granularity Manual: sub-sentential elements Automatic: sentences

ROUGE: Recall-Oriented Understudy for Gisting Evaluation
Rouge – Ngram co-occurrence metrics measuring content overlap Counts of n-gram overlaps between candidate and model summaries Total n-grams in summary model

ROUGE Experimentation with different units of comparison: unigrams, bigrams, longest common substring, skip-bigams, basic elements Automatic and thus easy to apply Important to consider confidence intervals when determining differences between systems Scores falling within same interval not significantly different Rouge scores place systems into large groups: can be hard to definitively say one is better than another Sometimes results unintuitive: Multilingual scores as high as English scores Use in speech summarization shows no discrimination Good for training regardless of intervals: can see trends

LexPageRank: Prestige in Multi-Document Text Summarization
Gunes Erkan and Dragomir R. Radev ACL 2004

Abstract This paper consider an approach for computing sentence importance based on the concept of eigenvector centrality (prestige) – LexPageRank In this model, a sentence connectivity matrix is constructed based on cosine similarity The experimental results using DUC2004 show that this approach outperforms centroid-based summarization and is quite successful compared to other summarization systems

Introduction Text summarization is the process of automatically creating a compressed version of a given text that provides useful information for the user This summarization approach is to assess the centrality of each sentence in a cluster and include the most important ones in the summary Introduce two new measures for centrality, Degree and LexPageRank, inspired from the prestige concept in social networks

Sentence centrality and centroid-based summarization
Extractive summarization produces summaries by choosing a subset of the sentences in the original documents Centrality of a sentence is often defined in terms of the centrality of the words that it contains The centroid of a cluster is a psuedo-document which consists of words that have frequency*IDF scores above a predefined threshold In centroid-based summarization (Radevet et al., 2000), the sentences that contain more words from the centroid of the cluster are considered central Centroid-based summarization has given promising results in the past

Prestige-based sentence centrality
We hypothesize that the sentences that are similar to many of the other sentences in a cluster are more central (or prestigious) to the topic There are two issues How to define similarity between two sentences Cosine How to compute the overall prestige of a sentence given its similarity to other sentences Degree centrality Eigenvector centrality and LexPageank

A cluster may be represented by a cosine similarity matrix

Most of them are nonzero

Degree centrality Since we are interested in significant similarities in the matrix, we can eliminate some low values by defining a threshold , so that the cluster can be view as an undirected graph We define degree centrality as the degree of each node in the similarity graph

Issue for degree centrality Several unwanted sentences vote for each and raise their prestige This situation can be avoided by considering where the votes come from and taking the prestige of the voting node into account in weight each node Eigenvector centrality and LexPageRank PageRank (Page et al., 1998) is a method propose for assigning a prestige score to each page in the web independent of a specific query Depending on the number of pages that link to that pages as well as the individual score of the linking pages

The PageRank of Page A This recursively defined value can be computed by forming the binary adjacency matrix of the web, normalizing this matrix so that row sums equal to 1, and finding the principal eigenvector of the normalized matrix PageRank for ith pages equals to the ith entry in the eigenvector T1,…,Tn: pages that link to page A d: damping factor, C(Ti): the number of outgoing links from page Ti

This method can be easily applied to the cosine similarity graph to find the most prestigious sentences in a document We called this new measure of sentence similarity LexPageRank

damping factor = 1

Advantage over Centroid It accounts for information subsumption among sentences It prevents unnaturally high IDF scores from boosting up the score of a sentence that is unrelated to the topic

Experiments on DUC 2004 data
DUC 2004 data was used in our experiments Task 2 involves summarization of 50 TDT English clusters Task 4 is to produce summaries of machine translation output (in English) of 24 Arabic TDT documents Recall-based measure – Rouge is adopted and 665-byte summaries for each cluster are produced

MEAD summarization toolkit Extractive multi-document summarization Consist of three components Feature extractor (document -> feature vector) Centroid, Position and Length Combiner (feature vector -> scalar value) Reranker (the scores are adjusted upward or downward) MMR (Maximum Margin Relevance), CSIS (Cross-Sentence Information Subsumption) weight Threshold

Centroid

Thank You! Q&A

HOME WORK 阅读以下文献之一，写一个阅读报告
SentTopic-MultiRank: a novel ranking model for multi-document summarization. In COLING’12 RelationListwise for query-focused multi-document summarization. In COLING’12 A supervised aggregation framework for multi-document summarization. In COLING’12 Query-Focused Multidocument Summarization Based on Query-Sensitive Feature Space. In CIKM’12 Optimized Event Storyline Generation based on Mixture-Event-Aspect Model. In EMNLP’13

Text Summarization 黄连恩

Similar presentations

Presentation on theme: "Text Summarization 黄连恩"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text Summarization 黄连恩

Similar presentations

Presentation on theme: "Text Summarization 黄连恩"— Presentation transcript:

Similar presentations

About project

Feedback