Presentation is loading. Please wait.

Presentation is loading. Please wait.

Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.

Similar presentations


Presentation on theme: "Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary."— Presentation transcript:

1 Approaches to automatic summarization Lecture 5

2 Types of summaries Extracts – Sentences from the original document are displayed together to form a summary Abstracts – Materials is transformed: paraphrased, restructured, shortened

3 Extractive summarization Each sentence is assigned a score that reflects how important and contenful they are Data-driven approaches – Do not use any domain knowledge or external resources – Importance “immerges” for the data – Probabilistic models of word occurrence and sentence similarity

4 Sentence ranking options Based on word probability – S is sentence with length n – P i is the probability of the i-th word in the sentence – Based on word tf.idf

5 Centrality measures How representative is a sentence of the overall content of a document – The more similar are sentence is to the document, the more representative it is

6 Data-driven approach Unsupervised---no information about what constitutes a desirable choice How can be supervised approaches used? – For example the scientific article summarization paper from last week

7

8 Rhetorical status What is the purpose of the sentence? To communicate – Background – Aim – Basis (related work) How can we know which sentence serves each aim?

9 Rhetorical zones

10

11 Distribution of categories

12 Selecting important sentences (relevance) How well can it be performed by people? – Rather subjective; depends on prior knowledge and interests Even the same person would select 50% different sentences if she performs the task at different times Still, judgments can be solicited by several people to mitigate the problem For each sentence in at article---say if it is important and interesting enough to be included in a summary

13 Annotated data 80 computational linguistics articles Can be used to train classifiers – Given a sentence, which rhetorical class does it belong to? – Given a sentence, should it be included in the summary or not?

14 Features Location – Absolute location of the sentence – Section structure: first sentence, last sentence, other – Paragraph structure What section the sentence appeared in – Introduction, implementation, example, conclusion, result, evaluation, experiment etc

15 Sentence length – Very long and very short sentences are unusual Title word overlap Tf.idf word content – Binary feature – “yes” if the sentence contains one of the 18 most important words – “no” otherwise

16 Presence and type of citation Formulaic expressions – “in traditional approaches”, “a novel method for”

17

18

19 Important lessons for us Vector representation of sentences – Can be words – But can also be other features! The probability of a sentences belonging to a class can be computed Complex distinctions can be accurately predicted using simple features

20 Problems with ML for summarization Annotation is expensive – Here---relevance and rhetorical status judgments People don’t agree – So more annotators are necessary – And/or more training of the annotators


Download ppt "Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary."

Similar presentations


Ads by Google