1 Reference Julian Kupiec, Jan Pedersen, Francine Chen, “A Trainable Document Summarizer”, SIGIR’95 Seattle WA USA, 1995. Xiaodan Zhu, Gerald Penn, “Evaluation.

1 Reference Julian Kupiec, Jan Pedersen, Francine Chen, “A Trainable Document Summarizer”, SIGIR’95 Seattle WA USA, 1995. Xiaodan Zhu, Gerald Penn, “Evaluation of Sentence Selection for Speech Summarization”, Proceedings of the 2nd International Conference on Recent Advances in Natural Language Processing (RANLP-05), Borovets, Bulgaria, pp. 39-45. September 2005. C.D. Paice, “Constructing literature abstracts by computer: Techniques and prospects”. Information Processing and Management, 26:171-186, 1990.

2 A Trainable Document Summarizer Julian Kupiec, Jan Pedersen and Francine Chen Xerox Palo Alto Research Center

3 Outline Introduction A Trainable Summarizer Experiments and Evaluation Discussion and Conclusions

4 Introduction To summarize is to reduce in complexity, and hence in length, while retaining some of the essential qualities of the original This paper focuses on document extracts, a particular kind of computed document summary Document extracts consisting of roughly 20% of the original can be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summaries Titles, key-words, tables-of-contents and abstracts might all be considered as forms of summary They approach extract election as a statistical classification problem This framework provides a natural evaluation criterion: the classification success rate or precision It does require a “training corpus” of documents with labelled extracts

5 A Trainable Summarizer Features –Paice groups sentence scoring features into seven categories Frequency-keyword heuristics The title-keyword heuristic Location heuristics Indicator phrase (e.g., “this report…..”) Related heuristic: involves cue words –Two set of words which are positively and negatively correlated with summary sentences –Bonus: e.g., greatest and significant –Stigma: e.g., hardly and impossible Ref. ---The frequency-keyword approach 、 the title-keyword method 、 The location method 、 Syntactic criteria 、 The cue method 、 The indicator-phrase method 、 Relational criteria

6 A Trainable Summarizer Features –Sentence Length Cut-off Feature Given a threshold (e.g., 5 words) The feature is true for all sentences longer than the threshold, and false otherwise –Fixed-phrase Feature This features is true for sentences that contain any of 26 indicator phrases, or that follow section heads that contain specific key words –Paragraph Feature –Thematic Word Feature The most frequent content words are defined as thematic words This feature is binary, depending on whether a sentence is present in the set of highest scoring sentences –Uppercase Word Feature

7 A Trainable Summarizer Classifier –For each sentence s, to compute the probability it will be included in a summary S given the k features, which can be expressed using Bayes’ rule as follows: Assuming statistical independence of the features: is a constant and and can be estimated directly from the training set by “counting occurrences”

8 Experiments and Evaluation The corpus –There are 188 document/summary pairs, sampled from 21 publications in the scientific/technical domain –The average number of sentences per document is 86 –Each document was “normalized” so that the first line of each file contained the document title

9 Experiments and Evaluation The corpus –Sentence Matching: Direct sentence match (verbatim or minor modification) Direct join (two or more sentences) Unmatchable Incomplete (some overlap, includes a sentence from the original document, but also contains other information) The correspondences were produced in two passes 79% of the summary sentences have direct matches

10 Experiments and Evaluation The corpus

11 Experiments and Evaluation Evaluation –Using a cross-validation strategy for evaluation –Unmatchable and incomplete sentences were excluded from both training and testing, yielding a total of 498 unique sentences –Performance: First way –the highest performance –A sentence produced by the summarizer is defined as correct here if: (direct sentence match, direct join) –Of the 568 sentences, 195 direct sentence matches and 6 direct joins were correctly identified, for a total of 201 correctly identified summary sentences : 35% Second way : –498 match-able sentences –42%

12 Experiments and Evaluation Evaluation –The best combination is (Paragraph+fixed-phrase+sentence- length) –Addition of the frequency-keyword features (thematic and uppercase word features) results in a slight decrease in overall performance –For a baseline, to select sentences from the beginning of a document (considering the sentence length cut-off feature alone) 24% (121 sentences correct)

13 Experiments and Evaluation –Figure 3 shows the performance of the summarizer (using all features) as a function of summary size –Edmundson cites a sentence-level performance of 44% –By analogy, 25% of the average document length (86 sentences) in our corpus is about 20 sentences –Reference to the table indicates performance at 84%

14 Discussion and Conclusions The trends in our results are in agreement with those of edmundson who used a subjectively weighted combination of features as opposed to training the feature weights using a corpus Frequency-keyword features also gave poorest individual performance in evaluation They have however retained these features in our final system for several reasons –The first is robustness –Secondly, as the number of sentences in a summary grows, more dispersed informative material tends to be included

15 Discussion and Conclusions The goal is to provide a summarization program that is of general utility –The first concerns robustness –The second issue concerns presentation and other forms of summary information

16 Reference Julian Kupiec, Jan Pedersen, Francine Chen, “A Trainable Document Summarizer”, SIGIR’95 Seattle WA USA, 1995. Xiaodan Zhu, Gerald Penn, “Evaluation of Sentence Selection for Speech Summarization”, Proceedings of the 2nd International Conference on Recent Advances in Natural Language Processing (RANLP-05), Borovets, Bulgaria, pp. 39-45. September 2005.

17 Evaluation of Sentence Selection for Speech Summarization Xiaodan Zhu and Gerald Penn Department of Computer Science University of Toronto

18 Outline Introduction Speech Summarization by Sentence Selection Evaluation Metrics Experiments Conclusions

19 Introduction This paper consider whether ASR-inspired evaluation metrics produce different results than those taken from text summarization The goal of speech summarization is to distill important information from speech data In this paper, we will focus on sentence-level extraction

20 Speech Summarization by Sentence Selection “LEAD”: sentence selection is to select the first N% of sentences from the beginning of the transcript “RAND”: random selection Knowledge-based Approach: “SEM” –To calculate semantic similarity between a given utterance and the dialogue, the noun portion of WordNet is used as a knowledge source, with semantic distance between senses computed using normalized path length –The performance of the system is reported as better than LEAD, RAND and TF*IDF based methods –Not using manually disambiguated, to apply Brill’s POS tagger to acquire the nouns –Using semantic similarity package

21 Speech Summarization by Sentence Selection MMR-based Approach: “MMR” –Whether it is more similar to the whole dialogue –Whether it is less similar to the sentences that have so far been selected Classification-Based Approaches –To formulate sentence selection as a binary classification problem –The best two have consistently been SVM and logistic regression –SVM: (OSU-SVM package) SVM seeks an optimal separating hyperplane, where the margin is maximal Decision function is :

22 Speech Summarization by Sentence Selection Features

23 Speech Summarization by Sentence Selection Classification-Based Approaches –Logistic Regression: “LOG” To model the posterior probabilities of the class label with linear functions: X are feature sets and Yare class labels

24 Evaluation Metrics Precision/Recall –When evaluated on binary annotations and using precision/recall metrics, sys1 and sys2 achieve 50% and 0% Relative Utility –For the above example, if using relative utility, sys1 gets 18/19 and sys2 gets 15/19 –The values obtained are higher than with P/R, but they are higher for all of the systems evaluated

25 Evaluation Metrics Word Error Rate –Sentence level and word level –The sum of insertion error, substitution error and deletion error of words, divided by the number of all these errors plus the number of corrects words Zechner’s Summarization Accuracy –The summarization accuracy is defined as the sum of the relevance scores of all the words in the automatic summary, divided by the maximum achievable relevance score with the same number of words ROUGE –To measuring overlapping units such as n-grams, word sequences and word pairs –ROUGE-N and ROUGE-L

26 Experiments Corpus: the SWITCHBOARD dataset (a corpus of open- domain spoken dialogue) To randomly select 27 spoken dialogues from SWITCHBOARD Three annotators are asked to assign 0/1 labels to indicate whether a sentence is in the summary or not (required to select around 10% of the sentences into the summary) Judge’s annotation relative to another are evaluated (F- scores)

27 Experiments Precision/Recall –One standard marks a sentence as in the summary only when all three annotators agree –LOG and SVM have similar performance and outperform the others, with MMR following, and then SEM and LEAD –At least two of the three judges include in the summary

28 Experiments Precision/Recall –Any of the three annotators Relative Utility –For three different human judges, an assignment of a number between 0 and 9 to each sentence are obtained, to indicate the confidence that this sentence should be included in the summary

29 Experiments Relative Utility –The performance ranks of the five summarizaers are the same here as they are in the three P/R evaluations First, the P/R agreement among annotators is not low Second, the redundancy in the data is much less than in the multi-document summarization tasks Third, the summarizers we compare might tend to select the same sentences

30 Experiments Word Error Rate and Summarization Accuracy

31 Experiments Word Error Rate and Summarization Accuracy

32 Experiments ROUGE

33 Conclusion Five summarizers were evaluated on three text- summarization-inspired metrics: (P/R), (RU), and ROUGE, as well as on two ASR-inspired evaluation metrics: (WER) and (SA) Preliminary conclusion is that considerably greater caution must be exercised when using ASR-based measures than we have witnessed to date in the speech summarization literature

1 Reference Julian Kupiec, Jan Pedersen, Francine Chen, “A Trainable Document Summarizer”, SIGIR’95 Seattle WA USA, 1995. Xiaodan Zhu, Gerald Penn, “Evaluation.

Similar presentations

Presentation on theme: "1 Reference Julian Kupiec, Jan Pedersen, Francine Chen, “A Trainable Document Summarizer”, SIGIR’95 Seattle WA USA, 1995. Xiaodan Zhu, Gerald Penn, “Evaluation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Reference Julian Kupiec, Jan Pedersen, Francine Chen, “A Trainable Document Summarizer”, SIGIR’95 Seattle WA USA, 1995. Xiaodan Zhu, Gerald Penn, “Evaluation.

Similar presentations

Presentation on theme: "1 Reference Julian Kupiec, Jan Pedersen, Francine Chen, “A Trainable Document Summarizer”, SIGIR’95 Seattle WA USA, 1995. Xiaodan Zhu, Gerald Penn, “Evaluation."— Presentation transcript:

Similar presentations

About project

Feedback