Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extraction Chapter 3 in Automatic Summarization 한 경 수 2001-11-08 고려대학교 자연어처리연구실.

Similar presentations

Presentation on theme: "Extraction Chapter 3 in Automatic Summarization 한 경 수 2001-11-08 고려대학교 자연어처리연구실."— Presentation transcript:

1 Extraction Chapter 3 in Automatic Summarization 한 경 수 2001-11-08 고려대학교 자연어처리연구실

2 한경수 Extraction2 Contents  Introduction  The Edmundsonian paradigm  Corpus based sentence extraction General considerations Aspects of learning approaches  Coherence of extracts  Conclusion

3 한경수 Extraction3  Extraction (discussed here) Analysis phase dominates. This analysis is relatively shallow. Discourse level information, if used at all, is mostly for … –establishing coreference between proper names –pronoun resolution  Extraction is not appropriate for every summarization. At high compression rate –extraction seems less likely to be effective, unless some pre-existing highly compressed summary material is found. In multi-document summarization –both differences and similarities between documents need to be characterized. Human abstractors produce abstracts, not extracts. Introduction

4 한경수 Extraction4 Extraction element  The basic unit of extraction is the sentence.  Practical reason preferring sentence to paragraph It offers better control over compression Linguistic motivation –Sentence has historically served as a prominent unit in syntactic and semantic analysis. –Logical accounts of meaning offer precise notions of sentential meaning. oSentences can be represented in a logical form, and taken to denote propositions.  The extraction of elements below the sentence level The extracts will often be fragmentary in nature.  The sentence seems a natural unit to consider in the general case. Introduction

5 한경수 Extraction5 Classic work of Edmundson (1969)  Used a corpus of 200 scientific papers on chemistry. Each paper between 100 and 3900 words long. Manually prepare the target extracts  Features Title words –Words from the title, subtitles, and headings –given a hand-assigned weight Cue words –Extracted from the training corpus based on selection ratio oSelection ratio = # of occurrences in extract / # of occurrences in all sentences of the corpus –Bonus words oEvidence for selection: above an upper selection ratio threshold ocomparatives, superlatives, adverbs of conclusion, value terms, relative interrogatives, causality terms –Stigma wods oEvidence for non-selection: below a lower selection ratio cutoff oAnaphoric expressions, belittling expressions, insignificant detail expressions, hedging expressions The Edmundsonian paradigm

6 한경수 Extraction6 Classic work of Edmundson (1969)  Features (continued) Keywords –The word frequencies were tabulated in descending order oUntil a given cutoff percentage of all the word occurrences in the document were reached –Non-cue words above that threshold were extracted as key words. –Each word ’ s weight is its frequency in the document. Sentence location –Heading weight oShort list of particular section headings was constructed. Like “ Introduction ” and “ Conclusion ” oSentences occurred under such headings were assigned a positive weight. –Ordinal weight oSentences were assigned weights based on their ordinal position. oIf they occurred in the first and last paragraph or if they were the first or last sentences of paragraphs, they were assigned a positive weight. The Edmundsonian paradigm

7 한경수 Extraction7 Classic work of Edmundson (1969)  Sentence scoring Based on a linear function of the weights of each features Edmundson adjusted by hand the feature weights and the tuning parameters –by feedback from comparisons against manually created training extracts  Evaluations Key words were poorer than the other 3 features. The combination of cue-title-location was the best –The best individual feature: location, the worst: key words The Edmundsonian paradigm

8 한경수 Extraction8 Feature reinterpretation: cue words  Cue words  cue phrases  Cue phrases Expressions –like “ I conclude by ”, “ this paper is concerned with ”, … Bonus words, stigma words In-text summary cues (indicator phrases) –E.g. beginning with “ in summary ”  Useful for specific technical domains  Indicator phrases can be extracted by a pattern matching process Black(1990): p.49 example The Edmundsonian paradigm

9 한경수 Extraction9 Feature reinterpretation: key words  Key words  presence of thematic term features Selected based on term frequency Including key words of Edmundson  Thematic Term Assumption Relatively more frequent terms are more salient. Luhn(1958) –Find content words in a document by filtering against a stoplist of function words –Arrange it by frequency –Suitable high-frequency and low-frequency cutoffs were estimated from a collection of articles and their abstracts.  A variant of the thematic term assumption: tf*idf Its use in automatic summarization is somewhat less well- motivated. The Edmundsonian paradigm

10 한경수 Extraction10 Feature reinterpretation: location  Baxendale(1958) Found that important sentences were located at the beginning or end of paragraphs. Salient sentences were likely to occur as … –first sentence in the paragraph 85% the time –Last sentence 7% of the time  Brandow et al.(1995) Compared their thematic term based extraction system for news(ANES) against Searchable Lead, a system which just output sentences in order. Searchable Lead outperformed ANES –Acceptable 87% to 96% of the time –Unacceptable case oanecdotal, human-interest style lead-ins, documents that contained multiple news stories, stories with unusual structural/stylistic features, … The Edmundsonian paradigm

11 한경수 Extraction11 Feature reinterpretation: location  Lin & Hovy(1997) Defined Optimal Position Policy(OPP). OPP –A list of positions in the text in which salient sentences were likely to occur. For 13,000 Ziff-Davis news articles –Title, 1 st sentence of 2 nd paragraph, 1 st sent of 3 rd para, … For Wall Street Journal –Title, 1 st sentence of 1 st paragraph, 2 nd sent of 1 st para, … The Edmundsonian paradigm

12 한경수 Extraction12 Feature reinterpretation: title  Title words  Add Term Weight is assigned based on terms in it that are also present in the title, article headline, or the user ’ s profile or query. A user-focused summary –Relatively heavy weight for –Will favor the relevance of the summary to the query or topic. –Must be balanced against the fidelity to the source document. oNeed for the summary to represent information in the document The Edmundsonian paradigm

13 한경수 Extraction13 Criticism  The Edmundsonian equation is inadequate for summarization for the following reasons Extracts only single elements in isolation, rather than extracting sequences of elements. –Incoherent summaries –Knowing that a particular sentence has been selected should affect the choice of subsequent sentences. Compression rate isn ’ t directly referenced in the equation –The compression rate should be part of the summarization process, not just an afterthought. oE.g. most salient concept A – s1, s2 Next-to-most salient concept B – s3 One-sentence summary: s3 Two-sentence summary: s1, s2 The Edmundsonian paradigm

14 한경수 Extraction14 Criticism A linear equation may not be a powerful enough model for summarization. –Non-linear model is required for certain applications oSpreading activation between words oOther probabilistic models Uses only shallow, morphological-level features for words and phrases in the sentence, along with the sentence ’ s location. –There has been a body of work which explores different linear combinations of syntactic, semantic, and discourse-level features. Is rather ad hoc. –Doesn ’ t tell us anything theoretically interesting about what makes a summary a summary. The Edmundsonian paradigm

15 한경수 Extraction15 General considerations  The most interesting empirical work in Edmundsonian paradigm has used some variant of Edmundson ’ s equation, leveraging a corpus to estimate the weights.  Basic methodology for a corpus-based to sentence extraction Figure 3.1 (p. 54) Corpus based sentence extraction

16 한경수 Extraction16 Labeling  A training extract is also preferred to a training abstract Because it is somewhat less likely to vary across human summarizers.  Producing an extract from an abstract Mani & Bloedorn(1998) –Treat the abstrat as a query. –Rank the sentences for similarity to the abstract. oCombined-match Each source sentence is matched against the entire abstract treated as a single sentence. Euqation 3.2 (p. 56) oIndividual-match Each source sentence is compared against each sentence of the abstract. Corpus based sentence extraction

17 한경수 Extraction17 Labeling  Producing an extract from an abstract (continued) Marcu(1999) –Prunes a clause away from the source that is least similar to abstract. Jing & McKeown(1999) –Word-sequence alignment using HMM –Refer to section 3 in Kyoung-Soo ’ s Technical Note KS-TN-200103  Can result in a score for each sentence Yes/no label Labeling can be left as a continuous function. Corpus based sentence extraction

18 한경수 Extraction18 Learning representation  The result of learning can be represented as … Rules Mathematical functions  If a human is to trust a machine ’ s summaries The machine has to have some way of explaining why it produced the summary it did.  Logical rules are usually preferred to mathematical functions. Corpus based sentence extraction

19 한경수 Extraction19 Compression & Evaluation  Compression Typically, it is applied at the time of testing. It is possible to train a summarizer for a particular compression. –Different feature combinations may be used for different compression rates.  Evaluation Precision, recall, accuracy, F-measure Table 3.1/3.2 (p. 59) Corpus based sentence extraction

20 한경수 Extraction20 Aspects of learning approaches  Sentence extraction as Bayesian classification Kupiec et al.(1995) 188 full text/summary pairs –drawn from 21 different collections of scientific articles –Summary was written by a professional abstractor and was 3 sentences long on average. Features –Sentence length, presence of fixed cue phrases, location, presence of thematic terms, presence of proper names Bayesian classifier (Equation 3.4p.60) Producing an extract from the abstract –Direct match(79%) oidentical, or considered to have the same content –Direct join(3%) otwo or more document sentences appear to have the same content as a single summary sentence. Corpus based sentence extraction

21 한경수 Extraction21 Aspects of learning approaches  Sentence extraction as Bayesian classification (cont ’ d) Evaluation –43% recall –As the summaries were lengthened performance improved. o84% recall at 25% of the full text length –Location was the best feature –Location-cue phrase-sentence length was the best combination Corpus based sentence extraction

22 한경수 Extraction22 Aspects of learning approaches  Classifier combination Myaeng & Jang(1999) –Tagged each sentence in the Introduction and Conclusion section oWhether the section represented … Background Main theme Explanation of the document structure Description of future work –96% of the summary sentence were main theme sentences. –Training method oUsed bayesian classifier to determine whether a sentence belonged to a main theme oCombined evidence from multiple Bayesian feature classifiers using a voting oApplied a filter to eliminate redundant sentences. –Evaluation oCue words-location-title words was the best combination oSuggests that the Edmundsonian features are not language-specific. Corpus based sentence extraction

23 한경수 Extraction23 Aspects of learning approaches  Term aggregation In a document about a certain topic, –There would be many reference to that topic. –The reference need not result in verbatim repetition. oSynonym, more specialized word, related term, … Aone et al.(1999) –Different methods of term aggregation can impact summarization performance. oTreat morphological variants, synonyms, name aliases as instances of the same term. –Performance can be improved oWhen place names and organization names are identified as terms, oAnd when person names are filtered out oReason: document topics are generally not about people. Corpus based sentence extraction

24 한경수 Extraction24 Aspects of learning approaches  Topic-focused summaries Lin(1999) –Used a corpus, called the Q&A corpus o120 texts (4 topics * 30 relevant docs/topic) oHuman-created, topic-focused passage extraction summary –Features oAdd-Term: query term Sentences are weighted based on the number of query terms they contained. oAdditional relevance feature Relevance feedback weight for terms that occurred in documents most relevant to the topic. oPresence of proper name, sentence length oCohesion features Number of terms shared with other sentences oNumerical expression, pronoun, adjective, reference to specific weekdays or months, presence of quoted speech Corpus based sentence extraction

25 한경수 Extraction25 Aspects of learning approaches  Topic-focused summaries (continued) Lin(1999) (continued) –Feature combination oNa ï ve combination with each feature given equal weight oDecision tree learner –Na ï ve method outperformed the decision tree learner on 3 out of 4 topics. –Baseline method(based on sentence order) also performed well on all topics. Corpus based sentence extraction

26 한경수 Extraction26 Aspects of learning approaches  Topic-focused summaries (continued) Mani & Bloedorn(1998) –Cmp-lg corpus: a set of 198 pairs of full-text docs/abstracts –Labeling oThe overall information need for a user was defined by a set of docs. oA subject was told to pick a sample of 10 docs matched his interests. oTop content words were extracted from each docs. oWords for the 10 docs were sorted by their scores oAll words more than 2.5 standard deviations above the mean of these words ’ scores were treated as a representation of the user ’ s interest, or topic. There were 72 such words. oRelevance match Used spreading activation based on cohesion information to weight word occurrences in the document related to the topic. Each sentence was weighted based on the average of its word weights. The top C% of these sentences were picked as positive examples Corpus based sentence extraction

27 한경수 Extraction27 Aspects of learning approaches  Topic-focused summaries (continued) Mani & Bloedorn(1998) (continued) –Features o2 additional user-interest-specific features Number of reweighted words(topic keywords) in the sentence Number of topic keywords / number of content word in the sentence Specific topic keywords weren ’ t used as features, since it is preferable to learn rules that could transfer across user-interests. Topic keywords are similar to ‘ relevance feedback ’ terms in Lin ’ s study. oLocation, thematic features ocohesion features Synonymy: judged by using WordNet Statistical cooccurrence: scores between content words i and j up to 40 words apart were computed using mutual information. Equation 3.5(p. 65) Association table only stores scores for tf counts greater than 10 and association scores greater than 10. Corpus based sentence extraction

28 한경수 Extraction28 Aspects of learning approaches  Topic-focused summaries (continued) Mani & Bloedorn(1998) (continued) –Evaluation oIn user-focused summaries, the number of topic keywords in a sentnece was the single most influential feature. oThe cohesion features contributed the least, Perhaps because the cohesion calculation was too imprecise. –Some sample rules (Table 3.4p.66) oThe learned rules are highly intelligible, and can perhaps be edited in accordance with human intuitions. oThe discretization of the features degraded performance by about 15% There is a tradeoff there between accuracy and transparency. Corpus based sentence extraction

29 한경수 Extraction29 Aspects of learning approaches  Case study: Noisy channel model There has been a surge of interest in language modeling approaches to summarization. (Berger & Mittal 2000) The problem of automatic summarization as a translation problem –translating between a verbose language(of source documents) and a succinct language(of summaries) –This idea is related to the notion of the abstractor reconstructing the author ’ s ideas in order to produce a summary. Generic summarization Corpus based sentence extraction decoder Noisy Channel

30 한경수 Extraction30 Aspects of learning approaches  Case study: Noisy channel model (continued) User-focused summarization –fidelity –relevance Corpus based sentence extraction relevancefidelity

31 한경수 Extraction31 Aspects of learning approaches  Case study: Noisy channel model (continued) Training –Use FAQ pages on WWW oLists a sequence of question-answer pairs (10,395) oCulled from 201 usenet FAQs and 4 call-center FAQs oView each answer as the query-focused summary of the document Evaluation –Assigns the correct summary, on the average, a rank of … o1.41 for usenet o4.3 for the call center data Criticism –The noisy channel model is appealing oBecause it decomposes the summarization problem for generic and user- focused summarization in a theoretically interesting way –However, the model tends to rely on large quantities of training data. Corpus based sentence extraction

32 한경수 Extraction32 Aspects of learning approaches  Conclusion The corpus-based approach to sentence extraction is attractive because … –It allows one to tune the summarizer to the characteristics of the corpus or genre of text. –Well-established –The capability to learn interesting and often quite interesting rules But, –Lots of design choices and parameters involved in training Issues –How is the training to be utilized in an end-application? –Learning sequences of sentences to extract deserves more attention. –Evaluation Corpus based sentence extraction

33 한경수 Extraction33 Coherence of extracts  When extracting sentences from a source, An obvious problem is preserving context. Picking sentences out of context can result in incoherent summaries  Coherence problems Dangling anaphors –If an anaphor is present in a summary extract, the extract may not be entirely intelligible if the referent isn ’ t included as well. Gaps –Breaking the connection between the ideas in a text can cause problems. Structured environments –Itemized lists, tables, logical arguments, etc., cannot be arbitrarily divided.

34 한경수 Extraction34 Conclusion  Abstracts vs. extracts The most important aspect of an abstract … –Is not so much that it paraphrases the input in its own words. –Some level of abstraction of the input has been carried out oProviding a degree of compression oRequires Knowledge of the meaning of the information talked about oAnd ability to make inferences at the semantic level Extraction methods –While knowledge-poor, are not entirely knowledge-free. –Knowledge about a particular domain is represented oIn terms of features specific to that domain oIn the particular rules or functions learned for that domain –The knowledge here is entirely internal. There is fundamental limitation to the capabilities of extraction systems. –Current attention is focused on the opportunity to avail of compression in a more effective way by producing abstracts automatically.

Download ppt "Extraction Chapter 3 in Automatic Summarization 한 경 수 2001-11-08 고려대학교 자연어처리연구실."

Similar presentations

Ads by Google