Presentation is loading. Please wait.

Presentation is loading. Please wait.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Similar presentations


Presentation on theme: "UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16."— Presentation transcript:

1 UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16

2 Marti A. Hearst SIMS 202, Fall 1997 Summarization n What is it for? n Reduce in complexity, and hence in length, while retaining some of the essential qualities of the original. (Kupiec et al. 95) n Other definitions? n What kinds of summaries are there? n Abstracts n Extracts n Highlights n Difficult to Evaluate Quality

3 Marti A. Hearst SIMS 202, Fall 1997 Abstracts n Act as Document Surrogates n Intermediate point between title and entire document n Summarize main points n Cohesive narratives n Reporting vs. Critical

4 Marti A. Hearst SIMS 202, Fall 1997 How to Generate Abstracts Automatically? “Understand” the text and summarize? “Understand” the text and summarize? Automated text understanding is still not possible. Automated text understanding is still not possible. Approximations to this require huge amounts of hand-coded knowledge. Approximations to this require huge amounts of hand-coded knowledge. Hard even for people (remember the SATs?) Hard even for people (remember the SATs?)

5 Marti A. Hearst SIMS 202, Fall 1997 Extracts n Excerpt directly from source material n Present inventory of all major topics or points n Easier to automate than Abstracts

6 Marti A. Hearst SIMS 202, Fall 1997 Automatic Extracting Define Goal Define Goal Summarize main point? Summarize main point? Create a survey of topics? Create a survey of topics? Show relation of document to user’s query terms? Show relation of document to user’s query terms? Show the context of document (for web page excerpts)? Show the context of document (for web page excerpts)?

7 Marti A. Hearst SIMS 202, Fall 1997 2. lecture 18 Lecture 18. Index language functions (Text Chapter 13) Objectives. The student should understand the principle of request-oriented (user- centered)... http://oriole.umd.edu:8000/courses/670/spring97/lect18.html - size 4K - 23-Apr-97 - English 5. Actions/change, Accepted articles Research area of. Planning and Scheduling. Received Research Articles. The following articles have been received for the ETAI area "Planning and... http://www.ida.liu.se/ext/etai/planning/nj/received.html size 2K - 8-Sep-97 - English 8. Wilson Readers' Guide Abstracts Wilson Readers' Guide Abstracts. Wilson Readers' Guide Abstracts includes citations and abstracts for articles from over 250 of the popular English... http://www.ovid.com/db/databses/wrga.htm - size 3K - 29-May-97 - English

8 Marti A. Hearst SIMS 202, Fall 1997 Automating Extracting Just about any simple algorithm can get “good” results for coarse tasks Just about any simple algorithm can get “good” results for coarse tasks Pull out “important” phrases Pull out “important” phrases Find “meaningfully” related words Find “meaningfully” related words Create extract from document Create extract from document Major problem: Evaluation Major problem: Evaluation Need to define goal or purpose Need to define goal or purpose Human extractors agree on only about 25% of sentences (Rath et al. 61) Human extractors agree on only about 25% of sentences (Rath et al. 61)

9 Marti A. Hearst SIMS 202, Fall 1997 Summary of Summary Paper Summary of Summary Paper Kupiec, Pedersen, and Chen, SIGIR 94 To summarize is to reduce in complexity, and hence in length, while retaining some of the essential qualities of the original. To summarize is to reduce in complexity, and hence in length, while retaining some of the essential qualities of the original. This paper focuses on document extracts, a particular kind of computed document summary. This paper focuses on document extracts, a particular kind of computed document summary. Document extracts consisting of roughly 20% of the original can be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summaries. Document extracts consisting of roughly 20% of the original can be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summaries. The trends in our results are in agreement with those of Edmunson who used a subjectively weighted combination of features as opposed to training the feature weights using a corpus. The trends in our results are in agreement with those of Edmunson who used a subjectively weighted combination of features as opposed to training the feature weights using a corpus. We have developed a trainable summarizer program that is grounded in a solid statistical framework. We have developed a trainable summarizer program that is grounded in a solid statistical framework.

10 Marti A. Hearst SIMS 202, Fall 1997 Text Pre-Processing The following steps are typical: Tokenization Tokenization Morphological Analysis (Stemming) Morphological Analysis (Stemming) inflectional, derivational, or crude IR methods Part-of-Speech Tagging Part-of-Speech Tagging I/Pro see/VP Pathfinder/PN on/P Mars/PN... Phrase Boundary Identification Phrase Boundary Identification [Subj I] [VP saw] [DO Pathfinder] [PP on Mars] [PP with a telescope].

11 Marti A. Hearst SIMS 202, Fall 1997 Extracting Example: sentence extraction from a single document (Kupiec et al.) Example: sentence extraction from a single document (Kupiec et al.) Start with training set of manually- generated extracts Start with training set of manually- generated extracts (This allows for objective evaluation.) Create heuristics Create heuristics Train Classification Function to estimate the probability a sentence is included in the extract. Train Classification Function to estimate the probability a sentence is included in the extract. 42% of assigned sentences actually belonged in the extracts. 42% of assigned sentences actually belonged in the extracts.

12 Marti A. Hearst SIMS 202, Fall 1997 Heuristic Feature Selection Sentence Length Cut-off Sentence Length Cut-off Key fixed-phrases Key fixed-phrases “this letter”, “in conclusion” “this letter”, “in conclusion” phrases appearing right after conclusions section phrases appearing right after conclusions section Position Position Of paragraph in document (first & last 10 paragraphs) Of paragraph in document (first & last 10 paragraphs) Of sentence within paragraph (first, last, median) Of sentence within paragraph (first, last, median) Thematic Words Thematic Words Most frequent content words Most frequent content words See Choueka article in reader See Choueka article in reader Uppercase Words (proper names) Uppercase Words (proper names)

13 Marti A. Hearst SIMS 202, Fall 1997 Classifier Function For each sentence S, compute the probability that S will appear in extract E: For each sentence S, compute the probability that S will appear in extract E: If a feature appears in sentences chosen to be in extracts, and not in other sentences, that feature is useful. If a feature appears in sentences chosen to be in extracts, and not in other sentences, that feature is useful. If a sentence contains many of the useful features, that sentence is likely to be chosen for the extract. If a sentence contains many of the useful features, that sentence is likely to be chosen for the extract.

14 Marti A. Hearst SIMS 202, Fall 1997 Classifier Function Compute: Compute: How likely is each feature to occur anywhere in any document? How likely is each feature to occur anywhere in any document? How likely is each feature to occur in a sentence that ends up in an extract? How likely is each feature to occur in a sentence that ends up in an extract? Combine feature scores for a sentence to compute probability for that sentence to be included in the extract. Combine feature scores for a sentence to compute probability for that sentence to be included in the extract.

15 Marti A. Hearst SIMS 202, Fall 1997

16 Marti A. Hearst SIMS 202, Fall 1997 Evaluation Corpus of Manual Extracts Corpus of Manual Extracts Engineering journal articles, average length 86 sentences Engineering journal articles, average length 86 sentences 188 document/extract pairs from 21 journals 188 document/extract pairs from 21 journals Statistics for manual extracts : Statistics for manual extracts : In Training set: Direct Sentence matches45179% Direct Joins193% Unmatchable Sentences509% Incomplete Single Sentences214% Incomplete Joins275% Total Extract Sentences568 Join: sentence combined with other material

17 Marti A. Hearst SIMS 202, Fall 1997 Evaluation Training set vs. Testing Set Training set vs. Testing Set Must keep them separate for legitimate results Must keep them separate for legitimate results Danger to avoid: over-fitting Danger to avoid: over-fitting

18 Marti A. Hearst SIMS 202, Fall 1997 Evaluation Baseline: Baseline: Use only sentences from beginning of document Use only sentences from beginning of document using length cut-off 121 of the original extracted sentences extracted by algorithm (24%) 121 of the original extracted sentences extracted by algorithm (24%) Results using classifier: Results using classifier: Algorithm assigns same number of sentences as manual abstractors did. Algorithm assigns same number of sentences as manual abstractors did. 195 direct matches + 6 direct joins = 35% correct 195 direct matches + 6 direct joins = 35% correct When extracts are larger (25% size of original text), algorithm selects 84% of the extracted sentences When extracts are larger (25% size of original text), algorithm selects 84% of the extracted sentences

19 Marti A. Hearst SIMS 202, Fall 1997 Evaluation Performance for each feature: Performance for each feature: FeatureAB Paragraph163 (33%)163 (33%) Fixed Phrases145 (29%)209 (42%) Length Cut-Off121 (24%)217 (44%) Thematic Word101 (20%)209 (42%) Uppercase Word211 (42%)211 (42%) A: Sentence level performance for individual features alone. If there are many sentences with the same feature, they are put in the abstract in order of appearance of the sentence within the document. B: How performance varies as features are combined cumulatively from the top down.

20 Marti A. Hearst SIMS 202, Fall 1997 Thought Points How well might this work on a different kind of collection? How well might this work on a different kind of collection? How to summarize a collection? How to summarize a collection?


Download ppt "UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16."

Similar presentations


Ads by Google