Download presentation
Presentation is loading. Please wait.
Published byIsaac Barton Modified over 9 years ago
1
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851
2
©2003 Paula Matuszek Document Summarization l Document Summarization –Provide meaningful summary for each document l Examples: –Search tool returns “context” –Monthly progress reports from multiple projects –Summaries of news articles on the human genome l Often part of a document retrieval system, to enable user judge documents better l Surprisingly hard to make sophisticated l Surprisingly easy to make effective
3
©2003 Paula Matuszek Document Summarization -- How T hree general approaches: l Extract predefined summary. –Useful in highly structured environments where you can specify format. Typically very good summaries. l Capture in abstract representation, generate summary –Useful in well-defined domains with clearcut information needs. l Extract representative sentences/clauses. –Useful in arbitrarily complex and unstructured domains; broadly applicable, and gets "general feel".
4
©2003 Paula Matuszek Extract Predefined Summary l Documents have a well-defined format. l Format includes a summary or abstract explicitly written by document author. l Text mining may reorganize, regroup, restructure summaries. l Example: –People working on multiple projects write monthly reports based on what they have done, one sentence/project. –Reporting system collects person-level reports and reorganizes into project-level reports.
5
©2003 Paula Matuszek Extract Predefined Summary: Methods l Extraction using some or all of –NLP for document parsing/chunking (finding abstract) –standard computer science: database retrieval, string processing, etc. l Reorganizing may be done using –explicit fields specified by author –keywords searched for in documents –business rules which capture knowledge about who is working on what tasks and projects l Grouping can shade into document classification for long summaries, ill-defined match to categories
6
©2003 Paula Matuszek Extracting Predefined Summaries: Advantages and Disadvantages l Advantages –Summaries reflect intent of author. –If part of an overall reporting system can actually make it simpler for author. –Incremental effort for author not large. l Disadvantages –Incremental effort for author not zero either. –Only feasible in structured situation where requirement can be defined ahead of time. –Can't be used to summarize a group of documents. –Not all authors write good summaries.
7
©2003 Paula Matuszek Capture and Generate l Documents can have arbitrary format l Knowledge needed is well-defined. l Often information need is for summarizations across multiple documents l Example: –Summarizing restaurant reviews. Take newspaper articles and produce price range, kind of food, atmosphere, quality, service.
8
©2003 Paula Matuszek Capture and Generate: Methods l State of the art: –Create "template" or "frame" –Represent the knowledge you want to capture –Extract Information to fill in frame –Standard information extraction problem –Typically relatively large frames with relatively few relations; mostly facts. –Generate based on template –Relatively simple "fill-in-the-blank" –More complex based on parse tree. l Still basically research: parse entire document into parse tree tied to rich semantic net; apply rules to trim tree; generate continuous narrative.
9
©2003 Paula Matuszek Capture and Generate: Advantages and Disadvantages l Advantages: –Produces very focused summaries. –Can readily incorporate multiple documents. –Not dependent on authors l Disadvantages –Assumes information need is clearly defined. –Information extraction component development time is significant –Document parsing slow; probably not real-time. l Comment: –Makes no attempt to capture author's intent
10
©2003 Paula Matuszek
11
Extract Representative Sentence l Document format can be arbitrary l Document content can also be arbitrary; information need not clearcut l Summarization consists of text extracted directly from document. l Examples: –Context returned by Google for each hit –Google News summaries.
12
©2003 Paula Matuszek Find Representative Sentences: Method l Typically, choose representative individual terms, then broaden to capture sentence containing terms. The more terms contained, the more important the sentence. –If in response to a search or other information request, the search terms are representative –If no prior query, TF*IDF and other BOW approaches. May use pairs or n-ary groups of words. l May add a layer of rules using position, some specific phrases such as "In summary,".
13
©2003 Paula Matuszek Find Representative Sentences: Advantages and Disadvantages l Advantages –Can be applied anywhere. –Relatively fast (compared to full parse) –Provides a good general idea or feel for content. –Can do multiple-document summaries. l Disadvantages –Often choppy or hard to read –Does poorly when document doesn't contain good summary sentences. –Can miss major information
14
©2003 Paula Matuszek Summary l Appropriate approach depends on what is known about the documents, the domain, and the information need. l All of the major approaches in use provide useful information in a reasonable time frame. l None of the automated methods is yet close to a good human summarizer. Research in this area is advancing fast, though.
15
©2003 Paula Matuszek Some Useful References l This is been a seriously simplified presentation; I am focusing mostly on applications. Here are some references for more detail: l http://www.cs.unm.edu/~storm/TSPresent.html. Detailed overview of text summarization history, methods and current state. l http://www.summarization.com/. Bibliography, tools, conferences, research. Some good resources. l http://clg.wlv.ac.uk/help/summarisation.php. Relatively simple overview with some good links. l http://citeseer.nj.nec.com/525002.html. Paper on summarization using GATE.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.