©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610) 270-6851.

Slides:



Advertisements
Similar presentations
Support.ebsco.com Points of View Reference Center Tutorial.
Advertisements

How to Write a Review Article
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Modelling with databases. Database management systems (DBMS) Modelling with databases Coaching modelling with databases Advantages and limitations of.
Chapter 11 user support. Issues –different types of support at different times –implementation and presentation both important –all need careful design.
Search Engine – Metasearch Engine Comparison By Ali Can Akdemir.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Search Engines and Information Retrieval
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Article Database Tutorial (and quick guide to library resources)
Help and Documentation zUser support issues ydifferent types of support at different times yimplementation and presentation both important yall need careful.
Designing Help… Mark Johnson Providing Support Issues –different types of support at different times –implementation and presentation both important.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Article Database Tutorial (and quick guide to library resources)
Information Retrieval
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
©2012 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek
CHAPTER 3: DEVELOPING LITERATURE REVIEW SKILLS
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
Web-Enabled Decision Support Systems
Flexible Text Mining using Interactive Information Extraction David Milward
Text Mining. ©2002 Paula Matuszek Challenges and Possibilities l Information overload. There’s too much. We would like –Better retrieval –Help with handling.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
©2003 Paula Matuszek CSC 9010: Information Extraction Dr. Paula Matuszek (610) Fall, 2003.
Data Tagging Architecture for System Monitoring in Dynamic Environments Bharat Krishnamurthy, Anindya Neogi, Bikram Sengupta, Raghavendra Singh (IBM Research.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
1 CSC 9010 Spring Paula Matuszek CSC 9010 ANN Lab Paula Matuszek Spring, 2011.
Configuration Management and Change Control Change is inevitable! So it has to be planned for and managed.
Chap#11 What is User Support?
©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
©2012 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)
CSC 9010 Spring, Paula Matuszek. 1 CS 9010: Semantic Web Applications and Ontology Engineering Paula Matuszek Spring, 2006.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Information Retrieval
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
IEEE MEDIA INDEPENDENT HANDOVER DCN: Title: Retrieval of multiple IEs and Reports with filering rule Date.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Using OARE Search Engines. Environmental Index (EBSCO) Advanced Search.
Principals of Research Writing. What is Research Writing? Process of communicating your research  Before the fact  Research proposal  After the fact.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
©2003 Paula Matuszek CSC 9010: AeroText, Ontologies, AeroDAML Dr. Paula Matuszek (610)
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Data Mining for Expertise: Using Scopus to Create Lists of Experts for U.S. Department of Education Discretionary Grant Programs Good afternoon, my name.
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
K Nearest Neighbors and Instance-based methods
Intro to Expert Systems Paula Matuszek CSC 8750, Fall, 2004
Lessons Vocabulary Access 2016.
Text Mining.
Introduction to Information Retrieval
Chapter 11 user support.
Information Retrieval and Web Design
Presentation transcript:

©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)

©2003 Paula Matuszek Document Summarization l Document Summarization –Provide meaningful summary for each document l Examples: –Search tool returns “context” –Monthly progress reports from multiple projects –Summaries of news articles on the human genome l Often part of a document retrieval system, to enable user judge documents better l Surprisingly hard to make sophisticated l Surprisingly easy to make effective

©2003 Paula Matuszek Document Summarization -- How T hree general approaches: l Extract predefined summary. –Useful in highly structured environments where you can specify format. Typically very good summaries. l Capture in abstract representation, generate summary –Useful in well-defined domains with clearcut information needs. l Extract representative sentences/clauses. –Useful in arbitrarily complex and unstructured domains; broadly applicable, and gets "general feel".

©2003 Paula Matuszek Extract Predefined Summary l Documents have a well-defined format. l Format includes a summary or abstract explicitly written by document author. l Text mining may reorganize, regroup, restructure summaries. l Example: –People working on multiple projects write monthly reports based on what they have done, one sentence/project. –Reporting system collects person-level reports and reorganizes into project-level reports.

©2003 Paula Matuszek Extract Predefined Summary: Methods l Extraction using some or all of –NLP for document parsing/chunking (finding abstract) –standard computer science: database retrieval, string processing, etc. l Reorganizing may be done using –explicit fields specified by author –keywords searched for in documents –business rules which capture knowledge about who is working on what tasks and projects l Grouping can shade into document classification for long summaries, ill-defined match to categories

©2003 Paula Matuszek Extracting Predefined Summaries: Advantages and Disadvantages l Advantages –Summaries reflect intent of author. –If part of an overall reporting system can actually make it simpler for author. –Incremental effort for author not large. l Disadvantages –Incremental effort for author not zero either. –Only feasible in structured situation where requirement can be defined ahead of time. –Can't be used to summarize a group of documents. –Not all authors write good summaries.

©2003 Paula Matuszek Capture and Generate l Documents can have arbitrary format l Knowledge needed is well-defined. l Often information need is for summarizations across multiple documents l Example: –Summarizing restaurant reviews. Take newspaper articles and produce price range, kind of food, atmosphere, quality, service.

©2003 Paula Matuszek Capture and Generate: Methods l State of the art: –Create "template" or "frame" –Represent the knowledge you want to capture –Extract Information to fill in frame –Standard information extraction problem –Typically relatively large frames with relatively few relations; mostly facts. –Generate based on template –Relatively simple "fill-in-the-blank" –More complex based on parse tree. l Still basically research: parse entire document into parse tree tied to rich semantic net; apply rules to trim tree; generate continuous narrative.

©2003 Paula Matuszek Capture and Generate: Advantages and Disadvantages l Advantages: –Produces very focused summaries. –Can readily incorporate multiple documents. –Not dependent on authors l Disadvantages –Assumes information need is clearly defined. –Information extraction component development time is significant –Document parsing slow; probably not real-time. l Comment: –Makes no attempt to capture author's intent

©2003 Paula Matuszek

Extract Representative Sentence l Document format can be arbitrary l Document content can also be arbitrary; information need not clearcut l Summarization consists of text extracted directly from document. l Examples: –Context returned by Google for each hit –Google News summaries.

©2003 Paula Matuszek Find Representative Sentences: Method l Typically, choose representative individual terms, then broaden to capture sentence containing terms. The more terms contained, the more important the sentence. –If in response to a search or other information request, the search terms are representative –If no prior query, TF*IDF and other BOW approaches. May use pairs or n-ary groups of words. l May add a layer of rules using position, some specific phrases such as "In summary,".

©2003 Paula Matuszek Find Representative Sentences: Advantages and Disadvantages l Advantages –Can be applied anywhere. –Relatively fast (compared to full parse) –Provides a good general idea or feel for content. –Can do multiple-document summaries. l Disadvantages –Often choppy or hard to read –Does poorly when document doesn't contain good summary sentences. –Can miss major information

©2003 Paula Matuszek Summary l Appropriate approach depends on what is known about the documents, the domain, and the information need. l All of the major approaches in use provide useful information in a reasonable time frame. l None of the automated methods is yet close to a good human summarizer. Research in this area is advancing fast, though.

©2003 Paula Matuszek Some Useful References l This is been a seriously simplified presentation; I am focusing mostly on applications. Here are some references for more detail: l Detailed overview of text summarization history, methods and current state. l Bibliography, tools, conferences, research. Some good resources. l Relatively simple overview with some good links. l Paper on summarization using GATE.