5/11/981 Untangling Text Data Mining Stanford Digital Libraries Seminar May 11, 1998 Marti Hearst UC Berkeley SIMS www.sims.berkeley.edu/~hearst.

Slides:

Advertisements

Similar presentations

Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.

Advertisements

Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

Search Engines and Information Retrieval

WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998.

Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.

1 Interfaces for Intense Information Analysis Marti Hearst UC Berkeley This research funded by ARDA.

Text Data Mining Prof. Marti Hearst UC Berkeley SIMS ABLE May 7, 1999.

Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS Advanced Technologies Seminar June 15, 2000.

An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.

WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999.

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 20, 2004.

UCB CS Research Fair Search Text Mining Web Site Usability Marti Hearst SIMS.

Data Mining – Intro.

Data mining By Aung Oo.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.

Overview of Web Data Mining and Applications Part I

Overview of Search Engines

CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Web:

Data Mining Techniques

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.

Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.

Search Engines and Information Retrieval Chapter 1.

Text Data Mining: Introduction Hao Chen School of Information Systems University of California at Berkeley

Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.

Chapter 1 Introduction to Data Mining

Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.

Knowledge Discovery and Data Mining Evgueni Smirnov.

Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.

Knowledge Discovery and Data Mining Evgueni Smirnov.

Information Visualization: Ten Years in Review Xia Lin Drexel University.

Principles of Data Mining. Introduction: Topics 1. Introduction to Data Mining 2. Nature of Data Sets 3. Types of Structure Models and Patterns 4. Data.

5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.

Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS IMA Text Mining Workshop April 17, 2000.

Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.

Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.

Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.

Information Retrieval

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

MIS2502: Data Analytics Advanced Analytics - Introduction.

Data Mining and Decision Support

Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.

A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.

Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.

Data mining in web applications

Data Mining – Intro.

MIS2502: Data Analytics Advanced Analytics - Introduction

Visualizing Documents and Search

Text Tango: A New Text Data Mining Project

Data Warehousing and Data Mining

Untangling Text Data Mining

Interfaces for Intense Information Analysis

CSE 635 Multimedia Information Retrieval

Data Warehousing Data Mining Privacy

CSE591: Data Mining by H. Liu

Presentation transcript:

5/11/981 Untangling Text Data Mining Stanford Digital Libraries Seminar May 11, 1998 Marti Hearst UC Berkeley SIMS

5/11/982 Caveat Emptor: I do information access. I do not do text data mining (yet). This talk is an attempt to explore the relationship between the two.

5/11/983 Talk Outline l Definitions –What is Data Mining? –What is Information Access? –What is Text Data Mining? l Empirical Computational Linguistics l Real text data mining tasks l Conclusions and Future Directions

5/11/984 The Knowledge Discovery from Data Process (KDD) KDD: The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. (Fayyad, Shapiro, & Smyth, CACM 96) Note: data mining is just one step in the process

5/11/985 What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97) l Fitting models to or determining patterns from very large datasets. l A “regime” which enables people to interact effectively with massive data stores. l Deriving new information from data. –finding patterns across large datasets –discovering heretofore unknown information

5/11/986 What is Data Mining? l Potential point of confusion: –The extracting ore from rock metaphor does not really apply to the practice of data mining –If it did, then standard database queries would fit under the rubric of data mining Find all employee records in which employee earns $300/month less than their managers –In practice, DM refers to: finding patterns across large datasets discovering heretofore unknown information

5/11/987 Another definition of DM l What SQL currently cannot do. –A standard query does not infer new information It retrieves a subset of what is already present and known. SQL originally intended for business apps –DM requires sophisticated aggregate queries

5/11/988 Why Data Mining? l Because the data is there. l Because current DBMS technology does not support data analysis. l Because –larger disks –faster cpus –high-powered visualization –networked information are becoming widely available.

5/11/989 DM Touchstone Applications (CACM 39 (11) Special Issue) l Finding patterns across data sets: –Reports on changes in retail sales to improve sales –Patterns of sizes of TV audiences for marketing –Patterns in NBA play to alter, and so improve, performance –Deviations in standard phone calling behavior to detect fraud for marketing

5/11/9810 DM Touchstone Applications (CACM 39 (11) Special Issue) l Separating signal from noise: –Classifying faint astronomical objects –Finding genes within DNA sequences –Discovering novel tectonic activity

5/11/9811 Data Mining Methods (Fayyad & Uthurusamy 96, Fayyad 97) l Major classes of DM methods: –Predictive modeling (classification) –Segmentation (clustering) –Finding associations (relations between attributes, link analysis) –Dependency Modeling (graphical models, density estimation) –Deviation detection/modeling visualization

5/11/9812 What’s new here? l Sounds like statistical modeling or machine learning. l Main Difference: scale and availability (Fayyad 97) –Datasets too large for classical analysis –Increased opportunity for access end user is often not a statistician –New issues in sampling

5/11/9813 Statistician’s Viewpoint (David Hand 97) l What’s new about DM? –Returns statisticians to their empirical roots exploration rather than modeling –Hypothesis testing may be irrelevant given the large data sizes everything is significant –Data was collected for some other purpose than what it is being analyzed for now

5/11/9814 The Statistician’s Viewpoint (David Hand 97) l conservative l rigorous l abstract l idealized l adventurous l engineering l practical l real solutions StatisticsMachine Learningvs.

5/11/9815 Talk Outline l Definitions –What is Data Mining? –What is Information Access? –What is Text Data Mining? l Empirical Computational Linguistics l Real text data mining tasks l Conclusions and Future Directions

5/11/9816 What is Information Access? l Goal: Build systems that help users in the discovery, creation, use, synthesis and understanding of information.

5/11/9817 Information Access (Information Retrieval more broadly construed) l Problem: –Huge amounts of online textual information l Goal: –Build systems to help people discover, create use, re- use, and understand information l Approach: –Leverage off of users’ smarts –Combine stats, text analysis, user interfaces

5/11/9818 Information Retrieval A restricted form of Information Access l The system has available only pre-existing, “canned” text passages. l Its response is limited to selecting from these passages and presenting them to the user. l It must select, say, 10 or 20 passages out of millions!

5/11/9819 Needles in Haystacks l The emphasis in IR (and standard DB) is in answering ad hoc queries.

5/11/9820 IA vs. KDD Process

5/11/9821 IA vs. KDD Process Query/Information Need

5/11/9822 IA vs. KDD Process Query/Information Need Match query against transformed data Show results ranked in relevance order

5/11/9823 Structure of an IR System Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language ) Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Query Line

5/11/9824 Talk Outline l Definitions –What is Data Mining? –What is Information Access? –What is Text Data Mining? l Empirical Computational Linguistics l Real text data mining tasks l Conclusions and Future Directions

5/11/9825 What is Text Data Mining? l Peoples’ first thought: –Make it easier to find things on the Web. –But this is information retrieval! l The metaphor of extracting ore from rock: – Does make sense for extracting documents of interest from a huge pile. –But does not reflect notions of DM in practice: finding patterns across large collections discovering heretofore unknown information

5/11/9826 Real Text DM l What would finding a pattern across a large text collection really look like?

5/11/9827 From: “The Internet Diary of the man who cracked the Bible Code ” Brendan McKay, Yahoo Internet Life, (William Gates, agitator, leader) Bill Gates + MS-DOS in the Bible!

5/11/9828 From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life,

5/11/9829 Real Text DM l The point: –Discovering heretofore unknown information is not what we usually do with text. –(If it weren’t known, it could not have been written by someone!) l However: –There is a field whose goal is to learn about patterns in text for its own sake...

5/11/9830 Observation Research that exploits patterns in text does so mainly in the service of computational linguistics, rather than for learning about and exploring text collections.

5/11/9831 Talk Outline l Definitions l Empirical Computational Linguistics –Special and important properties of text –Relationship to TDM –Examples of TDM as CL l Real text data mining tasks l Conclusions and Future Directions

5/11/9832 Recent Trends in NLP (CL) l Previously: AI, full understanding l Current: Corpus-based, Statistical ACL proceedings: from 3 corpus-based papers in 1991 to at least half in 1996 Stat NLP was tried long ago (Z. Harris) l Simple Often Wins Echoes results in IR l Interesting direction: Statistics + Linguistics (Klavans & Resnik 96)

5/11/9833 Text Analysis (CL) Tasks l Word Sense Disambiguation l Automatic Lexicon Augmentation l Discourse Analysis l Parsing Phrase Identification Phrase Attachments Predicate/Argument Structure Scope of Conjunctions...

5/11/9834 Why Text is Tough –Abstract concepts difficult to represent (AI-Complete) –“Countless” combinations of subtle, abstract relationships among concepts –Many ways to represent similar concepts space ship, flying saucer, UFO, figment of imagination –Concepts are difficult to visualize –High dimensionality Tens or hundreds of thousands of features

5/11/9835 Why Text is Tough l Language is: –ambiguous (many different meanings for the same words and phrases) –different combinations imply different meanings

5/11/9836 Why Text is Tough l I saw Pathfinder on Mars with a telescope. l Pathfinder photographed Mars. l The Pathfinder photograph mars our perception of a lifeless planet. l The Pathfinder photograph from Ford has arrived. l The Pathfinder forded the river without marring its paint job.

5/11/9837 Why Text is Easy l Highly redundant in bulk l Just about any simple algorithm can get “good” results for coarse tasks –Pull out “important” phrases –Find “meaningfully” related words –Create summary from document –Major problem: Evaluation

5/11/9838 Stupid Text Tricks –Coarse IR, Clustering Don’t need dimension reduction (except stopwords) Don’t need morphological analysis Don’t need word sense disambiguation –Partial parsing: Simple, greedy transformation rules Cascading finite state machines –Categorization Assume independence

5/11/9839 Text “Data Cleaning” Pre-process text as follows: l Tokenization l Morphological Analysis (Stemming) inflectional, derivational, or crude IR methods l Part-of-Speech Tagging I/Pro see/VP Pathfinder/PN on/P Mars/PN... l Phrase Boundary Identification [Subj I] [VP saw] [DO Pathfinder] [PP on Mars] [PP with a telescope].

5/11/9840 CCL Methodology l Describe here the standard methodology for corpus-based computational linguistics algorithms

5/11/9841 CCL Examples l Place here examples of the kinds of output generated for computational linguistics applications

5/11/9842 Inducing MetaData for Documents l Assigning bibliographic metadata –author, genre, time, region l Subject/Topic assignments –category labels: MeSH, LoC, ACM keywords l Information Extraction (MUC) –MUC: terrorist incidents who did the bombing where did the bombing take place what weapon(s) were used when did it happen

5/11/9843 Inducing MetaData for Collections l Indexes l Hierarchical Categorization l Overviews of Connectivity hyperlinks co-citation links l Overviews of Subject Matter 2D 3D dynamic

5/11/9844 A Main Point: l Empirical CL is usually not helpful for improving Information Access. l However, it can produce –metadata –overviews –associations that are indirectly useful for IA.

5/11/9845 Talk Outline l Definitions l Empirical Computational Linguistics l Real text data mining tasks –TDM not using text –TDM using text l Conclusions and Future Directions

5/11/9846 TDM using Metadata (instead of Text) ( Dagan, Feldman, and Hirsh, SDAIR ‘96) –Data: Reuter’s newswire (22,000 articles, late 1980s) Categories: commodities, time, countries, people, and topic –Goals: distributions of categories across time (trends) distributions of categories between collections category co-occurrence (e.g., topic|country) –Interactive Interface: lists, pie charts, 2D line plots

5/11/9847 Combining Text with Metadata (images, hyperlinks) l Examples –Text + Links to find “authority pages” (Kleinberg at Cornell, Page at Stanford) –Usage + Time + Links to study evolution of web and information use (Pitkow et al. at PARC) –Images + Text to improve image search

5/11/9848 Talk Outline l Definitions l The New Empirical Computational Linguistics l Real text data mining tasks –TDM not using text –TDM using text l Conclusions and Future Directions

5/11/9849 Ore-Filled Text Collections l Newspaper/Newswire l Medical Articles –Patterns associated with symptoms, drugs l Patent Law –Recent Study Justifying Scientific Funding –Hypotheses for New Inventions l “Corporate Memory”

5/11/9850 True Text Data Mining: Don Swanson’s Medical Work l Given –medical titles and abstracts –a problem (incurable rare disease) –some medical expertise l find causal links among titles –symptoms –drugs –results

5/11/9851 Swanson Example (1991) l Problem: Migraine headaches (M) –stress associated with M –stress leads to loss of magnesium –calcium channel blockers prevent some M –magnesium is a natural calcium channel blocker –spreading cortical depression (SCD)implicated in M –high levels of magnesium inhibit SCD –M patients have high platelet aggregability –magnesium can suppress platelet aggregability l All extracted from medical journal titles

5/11/9852 Swanson’s TDM l Two of his hypotheses have received some experimental verification. l His technique –Only partially automated –Required medical expertise l Few people are working on this.

5/11/9853 Text Collection Overviews l Clusters/Unsupervised Overviews –Chalmers: BEAD, Networks of Words –Lin,Chen: Kohonen Feature Maps –Xerox PARC: Local Clusters –Pacific Northwest: ThemeScapes –Rennison: Galaxy of News

5/11/9854 Text Overviews –Huge 2D maps may be inappropriate focus for information retrieval can’t see what documents are about documents forced into one position in semantic space space difficult to browse for IR purposes –Perhaps more suited for pattern discovery problem: often only one view on the space

5/11/9855 Talk Outline l Definitions l The New Empirical Computational Linguistics l Real text data mining tasks –TDM not using text –TDM using text l Conclusions and Future Directions

5/11/9856 Conclusions l Currently, what might be construed as Text Data Mining is really Computational Linguistics –Text is tricky to process, but rich and abundant (now) –There are many CL tools available l Data Mining directly from text –tells us about language –produces meta-information that may be useful for information access

5/11/9857 Conclusions, continued l Information Access != Text Data Mining –IA = finding needle in haystack –TDM = finding patterns or discovering new information l However, Information Access may potentially be served by Text Data Mining techniques: –automated metadata assignment –collection overviews l The synthesis of ideas from TDM and IA: –Perhaps a new field of exploratory data analysis over text!

5/11/9858 Promising Research Directions l Text Data Mining Problems: –Patterns within sets of documents: What is the latest in this field? How is this field related to that field? –Chains of evidence embedded in text: What drugs have been tested for this symptom? What effects did this funding have on that field? –Human use of information over time, How does information diffuse across the web?

5/11/9859 Needed from Systems l Support for linking chains of associations l Support for combined structured and unstructured data l Support for combining disparate collections