An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Advertisements

Unsupervised Learning
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Introduction to Text Mining
Sentiment Analysis An Overview of Concepts and Selected Techniques.
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 143, Brown James Hays 02/22/11 Many slides from Derek Hoiem.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/12.
A Probabilistic Framework for Semi-Supervised Clustering
Overview Full Bayesian Learning MAP learning
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS Advanced Technologies Seminar June 15, 2000.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Presented by Zeehasham Rasheed
Scalable Text Mining with Sparse Generative Models
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to machine learning
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Semi-Supervised Learning
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Text Classification, Active/Interactive learning.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Recent Trends in Text Mining Girish Keswani
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Text Mining Tools: Instruments for Scientific Discovery Marti Hearst UC Berkeley SIMS IMA Text Mining Workshop April 17, 2000.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.
Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Learning Subjective Nouns using Extraction Pattern Bootstrapping Ellen Riloff School of Computing University of Utah Janyce Wiebe, Theresa Wilson Computing.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
Data Mining and Decision Support
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Introduction to Text Mining Hongning Wang
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/22/11.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Recent Trends in Text Mining
Sentiment analysis algorithms and applications: A survey
Lecture 15: Text Classification & Naive Bayes
Statistical NLP: Lecture 9
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of the Association for Computational Linguistics, 1999.Untangling Text Data Mining E. Riloff and R. Jones, “Learning Dictionaries for Information Extraction Using Multi-level Boot-strapping,” in the Proceedings of AAAI-99, 1999.Learning Dictionaries for Information Extraction Using Multi-level Boot-strapping K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, “Text Classification from Labeled and Unlabeled Documents using EM,” in Machine Learning, 2000.Text Classification from Labeled and Unlabeled Documents using EM M. Grobelnik, D. Mladenic, and N. Milic-Frayling, “Text Mining as Integration of Several Related Research Areas: Report on KDD’2000 Workshop on Text Mining,” 2000.Text Mining as Integration of Several Related Research Areas: Report on KDD’2000 Workshop on Text Mining

What Is Text Mining? “The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001) “Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999)

Text Mining How does it relate to data mining in general? How does it relate to computational linguistics? How does it relate to information retrieval? Finding PatternsFinding “Nuggets” NovelNon-Novel Non-textual dataGeneral data-mining Exploratory Data Analysis Database queries Textual dataComputational Linguistics Information Retrieval

Challenges in Text Mining Data collection is “free text” –Data is not well-organized Semi-structured or unstructured –Natural language text contains ambiguities on many levels Lexical, syntactic, semantic, and pragmatic –Learning techniques for processing text typically need annotated training examples Consider bootstrapping techniques

Text Mining Tasks Exploratory Data Analysis –Using text to form hypotheses about diseases (Swanson and Smalheiser, 1997). Information Extraction –(Semi)automatically create (domain specific) knowledge bases, and then use standard data-mining techniques. Bootstrapping methods (Riloff and Jones, 1999). Text Classification –Useful intermediary step for information extraction Bootstrapping method using EM (Nigam et al., 2000).

Biomedical Data Exploration (Swanson, and Smalheiser, 1997) Extract pieces of evidence from article titles in the biomedical literature “stress is associated with migraines” “stress can lead to loss of magnesium” “calcium channel blockers prevent some migraines” “magnesium is a natural calcium channel blocker” Induce a new hypothesis not in the literature by combining culled text fragments with human medical expertise Magnesium deficiency may play a role in some kinds of migraine headache

Challenges in Data Exploration How can valid inference links be found without succumbing to combinatorial explosion of possibilities? –Need better models of lexical relationships and semantic constraints (very hard) How should the information be presented to the human experts to facilitate their exploration?

Information Extraction (IE) Extract domain-specific information from natural language text –Need a dictionary of extraction patterns (e.g., “traveled to ” or “presidents of ”) Constructed by hand Automatically learned from hand-annotated training data –Need a semantic lexicon (dictionary of words with semantic category labels) Typically constructed by hand

Challenges in IE Automatic learning methods are typically supervised (i.e., need labeled examples) But annotating training data is a time- consuming and expensive task. Can we develop better unsupervised algorithm? Can we make better use of a small set of labeled example?

Learning Dictionaries for IE via Bootstrapping (Riloff and Jones, 1999) Simultaneously learn extraction patterns and domain-specific semantic lexicons Input requires a small set of seed words (for the semantic categories) and a large collection of text Mutual bootstrapping –Learns extraction patterns from seed words –Use extraction patterns to identify new words to add to the semantic categories –Meta-bootstrapping to reduce noise

Text classification (TC) Tag a document as belonging to one of a set of pre-defined classes –“This does not lead to discovery of new information…” (Hearst, 1999). –Many practical uses Group documents into different domains (useful for domain specific information extraction) Learn reading interests of users Automatically sort On-line New Event Detection

Challenges in TC Like IE, also need lots of labeled examples as training data –After a user has labeled 1000 UseNet news articles, the system was only right ~50% of the time at selecting articles interesting to the user. What other sources of information can reduce the need for labeled examples?

TC from Labeled and Unlabeled Documents using EM (Nigam et al., 2000) Expectation-Maximization –Iterative algorithm for MLE in parametric estimation problems with missing data (e.g. the labels for the example) Nigam et al. combined the EM algorithm with a Naïve Bayes classifier, using both labeled and unlabeled data as input –Dynamically adjust strength of unlabeled data’s contribution to parameter estimation in EM –Reduce the bias of naïve Bayes by modeling each class with multiple mixture components

Probabilistic Framework for TC Assumption #1: Doc produced by mixture model –Generate docs according to probability distribution defined by the model parameters  Assumption #2: Each class is modeled by one mixture component: C ={c 1,…,c |C| } Prob. of model generating doc d i is:

Naïve Bayes Model Assumes words in the document are generated independently (no context) Assume all text have the same length Model parameters:

Using a Trained Model What class should a new document d be assigned to? Pick the class with the highest probability

Parameter Estimation with Labeled Documents Estimating model parameters:

Parameter Estimation with Unlabeled Documents EM: for “incomplete data” problems Maximize prob. of model generating observed data Build initial classifier (initialize the parameters to “reasonable” starting values) Repeat until convergence –E-Step: Use current classifier params,  t, to estimate P(c|d;  t ) for all d in D u –M-Step: Re-estimate the classifier,  t+1, using the expected counts from the E-Step

Augmented EM Weight the unlabeled data –Otherwise, unlabeled data overwhelms the small amount of labeled data –Modify M-step to multiply expected counts with a weight factor Relax the one class one mixture component assumption –Allow labeled data to fall into “topics” within a class – Modify E-step to allow labeled document to probabilistically belong to sub-topics