Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.9: Semi-Supervised Learning Rodney Nielsen Many.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Chapter 4: Linear Models for Classification

Classification and Decision Boundaries

Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.

x – independent variable (input)

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Ensemble Learning: An Introduction

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ensemble Learning (2), Tree and Forest

Ensembles of Classifiers Evgueni Smirnov

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Active Learning for Class Imbalance Problem

Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.

Data mining and machine learning A brief introduction.

INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.

Text Classification, Active/Interactive learning.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Learning from Observations Chapter 18 Through

Modern Topics in Multivariate Methods for Data Analysis.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Classification Ensemble Methods 1

Data Mining and Decision Support

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Data Mining Practical Machine Learning Tools and Techniques Chapter 6.5: Instance-based Learning Rodney Nielsen Many / most of these slides were adapted.

Data Mining Practical Machine Learning Tools and Techniques Chapter 6.3: Association Rules Rodney Nielsen Many / most of these slides were adapted from:

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:

Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Data Science Credibility: Evaluating What’s Been Learned

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Data Mining Practical Machine Learning Tools and Techniques

Semi-Supervised Clustering

Introductory Seminar on Research: Fall 2017

Basic machine learning background with Python scikit-learn

Data Mining Lecture 11.

Data Mining Practical Machine Learning Tools and Techniques

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Presentation transcript:

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.9: Semi-Supervised Learning Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall

Rodney Nielsen, Human Intelligence & Language Technologies Lab Implementation: Real Machine Learning Schemes Decision trees From ID3 to C4.5 (pruning, numeric attributes,...) Classification rules From PRISM to RIPPER and PART (pruning, numeric data, …) Association Rules Frequent-pattern trees Extending linear models Support vector machines and neural networks Instance-based learning Pruning examples, generalized exemplars, distance functions

Rodney Nielsen, Human Intelligence & Language Technologies Lab Implementation: Real Machine Learning Schemes Numeric prediction Regression/model trees, locally weighted regression Bayesian networks Learning and prediction, fast data structures for learning Clustering: hierarchical, incremental, probabilistic Hierarchical, incremental, probabilistic, Bayesian Semisupervised learning Clustering for classification, co-training

Rodney Nielsen, Human Intelligence & Language Technologies Lab Semisupervised Learning Semisupervised learning: attempts to use unlabeled data as well as labeled data The aim is to improve classification performance Why try to do this? Unlabeled data is often plentiful and labeling data can be expensive Web mining: classifying web pages Text mining: identifying names in text Video mining: classifying people in the news Leveraging the large pool of unlabeled examples would be very attractive

Rodney Nielsen, Human Intelligence & Language Technologies Lab Clustering for Classification Idea: use Naïve Bayes on labeled examples and then apply EM Build Naïve Bayes model on labeled data Until convergence: Label unlabeled data based on class probabilities (“Expectation” step) Train new Naïve Bayes model based on all the data (“Maximization” step) Essentially the same as EM for clustering with fixed cluster membership probabilities for labeled data and #clusters = #classes

Rodney Nielsen, Human Intelligence & Language Technologies Lab Comments Has been applied successfully to document classification Certain phrases are indicative of classes Some of these phrases occur only in the unlabeled data, some in both sets EM can generalize the model by taking advantage of co-occurrence of these phrases Refinement 1: reduce weight of unlabeled data Refinement 2: allow multiple clusters per class

Rodney Nielsen, Human Intelligence & Language Technologies Lab Co-training Method for learning from multiple views (multiple sets of attributes), eg: First set of attributes describes content of web page Second set of attributes describes links that link to the web page Until stopping criteria: Step 1: build model from each view Step 2: use models to assign labels to unlabeled data Step 3: select those unlabeled examples that were most confidently predicted (often, preserving ratio of classes) Step 4: add those examples to the training set Assumption: views are independent

Rodney Nielsen, Human Intelligence & Language Technologies Lab EM and Co-training Like EM for semisupervised learning, but view is switched in each iteration of EM Uses all the unlabeled data (probabilistically labeled) for training Has also been used successfully with support vector machines Using logistic models fit to output of SVMs to estimate a class probability distribution Co-training sometimes also seems to work when views are chosen randomly! Why? Maybe Co-trained classifier is more robust

Rodney Nielsen, Human Intelligence & Language Technologies Lab Self-Training L  L 0  Until stopping-criteria h(x)  f(L) U *  select(U, h) L  L 0 +

Rodney Nielsen, Human Intelligence & Language Technologies Lab Example Selection Probability Probability ratio or probability margin Entropy Or several other possibilities (e.g., seach Burr Settles Active Learning Tutorial )

Rodney Nielsen, Human Intelligence & Language Technologies Lab Stopping Criteria T rounds, Repeat until convergence, Use held out validation data, or k-fold cross validation

Rodney Nielsen, Human Intelligence & Language Technologies Lab Seed Seed Data vs. Seed Classifier Training on seed data does not necessarily result in a classifier that perfectly labels the seed data Training on data output by a seed classifier does not necessarily result in the same classifier

Rodney Nielsen, Human Intelligence & Language Technologies Lab Indelibility Indelible L  Until stopping-criteria h(x)  f(L) U *  select(U, h) L  L + U  U – U * Original: Y (U) can change L  L 0  Until stopping-criteria h(x)  f(L) U *  select(U, h) L  L 0 +

Rodney Nielsen, Human Intelligence & Language Technologies Lab Persistence Indelible L  Until stopping-criteria h(x)  f(L) U *  select(U, h) L  L + U  U – U * P ersistent: X (L) can’t change L  L 0  Until stopping-criteria h(x)  f(L) U *  U * +select(U, h) L  L 0 + U  U – U *

Rodney Nielsen, Human Intelligence & Language Technologies Lab Throttling Throttled L  L 0  Until stopping-criteria h(x)  f(L) U *  select(U, h, k) L  L 0 + Select k examples from U, with the greatest confidence Original: Threshold L  L 0  Until stopping-criteria h(x)  f(L) U *  select(U, h, θ) L  L 0 + Select all examples from U, with confidence > θ

Rodney Nielsen, Human Intelligence & Language Technologies Lab Balanced Balanced (&Throttled) L  L 0  Until stopping-criteria h(x)  f(L) U *  select(U, h, k) L  L 0 + Select k + positive & k - negative exs; often k + =k - or they are proportional to N + & N - Throttled L  L 0  Until stopping-criteria h(x)  f(L) U *  select(U, h, k) L  L 0 + Select k examples from U, with greatest confidence

Rodney Nielsen, Human Intelligence & Language Technologies Lab Preselection Preselect Subset of U L  L 0  Until stopping-criteria h(x)  f(L) U’  select(U, φ) U *  select(U’, h, θ) L  L 0 + Select exs from U’, a subset of U (typically random) Original: Test all of U L  L 0  Until stopping-criteria h(x)  f(L) -- U *  select(U, h, θ) L  L 0 + Select exs from all of U

Rodney Nielsen, Human Intelligence & Language Technologies Lab X = X 1 × X 2 ; two different views of the data x = (x 1, x 2 ) ; i.e., each instance is comprised of two distinct sets of features and values Assume each view is sufficient for correct classification Co-training

Rodney Nielsen, Human Intelligence & Language Technologies Lab Co-Training Algorithm 1 Table 1: Blum and Mitchell, 1998

Rodney Nielsen, Human Intelligence & Language Technologies Lab Companionbots Perceptive, emotive, conversational, healthcare, companion robots

Rodney Nielsen, Human Intelligence & Language Technologies Lab Elderly and Depression Depression Leading cause of disability M/F All ages Worldwide (WHO) Doubles cost of care for chronic diseases Stats for 65+ Double in number by  20% 50-58% of hospital patients 36-50% of healthcare expenditures

Rodney Nielsen, Human Intelligence & Language Technologies Lab Companionbots Architecture Goal Manager Speech Recognition Dialogue Manager Natural Language Generation Natural Language U nderstanding Object Tracking AudioVisionLocation Force / Touch Distance M easurement R adar, IR … Scenario U nderstanding LanguageBeliefs Body & Motion Habits, Hobbies & Routines EmotionsHealth Scenario Prediction Emotion Prediction Dialogue Prediction M anipulation Manager Posture Manager Expression Manager Gesture Manager Locomotion Manager Sensor 1 Manager Object Recognition Emotion Recognition … Sensory Input … F undamental R ecognition … Situation U nderstanding Emotion U nderstanding E nvironment U nderstanding … Situation Prediction E nvironment Prediction … User Modeling & History Tracking … Behavior Manager Scenario Goal Manager Emotion Goal Manager Dialogue Goal Manager E nvironment Goal Manager Health Goal Manager Text to Speech Motor Controls … Mechatronic Control … Other M echatronic Controls Natural Behavior Generation Natural Movement Generation Natural Expression Generation … Tools Question Answering Information Retrieval / Extraction Document S ummarization AudioMovement Mechatronic Outputs Visual Displays … Time … InterpretationAction Instance selection for Co-training in emotion recognition

Rodney Nielsen, Human Intelligence & Language Technologies Lab Multimodal Emotion Recognition Vision Speech Language Why does this always have to happen to me

Rodney Nielsen, Human Intelligence & Language Technologies Lab Co-Training Emotion Recognition Givena set L of labeled training examples a set U of unlabeled training examples Create a pool U' of examples by choosing u examples at random from U Loop for k iterations: Use L to train a classifier h 1 that considers only Use L to train a classifier h 2 that considers only Use L to train a classifier h 3 that considers only Allow h 1 to label p 1 positive and n 1 negative examples from U’ Allow h 2 to label p 2 positive and n 2 negative examples from U' Allow h 3 to label p 3 positive and n 3 negative examples from U' Add these self-labeled examples to L Randomly choose examples from U to replenish U’ (Blum & Mitchell, 1998) Why does this always have to happen to me

Rodney Nielsen, Human Intelligence & Language Technologies Lab Semisupervised & Active Learning Most common strategy for instance selection Based on class probability estimates Semisupervised learning Select k instances with highest class probabilities Active learning Select k instances with lowest class probabilities

Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Usually an abundance of unlabeled data How much should you label? Which instances should you label? Does it matter? Can the learner benefit from selective labeling? Active Learning: incrementally requests labels for key instances

Rodney Nielsen, Human Intelligence & Language Technologies Lab Learning Paradigms ? ? Supervised Learning Unsupervised Learning Active Learning ? ? random query ? ? ? ? ? ?

Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Applications Speech Recognition 10 mins to annotate words in 1 min of speech 7 hrs to annotate phonemes of 1 minute speech Named Entity Recognition Half an hour for a simple newswire article PhD for a bioinformatics article Image annotation

Rodney Nielsen, Human Intelligence & Language Technologies Lab Face/Pedestrian/Object Detection

Rodney Nielsen, Human Intelligence & Language Technologies Lab Heuristic Active Learning Algorithm Start with unlabeled data Randomly pick small number of examples to have labeled Repeat Train classifier on labeled data Query the unlabeled ex that: Is closest to the boundary Has the least certainty Minimizes overall uncertainty ? ? ? ? random query ? ? ? ? ? ?

Rodney Nielsen, Human Intelligence & Language Technologies Lab Two Gaussians a) Two classes with Gaussian distributions b) Logistic Regression on 30 random labeled exs 70% accuracy c) Log. Reg. on 30 exs chosen by Active Learning 90% accuracy

Rodney Nielsen, Human Intelligence & Language Technologies Lab Space of Active Learning Query Types: Sampling method Membership Query Synthesis Stream-based Selective Sampling Pool-based Active Learning Uncertainty sampling Query by committee Expected model change Variance reduction Est. error reduction Density weighted methods

Rodney Nielsen, Human Intelligence & Language Technologies Lab Space of Active Learning Query Types: Sampling method Membership Query Synthesis Stream-based Selective Sampling Pool-based Active Learning Uncertainty sampling Query by committee Expected model change Variance reduction Est. error reduction Density weighted methods

Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Query Types From Burr Settles, 2009, AL Tutorial

Rodney Nielsen, Human Intelligence & Language Technologies Lab Membership Query Synthesis Dynamically construct query instances based on expected informativeness Applications Character recognition. Robot scientist: find optimal growth medium for a yeast 3x $ decrease vs. cheapest next 100x $ decrease vs. random selection

Rodney Nielsen, Human Intelligence & Language Technologies Lab Stream-based Selective Sampling Informativeness measure Region of uncertainty / Version space Applications POST Sensor scheduling IR ranking WSD

Rodney Nielsen, Human Intelligence & Language Technologies Lab Pool-based Active Learning Informativeness measure Applications Cancer diagnosis Text classification IE Image classfctn & retrieval Video classfctn & retrieval Speech recognition

Rodney Nielsen, Human Intelligence & Language Technologies Lab Pool-based Active Learning Loop From Burr Settles, 2009, AL Tutorial

Rodney Nielsen, Human Intelligence & Language Technologies Lab Space of Active Learning Query Types: Sampling method Membership Query Synthesis Stream-based Selective Sampling Pool-based Active Learning Uncertainty sampling Query by committee Expected model change Variance reduction Est. error reduction Density weighted methods

Rodney Nielsen, Human Intelligence & Language Technologies Lab Questions Questions???

Rodney Nielsen, Human Intelligence & Language Technologies Lab Instance Sampling in Active Learning Query Types: Sampling method Membership Query Synthesis Stream-based Selective Sampling Pool-based Active Learning Uncertainty sampling Query by committee Expected model change Variance reduction Est. error reduction Density weighted methods

Rodney Nielsen, Human Intelligence & Language Technologies Lab Uncertainty Sampling Uncertainty sampling Select examples based on confidence in prediction Least confident Margin sampling Entropy-based models

Rodney Nielsen, Human Intelligence & Language Technologies Lab Query by Committee Train a committee of hypotheses Representing different regions of the version space Obtain some measure of (dis)agreement on the instances in the dataset (e.g., vote entropy) Assume the most informative instance is the one on which the committee has the most disagreement Goal: minimize the version space No agreement on size of committee, but even 2-3 provides good results

Rodney Nielsen, Human Intelligence & Language Technologies Lab Competing Hypotheses a From Burr Settles, 2009, AL Tutorial

Rodney Nielsen, Human Intelligence & Language Technologies Lab Expected Model Change Query the instance that would result in the largest expected change in h based on the current model and Expectations E.g., the instance that would result in the largest gradient descent in the model parameters Prefer the instance x that leads to the most significant change in the model

Rodney Nielsen, Human Intelligence & Language Technologies Lab Expected Model Change What learning algorithms does this work for What are the issues Can be computationally expensive for large datasets and feature spaces Can be led astray if features aren’t properly scaled How do you properly scale the features?

Rodney Nielsen, Human Intelligence & Language Technologies Lab Estimated Error Reduction Other models approximate the goal of minimizing future error by minimizing (e.g., uncertainty,…) Estimated Error Reduction attempts to directly minimize E[error]

Rodney Nielsen, Human Intelligence & Language Technologies Lab Estimated Error Reduction Often computationally prohibitive Binary logistic regression would be O(|U||L|G) Where G is the number of gradient descent iterations to convergence Conditional Random Fields would be O(T|Y| T+2 |U||L|G) Where T is the number of instances in the sequence

Rodney Nielsen, Human Intelligence & Language Technologies Lab Variance Reduction Regression problems E[error 2 ] = noise + bias + variance: Learner can’t change noise or bias so minimize variance Fisher Information Ratio used for classification

Rodney Nielsen, Human Intelligence & Language Technologies Lab Outlier Phenomenon Uncertainty sampling and Query by Committee might be hindered by querying many outliers

Rodney Nielsen, Human Intelligence & Language Technologies Lab Density Weighted Methods Uncertainty sampling and Query by Committee might be hindered by querying many outliers Density weighted methods overcome this potential problem by also considering whether the example is representative of the input dist. Tends to work better than any of the base classifiers on their own

Rodney Nielsen, Human Intelligence & Language Technologies Lab Diversity Naïve selection by earlier methods results in selecting examples that are very similar Must factor this in and look for diversity in the queries

Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Empirical Results Appears to work well, barring publication bias From Burr Settles, 2009, AL Tutorial

Rodney Nielsen, Human Intelligence & Language Technologies Lab Labeling Costs Are all labels created equal? Generating labels by experiments Some instances easier to label (eg, shorter sents) Can pre-label data for a small savings Experimental problems Value of information (VOI) Considers labeling & estmtd misclassification costs Critical to goal of Active Learning Divide informativeness by cost?

Rodney Nielsen, Human Intelligence & Language Technologies Lab Batch Mode Active Learning

Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Evaluation Learning curves for text classification: baseball vs. hockey. Curves plot classification accuracy as a function of the number of documents queried for two selection strategies: uncertainty sampling (active learning) and random sampling (passive learning). We can see that the active learning approach is superior here because its learning curve dominates that of random sampling. From Burr Settles, 2009, AL Tutorial

Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Evaluation We can conclude that an active learning algorithm is superior to some other approach (e.g., a random baseline like traditional passive supervised learning) if it dominates the other for most or all of the points along their learning curves. From Burr Settles, 2009, AL Tutorial