Practical Online Active Learning for Classification Claire Monteleoni (MIT / UCSD) Matti Kääriäinen (University of Helsinki)

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Online learning, minimizing regret, and combining expert advice
Learning with Online Constraints: Shifting Concepts and Active Learning Claire Monteleoni MIT CSAIL PhD Thesis Defense August 11th, 2006 Supervisor: Tommi.
Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.
Boosting Approach to ML
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
A general agnostic active learning algorithm
Longin Jan Latecki Temple University
The loss function, the normal equation,
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
Active Learning. 2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Active Learning of Binary Classifiers
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Probably Approximately Correct Model (PAC)
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
Evaluating Hypotheses
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Machine Learning CMPT 726 Simon Fraser University
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
Online Learning Algorithms
Incorporating Unlabeled Data in the Learning Process
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
The Multiplicative Weights Update Method Based on Arora, Hazan & Kale (2005) Mashor Housh Oded Cats Advanced simulation methods Prof. Rubinstein.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.
8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.
Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Online Learning Rong Jin. Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Ensemble Methods in Machine Learning
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning with General Similarity Functions Maria-Florina Balcan.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
UCSpv: Principled Voting in UCS Rule Populations Gavin Brown, Tim Kovacs, James Marshall.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Online Learning Model. Motivation Many situations involve online repeated decision making in an uncertain environment. Deciding how to invest your money.
Hierarchical Sampling for Active Learning Sanjoy Dasgupta and Daniel Hsu University of California, San Diego Session : Active Learning and Experimental.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Dan Roth Department of Computer and Information Science
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Understanding Generalization in Adaptive Data Analysis
Importance Weighted Active Learning
CS 4/527: Artificial Intelligence
A general agnostic active learning algorithm
Computational Learning Theory
Computational Learning Theory
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Practical Online Active Learning for Classification Claire Monteleoni (MIT / UCSD) Matti Kääriäinen (University of Helsinki)

Online learning Forecasting, real-time decision making, streaming applications, online classification, resource-constrained learning.

Online learning [M 2006] studies learning under these online constraints: 1. Access to the data observations is one-at-a-time only. Once a data point has been observed, it might never be seen again. Learner makes a prediction on each observation. ! Models forecasting, temporal prediction problems(internet, stock market, the weather), high-dimensional, and/or streaming data applications. 2. Time and memory usage must not scale with data. Algorithms may not store previously seen data and perform batch learning. ! Models resource-constrained learning, e.g. on small devices.

Active learning Machine learning & vision applications: Image classification Object detection/classification in video Document/webpage classification Unlabeled data is abundant, but labels are expensive. Active learning is a useful model here. Allows for intelligent choices of which examples to label. Goal: given stream (or pool) of unlabeled data, use fewer labels to learn (to a fixed accuracy) than via supervised learning.

Online active learning: model

Online active learning: applications Data-rich applications: Image/webpage relevance filtering Speech recognition Your favorite data-rich vision/video application! Resource-constrained applications: Human-interactive learning on small devices: OCR on handhelds used by doctors, etc. /spam filtering Your favorite resource-constrained vision/video application!

Outline of talk Online learning Formal framework (Supervised) online learning algorithms studied Perceptron Modified-Perceptron (DKM) Online active learning Formal framework Online active learning algorithms Query-by-committee Active modified-Perceptron (DKM) Margin-based (CBGZ) Application to OCR Motivation Results Conclusions and future work

Online learning (supervised, iid setting) Supervised online classification: Labeled examples (x,y) received one at a time. Learner predicts at each time step t: v t (x t ). Independently, identically distributed (iid) framework: Assume observations x2X are drawn independently from a fixed probability distribution, D. No prior over concept class H assumed (non-Bayesian setting). The error rate of a classifier v is measured on distribution D: err(h) = P x~D [v(x)  y] Goal: minimize number of mistakes to learn the concept (w.h.p.) to a fixed final error rate, , on input distribution.

Problem framework u vtvt tt Target: Current hypothesis: Error region: Assumptions: u is through origin Separability (realizable case) D=U, i.e. x~Uniform on S error rate: tt

Performance guarantees Distribution-free mistake bound for Perceptron of O(1/  2 ), if exists margin . Uniform, i.i.d, separable setting: [Baum 1989]: An upper bound on mistakes for Perceptron on Õ(d/  2 ). [Dasgupta, Kalai & M, COLT 2005]: A lower bound for Perceptron of  (1/  2 ) mistakes. An modified-Perceptron algorithm, and a mistake bound of Õ(d log 1/  ).

Perceptron Perceptron update: v t+1 = v t + y t x t  error does not decrease monotonically. u vtvt xtxt v t+1

A modified Perceptron update Standard Perceptron update: v t+1 = v t + y t x t Instead, weight the update by “confidence” w.r.t. current hypothesis v t : v t+1 = v t + 2 y t |v t ¢ x t | x t (v 1 = y 0 x 0 ) (similar to update in [Blum,Frieze,Kannan&Vempala‘96], [Hampson&Kibler‘99]) Unlike Perceptron: Error decreases monotonically: cos(  t+1 ) = u ¢ v t+1 = u ¢ v t + 2 |v t ¢ x t ||u ¢ x t | ¸ u ¢ v t = cos(  t ) kv t k =1 (due to factor of 2)

A modified Perceptron update Perceptron update: v t+1 = v t + y t x t Modified Perceptron update: v t+1 = v t + 2 y t |v t ¢ x t | x t u vtvt xtxt v t+1 vtvt

Selective sampling [Cohn,Atlas&Ladner‘94]: Given: stream (or pool) of unlabeled examples, x2X, drawn i.i.d. from input distribution, D over X. Learner may request labels on examples in the stream/pool. (Noiseless) oracle access to correct labels, y2Y. Constant cost per label The error rate of any classifier v is measured on distribution D: err(h) = P x~D [v(x)  y] PAC-like case: no prior on hypotheses assumed (non-Bayesian). Goal: minimize number of labels to learn the concept (whp) to a fixed final error rate, , on input distribution. We impose online constraints on time and memory. PAC-like selective sampling frameworkOnline active learning framework

Performance Guarantees Bayesian, not-online, uniform, i.i.d, separable setting: [Freund,Seung,Shamir&Tishby ‘97]: Upper bound on labels for Query-by- committee algorithm [SOS‘92] of Õ(d log 1/  ). Uniform, i.i.d, separable setting: [Dasgupta, Kalai & M, COLT 2005] A lower bound for Perceptron in active learning context, paired with any active learning rule, of  (1/  2 ) labels. An online active learning algorithm and a label bound of Õ(d log 1/  ). A bound of Õ(d log 1/  ) on total errors (labeled or unlabeled). OPT:  (d log 1/  ) lower bound on labels for any active learning algorithm.

Active learning rule vtvt stst u { Goal: Filter to label just those points in the error region. ! but  t, and thus  t unknown! Define labeling region: Tradeoff in choosing threshold s t : If too high, may wait too long for an error. If too low, resulting update is too small. Choose threshold s t adaptively: Start high. Halve, if no error in R consecutive labels L

OCR application We apply online active learning to OCR [M‘06; M&K‘07]: Due to its potential efficacy for OCR on small devices. To empirically observe performance when relax distributional and separability assumptions. To start bridging theory and practice.

Algorithms Stated DKM implicitly. For this non-uniform application, start threshold at 1. [Cesa-Bianchi,Gentile & Zaniboni ‘06] algorithm (parameter b): Filtering rule: flip a coin w.p. b/(b + |x ¢ v t |) Update rule: standard Perceptron. CBGZ analysis framework: No assumptions on sequence (need not be iid). Relative bounds on error w.r.t. best linear classifier (regret). Fraction of labels queried depends on b. Other margin-based (batch) methods: Un-analyzed: [Tong&Koller‘01] [Lewis&Gale‘94]. Recently analyzed: [Balcan,Broder & Zhang COLT 2007].

Evaluation framework Experiments with all 6 combinations of: Update rule 2 {Perceptron, DKM modified Perceptron} Active learning logic 2 {DKM, C-BGZ, random} MNIST (d=784) and USPS (d=256) OCR data. 7 problems, with approx 10,000 examples each. 5 random restarts of 10-fold cross-validation. Parameters were first tuned to reach a target  per problem, on hold-out sets of approx 2,000 examples, using 10-fold cross-validation.

Learning curves Unseparable. Extremely easy:

Learning curves

Statistical efficiency

More results Mean § standard deviation, labels to reach  threshold per problem (in parentheses). Active learning always quite outperformed random sampling: Random sampling perc. used 1.26–6.08x as many labels as active. Factor was at least 2 for more than half of the problems.

More results and discussion Individual hypotheses tested on tabular results (to fixed  ): Both active learning rules, with both subalgorithms, performed better than their random sampling counterparts. Difference between the top performers, DKMactivePerceptron and CBGZactivePerceptron, was not significant. Perceptron outperformed Modified-perceptron (DKMupdate), when used as sub-algorithm to any active rule. DKMactive outperformed CBGZactive, with DKMupdate. Possible sources of error: Fairness: Tuning entails higher label usage, which was not accounted for. Modified-perceptron (DKMupdate) was not tuned (no parameters!). Two parameter algorithms should have been tuned jointly. DKMactive’s R relates to fold length however tuning set << data. Overfitting: were parameters overfit to holdout set for tuned algs?

Conclusions and future work Motivated and explained online active learning methods. If your problem is not online, you are better off using batch methods with active learning. Active learning uses much fewer labels than supervised (random sampling). Future work: Other applications! Kernelization. Cost-sensitive labels. Margin version for exponential convergence, without d dependence. Relax separability assumption (Agnostic case faces lower bound [K‘06]). Distributional relaxation? (Bound not possible under any distribution [D‘04]).

Thank you! Thanks to coauthor: Matti Kääriäinen Many thanks to: Sanjoy Dasgupta Tommi Jaakkola Adam Tauman Kalai Luis Perez-Breva Jason Rennie