Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jaime Carbonell, Pinar Donmez, Jingui He & Vamshi Ambati Language Technologies Institute Carnegie Mellon University www.cs.cmu.edu/~jgc 27 October 2010.

Similar presentations


Presentation on theme: "Jaime Carbonell, Pinar Donmez, Jingui He & Vamshi Ambati Language Technologies Institute Carnegie Mellon University www.cs.cmu.edu/~jgc 27 October 2010."— Presentation transcript:

1 Jaime Carbonell, Pinar Donmez, Jingui He & Vamshi Ambati Language Technologies Institute Carnegie Mellon University www.cs.cmu.edu/~jgc 27 October 2010 Active and Proactive Machine Learning

2 October 2010Jaime G. Carbonell, Language Technolgies Institute 2 Why is Active Learning Important?  Labeled data volumes  unlabeled data volumes 1.2% of all proteins have known structures <.01% of all galaxies in the Sloan Sky Survey have consensus type labels <.0001% of all web pages have topic labels << E-10% of all internet sessions are labeled as to fraudulence (malware, etc.) <.0001 of all financial transactions investigated w.r.t. fraudulence  If labeling is costly, or limited, select the instances with maximal impact for learning

3 Is (Pro)Active Learning Relevant to Language Technologies?  Text Classification By topic, genre, difficulty, … In learning to rank search results  Question Answering Question-type classification Answer ranking  Machine Translation Selecting sentences to translate for LDL’s Eliciting partial or full alignment October 2010Jaime G. Carbonell, Language Technolgies Institute 3

4 October 2010Jaime G. Carbonell, Language Technolgies Institute 4 Active Learning  Training data: Special case:  Functional space:  Fitness Criterion: a.k.a. loss function  Sampling Strategy:

5 October 2010Jaime G. Carbonell, Language Technolgies Institute 5 Sampling Strategies  Random sampling (preserves distribution)  Uncertainty sampling ( Lewis, 1996; Tong & Koller, 2000) proximity to decision boundary maximal distance to labeled x’s  Density sampling (kNN-inspired McCallum & Nigam, 2004)  Representative sampling (Xu et al, 2003)  Instability sampling (probability-weighted) x’s that maximally change decision boundary  Ensemble Strategies Boosting-like ensemble (Baram, 2003) DUAL (Donmez & Carbonell, 2007)  Dynamically switches strategies from Density-Based to Uncertainty- Based by estimating derivative of expected residual error reduction

6 Which point to sample? Grey = unlabeled Red = class A Brown = class B October 20106Jaime G. Carbonell, Language Technolgies Institute

7 Density-Based Sampling Centroid of largest unsampled cluster October 20107Jaime G. Carbonell, Language Technolgies Institute

8 Uncertainty Sampling Closest to decision boundary October 20108Jaime G. Carbonell, Language Technolgies Institute

9 Maximal Diversity Sampling Maximally distant from labeled x’s October 20109Jaime G. Carbonell, Language Technolgies Institute

10 Ensemble-Based Possibilities Uncertainty + Diversity criteria Density + uncertainty criteria October 201010Jaime G. Carbonell, Language Technolgies Institute

11 October 2010Jaime G. Carbonell, Language Technolgies Institute 11 Strategy Selection: No Universal Optimum Optimal operating range for AL sampling strategies differs How to get the best of both worlds? (Hint: ensemble methods, e.g. DUAL)

12 October 2010Jaime G. Carbonell, Language Technolgies Institute 12 How does DUAL do better?  Runs DWUS until it estimates a cross-over  Monitor the change in expected error at each iteration to detect when it is stuck in local minima  DUAL uses a mixture model after the cross-over ( saturation ) point  Our goal should be to minimize the expected future error If we knew the future error of Uncertainty Sampling (US) to be zero, then we’d force But in practice, we do not know it

13 October 2010Jaime G. Carbonell, Language Technolgies Institute 13 More on DUAL [ECML 2007]  After cross-over, US does better => uncertainty score should be given more weight  should reflect how well US performs can be calculated by the expected error of US on the unlabeled data * =>  Finally, we have the following selection criterion for DUAL: * US is allowed to choose data only from among the already sampled instances, and is calculated on the remaining unlabeled set to

14 October 2010Jaime G. Carbonell, Language Technolgies Institute 14 Results: DUAL vs DWUS

15 October 2010Jaime G. Carbonell, Language Technolgies Institute 15 Beyond Dual  Paired Sampling with Geodesic Density Estimation Donmez & Carbonell, SIAM 2008  Active Rank Learning Search results: Donmez & Carbonell, WWW 2008 In general: Donmez & Carbonell, ICML 2008  Structure Learning Inferring 3D protein structure from 1D sequence Remains open problem

16 October 2010Jaime G. Carbonell, Language Technolgies Institute 16 Active Sampling for RankSVM I  Consider a candidate  Assume is added to training set with  Total loss on pairs that include is:  n is the # of training instances with a different label than  Objective function to be minimized becomes:

17 October 2010Jaime G. Carbonell, Language Technolgies Institute 17 Active Sampling for RankSVM II  Assume the current ranking function is  There are two possible cases:  Assume  Derivative w.r.t at a single point or

18 October 2010Jaime G. Carbonell, Language Technolgies Institute 18 Active Sampling for RankSVM III  Substitute in the previous equation to estimate  Magnitude of the total derivative  estimates the ability of to change the current ranker if added into training  Finally,

19 October 2010Jaime G. Carbonell, Language Technolgies Institute 19 Active Sampling for RankBoost I  Again, estimate how the current ranker would change if was in the training set  Estimate this change by the difference in ranking loss before and after is added  Ranking loss w.r.t is (Freund et al., 2003):

20 October 2010Jaime G. Carbonell, Language Technolgies Institute 20 Active Sampling for RankBoost II  Difference in the ranking loss between the current and the enlarged set:  indicates how much the current ranker needs to change to compensate for the loss introduced by the new instance  Finally, the instance with the highest loss differential is sampled:

21 October 2010Jaime G. Carbonell, Language Technolgies Institute 21 Results on TREC03

22 October 2010Jaime G. Carbonell, Language Technolgies Institute 22 Active vs Proactive Learning Active LearningProactive Learning Number of Oracles Individual (only one)Multiple, with different capabilities, costs and areas of expertise Reliability Infallible (100% right)Variable across oracles and queries, depending on difficulty, expertise, … Reluctance Indefatigable (always answers) Variable across oracles and queries, depending on workload, certainty, … Cost per query Invariant (free or constant)Variable across oracles and queries, depending on workload, difficulty, … Note: “Oracle”  {expert, experiment, computation, …}

23 Active Learning is Awesome, but … is it Enough? Single Perfect Source Multiple Sources Differing Expertise Labeling Noise Answer Reluctance Fixed Labeling Cost Varying-Cost Model Task Difficulty Ambiguity Expertise Level TraditionalActiveLearning GoingBeyond Fixed over timeTime-varying 23 J MLR_’09 KDD ‘09 SDM_sub ‘10 CIKM ‘08 ProactiveLearning

24 October 2010Jaime G. Carbonell, Language Technolgies Institute 24 Scenario 1: Reluctance  2 oracles: reliable oracle: expensive but always answers with a correct label reluctant oracle: cheap but may not respond to some queries  Define a utility score as expected value of information at unit cost

25 October 2010Jaime G. Carbonell, Language Technolgies Institute 25 How to estimate ?  Cluster unlabeled data using k-means  Ask the label of each cluster centroid to the reluctant oracle. If label received: increase of nearby points no label: decrease of nearby points equals 1 when label received, -1 otherwise  # clusters depend on the clustering budget and oracle fee

26 October 2010Jaime G. Carbonell, Language Technolgies Institute 26 Algorithm for Scenario 1

27 October 2010Jaime G. Carbonell, Language Technolgies Institute 27 Scenario 2: Fallibility  Two oracles: One perfect but expensive oracle One fallible but cheap oracle, always answers  Alg. Similar to Scenario 1 with slight modifications  During exploration: Fallible oracle provides the label with its confidence Confidence = of fallible oracle If then we don’t use the label but we still update

28 October 2010Jaime G. Carbonell, Language Technolgies Institute 28 Scenario 3: Non-uniform Cost  Uniform cost: Fraud detection, face recognition, etc.  Non-uniform cost: text categorization, medical diagnosis, protein structure prediction, etc.  2 oracles: Fixed-cost Oracle Variable-cost Oracle

29 October 2010Jaime G. Carbonell, Language Technolgies Institute 29 Underlying Sampling Strategy  Conditional entropy based sampling, weighted by a density measure  Captures the information content of a close neighborhood close neighbors of x

30 October 2010Jaime G. Carbonell, Language Technolgies Institute 30 Results: Reluctance

31 October 2010Jaime G. Carbonell, Language Technolgies Institute 31 Cost varies non-uniformly statistically significant (p<0.01)

32 Sequential Bayesian Filtering  Tracking the states of multiple systems as each evolves over time  Sequentially arriving observations (noisy labels)  Goal: Estimate posterior distribution 32 Changing Accuracy with time t Noisy labels SDM ‘10

33 33 Predict Update A Closer Look to the Model SDM ‘10

34 Predictor Selection 34 Accuracy at the last time selected Probability of accuracy There is a chance that the accuracy might have increased Our belief of the accuracy diverges over time as the source goes unexplored SDM ‘10

35 Copyright@2009 Pinar Donmez35 Red: true Blue: estimated Black: mle

36 Does Tracking Predictor Accuracy Actually Help in Proactive Learning? 36 SDM ‘10

37 October 2010Jaime G. Carbonell, Language Technolgies Institute 37 Proactive Learning in General  Multiple Experts (a.k.a. Oracles) Different areas of expertise Different costs Different reliabilities Different availability  What question to ask and whom to query? Joint optimization of query & oracle selection Referals among Oracles (with referal fees) Learn about Oracle capabilities as well as solving the Active Learning problem at hand Non-static Oracle properties

38 October 2010Jaime G. Carbonell, Language Technolgies Institute 38 Current Issues in Proactive Learning  Large numbers of oracles [Donmez, Carbonell & Schneider, KDD-2009] Based on multi-armed bandit approach  Non-stationary oracles [Donmez, Carbonell & Schneider, SDM-2010] Expertise changes with time (improve or decay) Exploration vs exploitation tradeoff  What if labeled set is empty for some classes? Minority class discovery (unsupervised) [He & Carbonell, NIPS 2007, SIAM 2008, SDM 2009] After first instance discovery  proactive learning, or  minority-class characterization [He & Carbonell, SIAM 2010]

39 October 2010Jaime G. Carbonell, Language Technolgies Institute 39 Minority Classes vs Outliers  Rare classes A group of points Clustered Non-separable from the majority classes  Outliers A single point Scattered Separable

40 The Big Picture Unbalanced Unlabeled Data Set Rare Category Detection Learning in Unbalanced Settings Classifier Raw Data Feature Representation Relational Temporal Feature Extraction October 201040Jaime G. Carbonell, Language Technolgies Institute

41 October 2010Jaime G. Carbonell, Language Technolgies Institute 41 7. Budget exhausted? Minority Class Discovery Method 1. Calculate problem-specific similarity 2.,, 3. 4. Query 5. a new class? Increase t by 1 6. Output No Yes No Relevance Feedback

42 October 2010Jaime G. Carbonell, Language Technolgies Institute 42 Summary of Real Data Sets  Abalone 4177 examples 7-dimensional features 20 classes Largest class: 16.50% Smallest class: 0.34%  Shuttle 4515 examples 9-dimensional features 7 classes Largest class: 75.53% Smallest class: 0.13%

43 October 2010Jaime G. Carbonell, Language Technolgies Institute 43 Results on Real Data Sets Abalone Shuttle MALICE Interleave Random sampling

44 Source Language Corpus Source Language Corpus Mode l Trainer MT System S S Active Learner S,T Active Learning for MT Expert Translator Sampled corpus Parallel corpus October 201044Jaime G. Carbonell, Language Technolgies Institute

45 S,T 1 Source Language Corpus Source Language Corpus Mode l Trainer MT System S S ACT Framework...... S,T 2 S,T n A ctive C rowd T ranslation Sentenc e Selectio n Translation Selection October 201045Jaime G. Carbonell, Language Technolgies Institute

46 Active Learning Strategy: Diminishing Density Weighted Diversity Sampling 46 Experiments: Language Pair: Spanish-English Iterations: 20 Batch Size: 1000 sentences each Translation: Moses Phrase SMT Development Set: 343 sens Test Set: 506 sens Graph: X: Performance (BLEU ) Y: Data (Thousand words) October 2010Jaime G. Carbonell, Language Technolgies Institute

47 Translation Selection from AMT Translator Reliability Translation Selection: October 201047Jaime G. Carbonell, Language Technolgies Institute

48 Parting Thoughts  Proactive Learning New field just started New work and full details  Domnez Dissertation Applications Abound: e-science (compbio), finance, network securirty, language technologies (MT), … Theory still in the making (e.g. Liu Yang) Open challenge: Proactive structure learning  Rare Class discovery and Classification Dovetails with Active/Proactive Learning New Work and Full Details  Jingrui He Dissertation October 2010Jaime G. Carbonell, Language Technolgies Institute 48

49 October 2010Jaime G. Carbonell, Language Technolgies Institute 49 THANK YOU!

50 October 2010Jaime G. Carbonell, Language Technolgies Institute 50 Specially Designed Exponential Families [Efron & Tibshirani 1996]  Favorable compromise between parametric and nonparametric density estimation  Estimated density Carrier density Normalizing parameter parameter vector vector of sufficient statistics

51 October 2010Jaime G. Carbonell, Language Technolgies Institute 51

52 October 2010Jaime G. Carbonell, Language Technolgies Institute 52 SEDER Algorithm  Carrier density: kernel density estimator   To decouple the estimation of different parameters Decompose Relax the constraint such that

53 October 2010Jaime G. Carbonell, Language Technolgies Institute 53 Parameter Estimation  Theorem 3 [SDM 2009]: the maximum likelihood estimate and of and satisfy the following conditions: where

54 October 2010Jaime G. Carbonell, Language Technolgies Institute 54 Parameter Estimation cont.  Let  : where, : positive parameter in most cases

55 October 2010Jaime G. Carbonell, Language Technolgies Institute 55 Scoring Function  The estimated density  Scoring function: norm of the gradient where

56 October 2010Jaime G. Carbonell, Language Technolgies Institute 56 Summary of Real Data Sets Data Set ndmLargest Class Smallest Class Ecoli3367642.56%2.68% Glass2149635.51%4.21% Page Blocks547310589.77%0.51% Abalone417772016.50%0.34% Shuttle45159775.53%0.13% Moderately Skewed Extremely Skewed

57 October 2010Jaime G. Carbonell, Language Technolgies Institute 57 Moderately Skewed Data Sets Ecoli Glass MALICE

58 GRADE: Full Prior Information 2. Calculate class-specific similarity 3.,, 4. 5. Query 6. class c? Increase t by 1 7. Output No Yes 1. For each rare class c, Relevance Feedback October 201058Jaime G. Carbonell, Language Technolgies Institute

59 Results on Real Data Sets Ecoli Glass Abalone Shuttle MALICE October 201059Jaime G. Carbonell, Language Technolgies Institute

60 October 2010Jaime G. Carbonell, Language Technolgies Institute 60 Performance Measures  MAP (Mean Average Precision) MAP is the average of AP values for all queries  NDCG (Normalized Discounted Cumulative Gain) The impact of each relevant document is discounted as a function of rank position


Download ppt "Jaime Carbonell, Pinar Donmez, Jingui He & Vamshi Ambati Language Technologies Institute Carnegie Mellon University www.cs.cmu.edu/~jgc 27 October 2010."

Similar presentations


Ads by Google