Active Learning and Experimental Design

Active Learning and Experimental Design
University of Stuttgart Active Learning and Experimental Design Anwar Sadat Delgado Monica

Outline Introduction Scenarios (Approaches to Querying)
Query Strategy Frameworks Uncertainty Sampling Query-by-Committee Expected Model Change Expected Error Reduction Variance Reduction Density-Weighted Methods Problem Setting Variants Practical Consideration Conclusion

Definition Active learning is a special case of semi-supervised machine learning in which a learning algorithm is able to interactively query the user (or some other information source) to obtain the desired outputs at new data points. In statistics literature it is sometimes also called optimal Experimental design. Settles, Burr (2010) Olsson, Fredrik (2009)

Introduction: Learning paradigms
University of Stuttgart Introduction: Learning paradigms Supervised Learning Unsupervised Learning Machine Learning tasks can be supervised learning (classification), unsupervised learning (clustering and model building) Usually an abundance of unlabeled data. How much should be labeled? Which instances should be labeled? It matters because of costs: time, money, etc. • Active Learning: incrementally request labels for instances believed to be informative Active Learning

Passive Learning vs. Active Learning
University of Stuttgart Passive Learning vs. Active Learning General Schema of a passive learner For all supervised and unsupervised learning tasks, it needs to gather significant amount of data randomly sampled from the underlying population distribution and then induce a classifier or model Cons: Gathering of labeled data often too expensive: time, money - (To say about the figure) A passive learner receives a random data set from the world and then outputs a classifier or model. - Often the most time-consuming and costly task in these applications is the gathering of data. Traditional supervised learning algorithms use whatever labeled data is provided to induce a model. By contrast, active learning gives the learner a degree of control by allowing it to select which instances are labeled and added to the training set. -

Passive Learning vs. Active Learning
University of Stuttgart Passive Learning vs. Active Learning General Schema of a active learner Difference: the ability to ask queries about the world based upon the past queries and responses The notion of what exactly a query is and what response it receives will depend upon the exact task at hand Pro: The entire data need not be labeled, it is the task of the learner to request for relevant label (To say about the figure) An active learner gathers information about the world by asking queries and receiving responses

Active Learning Heuristic
Start with a pool of unlabeled data (U) Pick a few points at random and get their labels using an oracle (e.g. human annotator) Repeat the following: Fit a classifier to the labels seen so far Pick the BEST unlabeled point to get a label for (closest to the boundary?) (most uncertain?) (most likely to decrease overall uncertainty?)

Active Learning (II) Start: Unlabeled data

Active Learning (III) Label a random subset

Active Learning (IV) Fit a classifier to labeled data

Active Learning (V) Pick the BEST next point to label (e.g. closest to boundary)

Active Learning (VI) Fit a classifier to labeled data

Active Learning (VII) Pick the BEST next point to label (e.g. closest to boundary)

Active Learning (VIII)
Fit a classifier to labeled data

Active learning examples
A.L. well motivated in many modern machine learning problems where data may be abundant but labels are scarce or expensive to obtain Applications: Speech recognition Information extraction Classification and filtering Comparison of Uncertainty Sampling (Active Learning) vs Random sampling for text classification of Baseball vs Hockey

Scenarios (Approaches to Querying)
Three Main Active Learning Scenarios (Settles, 2010)

Membership Query Synthesis
The learner may request labels for any unlabeled instance in the input space, including queries that the learner generates de novo, rather than those sampled from some underlying natural distribution. Pros: Computationally tractable (for finite domains) Can be extended to regression tasks Promising approach in automation of experiments that do not require a human annotator Cons: Human annotator may have difficulty interpreting and labeling arbitrary instances Scenarios

Stream-Based Selective Sampling
Generally used when obtaining an unlabeled instance is free or relatively cheap, the query can first be sampled from the actual distribution, and then the learner can decide whether or not to request its label. This approach is called stream / sequential active learning, as each unlabeled instance is typically drawn one at a time from the data source, and the learner must decide whether to query or discard it. Pros: Better to use when memory or processing power may be limited, as with mobile and embedded devices. Cons: The assumption that the unlabeled data is available free might not always hold. Scenarios

Pool-Based Sampling Scenarios The pool-based active learning cycle (Settles, 2010)

Pool-Based Sampling It assumes that there is a small set of labeled data L and a large pool of unlabeled data U available The learner is supplied with a set of unlabeled examples from which it can selects queries. Pros: Probably the most widely used sampling method (text classification, information extraction, image classification, video classification, speech recognition, cancer diagnosis) Applied to many real-world tasks Cons: Computationally intensive Needs to evaluate entire set at each iteration Scenarios

Query Strategy Frameworks
Uncertainty Sampling Query-By-Committee Expected Model Change Expected Error Reduction Variance Reduction Density-Weighted Methods

Uncertainty Sampling (I)
University of Stuttgart Uncertainty Sampling (I) An active learner queries the instances about which it is least certain how to label (e.g. closest to the decision boundary) Basically there are 3 strategies: least confident, margin sampling and entropy closest to boundary Query Strategy Frameworks When using probabilistic learning models for binary classification, uncertainty sampling simply queries the instance whose posterior probability of being positive is nearest 0.5

Uncertainty Sampling (II)
University of Stuttgart Uncertainty Sampling (II) For problems with three or more class labels: it might query the instances whose prediction is the least confident The uncertainty measure: Where is the class label with highest posterior probability under model θ Pros: Appropriate if the objective function is to reduce classification error Cons: This strategy only considers information about the most probable label Query Strategy Frameworks - One way to interpret this uncertainty measure is the expected 0/1-loss, i.e., the model’s belief that it will mislabel x. - x∗A refers to the most informative instance (i.e., the best query) according to some query selection algorithm A.

Uncertainty Sampling (III)
University of Stuttgart Uncertainty Sampling (III) Partial solution: margin sampling: where and are the first and second most probable class labels under the model Pros: Corrects the least confident strategy by incorporating the posterior of the second most likely label Appropriate if the objective function is to reduce classification error Cons: For problems with very large label sets, it still ignores the output distribution for the remaining classes Query Strategy Frameworks Margin sampling is a multi-class uncertainty sampling variant

Uncertainty Sampling (IV)
University of Stuttgart Uncertainty Sampling (IV) Solution: Entropy Where ranges over all possible labeling Entropy is an information-theoretic measure that represents the amount of information needed to “encode” a distribution Pro: Generalizes easily the strategy for probabilistic multi-label classifiers and for more complex structured instances (e.g. sequences, trees) Appropriate if the objective function is to minimize log-loss Query Strategy Frameworks - Entropy is often thought of as a measure of uncertainty or impurity in machine learning - For binary classification, entropy-based sampling reduces to the margin and least confident strategies

Uncertainty Sampling (V)
University of Stuttgart Uncertainty Sampling (V) Query Strategy Frameworks Simplex corners indicate where one label has very high probability, with the opposite edge showing the probability range for the other two classes when that label has very low probability. Simplex centers represent a uniform posterior distribution. The most informative query region for each strategy is shown in dark red, radiating from the centers. Heat maps illustrating the query behavior for three-label classification problem

Uncertainty Sampling (VI)
This strategy may also be employed with non-probabilistic classifiers Is also applicable in regression problems (i.e., learning tasks where the output variable is a continuous value rather than a set of discrete class labels) The learner queries the unlabeled instance for which the model has the highest output variance in its prediction Active learning for regression problems are generally referred to as optimal experimental design Query Strategy Frameworks

Query-By-Committee (I)
The QBC approach involves maintaining a committee C of models which are all trained on the current labeled set L, but represent competing hypotheses Each committee member votes on the labeling of query candidates Pick the instances generating the most disagreement among hypotheses Goal of QBC: minimize the version space (the set of hypotheses that are consistent with the current labeled trained data L), to constrain the size of this space as much as possible with a few labeled instances as possible Query Strategy Frameworks

Query-By-Committee (II)
University of Stuttgart Query-By-Committee (II) Version space examples for (a) linear and (b) axis-parallel box classifiers. All hypotheses are consistent with the labeled training data in L (as indicated by shaded polygons), but each represents a different model in the version space To implement a QBC selection algorithm: Construct a committee of models that represent different regions of the version space Have some measure of disagreement among committee members Query Strategy Frameworks There is no general agreement on the appropriate committee size to use, which may vary by model class or application.

Query-By-Committee (III)
University of Stuttgart Query-By-Committee (III) Two approaches for measuring the level of disagreement 1) Vote entropy: Where ranges over all possible labeling, is the number of “votes”, C is the committee size 2) Kullback-Leibler (KL) divergence: Where: represents a particular model, C is the committee as a whole Query Strategy Frameworks

Query-By-Committee (IV)
University of Stuttgart Query-By-Committee (IV) So the “consensus” probability that is the correct label is given by: KL divergence is an information-theoretic measure of the difference between two probability distributions Aside from QBC, other strategies attempt to minimize the version space: A selective sampling algorithm that uses a committee of 2 NN, the “most specific” and the “most general” models (Cohn et al., 1994) Pool-based margin strategy for SVMs (Tong and Koller, 2000) The membership query algorithms (Angluin and King et al.) Query Strategy Frameworks - So this disagreement measure considers the most informative query to be the one with the largest average difference between the label distributions of any one committee member and the consensus - Haussler (1994) shows that the size of the version space can grow exponentially with the size of L. This means that, in general, the version space of an arbitrary model class cannot be explicitly represented in practice - QBC can also be applied in regression settings by measuring disagreement as the variance among the committee members’ output predictions. As there is no notion of “version space” in models with continuous outputs, the interpretation of QBC in regression settings is a bit different. We can think of L as constraining the posterior joint probability of predicted output variables and the model parameters, P(Y, θ|L). By integrating over a set of hypotheses and identifying queries that lie in controversial regions of the instance space, the learner attempts to collect data that reduces variance over both the output predictions and the parameters of the model itself (as opposed to uncertainty sampling, which focuses only on the output variance of a single hypothesis)

Expected Model Change (I)
Is a decision-theoretic approach It selects the instance that would impart the greatest change to the current model if we knew its label Before After Query Strategy Frameworks

Expected Model Change (II)
University of Stuttgart Expected Model Change (II) Expected Gradient Length (EGL) It uses gradient-based training The learner should query the instance x which, if labeled and added to L, would result in the new training gradient of the largest magnitude It calculates the length as an expectation over the possible labeling: With the gradient of the objective function ℓ with respect to the model parameters θ the new gradient that would be obtained by adding the training tuple <x,y> to L Query Strategy Frameworks We can make the approximation of the last equation for computational efficiency, because training instances are usually assumed to be independent.

Expected Model Change (III)
University of Stuttgart Expected Model Change (III) It prefers instances that are likely to most influence the model, regardless of the resulting query label Cons: Computationally expensive if both the feature space and set of labeling are very large If features are not properly scaled, EGL approach can be led astray (Solution: parameter regularization) Query Strategy Frameworks Furthermore, the EGL approach can be led astray if features are not properly scaled. That is, the informativeness of a given instance may be over-estimated simply because one of its feature values is unusually large, or the corresponding parameter estimate is larger, both resulting in a gradient of high magnitude

Expected Error Reduction (I)
University of Stuttgart Expected Error Reduction (I) A decision-theoretic approach Query the instances that would most reduce error. Measures how much its generalization error is likely to be reduced The idea it to estimate the expected future error of a model trained using on the remaining unlabeled instances in U, and query the instance with minimal expected future error (risk) One approach: Minimize the expected 0/1-loss: Where refers to the new model after it has been re-trained with the training tuple <x,yi> added to L Query Strategy Frameworks U is assumed to be representative of the test distribution, and used as a sort of validation set - As with EGL we do not know the true label for each query instance, so we approximate using expectation over all possible labels under the current model θ.

Expected Error Reduction (II)
University of Stuttgart Expected Error Reduction (II) Objectives: To reduce the expected total number of incorrect predictions To minimize the expected log-loss, which is equivalent to reduce the expected entropy over U (maximizing the expected information gain of the query x or the mutual information of the output variables over x and U It’s been successfully used with naïve Bayes, Gaussian random fields, logistic regression and SVM Cons: The most computationally expensive query framework (It requires estimating the expected future error over U for each query and a new model must be incrementally re-trained for each possible query labeling) Because of the complexity, it is usually used for simple binary classification tasks Query Strategy Frameworks Roy and McCallum (2001) first proposed the expected error reduction framework for text classification using na¨ıve Bayes For nonparametric model classes such as Gaussian random fields (Zhu et al., 2003), the incremental training procedure is efficient and exact, making this approach fairly practical - For a many other model classes, this is not the case. For example, a binary logistic regression model would require O(ULG) time complexity simply to choose the next query, where U is the size of the unlabeled pool U, L is the size of the current training set L, and G is the number of gradient computations required by the by optimization procedure until convergence.

Variance Reduction (I)
University of Stuttgart Variance Reduction (I) Objective: reduce generalization error indirectly by minimizing output variance Considering a regression problem, the learner’s expected future error can be decomposed: Expectation over the conditional density P(y/x) Noise Query Strategy Frameworks noise, i.e., the variance of the true label y given only x, which does not depend on the model or training data. Such noise may result from stochastic effects of the method used to obtain the labels, for example, or because the feature representation is inadequate. 2) bias, which represents the error due to the model class itself, e.g., if a linear model is used to learn a function that is only approximately linear. This component of the overall error is invariant given a fixed model class. 3) variance, which is the remaining component of the learner’s squared-loss with respect to the target function => Minimizing the variance is guaranteed to minimize the future generalization error of the model Bias Expectation over both Variance Expectation over the labeled set L

Variance Reduction (II)
University of Stuttgart Variance Reduction (II) The variance reduction query selection strategy then becomes: Here is the estimated output mean variance once the model has trained for the variable x The variance can be seen as a composition of the Fisher’s Information matrix as given by this equation (MacKay, 1992): In other words, to minimize the variance over its parameter estimates, an active learner should select data that maximizes its Fisher information (or minimizes the inverse thereof) Fisher information I() represents the overall uncertainty about the estimated model parameters theta

Density-Weighted Methods (I)
Last methods spend time querying possible outliers simple because they are controversial Figure: Poor strategy for classification when using uncertainty sampling can be a poor strategy for classification. Since A is on the decision boundary, it would be queried as the most uncertain. However, querying B is likely to result in more information about the data distribution as a whole The basic idea is to query the most “informative” and “representative” (inhabit dense regions of the input space)

Density-Weighted Methods (II)
1) The information density framework (Settles and Craven, 2008) Main idea: informative instances should not only be those which are uncertain, but also those which are “representative” of the underlying distribution. Query Strategy Frameworks In densest region but still very close to decision boundary Very close to decision boundary

Density-Weighted Methods (III)
The Settles and Craven approach: Where φA(x) represents the informativeness of x according to some “base” query strategy A, such as an uncertainty sampling or QBC approach. The second term weights the informativeness of x by its average similarity to all other instances in the input distribution (as approximated by U), subject to a parameter β that controls the relative importance of the density term Query Strategy Frameworks

Density-Weighted Methods (IV)
McCallum and Nigam (1998) also developed a density-weighted QBC approach for text classification with naïve Bayes (special case of information density) Fujii et al. (1998) considered a query strategy for nearest-neighbor methods that selects queries that are least similar to the labeled instances in L, and most similar to the unlabeled instances in U. Nguyen and Smeulders (2004) proposed a density-based approach that first clusters instances and tries to avoid querying outliers by propagating label information to instances in the same cluster. Xu et al. (2007) use clustering to construct sets of queries for batch-mode active learning with SVMs Query Strategy Frameworks

Density-Weighted Methods (V)
Pro If densities can be pre-computed efficiently and cached for later use, the time required to select the next query is essentially no different than the base informativeness measure (e.g., uncertainty sampling) This is advantageous for conducting active learning interactively with oracles in real-time Query Strategy Frameworks

Usage Comparison of Different Active Learning Approach
(300) [Guyon, Cawley , Dror , Lemaire 2009] [Active Learning Challenge ]

Problem Setting Variants
Some common problem setting variants of the generalizations and extensions of traditional active learning work. Active learning for Structured Outputs Active feature acquisition and classification Active class selection Active clustering

AL for Structured Outputs (I)
Many important learning problems involve predicting structured outputs on instances, such as sequences and trees Example of an information extraction problem viewed as a sequence labeling task Problem Setting Variants

AL for Structured Outputs (II)
Unlike simpler classification tasks, each instance x in this setting is not represented by a single feature vector, but rather a structured sequence of feature vectors The appropriate label for a token often depends on its context in the sequence Labels are typically predicted by a sequence model based on a probabilistic finite state machine, such as CRFs or HMMs Settles and Craven published- An Analysis of Active Learning Strategies for Sequence Labeling Tasks (2008) discusses how many active learning strategies can be adapted for structured labeling tasks. Problem Setting Variants

Active Feature Acquisition and Classification (I)
In some learning domains, instances may have incomplete feature descriptions, example, many data mining tasks in modern business are characterized by naturally incomplete customer data The task of the model is to classify a problem using incomplete information as the feature set Problem Setting Variants Active Feature Acquisition seeks to alleviate these problems by allowing the learner to request more complete feature information, the assumption that more information can be requested at extra cost/effort

Active Feature Acquisition and Classification (II)
University of Stuttgart Active Feature Acquisition and Classification (II) The goal in active feature acquisition is to select the most informative features to obtain during training, rather than randomly or exhaustively acquiring all new features for all training instances There have been multiple approaches to solving this problem of feature acquisition, e.g: Zheng and Padmanabhan proposed two “single-pass” approaches Attempt to impute the missing values, and then acquire the ones about which the model has least confidence An alternative, imputing these values, training a classifiers on the imputed training instances, and only acquiring feature values for the instances which are misclassified Incremental active feature acquisition may acquire values for a few salient features at a time By selecting a small batch of misclassified examples (Melville et al., 2004) Problem Setting Variants

Active Feature Acquisition and Classification (III)
University of Stuttgart Active Feature Acquisition and Classification (III) Taking a decision-theoretic approach and acquiring the feature values which are expected to maximize some utility function (SaarTsechansky et al., 2009). Works in active feature acquisition have considered cases in which feature acquisition can be done during test-time rather than during training The difference between these learning settings and typical active learning is that the “oracle” provides salient feature values rather than training labels Problem Setting Variants Since feature values can be highly variable in their acquisition costs (e.g., running two different medical tests might provide roughly the same predictive power, while one is half the cost of the other), some of these approaches are related in spirit to cost-sensitive active learning

Active Class Selection
We have till now assumed that the data needed for training the model is freely available and that classification involves cost. An active class selection problem, the learner is capable of querying a known class label and obtaining each instance incurs a cost. Lomasky et al. (2007) propose several active class selection query algorithms for an “artificial nose” task machine learns to discriminate between different vapor types (the class labels) which must be chemically synthesized (to generate the instances) Problem Setting Variants

University of Stuttgart
Active Clustering Clustering algorithms are the most common examples of unsupervised learning Since active learning aims at selecting data that will reduce models uncertainty, unsupervised active learning is counterintuitive Hofmann and Buhmann have proposed an active clustering algorithm for proximity data, based on an expected value of information criterion The idea is to generate (or subsample) the unlabeled instances in such a way that they self-organize into groupings with less overlap or noise than for clusters induced using random sampling Problem Setting Variants Hofmann and Buhmann(Active data clustering. In Advances in Neural Information Processing Systems (NIPS))

Practical Considerations
Until very recently the main question has been “Can machines learn with fewer training instances if they ask questions?” Yes (subject to some assumptions e,g: single oracle, oracle is always right, cost of labeling instances is uniform) These assumptions don’t hold in most real world situations The research trend has now taken a turn to try and answer the question “Can machines learn more economically if they ask questions?”

Noisy Oracles The assumption that is oracle is correct doesn’t hold in real world scenarios Crowdsourcing tools and annotation games allow to average out the result by getting labels from multiple oracles Question: when the learner should query an potentially noisy oracle or even re-request the label for an instance that seems off Research has proved that the labeling can be improved by selective relabeling Questions still remaining for research: How can learner incorporate varying level of noise from same oracle? If repeated labeling not able to improve at all? Practical Considerations

Variable Labeling Costs
Aim to reduce total labeling cost, simply reducing the total number of labels doesn’t solve the problem Annotation costs can be reduced by using trained models to assist in labeling of query instances by pre-labeling them (similar to data parsing) Kapoor et al. propose a method that accounts for labeling costs and mislabeling costs. Queries are evaluated by summing the labeling cost taking into account the probability of mislabeling depending on the quality of oracle (the cost of labeling and quality should be known before hand) Practical Considerations

Batch-Mode Active Learning
University of Stuttgart Batch-Mode Active Learning Most active learning research are oriented to work well when the queries are selected in serial In the presence of multiple annotators, this method slows down the learning process Batch-Mode active learning allows the learner to query the instances in groups, this approach is more suited for a parallel labeling Challenge is to form the query set. Querying the “Q-best” queries according to the instance-level strategy usually causes information overlap and is not helpful Research is oriented at approaches that specifically incorporate diversity in the queries eg. Xu et al. (2007) query the centroids of different clusters Practical Considerations Brinker (2003) considers an approach for SVMs that explicitly incorporates diversity among instances in the batch. Xu et al. (2007) propose a similar approach for SVM active learning, which also incorporates a density measure (Section 3.6). Specifically, they query the centroids of clusters of instances that lie closest to the decision boundary. Hoi et al. (2006a,b) extend the Fisher information framework (Sec- tion 3.5) to the batch-mode setting for binary logistic regression. Most of these approaches use greedy heuristics to ensure that instances in the batch are both diverse and informative, although Hoi et al. (2006b) exploit the properties of sub- modular functions (see Section 7.3) to find batches that are guaranteed to be near- optimal. Alternatively, Guo and Schuurmans (2008) treat batch construction for logistic regression as a discriminative optimization problem, and attempt to construct the most informative batch directly. For the most part, these approaches show improvements over random batch sampling, which in turn is generally better than simple “Q-best” batch construction.

Alternative Query Types
University of Stuttgart Alternative Query Types Settles et al. (2008) suggest Multiple-Instance (MI) learning. Here instances are grouped into bags and the bags are labeled rather than the instances. A bag is labeled positive if any of its instances are positive A bag is labeled negative if and only if all instances are negative It is easier to label bags compared to labeling instances Mixed granularity MI learning also proposed by Settles. The learner is capable of querying instances as well as bags Querying on feature as opposed to label also an alternative. Also interleave feature based learning with instance learning to improve accuracy. (eg: presence of word football in a text documents usually means it’s a sport document) Practical Considerations active learning assumes that a “query unit” is of the same type as the target concept to be learned. In other words, if the task is to assign class labels to text documents, the learner must query a document and the oracle provides its label.

Multi-Task Active Learning
The same data instance can be labeled in multiple ways for different subtasks Also sometimes more economical to label all tasks of an instance simultaneously Qi et al. (2008) study a different multi-task active learning scenario, in which images may be labeled for several binary classification tasks in parallel. For example, an image might be labeled as containing a beach, sunset, mountain, field, etc., which are not all mutually exclusive; however, they are not entirely independent, either Practical Considerations

Stopping Criteria When is it good to stop?
Cost of acquiring new training data is greater than the cost of the errors made by the current model Recognize when the accuracy plateau has been reached The real stopping criterion for practical applications is based on economic or other external factors, which likely come well before an intrinsic learner-decided threshold Practical Considerations

Analysis of Active Learning
A number of research in the field and works done by large corporations like Google, IBM, CiteSeer suggest that it works It is however very closely related to the quality of the annotators It does reduce the number of instance labels required to achieve a given level of accuracy in most of the studied research However there are reported research in which the users opted out of active learning as they lacked the confidence in the approach

Conclusions Active learning is a growing area of research in machine learning backed by the fact that the data available is ever growing and is available for less A lot of the research is involved in improving the accuracy of the learner with lesser number of queries Current day research is intended at implementing active learning in practical scenarios and problem solving

Active Learning and Experimental Design

Similar presentations

Presentation on theme: "Active Learning and Experimental Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Active Learning and Experimental Design

Similar presentations

Presentation on theme: "Active Learning and Experimental Design"— Presentation transcript:

Similar presentations

About project

Feedback