# Paper by Yoav Freund and Robert E. Schapire

## Presentation on theme: "Paper by Yoav Freund and Robert E. Schapire"— Presentation transcript:

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Paper by Yoav Freund and Robert E. Schapire Presented by Steven S. Olson

Outline -The General Idea and Problem Definition -Underlying Concepts -On-line Allocation of Resources -Hedge Algorithm -Boosting -AdaBoost -Exam Questions

The Problem - Gambler is tired of losing, decides to let friends make bets on his behalf. - Gambler decides he will wager a fixed sum of money on each race, but will apportion his money among his friends, based on how well they're doing. - Can't know which friend will win, which will lose, so how to distribute the money?

Useful Terms - On-Line Learning: Information comes to us one step at a time, learner must apply some kind of model, make a prediction, etc continuously. - Weak Learner: Some algorithm that has a marginally higher than random chance of being correct. Impractical by itself in most cases. - PAC: Probably Approximately Correct, a good majority of the time the prediction returned will be close to the actual result.

Ensemble Learning A central idea to the concepts provided in this presentation. A machine learning paradigm where multiple learners, often time weak ones, are used to solve a problem. More than one learner is used to increase accuracy of results, speed with which results are achieved, or both. Boosting, covered later, is one of the most important ensemble learning methods around.

On Line Learning A method of learning where we dont know the information that is coming, but we are well enough equipped to guess what it might be. Broadly speaking, it takes the following cycle: 1. nature gives us some side information 2. We give some prediction 3. nature outputs some observation 4. cycle is repeated

On Line Allocation of resources
The problem on line learning aims to solve is: How to we best appropriate resources dynamically among a set of options. Or put more simply, given a set of individual predictions, how much trust should we have in each of these predictions? A good example of this is the gambling problem (a recurring example)

Formal On Line Model - Allocation Agent: A (the gambler) - Certain strategy: i (a friend) - Number of strategies: {1,2,3,4,...,N} (number of friends) - # of time steps: T (Number of races) - distribution over strategies: pt (how much money he gives to each friend) - loss: l (money not gained from betting)

Formal On Line Model Con't
- The goal is to minimize the loss suffered by our Algorithm A

Outline -The General Idea and Problem Definition -Underlying Concepts -On-line Allocation of Resources -Hedge Algorithm -Boosting -AdaBoost -Exam Questions

The Hedge Algorithm The Hedge algorithm is the author's implementation of an on line learning algorithm The idea: Maintain a weight vector wt, which contains all the weights for all the various “strategies,” or friends to give money to, at time t. The higher the weight, the better we think they are. Our algorithm allocates among the strategies using the current weight vector, after normalizing.

The Hedge Algorithm cont'd
After the loss vector lt has been received, our weight vector wt is updated using the multiplicative rule:

The Hedge Algorithm Pseudo code

Hedge in action Gambler splits his money between three friends, giving 5\$ to each, p1 = <.33, .33, .33> Gambler records loss to each friend Friend 1 loses \$2 Friend 2 loses \$1 Friend 3 loses \$4 loss vector lt = <2,1,4> total loss: .33x x x4 = 2.33

Hedge in action con't The gambler sets new weights using this data and a beta of 5. Friend 1 is weighted .33x.52 = Friend 2 is weighted .33x.51 = Friend 3 is weighted .33x0.54 = Total weight = .313 The gambler then repeats the process, but now hedging his bets. P2 = <0.083/0.313, 0.167/0.313, 0.063/0.313> = <0.265, 0.533, 0.202>

Finding Bounds of Hedge
After complex system of equations, the bound for hedge was found to be Or, in more simple terms,

Finding Bounds of Hedge
After complex system of equations, the bound for hedge was found to be Or, in more simple terms,

Hedge Bounds Cont'd Given research done by V.G. Vovk, it was discovered that the constants a and c, are optimal.

How to choose ß So far, we've looked at hedge(ß) for a given choice of ß. In practice it is often nice to chance ß so as to maximally exploit prior knowledge. Set where Then,

Outline -The General Idea and Problem Definition -Underlying Concepts -On-line Allocation of Resources -Hedge Algorithm -Boosting -AdaBoost -Exam Questions

More Gambling - Our Gambler is tired of his crappy friends, creates program that will predict winner based on usual information. - To create this, he asks horse racing expert to explain methodology for choosing winner. Expert gives vague “rules of thumb” on a per race basis. So we have two problems here: how does the expert choose which races to draw rules from, and which rules do we give credence to?

The General Idea In 1988 M. Kearns and L.G. Valiant asked:
“Can a “weak” learning algorithm that performs just slightly better than random guess can be “boosted” into an arbitrarily accurate “strong” learning algorithm?” Or more simply, can we take a bunch of weak learners and make one good one?

Boosting - Boosting is the method by which we take “rules of thumb,” and create a workable, highly accurate prediction rule. Formally: booster provided with labeled training examples: (x1, y1),...,(xn, yn), where yi is the label associated with event instance xi. So xi might be the observation data for a particular horse race, and yi is the winner of that race. For each round t, we create a distribution Dt over the set of examples, which specifies the importance of each rule of thumb. After a certain set of rounds, the booster must combine the weak rules of thumb into one strong rule.

Boosting Con't The aim of boosting is to convert a collection of weak learning algorithms into one strong one. Example: Take some rules used to determine which horses win, such as “the horse with the most wins will win,” or “the horse with the most experianced jockey will win,” rules that are often wrong, and from these rules figure out the winning horse. Problem: Which data do we use, and how do we combine the rules?

Boosting in practice 1. Split a training data set into multiple overlapping subsets 2. Train a weak learner on one equally weighed set, until its accuracy is greater than 50%. 3. Train a weak learner on a new example set, now weighed, to focus on errors. 4. Repeat until we're out of examples. 5. Apply all learners to test set to determine final hypothesis.

The problem Previous boosting algorithms would take some learning method, run it a bunch of times, each time giving it a different distribution of examples, and finally combining all the generated hypothesis into one great big hypothesis. Problems: We need to know too much before hand Improvements to overall performance is dependent on the weakest rules.

Outline -The General Idea and Problem Definition -Underlying Concepts -On-line Allocation of Resources -Hedge Algorithm -Boosting -AdaBoost -Exam Questions

Adaboost Adaboost, or Adaptive Boosting, is the idea of taking the outputs of a bunch of Weak Learning algorithms into a weighted average sum, which represents the boosted classifier. It introduced the idea of “adaptability” into boosting, in that subsequent weak learners are tweaked in favour of those instances previously misclassified. Adaboost is considered to be one of the best all around out-of-the-box classifiers, which decision trees are used as the weak classifiers. It is #7 on the Top 10 List of Data Mining Algorithms.

Adaboost Inputs: Sequence of N labeled examples <x1, y1>, <x1, y2>,… A distribution D over the N examples Some weak learning algorithm. We will use WeakLearn Integer T specifying iterations, or time. Initialize: Weight vector wi = Di for I = 1, …, N

Adaboost con’t Do: for t = 1 , 2, … T 1. Set
2. Call WeakLearn, providing it with distribution pt, get back some hypotheses ht: X -> [0, 1] 3. Calculate the error of ht:

Adaboost con’t 4. Create and set Bt= et/(1-et), where et is the error at iteration t. 5. Set the new weights vector to be

What just happened? The weights of the examples are adjusted for each time iteration, so that the multiplier is more correct, <1, or 1 if incorrect. Remember, we want it to be low. The weight also gets normalized, so no decrease is effectively an increase. Every weak learner gets a vote which is inversely proportional to the log of its beta, which is proportional to its weight, which is proportional to its error, so everything works.

Formal Procedure

The Beta If the error is .5, we get no information since we’re just guessing, and the timestep t isn’t used. For a error of < 0.5, we can weight examples proportional to error, and weight votes inversely proportional to error. In the Final model generation, if a particular time step had > 0.5 error, it will have a “negative” vote proportional to its error.

Error Rate Freund and Schapire, in a later paper, found that the training error of Adaboost is bounded by So if each classifier is slightly better than random so that For some , then the training error drops exponentially fast in T since the above bound is at most

MORE Gambling So our gambler now has a pretty good scheme to make money. He goes and downloads the entire race history from the tracks database. 1. He finds that odds are a PAC indicator (probably approximately correct), and comes up with some hypothesis accordingly. 2. He finds the error of using this predictor. 3. Next he looks at the data, focusing on examples that odds could not predict, and comes up with a new heuristic. (When its sunny out the most experienced jockey wins). 4. He repeats this process until no more heuristics can be determined. 5. When some threshold of heuristics indicate a win for a given horse, he bets on it.

Pro’s: Very fast, easy to program (Shapire said you could do it in 10 lines), no parameters except T to tune, no prior knowledge needed about weak learner, versatile. Con’s: Weak classifiers too complex may lead to overfitting. Weak classifiers too weak can lead to low margins, and can also lead to overfitting. Adaboost is also particularly vulnerable to uniform noise.

Adaboost.M1 We modify the error calculation as follows:
With the assumption that And we arrive at the final hypothesis by

Exam Questions What are we seeking to minimize in resource allocation?

Q1: What are we seeking to minimize in Resource Allocation?
Or more simply, we are trying to minimize the total loss of the allocator with respect to the loss of the best learner. This gives us a consistent worst case scenario, and a way to hedge our bets.

Q2: What is the goal of boosting?
The goal is to use one or more weak learners as an arbitrarily accurate strong learner. In other words, to use better than chance heuristics in ensemble for high predictive accuracy.