Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Ensembles Last modified 1/9/19.

Similar presentations


Presentation on theme: "Data Mining Ensembles Last modified 1/9/19."— Presentation transcript:

1 Data Mining Ensembles Last modified 1/9/19

2 Ensemble Methods Ensemble methods construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made by multiple classifiers In Olympic Ice-Skating you have multiple judges? Why? When would multiple judges/classifiers help the most?

3 Why does it work? Suppose there are 25 base classifiers
Each classifier has error rate,  = 0.35 Assume classifiers are independent Probability that the ensemble classifier makes a wrong prediction: Practice has shown that even when independence does not hold results are good

4 Why does it work? Assuming ensemble of 25 classifiers

5 Do Ensembles always help?
Based on the last graph, when will an ensemble hurt– even assuming the base classifiers are perfectly independent? When the error rate of the base classifier is worse than guessing (< 50% for two classes) In prior graph, when base classifier error rate > 0.5 then the ensemble error rate is higher than the base classifier error rate

6 Other reasons why it works
Classifier performance can be impacted by: Bias: assumptions made to help with generalization "Simpler is better" is a bias Variance: a learning method will give different results based on small changes (e.g., in training data). When I run experiments and use random sampling with repeated runs, I get different results each time.

7 Other reasons why it works (cont.)
Ensemble methods can assist with the bias and variance Averaging over multiple runs reduces variance I observe this when I use 10 runs with random sampling and see that my learning curves are much smoother Ensemble methods especially helpful for unstable classifier algorithms Decision trees are unstable since small changes in the training data can greatly impact the structure of the learned decision tree If you combine different classifier methods into an ensemble, then you are using methods with different biases so perhaps more robust

8 Expressiveness of Ensembles
We discussed the expressiveness of a classifier What does its decision boundary look like? An ensemble has more expressive power Consider 3 linear classifiers where you classify positive only if all three agree–triangular region

9 Ways to Generate Multiple Classifiers
How many ways can you generate multiple classifiers (what can you vary)? Manipulate the training data Sample the data differently each time Examples: Bagging and Boosting Manipulate the input features Sample the features differently each time Makes especially good sense if there is redundancy Example: Random Forest Manipulate the learning algorithm Vary amount of pruning, learning parameters, or simply use completely different algorithms

10 Framework when varying the data

11 Examples of Ensemble Methods
How to generate an ensemble of classifiers? Bagging Boosting These methods have been shown to be quite effective A technique ignored by the textbook is to combine classifiers built separately By simple voting By voting and factoring in the reliability of each classifier

12 Bagging Sampling with replacement
Build classifier on each bootstrap sample Each sample has probability (1 – 1/n)n of being selected (about 63% for large n) Some values will be picked more than once Combine the resulting classifiers, such as by majority voting Greatly reduces the variance when compared to a single base classifier

13 Bagging Sampling with replacement
Build classifier on each bootstrap sample Each sample has probability (1 – 1/n)n of being selected

14 Bagging Algorithm

15 Bagging Example Consider 1-dimensional data set:
Classifier is a decision stump Decision rule: x  k versus x > k Split point k is chosen based on entropy x  k True False yleft yright

16 Bagging Example

17 Bagging Example

18 Bagging Example Summary of Training sets:

19 Bagging Example Assume test set is same as the original data
Use majority vote to determine class of ensemble classifier Predicted Class

20 Boosting An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records Initially, all N records are assigned equal weights Unlike bagging, weights may change at the end of boosting round

21 Boosting Records that are wrongly classified will have their weights increased and are more likely to be sampled Records that are classified correctly will have their weights decreased Example 4 is hard to classify Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds

22 AdaBoost Base classifiers: C1, C2, …, CT Error rate:
Importance of a classifier: Note that classifier important (large α) if error rate ε close to 0

23 AdaBoost Algorithm Weight update:
If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and the resampling procedure is repeated Classification:

24 AdaBoost Algorithm

25 AdaBoost Example Consider 1-dimensional data set:
Classifier is a decision stump Decision rule: x  k versus x > k Split point k is chosen based on entropy x  k True False yleft yright

26 AdaBoost Example Training sets for the first 3 boosting rounds:
Summary

27 AdaBoost Example Predicted Class

28 Random Forest An ensemble of decision trees
Boostrap sample of training data so different training examples for each classifier (like bagging) When splitting an internal node, best split is chosen from a random selection of features So the features that are used differ Trees built until pure and no pruning The final decision is made via majority voting of the trees in the forest

29 Random Forest Imagine that a few features are strongly predictive of class but many are only weakly predictive Even if we use different training data, likely to still use the same features Because random does not use all features at each node, some weak features will be used This leads to very different decision trees Random forest reduces variance of the trees and are robust to noise Random Forest is computationally fast

30 Random Forest The number of features selected at each node is a hyperparameter Suggestions include using the square root or log of the number of features Can also be tuned using a validation set Do not worry about out of bag error estimate

31 Using Ensembles for Course Project
If your course project focuses on classification, you can improve your project by trying different classifiers and different ensemble methods

32 The Netflix Prize: $1 million Prize for 10% improvement over Netflix’s CINEMATCH Algorithm Ensembles in Action!

33 Netflix Prize Video https://www.youtube.com/watch?v=ImpV70uLxyw
The contest is also summarized in the following slides

34 Netflix Netflix is a subscription-based movie and television rental service that offers media to subscribers: Physically by mail (common at time of contest) Over the internet Has large catalog of movies and TV shows Subscriber base of over 10 million As of January 2018 it was 118 million

35 Recommendations Netflix offers users the ability to rate movies and TV shows that they have seen Depending on those ratings, Netflix provides recommendations These recommendations are based on an algorithm called Cinematch

36 Cinematch Uses straightforward statistical linear models with a lot of data conditioning This means that the more a subscriber rates, the more accurate the recommendations will become

37 Netflix Prize Competition for the best collaborative filtering algorithm to predict user ratings for movies and television shows, based on previous ratings Offered a $1 million prize to the team who could improve Cinematch’s accuracy by 10% Awarded a $50,000 progress prize for the team who makes the most progress for each year before the 10% mark was reached The contest started on October 2, 2006 and would run until at least October 2, 2011, depending on when a winner was chosen

38 Metrics The accuracy of the algorithms was measured by using root mean square error, or RMSE Chosen because it is a well-known, single value that can account for and amplify the contributions of errors such as false positives and false negatives

39 Metrics Cinematch scored 0.9525 on the test subset
The winning team needed to score at least 10% lower, with an RMSE of

40 Results The contest ended on June 26, 2009
The threshold was broken by the teams “BellKor's Pragmatic Chaos” and “The Ensemble”, both achieving a 10.06% improvement over Cinematch, with an RMSE of “BellKor's Pragmatic Chaos” won the prize due to the team submitting their results 20 minutes before “The Ensemble”

41 Netflix Prize Sequel Due to the success of their contest, Netflix announced another contest to further improve their recommender system Unfortunately, it was discovered that the anonymized customer data that they provided to the contestants could actually be used to identify individual customers This, combined with a resulting investigation by the FTC and a lawsuit, led Netflix to cancel their sequel

42 Sources product-officer.html l?_r=1


Download ppt "Data Mining Ensembles Last modified 1/9/19."

Similar presentations


Ads by Google