Download presentation

Presentation is loading. Please wait.

1
Supervised Learning Fall 2004

2
**Introduction Key idea Algorithms**

Known target concept (predict certain attribute) Find out how other attributes can be used Algorithms Rudimentary Rules (e.g., 1R) Statistical Modeling (e.g., Naïve Bayes) Divide and Conquer: Decision Trees Instance-Based Learning Neural Networks Support Vector Machines Fall 2004

3
**1-Rule Generate a one-level decision tree One attribute**

Performs quite well! Basic idea: Rules testing a single attribute Classify according to frequency in training data Evaluate error rate for each attribute Choose the best attribute That’s all folks! Fall 2004

4
**The Weather Data (again)**

Fall 2004

5
**Apply 1R Attribute Rules Errors Total 1 outlook sunnyno 2/5 4/14**

overcast yes 0/4 rainy yes 2/5 2 temperature hot no 2/4 5/14 mild yes 2/6 cool no 3/7 3 humidity high no 3/7 4/14 normal yes 2/8 4 windy false yes 2/8 5/14 true no 3/6 Fall 2004

6
**Other Features Numeric Values Missing Values Discretization :**

Sort training data Split range into categories Missing Values “Dummy” attribute Fall 2004

7
**Naïve Bayes Classifier**

Allow all attributes to contribute equally Assumes All attributes equally important All attributes independent Realistic? Selection of attributes Fall 2004

8
**Bayes Theorem Hypothesis Posterior Probability Prior Evidence**

Conditional probability of H given E Fall 2004

9
**Maximum a Posteriori (MAP)**

Maximum Likelihood (ML) Fall 2004

10
Classification Want to classify a new instance (a1, a2,…, an) into finite number of categories from the set V. Bayesian approach: Assign the most probable category vMAP given (a1, a2,…, an). Can we estimate the probabilities from the training data? Fall 2004

11
**Naïve Bayes Classifier**

Second probability easy to estimate How? The first probability difficult to estimate Why? Assume independence (this is the naïve bit): Fall 2004

12
**The Weather Data (yet again)**

Fall 2004

13
**Estimation Given a new instance with outlook=sunny, temperature=high,**

humidity=high, windy=true Fall 2004

14
**Calculations continued …**

Similarly Thus Fall 2004

15
**Normalization Note that we can normalize to get the probabilities:**

Fall 2004

16
**Problems …. Suppose we had the following training data: Now what?**

Fall 2004

17
Laplace Estimator Replace estimates with Fall 2004

18
Numeric Values Assume a probability distribution for the numeric attributes density f(x) normal fit a distribution (better) Similarly as before Fall 2004

19
**Discussion Simple methodology Powerful - good results in practice**

Missing values no problem Not so good if independence assumption is severely violated Extreme case: multiple attributes with same values Solutions: Preselect which attributes to use Non-naïve Bayesian methods: networks Fall 2004

20
**Decision Tree Learning**

Basic Algorithm: Select an attribute to be tested If classification achieved return classification Otherwise, branch by setting attribute to each of the possible values Repeat with branch as your new tree Main issue: how to select attributes Fall 2004

21
**Deciding on Branching What do we want to accomplish?**

Make good predictions Obtain simple to interpret rules No diversity (impurity) is best all same class all classes equally likely Goal: select attributes to reduce impurity Fall 2004

22
**Measuring Impurity/Diversity**

Lets say we only have two classes: Minimum Gini index/Simpson diversity index Entropy Fall 2004

23
Impurity Functions Entropy Gini index Minimum Fall 2004

24
**Entropy Number of classes Training data Proportion of (instances)**

S classified as i Entropy is a measure of impurity in the training data S Measured in bits of information needed to encode a member of S Extreme cases All member same classification (Note: 0·log 0 = 0) All classifications equally frequent Fall 2004

25
**Expected Information Gain**

All possible values for attribute a Gain(S,a) is the expected information provided about the classification from knowing the value of attribute a (Reduction in number of bits needed) Fall 2004

26
**The Weather Data (yet again)**

Fall 2004

27
**Decision Tree: Root Node**

Outlook Rainy Sunny Overcast Yes No Yes Yes No Fall 2004

28
**Calculating the Entropy**

Fall 2004

29
Calculating the Gain Select! Fall 2004

30
**Next Level Outlook Rainy Sunny Overcast Temperature No Yes No Yes**

Fall 2004

31
**Calculating the Entropy**

Fall 2004

32
Calculating the Gain Select Fall 2004

33
**Final Tree Outlook Sunny Rainy Overcast Humidity Yes Windy High Normal**

True False No Yes No Yes Fall 2004

34
What’s in a Tree? Our final decision tree correctly classifies every instance Is this good? Two important concepts: Overfitting Pruning Fall 2004

35
**Overfitting Two sources of abnormalities**

Noise (randomness) Outliers (measurement errors) Chasing every abnormality causes overfitting Tree to large and complex Does not generalize to new data Solution: prune the tree Fall 2004

36
**Pruning Prepruning Postpruning**

Halt construction of decision tree early Use same measure as in determining attributes, e.g., halt if InfoGain < K Most frequent class becomes the leaf node Postpruning Construct complete decision tree Prune it back Prune to minimize expected error rates Prune to minimize bits of encoding (Minimum Description Length principle) Fall 2004

37
**Scalability Need to design for large amounts of data**

Two things to worry about Large number of attributes Leads to a large tree (prepruning?) Takes a long time Large amounts of data Can the data be kept in memory? Some new algorithms do not require all the data to be memory resident Fall 2004

38
**Discussion: Decision Trees**

The most popular methods Quite effective Relatively simple Have discussed in detail the ID3 algorithm: Information gain to select attributes No pruning Only handles nominal attributes Fall 2004

39
**Selecting Split Attributes**

Other Univariate splits Gain Ratio: C4.5 Algorithm (J48 in Weka) CART (not in Weka) Multivariate splits May be possible to obtain better splits by considering two or more attributes simultaneously Fall 2004

40
**Instance-Based Learning**

Classification To not construct a explicit description of how to classify Store all training data (learning) New example: find most similar instance computing done at time of classification k-nearest neighbor Fall 2004

41
**K-Nearest Neighbor Each instance lives in n-dimensional space**

Distance between instances Fall 2004

42
**Example: nearest neighbor**

- + 1-Nearest neighbor? 6-Nearest neighbor? - - + - xq* - + - - + + Fall 2004

43
**Normalizing Some attributes may take large values and other small**

Normalize All attributes on equal footing Fall 2004

44
**Other Methods for Supervised Learning**

Neural networks Support vector machines Optimization Rough set approach Fuzzy set approach Fall 2004

45
**Evaluating the Learning**

Measure of performance Classification: error rate Resubstitution error Performance on training set Poor predictor of future performance Overfitting Useless for evaluation Fall 2004

46
**Test Set Need a set of test instances Sometimes: validation data**

Independent of training set instances Representative of underlying structure Sometimes: validation data Fine-tune parameters Independent of training and test data Plentiful data - no problem! Fall 2004

47
**Holdout Procedures Common case: data set large but limited**

Usual procedure: Reserve some data for testing Use remaining data for training Problems: Want both sets as large as possible Want both sets to be representitive Fall 2004

48
"Smart" Holdout Simple check: Are the proportions of classes about the same in each data set? Stratified holdout Guarantee that classes are (approximately) proportionally represented Repeated holdout Randomly select holdout set several times and average the error rate estimates Fall 2004

49
**Holdout w/ Cross-Validation**

Fixed number of partitions of the data (folds) In turn: each partition used for testing and remaining instances for training May use stratification and randomization Standard practice: Stratified tenfold cross-validation Instances divided randomly into the ten partitions Fall 2004

50
**Cross Validation Fold 1 Train on 90% of the data Model Test on 10%**

Error rate e1 Fold 2 Train on 90% of the data Model Test on 10% of the data Error rate e2 Fall 2004

51
Cross-Validation Final estimate of error Quality of estimate Fall 2004

52
**Leave-One-Out Holdout**

n-Fold Cross-Validation (n instance set) Use all but one instance for training Maximum use of the data Deterministic High computational cost Non-stratified sample Fall 2004

53
**Bootstrap Sample with replacement n times**

Use as training data Use instances not in training data for testing How many test instances are there? Fall 2004

54
0.632 Bootstrap On the average e-1 n = n instances will be in the test set Thus, on average we have 63.2% of instance in training set Estimate error rate e = etest etrain Fall 2004

55
**Accuracy of our Estimate?**

Suppose we observe s successes in a testing set of ntest instances ... We then estimate the success rate Rsuccess=s/ ntest. Each instance is either a success or failure (Bernoulli trial w/success probability p) Mean p Variance p(1-p) Fall 2004

56
**Properties of Estimate**

We have E[Rsuccess]=p Var[Rsuccess]=p(1-p)/ntest If ntraining is large enough the Central Limit Theorem (CLT) states that, approximately, Rsuccess~Normal(p,p(1-p)/ntest) Fall 2004

57
**Confidence Interval CI for normal CI for p Look up in table Level**

Fall 2004

58
Comparing Algorithms Know how to evaluate the results of our data mining algorithms (classification) How should we compare different algorithms? Evaluate each algorithm Rank Select best one Don't know if this ranking is reliable Fall 2004

59
**Assessing Other Learning**

Developed procedures for classification Association rules Evaluated based on accuracy Same methods as for classification Numerical prediction Error rate no longer applies Same principles use independent test set and hold-out procedures cross-validation or bootstrap Fall 2004

60
**Measures of Effectiveness**

Need to compare: Predicted values p1, p2,..., pn. Actual values a1, a2,..., an. Most common measure Mean-squared error Fall 2004

61
**Other Measures Mean absolute error Relative squared error**

Relative absolute error Correlation Fall 2004

62
**What to Do? “Large” amounts of data “Moderate” amounts of data**

Hold-out 1/3 of data for testing Train a model on 2/3 of data Estimate error (or success) rate and calculate CI “Moderate” amounts of data Estimate error rate: Use 10-fold cross-validation with stratification, or use bootstrap. Train model on the entire data set Fall 2004

63
**Predicting Probabilities**

Classification into k classes Predict probabilities p1, p2,..., pnfor each class. Actual values a1, a2,..., an. No longer 0-1 error Quadratic loss function Correct class Fall 2004

64
**Information Loss Function**

Instead of quadratic function: where the j-th prediction is correct. Information required to communicate which class is correct in bits with respect to the probability distribution Fall 2004

65
Occam's Razor Given a choice of theories that are equally good the simplest theory should be chosen Physical sciences: any theory should be consistant with all empirical observations Data mining: theory = predictive model good theory = good prediction What is good? Do we minimize the error rate? Fall 2004

66
**Minimum Description Length**

MDL principle: Minimize size of theory + info needed to specify exceptions Suppose trainings set E is mined resulting in a theory T Want to minimize Fall 2004

67
**Most Likely Theory Suppose we want to maximize P[T|E] Bayes' rule**

Take logarithms Fall 2004

68
**Information Function Maximizing P[T|E] equivilent to minimizing**

That is, the MDL principle! Number of bits it takes to submit the exceptions Number of bits it takes to submit the theory Fall 2004

69
**Applications to Learning**

Classification, association, numeric prediciton Several predictive models with 'similar' error rate (usually as small as possible) Select between them using Occam's razor Simplicity subjective Use MDL principle Clustering Important learning that is difficult to evaluate Can use MDL principle Fall 2004

70
**Comparing Mining Algorithms**

Know how to evaluate the results Suppose we have two algorithms Obtain two different models Estimate the error rates e(1) and e(2). Compare estimates Select the better one Problem? Fall 2004

71
**Weather Data Example Suppose we learn the rule**

If outlook=rainy then play=yes Otherwise play=no Test it on the following test set: Have zero error rate Fall 2004

72
**Different Test Set 2 Again, suppose we learn the rule**

If outlook=rainy then play=yes Otherwise play=no Test it on a different test set: Have 100% error rate! Fall 2004

73
**Comparing Random Estimates**

Estimated error rate is just an estimate (random) Need variance as well as point estimates Construct a t-test statistic Average of differences in error rates H0: Difference = 0 Estimated standard deviation Fall 2004

74
Discussion Now know how to compare two learning algorithms and select the one with the better error rate We also know to select the simplest model that has 'comparable' error rate Is it really better? Minimising error rate can be misleading Fall 2004

75
**Examples of 'Good Models'**

Application: loan approval Model: no applicants default on loans Evaluation: simple, low error rate Application: cancer diagnosis Model: all tumors are benign Application: information assurance Model: all visitors to network are well intentioned Fall 2004

76
What's Going On? Many (most) data mining applications can be thought about as detecting exceptions Ignoring the exceptions does not significantly increase the error rate! Ignoring the exceptions often leads to a simple model! Thus, we can find a model that we evaluate as good but completely misses the point Need to account for the cost of error types Fall 2004

77
**Accounting for Cost of Errors**

Explicit modeling of the cost of each error costs may not be known often not practical Look at trade-offs visual inspection semi-automated learning Cost-sensitive learning assign costs to classes a priori Fall 2004

78
**Explicit Modeling of Cost**

Confusion Matrix (Displayed in Weka) Fall 2004

79
**Cost Sensitive Learning**

Have used cost information to evaluate learning Better: use cost information to learn Simple idea: Increase instances that demonstrate important behavior (e.g., classified as exceptions) Applies for any learning algorithm Fall 2004

80
**Discussion Evaluate learning Comparison of algorithm**

Estimate error rate Minimum length principle/Occam’s Razor Comparison of algorithm Based on evaluation Make sure difference is significant Cost of making errors may differ Use evaluation procedures with caution Incorporate into learning Fall 2004

81
**Engineering the Output**

Prediction base on one model Model performs well on one training set, but poorly on others New data becomes available new model Combine models Bagging Boosting Stacking } Improve prediction but complicate structure Fall 2004

82
**Bagging Bias: error despite all the data in the world!**

Variance: error due to limited data Intuitive idea of bagging: Assume we have several data sets Apply learning algorithm to each set Vote on the prediction (classification/numeric) What type of error does this reduce? When is this beneficial? Fall 2004

83
**Bootstrap Aggregating**

In practice: only one training data set Create many sets from one Sample with replacement (remember the bootstrap) Does this work? Often given improvements in predictive performance Never degeneration in performance Fall 2004

84
**Boosting Assume a stable learning procedure**

Low variance Bagging does very little Combine structurally different models Intuitive motivation: Any given model may be good for a subset of the training data Encourage models to explain part of the data Fall 2004

85
**AdaBoost.M1 Generate models:**

Assign equal weight to each training instance Iterate: Apply learning algorithm and store model e ¬ error If e = 0 or e > 0.5 terminate For every instance: If classified correctly multiply weight by e/(1-e) Normalize weight Until STOP Fall 2004

86
**AdaBoost.M1 Classification: Add to class predicted by model**

Assign zero weight to each class For every model: Add to class predicted by model Return class with highest weight Fall 2004

87
Performance Analysis Error of combined classifier converges to zero at an exponential rate (very fast) Questionable value due to possible overfitting Must use independent test data Fails on test data if Classifier more complex than training data justifies Training error become too large too quickly Must achieve balance between model complexity and the fit to the data Fall 2004

88
**Fitting versus Overfitting**

Overfitting very difficult to assess here Assume we have reached zero error May be beneficial to continue boosting! Occam's razor? Build complex models from simple ones Boosting offers very significant improvement Can hope for more improvement than bagging Can degenerate performance Never happens with bagging Fall 2004

89
**Stacking Models of different types Meta learner:**

Learn which learning algorithms are good Combine learning algorithms intelligently Level-0 Models Level-1 Model Decision Tree Naïve Bayes Instance-Based Meta Learner Fall 2004

90
**Meta Learning Holdout part of the training set**

Use remaining data for training level-0 methods Use holdout data to train level-1 learning Retrain level-0 algorithms with all the data Comments: Level-1 learning: use very simple algorithm (e.g., linear model) Can use cross-validation to allow level-1 algorithms to train on all the data Fall 2004

91
**Supervised Learning Two types of learning**

Classification Numerical prediction Classification learning algorithms Decision trees Naïve Bayes Instance-based learning Many others are part of Weka, browse! Fall 2004

92
**Other Issues in Supervised Learning**

Evaluation Accuracy: hold-out, bootstrap, cross-validation Simplicity: MDL principle Usefulness: cost-sensitive learning Metalearning Bagging, Boosting, Stacking Fall 2004

Similar presentations

OK

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To ensure the functioning of the site, we use **cookies**. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy & Terms.
Your consent to our cookies if you continue to use this website.

Ads by Google

Module architecture view ppt on iphone Ppt on content management system Ppt on protein energy malnutrition Ppt on queen victoria Download ppt on abdul kalam Ppt on phonetic transcription Ppt on wind power generation in india How to download slideshare ppt on negotiations Ppt on means of transport for class 1 Projector view ppt on android