Presentation is loading. Please wait.

Presentation is loading. Please wait.

Three kinds of learning

Similar presentations


Presentation on theme: "Three kinds of learning"— Presentation transcript:

1 Three kinds of learning
Supervised learning Learning some mapping from inputs to outputs Unsupervised learning Given “data”, what kinds of patterns can you find? Reinforcement learning Learn from positive negative reinforcement

2 Categorical data example
Example from Ross Quinlan, Decision Tree Induction; graphics from Tom Mitchell, Machine Learning

3 Decision Tree Classification

4 Which feature to split on?
Try to classify as many as possible with each split (This is a good split)

5 Which feature to split on?
These are bad splits – no classifications obtained

6 Improving a good split

7 Decision Tree Algorithm Framework
Use splitting criterion to decide on best attribute to split Each child is new decision tree – recurse with parent feature removed If all data points in child node are same class, classify node as that class If no attributes left, classify by majority rule If no data points left, no such example seen: classify as majority class from entire dataset

8 How do we know which splits are good?
Want nodes as “pure” as possible How do we quantify “randomness” of a node? Want All elements +: “randomness” = 0 All elements –: “randomness” = 0 Half +, half -: “randomness” = 1 Draw plot What should “randomness” function look like? Draw plot from Mitchell; have class break into groups and determine what entropy function should be

9 Typical solution: Entropy
pp = proportion of + examples pn = proportion of – examples A collection with low entropy is good. Point out that entropy is better than quadratic solns, etc., since it generalizes so nicely to 3 or more classes (keep adding on terms)

10 ID3 Criterion Split on feature with most information gain.
Gain = entropy in original node – weighted sum of entropy in child nodes

11 How good is this split?

12 How good is this split?

13 The big picture Start with root
Find attribute to split on with most gain Recurse

14 Assessment How do I know how well my decision tree works?
Training set: data that you use to build decision tree Test set: data that you did not use for training that you use to assess the quality of decision tree

15 Issues on training and test sets
Do you know the correct classification for the test set? If you do, why not include it in the training set to get a better classifier? If you don’t, how can you measure the performance of your classifier?

16 Cross Validation Tenfold cross-validation
Ten iterations Pull a different tenth of the dataset out each time to act as a test set Train on the remaining training set Measure performance on the test set Leave one out cross-validation Similar, but leave only one point out each time, then count correct vs. incorrect

17 Noise and Overfitting Can we always obtain a decision tree that is consistent with the data? Do we always want a decision tree that is consistent with the data? Example: Predict Carleton students who become CEOs Features: state/country of origin, GPA letter, major, age, high school GPA, junior high GPA, ... What happens with only a few features? What happens with many features?

18 Overfitting Fitting a classifier “too closely” to the data
finding patterns that aren’t really there Prevented in decision trees by pruning When building trees, stop recursion on irrelevant attributes Do statistical tests at node to determine if should continue or not

19 Examples of decision trees using Weka
setenv WEKAHOME /Accounts/courses/cs327/weka-3-0-3 java -classpath weka.jar:$CLASSPATH weka.classifiers.j48/J48 -t data/weather.arff -C 0.09 C is confidence parameter for pruning

20 Preventing overfitting by cross validation
Another technique to prevent overfitting (is this valid)? Keep on recursing on decision tree as long as you continue to get improved accuracy on the test set

21 Ensemble Methods Many “weak” learners, when combined together, can perform more strongly than any one by itself Bagging & Boosting: many different learners, voting on which classification Multiple algorithms, or different features, or both

22 Bagging / Boosting Bagging: vote to determine answer
Run one algorithm on random subsets of data to obtain multiple classifiers Boosting: weighted vote to determine answer Each iteration, weight more heavily data that learner got wrong What does it mean to “weight more heavily” for k-nn? For decision trees? AdaBoost is recent (1997) and has become popular, fast

23 Computational Learning Theory

24 Chapter 20 up next Moving on to Chapter 20: statistical learning methods Skipping to: will revisit earlier topics (perhaps) near end of course 20.5: Neural Networks 20.6: Support vector machines


Download ppt "Three kinds of learning"

Similar presentations


Ads by Google