Download presentation

Presentation is loading. Please wait.

Published bySilvia Hall Modified about 1 year ago

1
T AMING THE L EARNING Z OO

2
S UPERVISED L EARNING Z OO Bayesian learning Maximum likelihood Maximum a posteriori Decision trees Support vector machines Neural nets k-Nearest-Neighbors 2

3
V ERY APPROXIMATE “ CHEAT - SHEET ” FOR TECHNIQUES D ISCUSSED IN C LASS AttributesN scalabilityD scalabilityCapacity Bayes netsDGood Naïve BayesDExcellent Low Decision treesD,CExcellent Fair Neural netsCPoorGood SVMsCGood Nearest neighbors D,CLearn: E, Eval: P PoorExcellent

4
W HAT HAVEN ’ T WE COVERED ? Boosting Way of turning several “weak learners” into a “strong learner” E.g. used in popular random forests algorithm Regression: predicting continuous outputs y=f(x) Neural nets, nearest neighbors work directly as described Least squares, locally weighted averaging Unsupervised learning Clustering Density estimation Dimensionality reduction [Harder to quantify performance]

5
A GENDA Quantifying learner performance Cross validation Precision & recall Model selection

6
C ROSS -V ALIDATION

7
A SSESSING P ERFORMANCE OF A L EARNING A LGORITHM Samples from X are typically unavailable Take out some of the training set Train on the remaining training set Test on the excluded instances Cross-validation

8
C ROSS -V ALIDATION Split original set of examples, train + + + + + + + - - - - - - + + + + + - - - - - - Hypothesis space H Train Examples D

9
C ROSS -V ALIDATION Evaluate hypothesis on testing set + + + + + + + - - - - - - Hypothesis space H Testing set

10
C ROSS -V ALIDATION Evaluate hypothesis on testing set Hypothesis space H Testing set ++ + + + - - - - - - + + Test

11
C ROSS -V ALIDATION Compare true concept against prediction + + + + + + + - - - - - - Hypothesis space H Testing set ++ + + + - - - - - - + + 9/13 correct

12
C OMMON S PLITTING S TRATEGIES k-fold cross-validation TrainTest Dataset

13
C OMMON S PLITTING S TRATEGIES k-fold cross-validation Leave-one-out (n-fold cross validation) TrainTest Dataset

14
C OMPUTATIONAL COMPLEXITY k-fold cross validation requires k training steps on n(k-1)/k datapoints k testing steps on n/k datapoints (There are efficient ways of computing L.O.O. estimates for some nonparametric techniques, e.g. Nearest Neighbors) Average results reported

15
B OOTSTRAPPING Similar technique for estimating the confidence in the model parameters Procedure: 1. Draw k hypothetical datasets from original data. Either via cross validation or sampling with replacement. 2. Fit the model for each dataset to compute parameters k 3. Return the standard deviation of 1,…, k (or a confidence interval) Can also estimate confidence in a prediction y=f(x)

16
S IMPLE E XAMPLE : AVERAGE OF N NUMBERS Data D={x (1),…,x (N) }, model is constant Learning: minimize E( ) = i (x (i) - ) 2 => compute average Repeat for j=1,…,k : Randomly sample subset x (1) ’,…,x (N) ’ from D Learn j = 1/N i x (i) ’ Return histogram of 1,…, j

17
P RECISION R ECALL C URVES 17

18
P RECISION VS. R ECALL Precision # of true positives / (# true positives + # false positives) Recall # of true positives / (# true positives + # false negatives) A precise classifier is selective A classifier with high recall is inclusive 18

19
P RECISION -R ECALL CURVES 19 Precision Recall Measure Precision vs Recall as the classification boundary is tuned Better learning performance

20
P RECISION -R ECALL CURVES 20 Precision Recall Measure Precision vs Recall as the classification boundary is tuned Learner A Learner B Which learner is better?

21
A REA U NDER C URVE 21 Precision Recall AUC-PR: measure the area under the precision- recall curve AUC=0.68

22
AUC METRICS A single number that measures “overall” performance across multiple thresholds Useful for comparing many learners “Smears out” PR curve Note training / testing set dependence

23
M ODEL S ELECTION AND R EGULARIZATION

24
C OMPLEXITY V S. G OODNESS OF F IT More complex models can fit the data better, but can overfit Model selection: enumerate several possible hypothesis classes of increasing complexity, stop when cross-validated error levels off Regularization: explicitly define a metric of complexity and penalize it in addition to loss

25
M ODEL S ELECTION WITH K - FOLD C ROSS - V ALIDATION Parameterize learner by a complexity level C Model selection pseudocode: For increasing levels of complexity C: errT[C],errV[C] = Cross-Validate(Learner,C,examples) [average k-fold CV training error, testing error] If errT has converged, Find value Cbest that minimizes errV[C] Return Learner(Cbest,examples) Needed capacity reached

26
M ODEL S ELECTION : D ECISION T REES C is max depth of decision tree. Suppose N attributes For C=1,…,N: errT[C],errV[C] = Cross-Validate(Learner,C, examples) If errT has converged, Find value Cbest that minimizes errV[C] Return Learner(Cbest,examples)

27
M ODEL S ELECTION : F EATURE SELECTION EXAMPLE Have many potential features f 1,…,f N Complexity level C indicates number of features allowed for learning For C = 1,…,N errT[C],errV[C] = Cross-Validate(Learner, examples[f 1,..,f C ]) If errT has converged, Find value Cbest that minimizes errV[C] Return Learner(Cbest,examples)

28
B ENEFITS / D RAWBACKS Automatically chooses complexity level to perform well on hold-out sets Expensive: many training / testing iterations [But wait, if we fit complexity level to the testing set, aren’t we “peeking?”]

29
R EGULARIZATION Let the learner penalize the inclusion of new features vs. accuracy on training set A feature is included if it improves accuracy significantly, otherwise it is left out Leads to sparser models Generalization to test set is considered implicitly Much faster than cross-validation

30
R EGULARIZATION Minimize: Cost(h) = Loss(h) + Complexity(h) Example with linear models y = T x: L 2 error: Loss( ) = i (y (i) - T x (i) ) 2 L q regularization: Complexity( ): j | j | q L 2 and L 1 are most popular in linear regularization L 2 regularization leads to simple computation of optimal L 1 is more complex to optimize, but produces sparse models in which many coefficients are 0!

31
D ATA D REDGING As the number of attributes increases, the likelihood of a learner to pick up on patterns that arise purely from chance increases In the extreme case where there are more attributes than datapoints (e.g., pixels in a video), even very simple hypothesis classes can overfit E.g., linear classifiers Sparsity important to enforce Many opportunities for charlatans in the big data age!

32
I SSUES IN P RACTICE The distinctions between learning algorithms diminish when you have a lot of data The web has made it much easier to gather large scale datasets than in early days of ML Understanding data with many more attributes than examples is still a major challenge! Do humans just have really great priors?

33
N EXT L ECTURES Intelligent agents (R&N Ch 2) Markov Decision Processes Reinforcement learning Applications of AI: computer vision, robotics

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google