Presentation is loading. Please wait.

Presentation is loading. Please wait.

T AMING THE L EARNING Z OO. S UPERVISED L EARNING Z OO Bayesian learning (find parameters of a probabilistic model) Maximum likelihood Maximum a posteriori.

Similar presentations


Presentation on theme: "T AMING THE L EARNING Z OO. S UPERVISED L EARNING Z OO Bayesian learning (find parameters of a probabilistic model) Maximum likelihood Maximum a posteriori."— Presentation transcript:

1 T AMING THE L EARNING Z OO

2 S UPERVISED L EARNING Z OO Bayesian learning (find parameters of a probabilistic model) Maximum likelihood Maximum a posteriori Classification Decision trees (discrete attributes, few relevant) Support vector machines (continuous attributes) Regression Least squares (known structure, easy to interpret) Neural nets (unknown structure, hard to interpret) Nonparametric approaches k-Nearest-Neighbors Locally-weighted averaging / regression 2

3 V ERY APPROXIMATE “ CHEAT - SHEET ” FOR TECHNIQUES D ISCUSSED IN C LASS TaskAttributesN scalabilityD scalabilityCapacity Bayes netsCDGood Naïve BayesCDExcellent Low Decision treesCD,CExcellent Fair Linear least squares RCExcellent Low Nonlinear LSRCPoor Good Neural netsRCPoorGood SVMsCCGood Nearest neighbors CD,CL:E, E:PPoorExcellent* Locally- weighted averaging RCL:E, E:PPoorExcellent* BoostingCD,C??Excellent*

4 V ERY APPROXIMATE “ CHEAT - SHEET ” FOR TECHNIQUES D ISCUSSED IN C LASS TaskAttributesN scalabilityD scalabilityCapacity Bayes netsCDGood Naïve BayesCDExcellent Low Decision treesCD,CExcellent Fair Linear least squares RCExcellent Low Nonlinear LSRCPoor Good Neural netsRCPoorGood SVMsCCGood Nearest neighbors CD,CL:E, E:PPoorExcellent* Locally- weighted averaging RCGoodPoorExcellent* BoostingCD,C??Excellent* Note: we have looked at a limited subset of existing techniques in this class (typically, the “classical” versions). Most techniques extend to: Both C/R tasks (e.g., support vector regression) Both continuous and discrete attributes Better scalability for certain types of problem With “sufficiently large” data sets With “sufficiently diverse” weak leaners

5 A GENDA Quantifying learner performance Cross validation Error vs. loss Precision & recall Model selection

6 C ROSS -V ALIDATION

7 A SSESSING P ERFORMANCE OF A L EARNING A LGORITHM Samples from X are typically unavailable Take out some of the training set Train on the remaining training set Test on the excluded instances Cross-validation

8 C ROSS -V ALIDATION Split original set of examples, train + + + + + + + - - - - - - + + + + + - - - - - - Hypothesis space H Train Examples D

9 C ROSS -V ALIDATION Evaluate hypothesis on testing set + + + + + + + - - - - - - Hypothesis space H Testing set

10 C ROSS -V ALIDATION Evaluate hypothesis on testing set Hypothesis space H Testing set ++ + + + - - - - - - + + Test

11 C ROSS -V ALIDATION Compare true concept against prediction + + + + + + + - - - - - - Hypothesis space H Testing set ++ + + + - - - - - - + + 9/13 correct

12 C OMMON S PLITTING S TRATEGIES k-fold cross-validation TrainTest Dataset

13 C OMMON S PLITTING S TRATEGIES k-fold cross-validation Leave-one-out (n-fold cross validation) TrainTest Dataset

14 C OMPUTATIONAL COMPLEXITY k-fold cross validation requires k training steps on n(k-1)/k datapoints k testing steps on n/k datapoints (There are efficient ways of computing L.O.O. estimates for some nonparametric techniques, e.g. Nearest Neighbors) Average results reported

15 B OOTSTRAPPING Similar technique for estimating the confidence in the model parameters  Procedure: 1. Draw k hypothetical datasets from original data. Either via cross validation or sampling with replacement. 2. Fit the model for each dataset to compute parameters  k 3. Return the standard deviation of  1,…,  k (or a confidence interval) Can also estimate confidence in a prediction y=f(x)

16 S IMPLE E XAMPLE : AVERAGE OF N NUMBERS Data D={x (1),…,x (N) }, model is constant  Learning: minimize E(  ) =  i (x (i) -  ) 2 => compute average Repeat for j=1,…,k : Randomly sample subset x (1) ’,…,x (N) ’ from D Learn  j = 1/N  i x (i) ’ Return histogram of  1,…,  j

17 B EYOND E RROR R ATES 17

18 B EYOND E RROR R ATE Predicting security risk Predicting “low risk” for a terrorist, is far worse than predicting “high risk” for an innocent bystander (but maybe not 5 million of them) Searching for images Returning irrelevant images is worse than omitting relevant ones 18

19 B IASED S AMPLE S ETS Often there are orders of magnitude more negative examples than positive E.g., all images of Kris on Facebook If I classify all images as “not Kris” I’ll have >99.99% accuracy Examples of Kris should count much more than non-Kris!

20 F ALSE P OSITIVES 20 x1x1 x2x2 True conceptLearned concept

21 F ALSE P OSITIVES 21 x1x1 x2x2 True conceptLearned concept New query An example incorrectly predicted to be positive

22 F ALSE N EGATIVES 22 x1x1 x2x2 True conceptLearned concept New query An example incorrectly predicted to be negative

23 P RECISION VS. R ECALL Precision # of relevant documents retrieved / # of total documents retrieved Recall # of relevant documents retrieved / # of total relevant documents Numbers between 0 and 1 23

24 P RECISION VS. R ECALL Precision # of true positives / (# true positives + # false positives) Recall # of true positives / (# true positives + # false negatives) A precise classifier is selective A classifier with high recall is inclusive 24

25 R EDUCING F ALSE P OSITIVE R ATE 25 x1x1 x2x2 True conceptLearned concept

26 R EDUCING F ALSE N EGATIVE RATE 26 x1x1 x2x2 True conceptLearned concept

27 P RECISION -R ECALL CURVES 27 Precision Recall Measure Precision vs Recall as the classification boundary is tuned Perfect classifier Actual performance

28 P RECISION -R ECALL CURVES 28 Precision Recall Measure Precision vs Recall as the classification boundary is tuned Penalize false negatives Penalize false positives Equal weight

29 P RECISION -R ECALL CURVES 29 Precision Recall Measure Precision vs Recall as the classification boundary is tuned

30 P RECISION -R ECALL CURVES 30 Precision Recall Measure Precision vs Recall as the classification boundary is tuned Better learning performance

31 O PTION 1: C LASSIFICATION T HRESHOLDS Many learning algorithms (e.g., linear models, NNets, BNs, SVM) give real-valued output v( x ) that needs thresholding for classification v( x ) >  => positive label given to x v( x ) negative label given to x May want to tune threshold to get fewer false positives or false negatives 31

32 O PTION 2: L OSS FUNCTIONS & W EIGHTED DATASETS General learning problem: “Given data D and loss function L, find the hypothesis from hypothesis class H that minimizes L” Loss functions : L may contain weights to favor accuracy on positive or negative examples E.g., L = 10 E + + 1 E - Weighted datasets : attach a weight w to each example to indicate how important it is Or construct a resampled dataset D’ where each example is duplicated proportionally to its w

33 M ODEL S ELECTION

34 C OMPLEXITY V S. G OODNESS OF F IT More complex models can fit the data better, but can overfit Model selection: enumerate several possible hypothesis classes of increasing complexity, stop when cross-validated error levels off Regularization: explicitly define a metric of complexity and penalize it in addition to loss

35 M ODEL S ELECTION WITH K - FOLD C ROSS - V ALIDATION Parameterize learner by a complexity level C Model selection pseudocode: For increasing levels of complexity C: errT[C],errV[C] = Cross-Validate(Learner,C,examples) If errT has converged, Find value Cbest that minimizes errV[C] Return Learner(Cbest,examples)

36 R EGULARIZATION Minimize: Cost(h) = Loss(h) + Complexity(h) Example with linear models y =  T x: L 2 error: Loss(  ) =  i (y (i) -  T x (i) ) 2 L q regularization: Complexity(  ):  j |  j | q L 2 and L 1 are most popular in linear regularization L 2 regularization leads to simple computation of optimal  L 1 is more complex to optimize, but produces sparse models in which many coefficients are 0!

37 D ATA D REDGING As the number of attributes increases, the likelihood of a learner to pick up on patterns that arise purely from chance increases In the extreme case where there are more attributes than datapoints (e.g., pixels in a video), even very simple hypothesis classes can overfit E.g., linear classifiers Many opportunities for charlatans in the big data age!

38 O THER TOPICS IN M ACHINE L EARNING Unsupervised learning Dimensionality reduction Clustering Reinforcement learning Agent that acts and learns how to act in an environment by observing rewards Learning from demonstration Agent that learns how to act in an environment by observing demonstrations from an expert 38

39 I SSUES IN P RACTICE The distinctions between learning algorithms diminish when you have a lot of data The web has made it much easier to gather large scale datasets than in early days of ML Understanding data with many more attributes than examples is still a major challenge! Do humans just have really great priors?

40 N EXT L ECTURES Temporal sequence models (R&N 15) Decision-theoretic planning Reinforcement learning Applications of AI


Download ppt "T AMING THE L EARNING Z OO. S UPERVISED L EARNING Z OO Bayesian learning (find parameters of a probabilistic model) Maximum likelihood Maximum a posteriori."

Similar presentations


Ads by Google