Download presentation

Presentation is loading. Please wait.

Published byPerla Hiner Modified over 4 years ago

1
1 A Black-Box approach to machine learning Yoav Freund

2
2 Why do we need learning? Computers need functions that map highly variable data: Speech recognition: Audio signal -> words Image analysis: Video signal -> objects Bio-Informatics: Micro-array Images -> gene function Data Mining: Transaction logs -> customer classification For accuracy, functions must be tuned to fit the data source. For real-time processing, function computation has to be very fast.

3
3 The complexity/accuracy tradeoff Complexity Error Trivial performance

4
4 The speed/flexibility tradeoff Flexibility Speed Matlab Code Java Code Machine code Digital Hardware Analog Hardware

5
5 Theory Vs. Practice Theoretician: I want a polynomial-time algorithm which is guaranteed to perform arbitrarily well in all situations. - I prove theorems. Practitioner: I want a real-time algorithm that performs well on my problem. - I experiment. My approach: I want combining algorithms whose performance and speed is guaranteed relative to the performance and speed of their components. - I do both.

6
6 Plan of talk The black-box approach Boosting Alternating decision trees A commercial application Boosting the margin Confidence rated predictions Online learning

7
7 The black-box approach Statistical models are not generators, they are predictors. A predictor is a function from observation X to action Z. After action is taken, outcome Y is observed which implies loss L (a real valued number). Goal: find a predictor with small loss (in expectation, with high probability, cumulative…)

8
8 Main software components xz A predictor Training examples A learner We assume the predictor will be applied to examples similar to those on which it was trained

9
9 Learning in a system Learning System predictor Training Examples Target System Sensor Data Action feedback

10
10 Special case: Classification Observation X - arbitrary (measurable) space Prediction Z - {1,…,K} Usually K=2 (binary classification) Outcome Y - finite set {1,..,K}

11
11 batch learning for binary classification Data distribution: Generalization error: Training set: Training error:

12
12 Boosting Combining weak learners

13
13 A weighted training set Feature vectors Binary labels {-1,+1} Positive weights

14
14 A weak learner The weak requirement: A weak rule h h Weak Leaner Weighted training set instances predictions

15
15 The boosting process weak learner h1 (x1,y1,1/n), … (xn,yn,1/n) weak learner h2 (x1,y1,w1), … (xn,yn,wn) h3 (x1,y1,w1), … (xn,yn,wn) h4 (x1,y1,w1), … (xn,yn,wn) h5 (x1,y1,w1), … (xn,yn,wn) h6 (x1,y1,w1), … (xn,yn,wn) h7 (x1,y1,w1), … (xn,yn,wn) h8 (x1,y1,w1), … (xn,yn,wn) h9 (x1,y1,w1), … (xn,yn,wn) hT (x1,y1,w1), … (xn,yn,wn) Final rule :

16
16 Adaboost

17
17 Main property of Adaboost If advantages of weak rules over random guessing are: T then training error of final rule is at most

18
18 Boosting block diagram Weak Learner Booster Weak rule Example weights Strong Learner Accurate Rule

19
19 What is a good weak learner? The set of weak rules (features) should be: flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label. Simple enough to allow efficient search for a rule with non-trivial weighted training error. Small enough to avoid over-fitting. Calculation of prediction from observations should be very fast.

20
20 Alternating decision trees Freund, Mason 1997

21
21 Decision Trees X>3 Y>5 +1 no yes no X Y 3 5 +1

22
22 -0.2 A decision tree as a sum of weak rules. X Y -0.2 +0.2-0.3 Y>5 yes no -0.1 +0.1 X>3 no yes +0.1-0.1 +0.2 -0.3 +1 sign

23
23 An alternating decision tree X Y +0.1-0.1 +0.2 -0.3 sign -0.2 Y>5 +0.2-0.3 yes no X>3 -0.1 no yes +0.1 +0.7 Y<1 0.0 no yes +0.7 +1 +1

24
24 Example: Medical Diagnostics Cleve dataset from UC Irvine database. Heart disease diagnostics (+1=healthy,-1=sick) 13 features from tests (real valued and discrete). 303 instances.

25
25 AD-tree for heart-disease diagnostics >0 : Healthy <0 : Sick

26
26 Commercial Deployment.

27
27 AT&T buisosity problem Distinguish business/residence customers from call detail information. (time of day, length of call …) 230M telephone numbers, label unknown for ~30% 260M calls / day Required computer resources: Freund, Mason, Rogers, Pregibon, Cortes 2000 Huge: counting log entries to produce statistics -- use specialized I/O efficient sorting algorithms (Hancock). Significant: Calculating the classification for ~70M customers. Negligible: Learning (2 Hours on 10K training examples on an off- line computer).

28
28 AD-tree for buisosity

29
29 AD-tree (Detail)

30
30 Quantifiable results For accuracy 94% increased coverage from 44% to 56%. Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities. Score Accuracy Precision/recall:

31
31 Adaboosts resistance to over fitting Why statisticians find Adaboost interesting.

32
32 A very curious phenomenon Boosting decision trees Using 2,000,000 parameters

33
33 Large margins Thesis: large margins => reliable predictions Very similar to SVM.

34
34 Experimental Evidence

35
35 Theorem Schapire, Freund, Bartlett & Lee / Annals of statistics 1998 H: set of binary functions with VC-dimension d C No dependence on no. of combined functions!!!

36
36 Idea of Proof

37
37 Confidence rated predictions Agreement gives confidence

38
38 A motivating example - - - + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +- - - - - - - - - -- - - ? ? ? Unsure

39
39 The algorithm Parameters Hypothesis weight: Empirical Log Ratio : Prediction rule: Freund, Mansour, Schapire 2001

40
40 Suggested tuning Yields: Suppose H is a finite set.

41
41 Confidence Rating block diagram Rater- Combiner Confidence-rated Rule Candidate Rules Training examples

42
42 Face Detection Paul Viola and Mike Jones developed a face detector that can work in real time (15 frames per second). Viola & Jones 1999

43
43 Using confidence to save time The detector combines 6000 simple features using Adaboost. In most boxes, only 8-9 features are calculated. Feature 1 All boxes Definitely not a face Might be a face Feature 2

44
44 Using confidence to train car detectors

45
45 Original Image Vs. difference image

46
46 Co-training Hwy Images Raw B/W Diff Image Partially trained B/W based Classifier Partially trained Diff based Classifier Confident Predictions Confident Predictions Blum and Mitchell 98

47
47 Co-Training Results Raw Image detector Difference Image detector Before co-training After co-training Levin, Freund, Viola 2002

48
48 Selective sampling Unlabeled data Partially trained classifier Sample of unconfident examples Labeled examples Query-by-committee, Seung, Opper & Sompolinsky Freund, Seung, Shamir & Tishby

49
49 Online learning Adapting to changes

50
50 Online learning An expert is an algorithm that maps the past to a prediction So far, the only statistical assumption was that data is generated IID. Can we get rid of that assumption? Yes, if we consider prediction as a repeating game Suppose we have a set of experts, we believe one is good, but we dont know which one.

51
51 Online prediction game Experts generate predictions: Algorithm makes its own prediction: Nature generates outcome: For Total loss of expert i: Total loss of algorithm: Goal: for any sequence of events

52
52 A very simple example Binary classification N experts one expert is known to be perfect Algorithm: predict like the majority of experts that have made no mistake so far. Bound:

53
53 History of online learning Littlestone & Warmuth Vovk Vovk and Shafers recent book: Probability and Finance, its only a game! Innumerable contributions from many fields: Hannan, Blackwell, Davison, Gallager, Cover, Barron, Foster & Vohra, Fudenberg & Levin, Feder & Merhav, Starkov, Rissannen, Cesa-Bianchi, Lugosi, Blum, Freund, Schapire, Valiant, Auer …

54
54 Lossless compression Z - [0,1] X - arbitrary input space. Y - {0,1} Entropy, Lossless compression, MDL. Statistical likelihood, standard probability theory. Log Loss:

55
55 Bayesian averaging Folk theorem in Information Theory

56
56 Game theoretical loss X - arbitrary space Y - a loss for each of N actions Z - a distribution over N actions Loss:

57
57 Learning in games An algorithm which knows T in advance guarantees: Freund and Schapire 94

58
58 Instead, a single is chosen at random according to and is observed Multi-arm bandits Algorithm cannot observe full outcome Auer, Cesa-Bianchi, Freund, Schapire 95 With probability We describe an algorithm that guarantees:

59
59 Why isnt online learning practical? Prescriptions too similar to Bayesian approach. Implementing low-level learning requires a large number of experts. Computation increases linearly with the number of experts. Potentially very powerful for combining a few high-level experts.

60
60 code B/W Frontal face detector Indoor, neutral background direct front-right lighting Merl frontal 1.0 Online learning for detector deployment Face Detector Library OL Images Download Feedback Face Detections Adaptive real-time face detector Detector can be adaptive!!

61
61 Summary By Combining predictors we can: Improve accuracy. Estimate prediction confidence. Adapt on-line. To make machine learning practical: Speed-up the predictors. Concentrate human feedback on hard cases. Fuse data from several sources. Share predictor libraries.

Similar presentations

Presentation is loading. Please wait....

OK

Factoring Quadratics — ax² + bx + c Topic 6.6.2.

Factoring Quadratics — ax² + bx + c Topic 6.6.2.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google