Download presentation

Presentation is loading. Please wait.

Published byPerla Hiner Modified over 2 years ago

1
1 A Black-Box approach to machine learning Yoav Freund

2
2 Why do we need learning? Computers need functions that map highly variable data: Speech recognition: Audio signal -> words Image analysis: Video signal -> objects Bio-Informatics: Micro-array Images -> gene function Data Mining: Transaction logs -> customer classification For accuracy, functions must be tuned to fit the data source. For real-time processing, function computation has to be very fast.

3
3 The complexity/accuracy tradeoff Complexity Error Trivial performance

4
4 The speed/flexibility tradeoff Flexibility Speed Matlab Code Java Code Machine code Digital Hardware Analog Hardware

5
5 Theory Vs. Practice Theoretician: I want a polynomial-time algorithm which is guaranteed to perform arbitrarily well in all situations. - I prove theorems. Practitioner: I want a real-time algorithm that performs well on my problem. - I experiment. My approach: I want combining algorithms whose performance and speed is guaranteed relative to the performance and speed of their components. - I do both.

6
6 Plan of talk The black-box approach Boosting Alternating decision trees A commercial application Boosting the margin Confidence rated predictions Online learning

7
7 The black-box approach Statistical models are not generators, they are predictors. A predictor is a function from observation X to action Z. After action is taken, outcome Y is observed which implies loss L (a real valued number). Goal: find a predictor with small loss (in expectation, with high probability, cumulative…)

8
8 Main software components xz A predictor Training examples A learner We assume the predictor will be applied to examples similar to those on which it was trained

9
9 Learning in a system Learning System predictor Training Examples Target System Sensor Data Action feedback

10
10 Special case: Classification Observation X - arbitrary (measurable) space Prediction Z - {1,…,K} Usually K=2 (binary classification) Outcome Y - finite set {1,..,K}

11
11 batch learning for binary classification Data distribution: Generalization error: Training set: Training error:

12
12 Boosting Combining weak learners

13
13 A weighted training set Feature vectors Binary labels {-1,+1} Positive weights

14
14 A weak learner The weak requirement: A weak rule h h Weak Leaner Weighted training set instances predictions

15
15 The boosting process weak learner h1 (x1,y1,1/n), … (xn,yn,1/n) weak learner h2 (x1,y1,w1), … (xn,yn,wn) h3 (x1,y1,w1), … (xn,yn,wn) h4 (x1,y1,w1), … (xn,yn,wn) h5 (x1,y1,w1), … (xn,yn,wn) h6 (x1,y1,w1), … (xn,yn,wn) h7 (x1,y1,w1), … (xn,yn,wn) h8 (x1,y1,w1), … (xn,yn,wn) h9 (x1,y1,w1), … (xn,yn,wn) hT (x1,y1,w1), … (xn,yn,wn) Final rule :

16
16 Adaboost

17
17 Main property of Adaboost If advantages of weak rules over random guessing are: T then training error of final rule is at most

18
18 Boosting block diagram Weak Learner Booster Weak rule Example weights Strong Learner Accurate Rule

19
19 What is a good weak learner? The set of weak rules (features) should be: flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label. Simple enough to allow efficient search for a rule with non-trivial weighted training error. Small enough to avoid over-fitting. Calculation of prediction from observations should be very fast.

20
20 Alternating decision trees Freund, Mason 1997

21
21 Decision Trees X>3 Y>5 +1 no yes no X Y

22
A decision tree as a sum of weak rules. X Y Y>5 yes no X>3 no yes sign

23
23 An alternating decision tree X Y sign -0.2 Y> yes no X> no yes Y<1 0.0 no yes

24
24 Example: Medical Diagnostics Cleve dataset from UC Irvine database. Heart disease diagnostics (+1=healthy,-1=sick) 13 features from tests (real valued and discrete). 303 instances.

25
25 AD-tree for heart-disease diagnostics >0 : Healthy <0 : Sick

26
26 Commercial Deployment.

27
27 AT&T buisosity problem Distinguish business/residence customers from call detail information. (time of day, length of call …) 230M telephone numbers, label unknown for ~30% 260M calls / day Required computer resources: Freund, Mason, Rogers, Pregibon, Cortes 2000 Huge: counting log entries to produce statistics -- use specialized I/O efficient sorting algorithms (Hancock). Significant: Calculating the classification for ~70M customers. Negligible: Learning (2 Hours on 10K training examples on an off- line computer).

28
28 AD-tree for buisosity

29
29 AD-tree (Detail)

30
30 Quantifiable results For accuracy 94% increased coverage from 44% to 56%. Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities. Score Accuracy Precision/recall:

31
31 Adaboosts resistance to over fitting Why statisticians find Adaboost interesting.

32
32 A very curious phenomenon Boosting decision trees Using 2,000,000 parameters

33
33 Large margins Thesis: large margins => reliable predictions Very similar to SVM.

34
34 Experimental Evidence

35
35 Theorem Schapire, Freund, Bartlett & Lee / Annals of statistics 1998 H: set of binary functions with VC-dimension d C No dependence on no. of combined functions!!!

36
36 Idea of Proof

37
37 Confidence rated predictions Agreement gives confidence

38
38 A motivating example ? ? ? Unsure

39
39 The algorithm Parameters Hypothesis weight: Empirical Log Ratio : Prediction rule: Freund, Mansour, Schapire 2001

40
40 Suggested tuning Yields: Suppose H is a finite set.

41
41 Confidence Rating block diagram Rater- Combiner Confidence-rated Rule Candidate Rules Training examples

42
42 Face Detection Paul Viola and Mike Jones developed a face detector that can work in real time (15 frames per second). Viola & Jones 1999

43
43 Using confidence to save time The detector combines 6000 simple features using Adaboost. In most boxes, only 8-9 features are calculated. Feature 1 All boxes Definitely not a face Might be a face Feature 2

44
44 Using confidence to train car detectors

45
45 Original Image Vs. difference image

46
46 Co-training Hwy Images Raw B/W Diff Image Partially trained B/W based Classifier Partially trained Diff based Classifier Confident Predictions Confident Predictions Blum and Mitchell 98

47
47 Co-Training Results Raw Image detector Difference Image detector Before co-training After co-training Levin, Freund, Viola 2002

48
48 Selective sampling Unlabeled data Partially trained classifier Sample of unconfident examples Labeled examples Query-by-committee, Seung, Opper & Sompolinsky Freund, Seung, Shamir & Tishby

49
49 Online learning Adapting to changes

50
50 Online learning An expert is an algorithm that maps the past to a prediction So far, the only statistical assumption was that data is generated IID. Can we get rid of that assumption? Yes, if we consider prediction as a repeating game Suppose we have a set of experts, we believe one is good, but we dont know which one.

51
51 Online prediction game Experts generate predictions: Algorithm makes its own prediction: Nature generates outcome: For Total loss of expert i: Total loss of algorithm: Goal: for any sequence of events

52
52 A very simple example Binary classification N experts one expert is known to be perfect Algorithm: predict like the majority of experts that have made no mistake so far. Bound:

53
53 History of online learning Littlestone & Warmuth Vovk Vovk and Shafers recent book: Probability and Finance, its only a game! Innumerable contributions from many fields: Hannan, Blackwell, Davison, Gallager, Cover, Barron, Foster & Vohra, Fudenberg & Levin, Feder & Merhav, Starkov, Rissannen, Cesa-Bianchi, Lugosi, Blum, Freund, Schapire, Valiant, Auer …

54
54 Lossless compression Z - [0,1] X - arbitrary input space. Y - {0,1} Entropy, Lossless compression, MDL. Statistical likelihood, standard probability theory. Log Loss:

55
55 Bayesian averaging Folk theorem in Information Theory

56
56 Game theoretical loss X - arbitrary space Y - a loss for each of N actions Z - a distribution over N actions Loss:

57
57 Learning in games An algorithm which knows T in advance guarantees: Freund and Schapire 94

58
58 Instead, a single is chosen at random according to and is observed Multi-arm bandits Algorithm cannot observe full outcome Auer, Cesa-Bianchi, Freund, Schapire 95 With probability We describe an algorithm that guarantees:

59
59 Why isnt online learning practical? Prescriptions too similar to Bayesian approach. Implementing low-level learning requires a large number of experts. Computation increases linearly with the number of experts. Potentially very powerful for combining a few high-level experts.

60
60 code B/W Frontal face detector Indoor, neutral background direct front-right lighting Merl frontal 1.0 Online learning for detector deployment Face Detector Library OL Images Download Feedback Face Detections Adaptive real-time face detector Detector can be adaptive!!

61
61 Summary By Combining predictors we can: Improve accuracy. Estimate prediction confidence. Adapt on-line. To make machine learning practical: Speed-up the predictors. Concentrate human feedback on hard cases. Fuse data from several sources. Share predictor libraries.

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google