# 1 A Black-Box approach to machine learning Yoav Freund.

## Presentation on theme: "1 A Black-Box approach to machine learning Yoav Freund."— Presentation transcript:

1 A Black-Box approach to machine learning Yoav Freund

2 Why do we need learning? Computers need functions that map highly variable data: Speech recognition: Audio signal -> words Image analysis: Video signal -> objects Bio-Informatics: Micro-array Images -> gene function Data Mining: Transaction logs -> customer classification For accuracy, functions must be tuned to fit the data source. For real-time processing, function computation has to be very fast.

3 The complexity/accuracy tradeoff Complexity Error Trivial performance

4 The speed/flexibility tradeoff Flexibility Speed Matlab Code Java Code Machine code Digital Hardware Analog Hardware

5 Theory Vs. Practice Theoretician: I want a polynomial-time algorithm which is guaranteed to perform arbitrarily well in all situations. - I prove theorems. Practitioner: I want a real-time algorithm that performs well on my problem. - I experiment. My approach: I want combining algorithms whose performance and speed is guaranteed relative to the performance and speed of their components. - I do both.

6 Plan of talk The black-box approach Boosting Alternating decision trees A commercial application Boosting the margin Confidence rated predictions Online learning

7 The black-box approach Statistical models are not generators, they are predictors. A predictor is a function from observation X to action Z. After action is taken, outcome Y is observed which implies loss L (a real valued number). Goal: find a predictor with small loss (in expectation, with high probability, cumulative…)

8 Main software components xz A predictor Training examples A learner We assume the predictor will be applied to examples similar to those on which it was trained

9 Learning in a system Learning System predictor Training Examples Target System Sensor Data Action feedback

10 Special case: Classification Observation X - arbitrary (measurable) space Prediction Z - {1,…,K} Usually K=2 (binary classification) Outcome Y - finite set {1,..,K}

11 batch learning for binary classification Data distribution: Generalization error: Training set: Training error:

12 Boosting Combining weak learners

13 A weighted training set Feature vectors Binary labels {-1,+1} Positive weights

14 A weak learner The weak requirement: A weak rule h h Weak Leaner Weighted training set instances predictions

15 The boosting process weak learner h1 (x1,y1,1/n), … (xn,yn,1/n) weak learner h2 (x1,y1,w1), … (xn,yn,wn) h3 (x1,y1,w1), … (xn,yn,wn) h4 (x1,y1,w1), … (xn,yn,wn) h5 (x1,y1,w1), … (xn,yn,wn) h6 (x1,y1,w1), … (xn,yn,wn) h7 (x1,y1,w1), … (xn,yn,wn) h8 (x1,y1,w1), … (xn,yn,wn) h9 (x1,y1,w1), … (xn,yn,wn) hT (x1,y1,w1), … (xn,yn,wn) Final rule :

17 Main property of Adaboost If advantages of weak rules over random guessing are: T then training error of final rule is at most

18 Boosting block diagram Weak Learner Booster Weak rule Example weights Strong Learner Accurate Rule

19 What is a good weak learner? The set of weak rules (features) should be: flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label. Simple enough to allow efficient search for a rule with non-trivial weighted training error. Small enough to avoid over-fitting. Calculation of prediction from observations should be very fast.

20 Alternating decision trees Freund, Mason 1997

21 Decision Trees X>3 Y>5 +1 no yes no X Y 3 5 +1

22 -0.2 A decision tree as a sum of weak rules. X Y -0.2 +0.2-0.3 Y>5 yes no -0.1 +0.1 X>3 no yes +0.1-0.1 +0.2 -0.3 +1 sign

23 An alternating decision tree X Y +0.1-0.1 +0.2 -0.3 sign -0.2 Y>5 +0.2-0.3 yes no X>3 -0.1 no yes +0.1 +0.7 Y<1 0.0 no yes +0.7 +1 +1

24 Example: Medical Diagnostics Cleve dataset from UC Irvine database. Heart disease diagnostics (+1=healthy,-1=sick) 13 features from tests (real valued and discrete). 303 instances.

25 AD-tree for heart-disease diagnostics >0 : Healthy <0 : Sick

26 Commercial Deployment.

27 AT&T buisosity problem Distinguish business/residence customers from call detail information. (time of day, length of call …) 230M telephone numbers, label unknown for ~30% 260M calls / day Required computer resources: Freund, Mason, Rogers, Pregibon, Cortes 2000 Huge: counting log entries to produce statistics -- use specialized I/O efficient sorting algorithms (Hancock). Significant: Calculating the classification for ~70M customers. Negligible: Learning (2 Hours on 10K training examples on an off- line computer).

30 Quantifiable results For accuracy 94% increased coverage from 44% to 56%. Saved AT&T 15M\$ in the year 2000 in operations costs and missed opportunities. Score Accuracy Precision/recall:

32 A very curious phenomenon Boosting decision trees Using 2,000,000 parameters

33 Large margins Thesis: large margins => reliable predictions Very similar to SVM.

34 Experimental Evidence

35 Theorem Schapire, Freund, Bartlett & Lee / Annals of statistics 1998 H: set of binary functions with VC-dimension d C No dependence on no. of combined functions!!!

36 Idea of Proof

37 Confidence rated predictions Agreement gives confidence

38 A motivating example - - - + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +- - - - - - - - - -- - - ? ? ? Unsure

39 The algorithm Parameters Hypothesis weight: Empirical Log Ratio : Prediction rule: Freund, Mansour, Schapire 2001

40 Suggested tuning Yields: Suppose H is a finite set.

41 Confidence Rating block diagram Rater- Combiner Confidence-rated Rule Candidate Rules Training examples

42 Face Detection Paul Viola and Mike Jones developed a face detector that can work in real time (15 frames per second). Viola & Jones 1999

43 Using confidence to save time The detector combines 6000 simple features using Adaboost. In most boxes, only 8-9 features are calculated. Feature 1 All boxes Definitely not a face Might be a face Feature 2

44 Using confidence to train car detectors

45 Original Image Vs. difference image

46 Co-training Hwy Images Raw B/W Diff Image Partially trained B/W based Classifier Partially trained Diff based Classifier Confident Predictions Confident Predictions Blum and Mitchell 98

47 Co-Training Results Raw Image detector Difference Image detector Before co-training After co-training Levin, Freund, Viola 2002

48 Selective sampling Unlabeled data Partially trained classifier Sample of unconfident examples Labeled examples Query-by-committee, Seung, Opper & Sompolinsky Freund, Seung, Shamir & Tishby

49 Online learning Adapting to changes

50 Online learning An expert is an algorithm that maps the past to a prediction So far, the only statistical assumption was that data is generated IID. Can we get rid of that assumption? Yes, if we consider prediction as a repeating game Suppose we have a set of experts, we believe one is good, but we dont know which one.

51 Online prediction game Experts generate predictions: Algorithm makes its own prediction: Nature generates outcome: For Total loss of expert i: Total loss of algorithm: Goal: for any sequence of events

52 A very simple example Binary classification N experts one expert is known to be perfect Algorithm: predict like the majority of experts that have made no mistake so far. Bound:

53 History of online learning Littlestone & Warmuth Vovk Vovk and Shafers recent book: Probability and Finance, its only a game! Innumerable contributions from many fields: Hannan, Blackwell, Davison, Gallager, Cover, Barron, Foster & Vohra, Fudenberg & Levin, Feder & Merhav, Starkov, Rissannen, Cesa-Bianchi, Lugosi, Blum, Freund, Schapire, Valiant, Auer …

54 Lossless compression Z - [0,1] X - arbitrary input space. Y - {0,1} Entropy, Lossless compression, MDL. Statistical likelihood, standard probability theory. Log Loss:

55 Bayesian averaging Folk theorem in Information Theory

56 Game theoretical loss X - arbitrary space Y - a loss for each of N actions Z - a distribution over N actions Loss:

57 Learning in games An algorithm which knows T in advance guarantees: Freund and Schapire 94

58 Instead, a single is chosen at random according to and is observed Multi-arm bandits Algorithm cannot observe full outcome Auer, Cesa-Bianchi, Freund, Schapire 95 With probability We describe an algorithm that guarantees:

59 Why isnt online learning practical? Prescriptions too similar to Bayesian approach. Implementing low-level learning requires a large number of experts. Computation increases linearly with the number of experts. Potentially very powerful for combining a few high-level experts.

60 code B/W Frontal face detector Indoor, neutral background direct front-right lighting Merl frontal 1.0 Online learning for detector deployment Face Detector Library OL Images Download Feedback Face Detections Adaptive real-time face detector Detector can be adaptive!!

61 Summary By Combining predictors we can: Improve accuracy. Estimate prediction confidence. Adapt on-line. To make machine learning practical: Speed-up the predictors. Concentrate human feedback on hard cases. Fuse data from several sources. Share predictor libraries.