Machine Learning Chapter 1. Introduction

Machine Learning Chapter 1. Introduction
Tom M. Mitchell

Machine Learning, Tom T. Mitchell, McGraw-Hill, 1997  강의
Reinforcement Learning: An Introduction, R. S. Sutton and A. G. Barto, The MIT Press, 1998  발표

Machine Learning How to construct computer programs that automatically improve with experience Data mining(medical applications: 1989), fraudulent credit card (1989), transactions, information filtering, users’ reading preference, autonomous vehicles, backgammon at level of world champions(1992), speech recognition(1989), optimizing energy cost Machine learning theory How does learning performance vary with the number of training examples presented What learning algorithms are most appropriate for various types of learning tasks

예제 프로그램 http://www.cs.cmu.edu/~tom/mlbook.html Face recognition
Decision tree learning code Data for financial loan analysis Bayes classifier code Data for analyzing text documents

이론적 연구 Fundamental relationship among the number of training examples observed, the number of hypotheses under consideration, and the expected error in learned hypotheses Biological systems

Def. A computer program is said to learn from experience E wrt some classes of tasks T and performance P, if its performance at tasks in T, as measured by P, improves with experience E.

Outline Why Machine Learning? What is a well-defined learning problem?
An example: learning to play checkers What questions should we ask about Machine Learning?

Why Machine Learning Recent progress in algorithms and theory
Growing flood of online data Computational power is available Budding industry

Three niches for machine learning:
Data mining : using historical data to improve decisions medical records  medical knowledge Software applications we can't program by hand autonomous driving speech recognition Self customizing programs Newsreader that learns user interests

Typical Datamining Task (1/2)

Typical Datamining Task (2/2)
Given: 9714 patient records, each describing a pregnancy and birth Each patient record contains 215 features Learn to predict: Classes of future patients at high risk for Emergency Cesarean Section

Datamining Result One of 18 learned rules:
If No previous vaginal delivery, and Abnormal 2nd Trimester Ultrasound, and Malpresentation at admission Then Probability of Emergency C-Section is 0.6 Over training data: 26/41 = .63, Over test data: 12/20 = .60

Credit Risk Analysis (1/2)
Data :

Credit Risk Analysis (2/2)
Rules learned from synthesized data: If Other-Delinquent-Accounts > 2, and Number-Delinquent-Billing-Cycles > 1 Then Profitable-Customer? = No [Deny Credit Card application] If Other-Delinquent-Accounts = 0, and (Income > $30k) OR (Years-of-Credit > 3) Then Profitable-Customer? = Yes [Accept Credit Card application]

Other Prediction Problems (1/2)

Other Prediction Problems (2/2)

Problems Too Difficult to Program by Hand
ALVINN [Pomerleau] drives 70 mph on highways

Software that Customizes to User

Where Is this Headed? (1/2)
Today: tip of the iceberg First-generation algorithms: neural nets, decision trees, regression ... Applied to well-formatted database Budding industry

Where Is this Headed? (2/2)
Opportunity for tomorrow: enormous impact Learn across full mixed-media data Learn across multiple internal databases, plus the web and newsfeeds Learn by active experimentation Learn decisions rather than predictions Cumulative, lifelong learning Programming languages with learning embedded?

Relevant Disciplines Artificial intelligence Bayesian methods
Computational complexity theory Control theory Information theory Philosophy Psychology and neurobiology Statistics . . .

What is the Learning Problem?
Learning = Improving with experience at some task Improve over task T, with respect to performance measure P, based on experience E. E.g., Learn to play checkers T: Play checkers P: % of games won in world tournament E: opportunity to play against self

Learning to Play Checkers
T: Play checkers P: Percent of games won in world tournament What experience? What exactly should be learned? How shall it be represented? What specific algorithm to learn it?

Type of Training Experience
Direct or indirect? Teacher or not? A problem: is training experience representative of performance goal?

Choose the Target Function
ChooseMove : Board  Move ?? V : Board  R ?? . . .

Possible Definition for Target Function V
if b is a final board state that is won, then V(b) = 100 if b is a final board state that is lost, then V(b) = -100 if b is a final board state that is drawn, then V(b) = 0 if b is not a final state in the game, then V(b) = V(b'), where b' is the best final board state that can be achieved starting from b and playing optimally until the end of the game. This gives correct values, but is not operational

Choose Representation for Target Function
collection of rules? neural network ? polynomial function of board features? . . .

A Representation for Learned Function
w0+ w1·bp(b)+w2·rp(b)+w3·bk(b)+w4·rk(b)+w5·bt(b)+w6·rt(b) bp(b) : number of black pieces on board b rp(b) : number of red pieces on b bk(b) : number of black kings on b rk(b) : number of red kings on b bt(b) : number of red pieces threatened by black (i.e., which can be taken on black's next turn) rt(b) : number of black pieces threatened by red

Obtaining Training Examples
V(b): the true target function V(b) : the learned function Vtrain(b): the training value One rule for estimating training values: Vtrain(b)  V(Successor(b)) ^ ^

Choose Weight Tuning Rule
LMS Weight update rule: Do repeatedly: Select a training example b at random 1. Compute error(b): error(b) = Vtrain(b) – V(b) 2. For each board feature fi, update weight wi: wi  wi + c · fi · error(b) c is some small constant, say 0.1, to moderate the rate of learning

Final design The performance system The critic The generalizer
Playing games The critic 차이 발견 (분석) The generalizer Generate new hypothesis The experiment generator Generate new problems

학습방법 Backgammon : 6개 feature를 늘여서 Reinforcement learning
Neural network ::: 판 자체, 100만번 스스로 학습  인간에 필적할 만함 Nearest Neighbor algorithm : 여러 가지 학습자료를 저장한 후 가까운 것을 찾아서 처리 Genetic algorithm ::: 여러 프로그램을 만들어 적자생존을 통해 진화 Explanation-based learning ::: 이기고 지는 이유에 대한 분석을 통한 학습

Design Choices

Some Issues in Machine Learning
What algorithms can approximate functions well (and when)? How does number of training examples influence accuracy? How does complexity of hypothesis representation impact it? How does noisy data influence accuracy? What are the theoretical limits of learnability? How can prior knowledge of learner help? What clues can we get from biological learning systems? How can systems alter their own representations?

Machine Learning Chapter 2
Machine Learning Chapter 2. Concept Learning and The General-to-specific Ordering Tom M. Mitchell

Outline Learning from examples
General-to-specific ordering over hypotheses Version spaces and candidate elimination algorithm Picking new examples The need for inductive bias Note: simple approach assuming no noise, illustrates key concepts

Training Examples for EnjoySport
What is the general concept? Sky Temp Humid Wind Water Forecst EnjoySpt Sunny Warm Normal Strong Same Yes High Rainy Cold Change No Cool

Representing Hypotheses
Many possible representations Here, h is conjunction of constraints on attributes Each constraint can be a specific value (e.g., Water = Warm) don’t care (e.g., “Water =?”) no value allowed (e.g., “Water=0”) For example, Sky AirTemp Humid Wind Water Forecst <Sunny ? ? Strong ? Same>

Prototypical Concept Learning Task(1/2)
Given: Instances X: Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast Target function c: EnjoySport : X → {0, 1} Hypotheses H: Conjunctions of literals. E.g. <?, Cold, High, ?, ?, ?>. Training examples D: Positive and negative examples of the target function < x1, c(x1)>, … <xm, c(xm)> Determine: A hypothesis h in H such that h(x) =c(x) for all x in D.

Prototypical Concept Learning Task(2/2)
The inductive learning hypothesis: Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples.

Instance, Hypotheses, and More- General-Than

Find-S Algorithm 1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x For each attribute constraint ai in h If the constraint ai in h is satisfied by x Then do nothing Else replace ai in h by the next more general constraint that is satisfied by x 3. Output hypothesis h

Hypothesis Space Search by Find-S

Complaints about Find-S
Can’t tell whether it has learned concept Can’t tell when training data inconsistent Picks a maximally specific h (why?) Depending on H, there might be several!

Version Spaces A hypothesis h is consistent with a set of training examples D of target concept c if and only if h(x) = c(x) for each training example <x, c(x)> in D. Consistent(h, D) ≡ (∀<x, c(x)>∈D) h(x) = c(x) The version space, V SH,D, with respect to hypothesis space H and training examples D, is the subset of hypotheses from H consistent with all training examples in D. V SH,D ≡ {h ∈ H | Consistent(h, D)}

The List-Then-Eliminate Algorithm:
1. VersionSpace  a list containing every hypothesis in H 2. For each training example, <x, c(x)> remove from VersionSpace any hypothesis h for which h(x)  c(x) 3. Output the list of hypotheses in VersionSpace

Example Version Space

Representing Version Spaces
The General boundary, G, of version space V SH,D is the set of its maximally general members The Specific boundary, S, of version space V SH,D is the set of its maximally specific members Every member of the version space lies between these boundaries V SH,D = {h ∈ H | (∃s ∈ S)(∃g ∈ G) (g ≥ h ≥ s)} where x ≥ y means x is more general or equal to y

Candidate Elimination Algorithm (1/2)
G ← maximally general hypotheses in H S ← maximally specific hypotheses in H For each training example d, do If d is a positive example Remove from G any hypothesis inconsistent with d For each hypothesis s in S that is not consistent with d Remove s from S Add to S all minimal generalizations h of s such that 1. h is consistent with d, and 2. some member of G is more general than h Remove from S any hypothesis that is more general than another hypothesis in S

Candidate Elimination Algorithm (2/2)
If d is a negative example Remove from S any hypothesis inconsistent with d For each hypothesis g in G that is not consistent with d Remove g from G Add to G all minimal specializations h of g such that 1. h is consistent with d, and 2. some member of S is more specific than h Remove from G any hypothesis that is less general than another hypothesis in G

Example Trace

What Next Training Example?

How Should These Be Classified?
<Sunny Warm Normal Strong Cool Change> <Rainy Cool Normal Light Warm Same> <Sunny Warm Normal Light Warm Same>

What Justifies this Inductive Leap?
+ <Sunny Warm Normal Strong Cool Change> + <Sunny Warm Normal Light Warm Same> S : <Sunny Warm Normal ? ? ?> Why believe we can classify the unseen <Sunny Warm Normal Strong Warm Same>

An UNBiased Learner Idea: Choose H that expresses every teachable
concept (i.e., H is the power set of X) Consider H' = disjunctions, conjunctions, negations over previous H. E.g., <Sunny Warm Normal ???> ∨<?????Change> What are S, G in this case? S ← G ←

Inductive Bias Consider Definition: concept learning algorithm L
instances X, target concept c training examples Dc = {<x, c(x)>} let L(xi, Dc) denote the classification assigned to the instance xi by L after training on data Dc. Definition: The inductive bias of L is any minimal set of assertions B such that for any target concept c and corresponding training examples Dc (∀xi ∈ X)[(B ∧ Dc ∧ xi) ├ L(xi, Dc)] where A├ B means A logically entails B

Inductive Systems and Equivalent Deductive Systems

Three Learners with Different Biases
1. Rote learner: Store examples, Classify x iff it matches previously observed example. 2. Version space candidate elimination algorithm 3. Find-S

Summary Points 1. Concept learning as search through H
2. General-to-specific ordering over H 3. Version space candidate elimination algorithm 4. S and G boundaries characterize learner’s uncertainty 5. Learner can generate useful queries 6. Inductive leaps possible only if learner is biased 7. Inductive learners can be modelled by equivalent deductive systems

Machine Learning Chapter 3. Decision Tree Learning
Tom M. Mitchell

Abstract Decision tree representation ID3 learning algorithm
Entropy, Information gain Overfitting

Decision Tree for PlayTennis

A Tree to Predict C-Section Risk
Learned from medical records of 1000 women Negative examples are C-sections

Decision Trees Decision tree representation: How would we represent:
Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification How would we represent: , , XOR (A  B)  (C  D  E) M of N

When to Consider Decision Trees
Instances describable by attribute-value pairs Target function is discrete valued Disjunctive hypothesis may be required Possibly noisy training data Examples: Equipment or medical diagnosis Credit risk analysis Modeling calendar scheduling preferences

Top-Down Induction of Decision Trees
Main loop: 1. A  the “best” decision attribute for next node 2. Assign A as decision attribute for node 3. For each value of A, create new descendant of node 4. Sort training examples to leaf nodes 5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes Which attribute is best?

Entropy(1/2) S is a sample of training examples
p⊕ is the proportion of positive examples in S p⊖ is the proportion of negative examples in S Entropy measures the impurity of S Entropy(S)  - p⊕log2 p⊕ - p⊖log2 p⊖

Entropy(2/2) Entropy(S) = expected number of bits needed to encode class (⊕ or ⊖) of randomly drawn member of S (under the optimal, shortest-length code) Why? Information theory: optimal length code assigns log2p bits to message having probability p. So, expected number of bits to encode ⊕ or ⊖ of random member of S: p⊕(-log2 p⊕) + p⊖(-log2 p⊖) Entropy(S)  - p⊕log2 p⊕ - p⊖log2 p⊖

Information Gain Gain(S, A) = expected reduction in entropy due to sorting on A

Training Examples

Selecting the Next Attribute(1/2)
Which attribute is the best classifier?

Selecting the Next Attribute(2/2)
Ssunny = {D1,D2,D8,D9,D11} Gain (Ssunny , Humidity) = (3/5) (2/5) 0.0 = .970 Gain (Ssunny , Temperature) = (2/5) (2/5) (1/5) 0.0 = .570 Gain (Ssunny, Wind) = (2/5) (3/5) .918 = .019

Hypothesis Space Search by ID3(1/2)

Hypothesis Space Search by ID3(2/2)
Hypothesis space is complete! Target function surely in there... Outputs a single hypothesis (which one?) Can’t play 20 questions... No back tracking Local minima... Statistically-based search choices Robust to noisy data... Inductive bias: approx “prefer shortest tree”

Inductive Bias in ID3 Note H is the power set of instances X
→ Unbiased? Not really... Preference for short trees, and for those with high information gain attributes near the root Bias is a preference for some hypotheses, rather than a restriction of hypothesis space H Occam's razor: prefer the shortest hypothesis that fits the data

Occam’s Razor Why prefer short hypotheses? Argument in favor :
Fewer short hyps. than long hyps. → a short hyp that fits data unlikely to be coincidence → a long hyp that fits data might be coincidence Argument opposed : There are many ways to define small sets of hyps e.g., all trees with a prime number of nodes that use attributes beginning with “Z” What's so special about small sets based on size of hypothesis??

Overfitting in Decision Trees
Consider adding noisy training example #15: Sunny, Hot, Normal, Strong, PlayTennis = No What effect on earlier tree?

Overfitting Consider error of hypothesis h over
training data: errortrain(h) entire distribution D of data: errorD(h) Hypothesis h ∈ H overfits training data if there is an alternative hypothesis h'∈ H such that errortrain(h) < errortrain(h') and errorD(h) > errorD(h')

Overfitting in Decision Tree Learning

Avoiding Overfitting How can we avoid overfitting?
stop growing when data split not statistically significant grow full tree, then post-prune How to select “best” tree : Measure performance over training data Measure performance over separate validation data set MDL: minimize size(tree) + size(misclassifications(tree))

Reduced-Error Pruning
Split data into training and validation set Do until further pruning is harmful: 1. Evaluate impact on validation set of pruning each possible node (plus those below it) 2. Greedily remove the one that most improves validation set accuracy produces smallest version of most accurate subtree What if data is limited?

Effect of Reduced-Error Pruning

Rule Post-Pruning 1. Convert tree to equivalent set of rules
2. Prune each rule independently of others 3. Sort final rules into desired sequence for use Perhaps most frequently used method (e.g., C4.5 )

Converting A Tree to Rules
IF (Outlook = Sunny) ∧ (Humidity = High) THEN PlayTennis = No IF (Outlook = Sunny) ∧ (Humidity = Normal) THEN PlayTennis = Yes ….

Continuous Valued Attributes
Create a discrete attribute to test continuous Temperature = 82.5 (Temperature > 72.3) = t, f

Attributes with Many Values
Problem: If attribute has many values, Gain will select it Imagine using Date = Jun_3_1996 as attribute One approach : use GainRatio instead where Si is subset of S for which A has value vi

Attributes with Costs Consider
medical diagnosis, BloodTest has cost $150 robotics, Width_from_1ft has cost 23 sec. How to learn a consistent tree with low expected cost? One approach: replace gain by Tan and Schlimmer (1990) Nunez (1988) where w ∈ [0,1] determines importance of cost

Unknown Attribute Values
What if some examples missing values of A? Use training example anyway, sort through tree If node n tests A, assign most common value of A among other examples sorted to node n assign most common value of A among other examples with same target value assign probability pi to each possible value vi of A assign fraction pi of example to each descendant in tree Classify new examples in same fashion

Machine Learning Chapter 4. Artificial Neural Networks
Tom M. Mitchell

Artificial Neural Networks
Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics

Connectionist Models (1/2)
Consider humans: Neuron switching time ~ .001 second Number of neurons ~ 1010 Connections per neuron ~ 104-5 Scene recognition time ~ .1 second 100 inference steps doesn’t seem like enough  much parallel computation

Connectionist Models (2/2)
Properties of artificial neural nets (ANN’s): Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically

When to Consider Neural Networks
Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant Examples: Speech phoneme recognition [Waibel] Image classification [Kanade, Baluja, Rowley] Financial prediction

ALVINN drives 70 mph on highways

Perceptron Sometimes we’ll use simpler vector notation:

Decision Surface of a Perceptron
Represents some useful functions What weights represent g(x1, x2) = AND(x1, x2)? But some functions not representable e.g., not linearly separable Therefore, we’ll want networks of these...

Perceptron training rule
wi  wi + wi where wi =  (t – o) xi Where: t = c(x) is target value o is perceptron output  is small constant (e.g., .1) called learning rate Can prove it will converge If training data is linearly separable and  sufficiently small

Gradient Descent (1/4) To understand, consider simpler linear unit, where o = w0 + w1x1 + ··· + wnxn Let's learn wi’s that minimize the squared error Where D is set of training examples

Gradient Descent (2/4) Gradient Training rule: i.e.,

Gradient Descent (3/4)

Gradient Descent (4/4) Initialize each wi to some small random value
Until the termination condition is met, Do Initialize each wi to zero. For each <x, t> in training_examples, Do * Input the instance x to the unit and compute the output o * For each linear unit weight wi, Do wi  wi +  (t – o) xi For each linear unit weight wi , Do wi  wi + wi

Summary Perceptron training rule guaranteed to succeed if
Training examples are linearly separable Sufficiently small learning rate  Linear unit training rule uses gradient descent Guaranteed to converge to hypothesis with minimum squared error Given sufficiently small learning rate  Even when training data contains noise Even when training data not separable by H

Incremental (Stochastic) Gradient Descent (1/2)
Batch mode Gradient Descent: Do until satisfied 1. Compute the gradient ED[w] 2. w  w -  ED[w] Incremental mode Gradient Descent: For each training example d in D 1. Compute the gradient Ed[w] 2. w  w -  Ed[w]

Incremental (Stochastic) Gradient Descent (2/2)
Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if  made small enough

Multilayer Networks of Sigmoid Units

Sigmoid Unit (x) is the sigmoid function Nice property:
We can derive gradient decent rules to train One sigmoid unit Multilayer networks of sigmoid units  Backpropagation

Error Gradient for a Sigmoid Unit
But we know: So:

Backpropagation Algorithm
Initialize all weights to small random numbers. Until satisfied, Do For each training example, Do 1. Input the training example to the network and compute the network outputs 2. For each output unit k : k  k(1 - k) (tk - k) 3. For each hidden unit h h  h(1 - h) k outputs wh,kk 4. Update each network weight wi,j wi,j  wi,j + wi,j where wi,j =  j xi,j 

More on Backpropagation
Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum In practice, often works well (can run multiple times) Often include weight momentum  wi,j (n) =  j xi,j +  wi,j (n - 1) Minimizes error over training examples Will it generalize well to subsequent examples? Training can take thousands of iterations  slow! Using network after training is very fast

Learning Hidden Layer Representations (1/2)
A target function: Can this be learned??

Learning Hidden Layer Representations (2/2)
A network: Learned hidden layer representation:

Training (1/3)

Training (2/3)

Training (3/3)

Convergence of Backpropagation
Gradient descent to some local minimum Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different initial weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses

Expressive Capabilities of ANNs
Boolean functions: Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].

Overfitting in ANNs (1/2)

Overfitting in ANNs (2/2)

Neural Nets for Face Recognition
90% accurate learning head pose, and recognizing 1-of-20 faces

Learned Hidden Unit Weights

Alternative Error Functions
Penalize large weights: Train on target slopes as well as values: Tie together weights: e.g., in phoneme recognition network

Recurrent Networks (a) (b) (c) Feedforward network Recurrent network
Recurrent network unfolded in time

Machine Learning Chapter 5. Evaluating Hypotheses
Tom M. Mitchell

Evaluating Hypotheses
Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal distribution, Central Limit Theorem Paired t tests Comparing learning methods

Two Definitions of Error
The true error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D. The sample error of h with respect to target function f and data sample S is the proportion of examples h misclassifies Where (f(x)  h(x)) is 1 if f(x)  h(x), and 0 otherwise. How well does errorS(h) estimate errorD(h)?

Problems Estimating Error
1. Bias: If S is training set, errorS(h) is optimistically biased bias  E [errorS(h)] - errorD(h) For unbiased estimate, h and S must be chosen independently 2. Variance: Even with unbiased S, errorS(h) may still vary from errorD(h)

Example Hypothesis h misclassifies 12 of the 40 examples in S
errorS(h) = 12 / 40 = .30 What is errorD(h) ?

Estimators Experiment:
1. choose sample S of size n according to distribution D 2. measure errorS(h) errorS(h) is a random variable (i.e., result of an experiment) errorS(h) is an unbiased estimator for errorD(h) Given observed errorS(h) what can we conclude about errorD(h) ?

Confidence Intervals If
S contains n examples, drawn independently of h and each other n  30 Then, with approximately N% probability, errorD(h) lies in interval where N% 50% 68% 80% 90% 95% 98% 99% zN 0.67 1.00 1.28 1.64 1.96 2.33 2.58

errorS(h) is a Random Variable
Rerun the experiment with different randomly drawn S (of size n) Probability of observing r misclassified examples:

Binomial Probability Distribution
Probability P(r) of r heads in n coin flips, if p = Pr(heads) Expected, or mean value of X, E[X], is Variance of X is Standard deviation of X, X, is

Normal Distribution Approximates Binomial
errorS(h) follows a Binomial distribution, with mean errorS(h) = errorD(h) standard deviation errorS(h) Approximate this by a Normal distribution with

Normal Probability Distribution (1/2)
The probability that X will fall into the interval (a, b) is given by Expected, or mean value of X, E[X], is E[X] =  Variance of X is Var(X) = 2 Standard deviation of X, X is X = 

Normal Probability Distribution (2/2)
80% of area (probability) lies in   1.28 N% of area (probability) lies in   zN N% 50% 68% 80% 90% 95% 98% 99% zN 0.67 1.00 1.28 1.64 1.96 2.33 2.58

Confidence Intervals, More Correctly
If S contains n examples, drawn independently of h and each other n  30 Then, with approximately 95% probability, errorS(h) lies in interval equivalently, errorD(h) lies in interval which is approximately

Central Limit Theorem Consider a set of independent, identically distributed random variables Y Yn, all governed by an arbitrary probability distribution with mean  and finite variance 2. Define the sample mean, Central Limit Theorem. As n  , the distribution governing Y approaches a Normal distribution, with mean  and variance 2 / n .

Calculating Confidence Intervals
1. Pick parameter p to estimate errorD(h) 2. Choose an estimator errorS(h) 3. Determine probability distribution that governs estimator errorS(h) governed by Binomial distribution, approximated by Normal when n  30 4. Find interval (L, U) such that N% of probability mass falls in the interval Use table of zN values

Difference Between Hypotheses
Test h1 on sample S1, test h2 on S2 1. Pick parameter to estimate d  errorD(h1) - errorD(h2) 2. Choose an estimator d  errorS1(h1) – errorS2(h2) 3. Determine probability distribution that governs estimator 4. Find interval (L, U) such that N% of probability mass falls in the interval ^

Paired t test to compare hA, hB
1. Partition data into k disjoint test sets T1, T2, . . ., Tk of equal size, where this size is at least 30. 2. For i from 1 to k, do i  errorTi(hA) - errorTi(hB) 3. Return the value , where N% confidence interval estimate for d: Note i approximately Normally distributed

Comparing learning algorithms LA and LB (1/3)
What we’d like to estimate: ESD[errorD(LA (S)) - errorD(LB (S))] where L(S) is the hypothesis output by learner L using training set S i.e., the expected difference in true error between hypotheses output by learners LA and LB, when trained using randomly selected training sets S drawn according to distribution D. But, given limited data D0, what is a good estimator? could partition D0 into training set S0 and test set T0, and measure errorT0(LA (S0)) - errorT0(LB (S0)) even better, repeat this many times and average the results (next slide)

1. Partition data D0 into k disjoint test sets T1, T2, . . ., Tk of equal size, where this size is at least 30. 2. For i from 1 to k, do use Ti for the test set, and the remaining data for training set Si Si  { D0 – Ti } hA  LA(Si) hB  LB(Si) i  errorTi(hA) - errorTi(hB) 3. Return the value , where

Notice we’d like to use the paired t test on  to obtain a confidence interval but not really correct, because the training sets in this algorithm are not independent (they overlap!) more correct to view algorithm as producing an estimate of ESD0[errorD(LA (S)) - errorD(LB (S))] instead of ESD[errorD(LA (S)) - errorD(LB (S))] but even this approximation is better than no comparison

Machine Learning Chapter 6. Bayesian Learning
Tom M. Mitchell

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Minimum description length principle Bayes optimal classifier Naive Bayes learner Example: Learning over text data Bayesian belief networks Expectation Maximization algorithm

Two Roles for Bayesian Methods
Provides practical learning algorithms: Naive Bayes learning Bayesian belief network learning Combine prior knowledge (prior probabilities) with observed data Requires prior probabilities Provides useful conceptual framework Provides “gold standard” for evaluating other learning algorithms Additional insight into Occam’s razor

Bayes Theorem P(h) = prior probability of hypothesis h
P(D) = prior probability of training data D P(h|D) = probability of h given D P(D|h) = probability of D given h

Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori hypothesis hMAP: If assume P(hi) = P(hj) then can further simplify, and choose the Maximum likelihood (ML) hypothesis

Bayes Theorem Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer. P(cancer) = P(cancer) = P(|cancer) = P(|cancer) = P(|cancer) = P(|cancer) =

Basic Formulas for Probabilities
Product Rule: probability P(A  B) of a conjunction of two events A and B: P(A  B) = P(A | B) P(B) = P(B | A) P(A) Sum Rule: probability of a disjunction of two events A and B: P(A  B) = P(A) + P(B) - P(A  B) Theorem of total probability: if events A1,…, An are mutually exclusive with , then

Brute Force MAP Hypothesis Learner
For each hypothesis h in H, calculate the posterior probability Output the hypothesis hMAP with the highest posterior probability

Relation to Concept Learning(1/2)
Consider our usual concept learning task instance space X, hypothesis space H, training examples D consider the FindS learning algorithm (outputs most specific hypothesis from the version space V SH,D) What would Bayes rule produce as the MAP hypothesis? Does FindS output a MAP hypothesis??

Relation to Concept Learning(2/2)
Assume fixed set of instances <x1,…, xm> Assume D is the set of classifications: D = <c(x1),…,c(xm)> Choose P(D|h): P(D|h) = 1 if h consistent with D P(D|h) = 0 otherwise Choose P(h) to be uniform distribution P(h) = 1/|H| for all h in H Then,

Evolution of Posterior Probabilities

Characterizing Learning Algorithms by Equivalent MAP Learners

Learning A Real Valued Function(1/2)
Consider any real-valued target function f Training examples <xi, di>, where di is noisy training value di = f(xi) + ei ei is random variable (noise) drawn independently for each xi according to some Gaussian distribution with mean=0 Then the maximum likelihood hypothesis hML is the one that minimizes the sum of squared errors:

Learning A Real Valued Function(2/2)
Maximize natural log of this instead...

Learning to Predict Probabilities
Consider predicting survival probability from patient data Training examples <xi, di>, where di is 1 or 0 Want to train neural network to output a probability given xi (not a 0 or 1) In this case can show Weight update rule for a sigmoid unit: where

Minimum Description Length Principle (1/2)
Occam’s razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that minimizes where LC(x) is the description length of x under encoding C Example: H = decision trees, D = training data labels LC1(h) is # bits to describe tree h LC2(D|h) is # bits to describe D given h Note LC2(D|h) = 0 if examples classified perfectly by h. Need only describe exceptions Hence hMDL trades off tree size for training errors

Minimum Description Length Principle (2/2)
Interesting fact from information theory: The optimal (shortest expected coding length) code for an event with probability p is –log2p bits. So interpret (1): –log2P(h) is length of h under optimal code –log2P(D|h) is length of D given h under optimal code  prefer the hypothesis that minimizes length(h) + length(misclassifications)

Most Probable Classification of New Instances
So far we’ve sought the most probable hypothesis given the data D (i.e., hMAP) Given new instance x, what is its most probable classification? hMAP(x) is not the most probable classification! Consider: Three possible hypotheses: P(h1|D) = .4, P(h2|D) = .3, P(h3|D) = .3 Given new instance x, h1(x) = +, h2(x) = , h3(x) =  What’s most probable classification of x?

Bayes Optimal Classifier
Bayes optimal classification: Example: P(h1|D) = .4, P(|h1) = 0, P(+|h1) = 1 P(h2|D) = .3, P(|h2) = 1, P(+|h2) = 0 P(h3|D) = .3, P(|h3) = 1, P(+|h3) = 0 therefore and

E[errorGibbs]  2E [errorBayesOptional]
Gibbs Classifier Bayes optimal classifier provides best result, but can be expensive if many hypotheses. Gibbs algorithm: 1. Choose one hypothesis at random, according to P(h|D) 2. Use this to classify new instance Surprising fact: Assume target concepts are drawn at random from H according to priors on H. Then: E[errorGibbs]  2E [errorBayesOptional] Suppose correct, uniform prior distribution over H, then Pick any hypothesis from VS, with uniform probability Its expected error no worse than twice Bayes optimal

Naive Bayes Classifier (1/2)
Along with decision trees, neural networks, nearest nbr, one of the most practical learning methods. When to use Moderate or large training set available Attributes that describe instances are conditionally independent given classification Successful applications: Diagnosis Classifying text documents

Naive Bayes Classifier (2/2)
Assume target function f : X  V, where each instance x described by attributes <a1, a2 … an>. Most probable value of f(x) is: Naive Bayes assumption: which gives Naive Bayes classifier:

P(ai |vj)  estimate P(ai |vj)
Naive Bayes Algorithm Naive Bayes Learn(examples) For each target value vj P(vj)  estimate P(vj) For each attribute value ai of each attribute a P(ai |vj)  estimate P(ai |vj) Classify New Instance(x) ^ ^

Naive Bayes: Subtleties (1/2)
1. Conditional independence assumption is often violated ...but it works surprisingly well anyway. Note don’t need estimated posteriors to be correct; need only that see [Domingos & Pazzani, 1996] for analysis Naive Bayes posteriors often unrealistically close to 1 or 0

Naive Bayes: Subtleties (2/2)
2. what if none of the training instances with target value vj have attribute value ai? Then Typical solution is Bayesian estimate for where n is number of training examples for which v = vi, nc number of examples for which v = vj and a = ai p is prior estimate for m is weight given to prior (i.e. number of “virtual” examples)

Learning to Classify Text (1/4)
Why? Learn which news articles are of interest Learn to classify web pages by topic Naive Bayes is among most effective algorithms What attributes shall we use to represent text documents??

Target concept Interesting? : Document {, } 1. Represent each document by vector of words one attribute per word position in document 2. Learning: Use training examples to estimate P()  P() P(doc|)  P(doc|) Naive Bayes conditional independence assumption where P(ai = wk | vj) is probability that word in position i is wk, given vj one more assumption:

LEARN_NAIVE_BAYES_TEXT (Examples, V) 1. collect all words and other tokens that occur in Examples Vocabulary  all distinct words and other tokens in Examples 2. calculate the required P(vj) and P(wk | vj) probability terms For each target value vj in V do docsj  subset of Examples for which the target value is vj Textj  a single document created by concatenating all members of docsj

n  total number of words in Textj (counting duplicate words multiple times) for each word wk in Vocabulary * nk  number of times word wk occurs in Textj * CLASSIFY_NAIVE_BAYES_TEXT (Doc) positions  all word positions in Doc that contain tokens found in Vocabulary Return vNB where

Twenty NewsGroups Given 1000 training documents from each group Learn to classify new documents according to which newsgroup it came from Naive Bayes: 89% classification accuracy comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey alt.atheism soc.religion.christian talk.religion.misc talk.politics.mideast talk.politics.misc talk.politics.guns sci.space sci.crypt sci.electronics sci.med

Learning Curve for 20 Newsgroups
Accuracy vs. Training set size (1/3 withheld for test)

Bayesian Belief Networks
Interesting because: Naive Bayes assumption of conditional independence too restrictive But it’s intractable without some such assumptions... Bayesian Belief networks describe conditional independence among subsets of variables  allows combining prior knowledge about (in)dependencies among variables with observed training data (also called Bayes Nets)

Bayesian Belief Network (1/2)
Network represents a set of conditional independence assertions: Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors. Directed acyclic graph

Bayesian Belief Network (2/2)
Represents joint probability distribution over all variables e.g., P(Storm, BusTourGroup, , ForestFire) in general, where Parents(Yi) denotes immediate predecessors of Yi in graph so, joint distribution is fully defined by graph, plus the P(yi|Parents(Yi))

Inference in Bayesian Networks
How can one infer the (probabilities of) values of one or more network variables, given observed values of others? Bayes net contains all information needed for this inference If only one variable with unknown value, easy to infer it In general case, problem is NP hard In practice, can succeed in many cases Exact inference methods work well for some network structures Monte Carlo methods “simulate” the network randomly to calculate approximate solutions

Learning of Bayesian Networks
Several variants of this learning task Network structure might be known or unknown Training examples might provide values of all network variables, or just some If structure known and observe all variables Then it’s easy as training a Naive Bayes classifier

Learning Bayes Nets Suppose structure known, variables partially observable e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not Lightning, Campfire... Similar to training neural network with hidden units In fact, can learn network conditional probability tables using gradient ascent! Converge to network h that (locally) maximizes P(D|h)

Gradient Ascent for Bayes Nets
Let wijk denote one entry in the conditional probability table for variable Yi in the network wijk = P(Yi = yij|Parents(Yi) = the list uik of values) e.g., if Yi = Campfire, then uik might be <Storm = T, BusTourGroup = F > Perform gradient ascent by repeatedly 1. update all wijk using training data D 2. then, renormalize the to wijk assure j wijk = 1  0  wijk  1

More on Learning Bayes Nets
EM algorithm can also be used. Repeatedly: 1. Calculate probabilities of unobserved variables, assuming h 2. Calculate new wijk to maximize E[ln P(D|h)] where D now includes both observed and (calculated probabilities of) unobserved variables When structure unknown... Algorithms use greedy search to add/substract edges and nodes Active research topic

Summary: Bayesian Belief Networks
Combine prior knowledge with observed data Impact of prior knowledge (when correct!) is to lower the sample complexity Active research area Extend from boolean to real-valued variables Parameterized distributions instead of tables Extend to first-order instead of propositional systems More effective inference methods …

Expectation Maximization (EM)
When to use: Data is only partially observable Unsupervised clustering (target value unobservable) Supervised learning (some instance attributes unobservable) Some uses: Train Bayesian Belief Networks Unsupervised clustering (AUTOCLASS) Learning Hidden Markov Models

Generating Data from Mixture of k Gaussians
Each instance x generated by 1. Choosing one of the k Gaussians with uniform probability 2. Generating an instance at random according to that Gaussian

EM for Estimating k Means (1/2)
Given: Instances from X generated by mixture of k Gaussian distributions Unknown means <1,…,k > of the k Gaussians Don’t know which instance xi was generated by which Gaussian Determine: Maximum likelihood estimates of <1,…,k > Think of full description of each instance as yi = < xi, zi1, zi2> where zij is 1 if xi generated by jth Gaussian xi observable zij unobservable

EM for Estimating k Means (2/2)
EM Algorithm: Pick random initial h = <1, 2> then iterate E step: Calculate the expected value E[zij] of each hidden variable zij, assuming the current hypothesis h = <1, 2> holds. M step: Calculate a new maximum likelihood hypothesis h' = <'1, '2>, assuming the value taken on by each hidden variable zij is its expected value E[zij] calculated above. Replace h = <1, 2> by h' = <'1, '2>.

EM Algorithm Converges to local maximum likelihood h and provides estimates of hidden variables zij In fact, local maximum in E[ln P(Y|h)] Y is complete (observable plus unobservable variables) data Expected value is taken over possible values of unobserved variables in Y

General EM Problem Given:
Observed data X = {x1,…, xm} Unobserved data Z = {z1,…, zm} Parameterized probability distribution P(Y|h), where Y = {y1,…, ym} is the full data yi = xi  zi h are the parameters Determine: h that (locally) maximizes E[ln P(Y|h)] Many uses: Train Bayesian belief networks Unsupervised clustering (e.g., k means) Hidden Markov Models

Q(h'|h)  E[ln P(Y| h')|h, X]
General EM Method Define likelihood function Q(h'|h) which calculates Y = X  Z using observed X and current parameters h to estimate Z Q(h'|h)  E[ln P(Y| h')|h, X] EM Algorithm: Estimation (E) step: Calculate Q(h'|h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y . Maximization (M) step: Replace hypothesis h by the hypothesis h' that maximizes this Q function.

Machine Learning Chapter 7. Computational Learning Theory
Tom M. Mitchell

Computational Learning Theory (1/2)
Setting 1: learner poses queries to teacher Setting 2: teacher chooses examples Setting 3: randomly generated instances, labeled by teacher Probably approximately correct (PAC) learning Vapnik-Chervonenkis Dimension Mistake bounds

Computational Learning Theory (2/2)
What general laws constrain inductive learning? We seek theory to relate: Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Manner in which training examples presented

Prototypical Concept Learning Task
Given: Instances X: Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast Target function c: EnjoySport : X  {0, 1} Hypotheses H: Conjunctions of literals. E.g. <?, Cold, High, ?, ?, ?>. Training examples D: Positive and negative examples of the target function <x1, c(x1)>,… <xm, c(xm)> Determine: A hypothesis h in H such that h(x) = c(x) for all x in D? A hypothesis h in H such that h(x) = c(x) for all x in X?

Sample Complexity How many training examples are sufficient to learn the target concept? 1. If learner proposes instances, as queries to teacher Learner proposes instance x, teacher provides c(x) 2. If teacher (who knows c) provides training examples teacher provides sequence of examples of form <x, c(x)> 3. If some random process (e.g., nature) proposes instances instance x generated randomly, teacher provides c(x)

Sample Complexity: 1 Learner proposes instance x, teacher provides c(x) (assume c is in learner’s hypothesis space H) Optimal query strategy: play 20 questions pick instance x such that half of hypotheses in V S classify x positive, half classify x negative When this is possible, need log2 |H| queries to learn c when not possible, need even more

Sample Complexity: 2 Teacher (who knows c) provides training examples (assume c is in learner’s hypothesis space H) Optimal teaching strategy: depends on H used by learner Consider the case H = conjunctions of up to n boolean literals and their negations e.g., (AirTemp = Warm)  (Wind = Strong), where AirTemp, Wind, …, each have 2 possible values. if n possible boolean attributes in H, n + 1 examples suffice why?

Sample Complexity: 3 Given:
set of instances X  set of hypotheses H set of possible target concepts C training instances generated by a fixed, unknown probability distribution D over X Learner observes a sequence D of training examples of form <x, c(x)>, for some target concept c  C instances x are drawn from distribution D teacher provides target value c(x) for each Learner must output a hypothesis h estimating c h is evaluated by its performance on subsequent instances drawn according to D Note: probabilistic instances, noise-free classifications

True Error of a Hypothesis
Definition: The true error (denoted errorD(h)) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D.

Two Notions of Error Training error of hypothesis h with respect to target concept c How often h(x)  c(x) over training instances True error of hypothesis h with respect to c How often h(x)  c(x) over future random instances Our concern: Can we bound the true error of h given the training error of h? First consider when training error of h is zero (i.e., h  V SH,D)

Exhausting the Version Space
Definition: The version space V SH,D is said to be  -exhausted with respect to c and D, if every hypothesis h in V SH,D has error less than  with respect to c and D. (h  V SH,D ) errorD(h) < 

How many examples will -exhaust the VS?
Theorem: [Haussler, 1988]. If the hypothesis space H is finite, and D is a sequence of m  1 independent random examples of some target concept c, then for any 0    1, the probability that the version space with respect to H and D is not -exhausted (with respect to c) is less than |H|e-m Interesting! This bounds the probability that any consistent learner will output a hypothesis h with error(h)   If we want to this probability to be below  |H|e-m   then

Learning Conjunctions of Boolean Literals
How many examples are sufficient to assure with probability at least (1-  ) that every h in V SH,D satisfies errorD(h)   Use our theorem: Suppose H contains conjunctions of constraints on up to n boolean attributes (i.e., n boolean literals). Then |H| = 3n, and or

How About EnjoySport? If H is as given in EnjoySport then |H| = 973, and ... if want to assure that with probability 95%, V S contains only hypotheses with errorD(h)  .1, then it is sufficient to have m examples, where

PAC Learning Consider a class C of possible target concepts defined over a set of instances X of length n, and a learner L using hypothesis space H. Definition: C is PAC-learnable by L using H if for all c  C, distributions D over X,  such that 0 <  < 1/2, and  such that 0 <  < 1/2 learner L will with probability at least (1-  ) output a hypothesis h  H such that errorD(h)  , in time that is polynomial in 1/ , 1/ , n, and size(c).

Agnostic Learning So far, assumed c  H
Agnostic learning setting: don’t assume c  H What do we want then? The hypothesis h that makes fewest errors on training data What is sample complexity in this case? derived from Hoeffding bounds:

Shattering a Set of Instances
Definition: a dichotomy of a set S is a partition of S into two disjoint subsets. Definition: a set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy.

Three Instances Shattered
Instance space X

The Vapnik-Chervonenkis Dimension
Definition: The Vapnik-Chervonenkis dimension, V C(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then V C(H)  .

VC Dim. of Linear Decision Surfaces
(b)

Sample Complexity from VC Dimension
How many randomly drawn examples suffice to -exhaust V SH,D with probability at least (1 -  ) ?

Mistake Bounds So far: how many examples needed to learn?
What about: how many mistakes before convergence? Let’s consider similar setting to PAC learning: Instances drawn at random from X according to distribution D Learner must classify each instance before receiving correct classification from teacher Can we bound the number of mistakes learner makes before converging?

Mistake Bounds: Find-S
Consider Find-S when H = conjunction of boolean literals Find-S: Initialize h to the most specific hypothesis l1   l1  l2   l2 … ln   ln For each positive training instance x Remove from h any literal that is not satisfied by x Output hypothesis h. How many mistakes before converging to correct h?

Mistake Bounds: Halving Algorithm
Consider the Halving Algorithm: Learn concept using version space Candidate-Elimination algorithm Classify new instances by majority vote of version space members How many mistakes before converging to correct h? ... in worst case? ... in best case?

Optimal Mistake Bounds
Let MA(C) be the max number of mistakes made by algorithm A to learn concepts in C. (maximum over all possible c  C, and all possible training sequences) Definition: Let C be an arbitrary non-empty concept class. The optimal mistake bound for C, denoted Opt(C), is the minimum over all possible learning algorithms A of MA(C).

Machine Learning Chapter 8. Instance-Based Learning
Tom M. Mitchell

Instance Based Learning (1/2)
k-Nearest Neighbor Locally weighted regression Radial basis functions Case-based reasoning Lazy and eager learning

Instance-Based Learning (2/2)
Key idea: just store all training examples <xi, f(xi)> Nearest neighbor: Given query instance xq, first locate nearest training example xn, then estimate k-Nearest neighbor: Given xq, take vote among its k nearest nbrs (if discrete-valued target function) take mean of f values of k nearest nbrs (if real-valued)

When To Consider Nearest Neighbor
Instances map to points in Rn Less than 20 attributes per instance Lots of training data Advantages: Training is very fast Learn complex target functions Don’t lose information Disadvantages: Slow at query time Easily fooled by irrelevant attributes

Voronoi Diagram

Behavior in the Limit Consider p(x) defines probability that instance x will be labeled 1 (positive) versus 0 (negative). Nearest neighbor: As number of training examples  , approaches Gibbs Algorithm Gibbs: with probability p(x) predict 1, else 0 k-Nearest neighbor: As number of training examples   and k gets large, approaches Bayes optimal Bayes optimal: if p(x) > .5 then predict 1, else 0 Note Gibbs has at most twice the expected error of Bayes optimal

Distance-Weighted kNN
Might want weight nearer neighbors more heavily... and d(xq, xi) is distance between xq and xi Note now it makes sense to use all training examples instead of just k  Shepard’s method where

Curse of Dimensionality
Imagine instances described by 20 attributes, but only 2 are relevant to target function Curse of dimensionality: nearest nbr is easily mislead when high-dimensional X One approach: Stretch jth axis by weight zj, where z1,…, zn chosen to minimize prediction error Use cross-validation to automatically choose weights z1,…, zn Note setting zj to zero eliminates this dimension altogether see [Moore and Lee, 1994]

Locally Weighted Regression
Note kNN forms local approximation to f for each query point xq Why not form an explicit approximation f(x) for region surrounding xq Fit linear function to k nearest neighbors Fit quadratic, ... Produces “piecewise approximation” to f Several choices of error to minimize: Squared error over k nearest neighbors Distance-weighted squared error over all nbrs ^

Radial Basis Function Networks
Global approximation to target function, in terms of linear combination of local approximations Used, e.g., for image classification A different kind of neural network Closely related to distance-weighted regression, but “eager” instead of “lazy”

Radial Basis Function Networks
where ai(x) are the attributes describing instance x, and One common choice for Ku(d(xu, x)) is

Training Radial Basis Function Networks
Q1: What xu to use for each kernel function Ku(d(xu, x)) Scatter uniformly throughout instance space Or use training instances (reflects instance distribution) Q2: How to train weights (assume here Gaussian Ku) First choose variance (and perhaps mean) for each Ku e.g., use EM Then hold Ku fixed, and train linear output layer efficient methods to fit linear function

Case-Based Reasoning Can apply instance-based learning even when X  Rn  need different “distance” metric Case-Based Reasoning is instance-based learning applied to instances with symbolic logic descriptions

Case-Based Reasoning in CADET (1/3)
CADET: 75 stored examples of mechanical devices each training example: < qualitative function, mechanical structure > new query: desired function, target value: mechanical structure for this function Distance metric: match qualitative function descriptions

A stored case: T-junction pipe A problem specification: Water faucet

Instances represented by rich structural descriptions Multiple cases retrieved (and combined) to form solution to new problem Tight coupling between case retrieval and problem solving Bottom line: Simple matching of cases useful for tasks such as answering help-desk queries Area of ongoing research

Lazy and Eager Learning
Lazy: wait for query before generalizing k-Nearest Neighbor, Case based reasoning Eager: generalize before seeing query Radial basis function networks, ID3, Backpropagation, NaiveBayes, . . . Does it matter? Eager learner must create global approximation Lazy learner can create many local approximations if they use same H, lazy can represent more complex fns (e.g., consider H = linear functions)

Machine Learning Chapter 9. Genetic Algorithm
Tom M. Mitchell

Genetic Algorithms Evolutionary computation Prototypical GA
An example: GABIL Genetic Programming Individual learning and population evolution

Evolutionary Computation
Computational procedures patterned after biological evolution Search procedure that probabilistically applies search operators to set of points in the search space

Biological Evolution (1/3)
Lamarck and others: Species “transmute” over time Darwin and Wallace: Consistent, heritable variation among individuals in population Natural selection of the fittest Mendel and genetics: A mechanism for inheriting traits genotype  phenotype mapping

GA(Fitness, Fitness_threshold, p, r, m) Initialize: P  p random hypotheses Evaluate: for each h in P, compute Fitness(h) While [maxh Fitness(h)] < Fitness_threshold 1. Select: Probabilistically select (1-r)p members of P to add to PS.

2. Crossover: Probabilistically select r ·p/2 pairs of hypotheses from P. For each pair, <h1, h2>, produce two offspring by applying the Crossover operator. Add all offspring to Ps. 3. Mutate: Invert a randomly selected bit in m ·p random members of Ps 4. Update: P  Ps 5. Evaluate: for each h in P, compute Fitness(h) Return the hypothesis from P that has the highest fitness.

Representing Hypotheses
(Outlook = Overcast  Rain)  (Wind = Strong) by IF Wind = Strong THEN PlayTennis = yes Outlook Wind 011 10 Outlook Wind PlayTennis 111 10

Operators for Genetic Algorithms
Initial strings Crossover Mask Offspring Single-point crossover: Two-point crossover: Uniform crossover: Point mutation:

Selecting Most Fit Hypotheses
Fitness proportionate selection: ... can lead to crowding Tournament selection: Pick h1, h2 at random with uniform prob. With probability p, select the more fit. Rank selection: Sort all hypotheses by fitness Prob of selection is proportional to rank

GABIL [DeJong et al. 1993] Learn disjunctive set of propositional rules, competitive with C4.5 Fitness: Fitness(h) = (correct(h))2 Representation: IF a1 = Ta2 = F THEN c = T; IF a2 = T THEN c = F represented by Genetic operators: ??? want variable length rule sets want only well-formed bitstring hypotheses a1 a2 c 10 01 1 a1 a2 c 11 10

Crossover with Variable-Length Bitstrings
Start with 1. choose crossover points for h1, e.g., after bits 1, 8 2. now restrict points in h2 to those that produce bitstrings with well-defined semantics, e.g., <1, 3>, <1, 8>, <6, 8>. if we choose <1, 3>, result is

GABIL Extensions Add new genetic operators, also applied probabilistically: 1. AddAlternative: generalize constraint on ai by changing a 0 to 1 2. DropCondition: generalize constraint on ai by changing every 0 to 1 And, add new field to bitstring to determine whether to allow these So now the learning strategy also evolves!

GABIL Results Performance of GABIL comparable to symbolic rule/tree learning methods C4.5, ID5R, AQ14 Average performance on a set of 12 synthetic problems: GABIL without AA and DC operators: 92.1% accuracy GABIL with AA and DC operators: 95.2% accuracy symbolic learning methods ranged from 91.2 to 96.6

Schemas How to characterize evolution of population in GA?
Schema = string containing 0, 1, * (“don’t care”) Typical schema: 10**0* Instances of above schema: , , ... Characterize population by number of instances representing each possible schema m(s, t) = number of instances of schema s in pop at time t

Consider Just Selection
- f(t) = average fitness of pop. at time t m(s, t) = instances of schema s in pop at time t u(s, t) = ave. fitness of instances of s at time t Probability of selecting h in one selection step Probability of selecting an instance of s in one step Expected number of instances of s after n selections ^

Schema Theorem m(s, t) = instances of schema s in pop at time t
f(t) = average fitness of pop. at time t u(s, t) = ave. fitness of instances of s at time t pc = probability of single point crossover operator pm = probability of mutation operator l = length of single bit strings o(s) number of defined (non “*”) bits in s d(s) = distance between leftmost, rightmost defined bits in s - ^

Genetic Programming Population of programs represented by trees

Crossover

Block Problem (1/2) Goal: spell UNIVERSAL Terminals:
CS (“current stack”) = name of the top block on stack, or F. TB (“top correct block”) = name of topmost correct block on stack NN (“next necessary”) = name of the next block needed above TB in the stack

Block Problem (2/2) Primitive functions:
(MS x): (“move to stack”), if block x is on the table, moves x to the top of the stack and returns the value T. Otherwise, does nothing and returns the value F. (MT x): (“move to table”), if block x is somewhere in the stack, moves the block at the top of the stack to the table and returns the value T. Otherwise, returns F. (EQ x y): (“equal”), returns T if x equals y, and returns F otherwise. (NOT x): returns T if x = F, else returns F (DU x y): (“do until”) executes the expression x repeatedly until expression y returns the value T

(EQ (DU (MT CS)(NOT CS)) (DU (MS NN)(NOT NN)) )
Learned Program Trained to t 166 test problems Using population of 300 programs, found this after 10 generations: (EQ (DU (MT CS)(NOT CS)) (DU (MS NN)(NOT NN)) )

Genetic Programming More interesting example: design electronic filter circuits Individuals are programs that transform beginning circuit to final circuit, by adding/subtracting components and connections Use population of 640,000, run on 64 node parallel processor Discovers circuits competitive with best human designs

GP for Classifying Images
[Teller and Veloso, 1997] Fitness: based on coverage and accuracy Representation: Primitives include Add, Sub, Mult, Div, Not, Max, Min, Read, Write, If-Then-Else, Either,Pixel, Least, Most, Ave, Variance, Difference, Mini, Library Mini refers to a local subroutine that is separately co-evolved Library refers to a global library subroutine (evolved by selecting the most useful minis) Genetic operators: Crossover, mutation Create “mating pools” and use rank proportionate reproduction

Biological Evolution Lamark (19th century)
Believed individual genetic makeup was altered by lifetime experience But current evidence contradicts this view What is the impact of individual learning on population evolution?

Baldwin Effect (1/2) Assume Then
Individual learning has no direct influence on individual DNA But ability to learn reduces need to “hard wire” traits in DNA Then Ability of individuals to learn will support more diverse gene pool Because learning allows individuals with various “hard wired” traits to be successful More diverse gene pool will support faster evolution of gene pool  individual learning (indirectly) increases rate of evolution

Baldwin Effect (2/2) Plausible example:
1. New predator appears in environment 2. Individuals who can learn (to avoid it) will be selected 3. Increase in learning individuals will support more diverse gene pool 4. resulting in faster evolution 5. possibly resulting in new non-learned traits such as instinctive fear of predator

Computer Experiments on Baldwin Effect
[Hinton and Nowlan, 1987] Evolve simple neural networks: Some network weights fixed during lifetime, others trainable Genetic makeup determines which are fixed, and their weight values Results: With no individual learning, population failed to improve over time When individual learning allowed Early generations: population contained many individuals with many trainable weights Later generations: higher fitness, while number of trainable weights decreased

Summary: Evolutionary Programming
Conduct randomized, parallel, hill-climbing search through H Approach learning as optimization problem (optimize fitness) Nice feature: evaluation of Fitness can be very indirect consider learning rule set for multistep decision making no issue of assigning credit/blame to indiv. steps

Machine Learning Chapter 10. Learning Sets of Rules
Tom M. Mitchell

Learning Disjunctive Sets of Rules
Method 1: Learn decision tree, convert to rules Method 2: Sequential covering algorithm: 1. Learn one rule with high accuracy, any coverage 2. Remove positive examples covered by this rule 3. Repeat

Sequential Covering Algorithm
COVERING (Target attribute; Attributes; Examples; Threshold) Learned rules  {} Rule  LEARN-ONE- RULE(Target_attribute, Attributes, Examples) while PERFORMANCE (Rule, Examples) > Threshold, do Learned_rules  Learned_rules + Rule Examples  Examples – {examples correctly classified by Rule} Rule  LEARN-ONE- RULE (Target_attribute, Attributes, Examples) Learned_rules  sort Learned_rules accord to PERFORMANCE over Examples return Learned_rules

Learn-One-Rule

Learn-One-Rule(Cont.)
Pos  positive Examples Neg  negative Examples while Pos, do Learn a NewRule - NewRule  most general rule possible - NewRule  Neg - while NewRuleNeg, do Add a new literal to specialize NewRule 1. Candidate literals  generate candidates 2. Best_literal  argmaxLCandidate literals Performance(SpecializeRule(NewRule; L)) 3. add Best_literal to NewRule preconditions 4. NewRuleNeg  subset of NewRuleNeg that satisfies NewRule preconditions - Learned_rules  Learned_rules + NewRule - Pos  Pos – {members of Pos coverd by NewRule} Return Learned_rules

Subtleties: Learn One Rule
1. May use beam search 2. Easily generalizes to multi-valued target functions 3. Choose evaluation function to guide search: Entropy (i.e., information gain) Sample accuracy: where nc = correct rule predictions, n = all predictions m estimate:

Variants of Rule Learning Programs
Sequential or simultaneous covering of data? General  specific, or specific  general? Generate-and-test, or example-driven? Whether and how to post-prune? What statistical evaluation function?

Learning First Order Rules
Why do that? Can learn sets of rules such as Ancestor(x, y)  Parent(x; y) Ancestor(x; y)  Parent(x; z) ^ Ancestor(z; y) General purpose programming language PROLOG : programs are sets of such rules

First Order Rule for Classifying Web Pages
[Slattery, 1997] course(A)  has-word(A, instructor), Not has-word(A, good), link-from(A, B), has-word(B, assign), Not link-from(B, C) Train: 31/31, Test: 31/34

Specializing Rules in FOIL

Information Gain in FOIL

Induction as Inverted Deduction

Induction as Inverted Deduction(Cont’)

Induction is, in fact, the inverse operation of deduction, and cannot be conceived to exist without the corresponding operation, so that the question of relative importance cannot arise. Who thinks of asking whether addition or subtraction is the more important process in arithmetic? But at the same time much difference in difficulty may exist between a direct and inverse operation; : : : it must be allowed that inductive investigations are of a far higher degree of difficulty and complexity than any questions of deduction…. (Jevons 1874)

Deduction: Resolution Rule

Inverting Resolution

Inverted Resolution (Propositional)

First order resolution

Inverting First order resolution

Progol

Machine Learning Chapter 11. Analytical Learning
Tom M. Mitchell

Outline Two formulations for learning: Inductive and Analytical
Perfect domain theories and Prolog-EBG

A Positive Example

The Inductive Generalization Problem
Given: Instances Hypotheses Target Concept Training examples of target concept Determine: Hypotheses consistent with the training examples

The Analytical Generalization Problem(Cont’)
Given: Instances Hypotheses Target Concept Training examples of target concept Domain theory for explaining examples Determine: Hypotheses consistent with the training examples and the domain theory

An Analytical Generalization Problem

Learning from Perfect Domain Theories
Assumes domain theory is correct (error-free) Prolog-EBG is algorithm that works under this assumption This assumption holds in chess and other search problems Allows us to assume explanation = proof Later we’ll discuss methods that assume approximate domain theories

Prolog EBG Initialize hypothesis = {}
For each positive training example not covered by hypothesis: 1. Explain how training example satisfies target concept, in terms of domain theory 2. Analyze the explanation to determine the most general conditions under which this explanation (proof) holds 3. Refine the hypothesis by adding a new rule, whose preconditions are the above conditions, and whose consequent asserts the target concept

Explanation of a Training Example

Computing the Weakest Preimage of Explanation

Regression Algorithm

Lessons from Safe-to-Stack Example
Justified generalization from single example Explanation determines feature relevance Regression determines needed feature constraints Generality of result depends on domain theory Still require multiple examples

Perspectives on Prolog-EBG
Theory-guided generalization from examples Example-guided operationalization of theories "Just" restating what learner already "knows“ Is it learning? Are you learning when you get better over time at chess? Even though you already know everything in principle, once you know rules of the game... Are you learning when you sit in a mathematics class? Even though those theorems follow deductively from the axioms you’ve already learned...

Machine Learning Chapter 12
Machine Learning Chapter 12. Combining Inductive and Analytical Learning Tom M. Mitchell

Inductive and Analytical Learning
Inductive learning Hypothesis fits data Statistical inference Requires little prior knowledge Syntactic inductive bias Analytical learning Hypothesis fits domain the Deductive inference Learns from scarce data Bias is domain theory

What We Would Like General purpose learning method:
No domain theory  learn as well as inductive methods Perfect domain theory  learn as well as Prolog-EBG Accomodate arbitrary and unknown errors in domain theory Accomodate arbitrary and unknown errors in training data

Domain theory: Cup  Stable, Liftable, Open Vessel Stable  BottomIsFlat Liftable  Graspable, Light Graspable  HasHandle Open Vessel  HasConcavity, ConcavityPointsUp Training examples:

KBANN KBANN (data D, domain theory B)
1. Create a feedforward network h equivalent to B 2. Use BACKPROP to tune h to t D

Neural Net Equivalent to Domain Theory

Creating Network Equivalent to Domain Theory
Create one unit per horn clause rule (i.e., an AND unit) Connect unit inputs to corresponding clause antecedents For each non-negated antecedent, corresponding input weight w  W, where W is some constant For each negated antecedent, input weight w  -W Threshold weight w0  -(n-.5)W, where n is number of non-negated antecedents Finally, add many additional connections with near-zero weights Liftable  Graspable, Heavy

Result of refining the network

KBANN Results Classifying promoter regions in DNA leave one out testing: Backpropagation : error rate 8/106 KBANN: 4/106 Similar improvements on other classification, control tasks.

Hypothesis space search in KBANN

EBNN Key idea: Previously learned approximate domain theory
Domain theory represented by collection of neural networks Learn target function as another neural network

Modified Objective for Gradient Descent

Hypothesis Space Search in EBNN

Search in FOCL

FOCL Results Recognizing legal chess endgame positions:
30 positive, 30 negative examples FOIL : 86% FOCL : 94% (using domain theory with 76% accuracy) NYNEX telephone network diagnosis 500 training examples FOIL : 90% FOCL : 98% (using domain theory with 95% accuracy)

Machine Learning Chapter 13. Reinforcement Learning
Tom M. Mitchell

Control Learning Consider learning to choose actions, e.g.,
Robot learning to dock on battery charger Learning to choose actions to optimize factory output Learning to play Backgammon Note several problem characteristics: Delayed reward Opportunity for active exploration Possibility that state only partially observable Possible need to learn multiple tasks with same sensors/effectors

One Example: TD-Gammon
Learn to play Backgammon Immediate reward +100 if win -100 if lose 0 for all other states Trained by playing 1.5 million games against itself Now approximately equal to best human player

Reinforcement Learning Problem

Markov Decision Processes
Assume finite set of states S set of actions A at each discrete time agent observes state st  S and chooses action at  A then receives immediate reward rt and state changes to st+1 Markov assumption : st+1 = (st, at ) and rt = r(st, at ) i.e., rt and st+1 depend only on current state and action functions  and r may be nondeterministic functions  and r not necessarily known to agent

Agent's Learning Task

Value Function

What to Learn

Q Function

Training Rule to Learn Q

Q Learning for Deterministic Worlds

Nondeterministic Case

Nondeterministic Case(Cont’)

Temporal Difference Learning

Temporal Difference Learning(Cont’)

Subtleties and Ongoing Research

Machine Learning Chapter 1. Introduction

Similar presentations

Presentation on theme: "Machine Learning Chapter 1. Introduction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning Chapter 1. Introduction

Similar presentations

Presentation on theme: "Machine Learning Chapter 1. Introduction"— Presentation transcript:

Similar presentations

About project

Feedback