Download presentation
Presentation is loading. Please wait.
1
Machine Learning Chapter 1. Introduction
Tom M. Mitchell
2
Machine Learning, Tom T. Mitchell, McGraw-Hill, 1997 강의
Reinforcement Learning: An Introduction, R. S. Sutton and A. G. Barto, The MIT Press, 1998 발표
3
Machine Learning How to construct computer programs that automatically improve with experience Data mining(medical applications: 1989), fraudulent credit card (1989), transactions, information filtering, users’ reading preference, autonomous vehicles, backgammon at level of world champions(1992), speech recognition(1989), optimizing energy cost Machine learning theory How does learning performance vary with the number of training examples presented What learning algorithms are most appropriate for various types of learning tasks
4
예제 프로그램 http://www.cs.cmu.edu/~tom/mlbook.html Face recognition
Decision tree learning code Data for financial loan analysis Bayes classifier code Data for analyzing text documents
5
이론적 연구 Fundamental relationship among the number of training examples observed, the number of hypotheses under consideration, and the expected error in learned hypotheses Biological systems
6
Def. A computer program is said to learn from experience E wrt some classes of tasks T and performance P, if its performance at tasks in T, as measured by P, improves with experience E.
7
Outline Why Machine Learning? What is a well-defined learning problem?
An example: learning to play checkers What questions should we ask about Machine Learning?
8
Why Machine Learning Recent progress in algorithms and theory
Growing flood of online data Computational power is available Budding industry
9
Three niches for machine learning:
Data mining : using historical data to improve decisions medical records medical knowledge Software applications we can't program by hand autonomous driving speech recognition Self customizing programs Newsreader that learns user interests
10
Typical Datamining Task (1/2)
11
Typical Datamining Task (2/2)
Given: 9714 patient records, each describing a pregnancy and birth Each patient record contains 215 features Learn to predict: Classes of future patients at high risk for Emergency Cesarean Section
12
Datamining Result One of 18 learned rules:
If No previous vaginal delivery, and Abnormal 2nd Trimester Ultrasound, and Malpresentation at admission Then Probability of Emergency C-Section is 0.6 Over training data: 26/41 = .63, Over test data: 12/20 = .60
13
Credit Risk Analysis (1/2)
Data :
14
Credit Risk Analysis (2/2)
Rules learned from synthesized data: If Other-Delinquent-Accounts > 2, and Number-Delinquent-Billing-Cycles > 1 Then Profitable-Customer? = No [Deny Credit Card application] If Other-Delinquent-Accounts = 0, and (Income > $30k) OR (Years-of-Credit > 3) Then Profitable-Customer? = Yes [Accept Credit Card application]
15
Other Prediction Problems (1/2)
16
Other Prediction Problems (2/2)
17
Problems Too Difficult to Program by Hand
ALVINN [Pomerleau] drives 70 mph on highways
18
Software that Customizes to User
19
Where Is this Headed? (1/2)
Today: tip of the iceberg First-generation algorithms: neural nets, decision trees, regression ... Applied to well-formatted database Budding industry
20
Where Is this Headed? (2/2)
Opportunity for tomorrow: enormous impact Learn across full mixed-media data Learn across multiple internal databases, plus the web and newsfeeds Learn by active experimentation Learn decisions rather than predictions Cumulative, lifelong learning Programming languages with learning embedded?
21
Relevant Disciplines Artificial intelligence Bayesian methods
Computational complexity theory Control theory Information theory Philosophy Psychology and neurobiology Statistics . . .
22
What is the Learning Problem?
Learning = Improving with experience at some task Improve over task T, with respect to performance measure P, based on experience E. E.g., Learn to play checkers T: Play checkers P: % of games won in world tournament E: opportunity to play against self
23
Learning to Play Checkers
T: Play checkers P: Percent of games won in world tournament What experience? What exactly should be learned? How shall it be represented? What specific algorithm to learn it?
24
Type of Training Experience
Direct or indirect? Teacher or not? A problem: is training experience representative of performance goal?
25
Choose the Target Function
ChooseMove : Board Move ?? V : Board R ?? . . .
26
Possible Definition for Target Function V
if b is a final board state that is won, then V(b) = 100 if b is a final board state that is lost, then V(b) = -100 if b is a final board state that is drawn, then V(b) = 0 if b is not a final state in the game, then V(b) = V(b'), where b' is the best final board state that can be achieved starting from b and playing optimally until the end of the game. This gives correct values, but is not operational
27
Choose Representation for Target Function
collection of rules? neural network ? polynomial function of board features? . . .
28
A Representation for Learned Function
w0+ w1·bp(b)+w2·rp(b)+w3·bk(b)+w4·rk(b)+w5·bt(b)+w6·rt(b) bp(b) : number of black pieces on board b rp(b) : number of red pieces on b bk(b) : number of black kings on b rk(b) : number of red kings on b bt(b) : number of red pieces threatened by black (i.e., which can be taken on black's next turn) rt(b) : number of black pieces threatened by red
29
Obtaining Training Examples
V(b): the true target function V(b) : the learned function Vtrain(b): the training value One rule for estimating training values: Vtrain(b) V(Successor(b)) ^ ^
30
Choose Weight Tuning Rule
LMS Weight update rule: Do repeatedly: Select a training example b at random 1. Compute error(b): error(b) = Vtrain(b) – V(b) 2. For each board feature fi, update weight wi: wi wi + c · fi · error(b) c is some small constant, say 0.1, to moderate the rate of learning
31
Final design The performance system The critic The generalizer
Playing games The critic 차이 발견 (분석) The generalizer Generate new hypothesis The experiment generator Generate new problems
32
학습방법 Backgammon : 6개 feature를 늘여서 Reinforcement learning
Neural network ::: 판 자체, 100만번 스스로 학습 인간에 필적할 만함 Nearest Neighbor algorithm : 여러 가지 학습자료를 저장한 후 가까운 것을 찾아서 처리 Genetic algorithm ::: 여러 프로그램을 만들어 적자생존을 통해 진화 Explanation-based learning ::: 이기고 지는 이유에 대한 분석을 통한 학습
33
Design Choices
34
Some Issues in Machine Learning
What algorithms can approximate functions well (and when)? How does number of training examples influence accuracy? How does complexity of hypothesis representation impact it? How does noisy data influence accuracy? What are the theoretical limits of learnability? How can prior knowledge of learner help? What clues can we get from biological learning systems? How can systems alter their own representations?
35
Machine Learning Chapter 2
Machine Learning Chapter 2. Concept Learning and The General-to-specific Ordering Tom M. Mitchell
36
Outline Learning from examples
General-to-specific ordering over hypotheses Version spaces and candidate elimination algorithm Picking new examples The need for inductive bias Note: simple approach assuming no noise, illustrates key concepts
37
Training Examples for EnjoySport
What is the general concept? Sky Temp Humid Wind Water Forecst EnjoySpt Sunny Warm Normal Strong Same Yes High Rainy Cold Change No Cool
38
Representing Hypotheses
Many possible representations Here, h is conjunction of constraints on attributes Each constraint can be a specific value (e.g., Water = Warm) don’t care (e.g., “Water =?”) no value allowed (e.g., “Water=0”) For example, Sky AirTemp Humid Wind Water Forecst <Sunny ? ? Strong ? Same>
39
Prototypical Concept Learning Task(1/2)
Given: Instances X: Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast Target function c: EnjoySport : X → {0, 1} Hypotheses H: Conjunctions of literals. E.g. <?, Cold, High, ?, ?, ?>. Training examples D: Positive and negative examples of the target function < x1, c(x1)>, … <xm, c(xm)> Determine: A hypothesis h in H such that h(x) =c(x) for all x in D.
40
Prototypical Concept Learning Task(2/2)
The inductive learning hypothesis: Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples.
41
Instance, Hypotheses, and More- General-Than
42
Find-S Algorithm 1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x For each attribute constraint ai in h If the constraint ai in h is satisfied by x Then do nothing Else replace ai in h by the next more general constraint that is satisfied by x 3. Output hypothesis h
43
Hypothesis Space Search by Find-S
44
Complaints about Find-S
Can’t tell whether it has learned concept Can’t tell when training data inconsistent Picks a maximally specific h (why?) Depending on H, there might be several!
45
Version Spaces A hypothesis h is consistent with a set of training examples D of target concept c if and only if h(x) = c(x) for each training example <x, c(x)> in D. Consistent(h, D) ≡ (∀<x, c(x)>∈D) h(x) = c(x) The version space, V SH,D, with respect to hypothesis space H and training examples D, is the subset of hypotheses from H consistent with all training examples in D. V SH,D ≡ {h ∈ H | Consistent(h, D)}
46
The List-Then-Eliminate Algorithm:
1. VersionSpace a list containing every hypothesis in H 2. For each training example, <x, c(x)> remove from VersionSpace any hypothesis h for which h(x) c(x) 3. Output the list of hypotheses in VersionSpace
47
Example Version Space
48
Representing Version Spaces
The General boundary, G, of version space V SH,D is the set of its maximally general members The Specific boundary, S, of version space V SH,D is the set of its maximally specific members Every member of the version space lies between these boundaries V SH,D = {h ∈ H | (∃s ∈ S)(∃g ∈ G) (g ≥ h ≥ s)} where x ≥ y means x is more general or equal to y
49
Candidate Elimination Algorithm (1/2)
G ← maximally general hypotheses in H S ← maximally specific hypotheses in H For each training example d, do If d is a positive example Remove from G any hypothesis inconsistent with d For each hypothesis s in S that is not consistent with d Remove s from S Add to S all minimal generalizations h of s such that 1. h is consistent with d, and 2. some member of G is more general than h Remove from S any hypothesis that is more general than another hypothesis in S
50
Candidate Elimination Algorithm (2/2)
If d is a negative example Remove from S any hypothesis inconsistent with d For each hypothesis g in G that is not consistent with d Remove g from G Add to G all minimal specializations h of g such that 1. h is consistent with d, and 2. some member of S is more specific than h Remove from G any hypothesis that is less general than another hypothesis in G
51
Example Trace
52
What Next Training Example?
53
How Should These Be Classified?
<Sunny Warm Normal Strong Cool Change> <Rainy Cool Normal Light Warm Same> <Sunny Warm Normal Light Warm Same>
54
What Justifies this Inductive Leap?
+ <Sunny Warm Normal Strong Cool Change> + <Sunny Warm Normal Light Warm Same> S : <Sunny Warm Normal ? ? ?> Why believe we can classify the unseen <Sunny Warm Normal Strong Warm Same>
55
An UNBiased Learner Idea: Choose H that expresses every teachable
concept (i.e., H is the power set of X) Consider H' = disjunctions, conjunctions, negations over previous H. E.g., <Sunny Warm Normal ???> ∨<?????Change> What are S, G in this case? S ← G ←
56
Inductive Bias Consider Definition: concept learning algorithm L
instances X, target concept c training examples Dc = {<x, c(x)>} let L(xi, Dc) denote the classification assigned to the instance xi by L after training on data Dc. Definition: The inductive bias of L is any minimal set of assertions B such that for any target concept c and corresponding training examples Dc (∀xi ∈ X)[(B ∧ Dc ∧ xi) ├ L(xi, Dc)] where A├ B means A logically entails B
57
Inductive Systems and Equivalent Deductive Systems
58
Three Learners with Different Biases
1. Rote learner: Store examples, Classify x iff it matches previously observed example. 2. Version space candidate elimination algorithm 3. Find-S
59
Summary Points 1. Concept learning as search through H
2. General-to-specific ordering over H 3. Version space candidate elimination algorithm 4. S and G boundaries characterize learner’s uncertainty 5. Learner can generate useful queries 6. Inductive leaps possible only if learner is biased 7. Inductive learners can be modelled by equivalent deductive systems
60
Machine Learning Chapter 3. Decision Tree Learning
Tom M. Mitchell
61
Abstract Decision tree representation ID3 learning algorithm
Entropy, Information gain Overfitting
62
Decision Tree for PlayTennis
63
A Tree to Predict C-Section Risk
Learned from medical records of 1000 women Negative examples are C-sections
64
Decision Trees Decision tree representation: How would we represent:
Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification How would we represent: , , XOR (A B) (C D E) M of N
65
When to Consider Decision Trees
Instances describable by attribute-value pairs Target function is discrete valued Disjunctive hypothesis may be required Possibly noisy training data Examples: Equipment or medical diagnosis Credit risk analysis Modeling calendar scheduling preferences
66
Top-Down Induction of Decision Trees
Main loop: 1. A the “best” decision attribute for next node 2. Assign A as decision attribute for node 3. For each value of A, create new descendant of node 4. Sort training examples to leaf nodes 5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes Which attribute is best?
67
Entropy(1/2) S is a sample of training examples
p⊕ is the proportion of positive examples in S p⊖ is the proportion of negative examples in S Entropy measures the impurity of S Entropy(S) - p⊕log2 p⊕ - p⊖log2 p⊖
68
Entropy(2/2) Entropy(S) = expected number of bits needed to encode class (⊕ or ⊖) of randomly drawn member of S (under the optimal, shortest-length code) Why? Information theory: optimal length code assigns log2p bits to message having probability p. So, expected number of bits to encode ⊕ or ⊖ of random member of S: p⊕(-log2 p⊕) + p⊖(-log2 p⊖) Entropy(S) - p⊕log2 p⊕ - p⊖log2 p⊖
69
Information Gain Gain(S, A) = expected reduction in entropy due to sorting on A
70
Training Examples
71
Selecting the Next Attribute(1/2)
Which attribute is the best classifier?
72
Selecting the Next Attribute(2/2)
Ssunny = {D1,D2,D8,D9,D11} Gain (Ssunny , Humidity) = (3/5) (2/5) 0.0 = .970 Gain (Ssunny , Temperature) = (2/5) (2/5) (1/5) 0.0 = .570 Gain (Ssunny, Wind) = (2/5) (3/5) .918 = .019
73
Hypothesis Space Search by ID3(1/2)
74
Hypothesis Space Search by ID3(2/2)
Hypothesis space is complete! Target function surely in there... Outputs a single hypothesis (which one?) Can’t play 20 questions... No back tracking Local minima... Statistically-based search choices Robust to noisy data... Inductive bias: approx “prefer shortest tree”
75
Inductive Bias in ID3 Note H is the power set of instances X
→ Unbiased? Not really... Preference for short trees, and for those with high information gain attributes near the root Bias is a preference for some hypotheses, rather than a restriction of hypothesis space H Occam's razor: prefer the shortest hypothesis that fits the data
76
Occam’s Razor Why prefer short hypotheses? Argument in favor :
Fewer short hyps. than long hyps. → a short hyp that fits data unlikely to be coincidence → a long hyp that fits data might be coincidence Argument opposed : There are many ways to define small sets of hyps e.g., all trees with a prime number of nodes that use attributes beginning with “Z” What's so special about small sets based on size of hypothesis??
77
Overfitting in Decision Trees
Consider adding noisy training example #15: Sunny, Hot, Normal, Strong, PlayTennis = No What effect on earlier tree?
78
Overfitting Consider error of hypothesis h over
training data: errortrain(h) entire distribution D of data: errorD(h) Hypothesis h ∈ H overfits training data if there is an alternative hypothesis h'∈ H such that errortrain(h) < errortrain(h') and errorD(h) > errorD(h')
79
Overfitting in Decision Tree Learning
80
Avoiding Overfitting How can we avoid overfitting?
stop growing when data split not statistically significant grow full tree, then post-prune How to select “best” tree : Measure performance over training data Measure performance over separate validation data set MDL: minimize size(tree) + size(misclassifications(tree))
81
Reduced-Error Pruning
Split data into training and validation set Do until further pruning is harmful: 1. Evaluate impact on validation set of pruning each possible node (plus those below it) 2. Greedily remove the one that most improves validation set accuracy produces smallest version of most accurate subtree What if data is limited?
82
Effect of Reduced-Error Pruning
83
Rule Post-Pruning 1. Convert tree to equivalent set of rules
2. Prune each rule independently of others 3. Sort final rules into desired sequence for use Perhaps most frequently used method (e.g., C4.5 )
84
Converting A Tree to Rules
IF (Outlook = Sunny) ∧ (Humidity = High) THEN PlayTennis = No IF (Outlook = Sunny) ∧ (Humidity = Normal) THEN PlayTennis = Yes ….
85
Continuous Valued Attributes
Create a discrete attribute to test continuous Temperature = 82.5 (Temperature > 72.3) = t, f
86
Attributes with Many Values
Problem: If attribute has many values, Gain will select it Imagine using Date = Jun_3_1996 as attribute One approach : use GainRatio instead where Si is subset of S for which A has value vi
87
Attributes with Costs Consider
medical diagnosis, BloodTest has cost $150 robotics, Width_from_1ft has cost 23 sec. How to learn a consistent tree with low expected cost? One approach: replace gain by Tan and Schlimmer (1990) Nunez (1988) where w ∈ [0,1] determines importance of cost
88
Unknown Attribute Values
What if some examples missing values of A? Use training example anyway, sort through tree If node n tests A, assign most common value of A among other examples sorted to node n assign most common value of A among other examples with same target value assign probability pi to each possible value vi of A assign fraction pi of example to each descendant in tree Classify new examples in same fashion
89
Machine Learning Chapter 4. Artificial Neural Networks
Tom M. Mitchell
90
Artificial Neural Networks
Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics
91
Connectionist Models (1/2)
Consider humans: Neuron switching time ~ .001 second Number of neurons ~ 1010 Connections per neuron ~ 104-5 Scene recognition time ~ .1 second 100 inference steps doesn’t seem like enough much parallel computation
92
Connectionist Models (2/2)
Properties of artificial neural nets (ANN’s): Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically
93
When to Consider Neural Networks
Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant Examples: Speech phoneme recognition [Waibel] Image classification [Kanade, Baluja, Rowley] Financial prediction
94
ALVINN drives 70 mph on highways
95
Perceptron Sometimes we’ll use simpler vector notation:
96
Decision Surface of a Perceptron
Represents some useful functions What weights represent g(x1, x2) = AND(x1, x2)? But some functions not representable e.g., not linearly separable Therefore, we’ll want networks of these...
97
Perceptron training rule
wi wi + wi where wi = (t – o) xi Where: t = c(x) is target value o is perceptron output is small constant (e.g., .1) called learning rate Can prove it will converge If training data is linearly separable and sufficiently small
98
Gradient Descent (1/4) To understand, consider simpler linear unit, where o = w0 + w1x1 + ··· + wnxn Let's learn wi’s that minimize the squared error Where D is set of training examples
99
Gradient Descent (2/4) Gradient Training rule: i.e.,
100
Gradient Descent (3/4)
101
Gradient Descent (4/4) Initialize each wi to some small random value
Until the termination condition is met, Do Initialize each wi to zero. For each <x, t> in training_examples, Do * Input the instance x to the unit and compute the output o * For each linear unit weight wi, Do wi wi + (t – o) xi For each linear unit weight wi , Do wi wi + wi
102
Summary Perceptron training rule guaranteed to succeed if
Training examples are linearly separable Sufficiently small learning rate Linear unit training rule uses gradient descent Guaranteed to converge to hypothesis with minimum squared error Given sufficiently small learning rate Even when training data contains noise Even when training data not separable by H
103
Incremental (Stochastic) Gradient Descent (1/2)
Batch mode Gradient Descent: Do until satisfied 1. Compute the gradient ED[w] 2. w w - ED[w] Incremental mode Gradient Descent: For each training example d in D 1. Compute the gradient Ed[w] 2. w w - Ed[w]
104
Incremental (Stochastic) Gradient Descent (2/2)
Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if made small enough
105
Multilayer Networks of Sigmoid Units
106
Sigmoid Unit (x) is the sigmoid function Nice property:
We can derive gradient decent rules to train One sigmoid unit Multilayer networks of sigmoid units Backpropagation
107
Error Gradient for a Sigmoid Unit
But we know: So:
108
Backpropagation Algorithm
Initialize all weights to small random numbers. Until satisfied, Do For each training example, Do 1. Input the training example to the network and compute the network outputs 2. For each output unit k : k k(1 - k) (tk - k) 3. For each hidden unit h h h(1 - h) k outputs wh,kk 4. Update each network weight wi,j wi,j wi,j + wi,j where wi,j = j xi,j
109
More on Backpropagation
Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum In practice, often works well (can run multiple times) Often include weight momentum wi,j (n) = j xi,j + wi,j (n - 1) Minimizes error over training examples Will it generalize well to subsequent examples? Training can take thousands of iterations slow! Using network after training is very fast
110
Learning Hidden Layer Representations (1/2)
A target function: Can this be learned??
111
Learning Hidden Layer Representations (2/2)
A network: Learned hidden layer representation:
112
Training (1/3)
113
Training (2/3)
114
Training (3/3)
115
Convergence of Backpropagation
Gradient descent to some local minimum Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different initial weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses
116
Expressive Capabilities of ANNs
Boolean functions: Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].
117
Overfitting in ANNs (1/2)
118
Overfitting in ANNs (2/2)
119
Neural Nets for Face Recognition
90% accurate learning head pose, and recognizing 1-of-20 faces
120
Learned Hidden Unit Weights
121
Alternative Error Functions
Penalize large weights: Train on target slopes as well as values: Tie together weights: e.g., in phoneme recognition network
122
Recurrent Networks (a) (b) (c) Feedforward network Recurrent network
Recurrent network unfolded in time
123
Machine Learning Chapter 5. Evaluating Hypotheses
Tom M. Mitchell
124
Evaluating Hypotheses
Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal distribution, Central Limit Theorem Paired t tests Comparing learning methods
125
Two Definitions of Error
The true error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D. The sample error of h with respect to target function f and data sample S is the proportion of examples h misclassifies Where (f(x) h(x)) is 1 if f(x) h(x), and 0 otherwise. How well does errorS(h) estimate errorD(h)?
126
Problems Estimating Error
1. Bias: If S is training set, errorS(h) is optimistically biased bias E [errorS(h)] - errorD(h) For unbiased estimate, h and S must be chosen independently 2. Variance: Even with unbiased S, errorS(h) may still vary from errorD(h)
127
Example Hypothesis h misclassifies 12 of the 40 examples in S
errorS(h) = 12 / 40 = .30 What is errorD(h) ?
128
Estimators Experiment:
1. choose sample S of size n according to distribution D 2. measure errorS(h) errorS(h) is a random variable (i.e., result of an experiment) errorS(h) is an unbiased estimator for errorD(h) Given observed errorS(h) what can we conclude about errorD(h) ?
129
Confidence Intervals If
S contains n examples, drawn independently of h and each other n 30 Then, with approximately N% probability, errorD(h) lies in interval where N% 50% 68% 80% 90% 95% 98% 99% zN 0.67 1.00 1.28 1.64 1.96 2.33 2.58
130
errorS(h) is a Random Variable
Rerun the experiment with different randomly drawn S (of size n) Probability of observing r misclassified examples:
131
Binomial Probability Distribution
Probability P(r) of r heads in n coin flips, if p = Pr(heads) Expected, or mean value of X, E[X], is Variance of X is Standard deviation of X, X, is
132
Normal Distribution Approximates Binomial
errorS(h) follows a Binomial distribution, with mean errorS(h) = errorD(h) standard deviation errorS(h) Approximate this by a Normal distribution with
133
Normal Probability Distribution (1/2)
The probability that X will fall into the interval (a, b) is given by Expected, or mean value of X, E[X], is E[X] = Variance of X is Var(X) = 2 Standard deviation of X, X is X =
134
Normal Probability Distribution (2/2)
80% of area (probability) lies in 1.28 N% of area (probability) lies in zN N% 50% 68% 80% 90% 95% 98% 99% zN 0.67 1.00 1.28 1.64 1.96 2.33 2.58
135
Confidence Intervals, More Correctly
If S contains n examples, drawn independently of h and each other n 30 Then, with approximately 95% probability, errorS(h) lies in interval equivalently, errorD(h) lies in interval which is approximately
136
Central Limit Theorem Consider a set of independent, identically distributed random variables Y Yn, all governed by an arbitrary probability distribution with mean and finite variance 2. Define the sample mean, Central Limit Theorem. As n , the distribution governing Y approaches a Normal distribution, with mean and variance 2 / n .
137
Calculating Confidence Intervals
1. Pick parameter p to estimate errorD(h) 2. Choose an estimator errorS(h) 3. Determine probability distribution that governs estimator errorS(h) governed by Binomial distribution, approximated by Normal when n 30 4. Find interval (L, U) such that N% of probability mass falls in the interval Use table of zN values
138
Difference Between Hypotheses
Test h1 on sample S1, test h2 on S2 1. Pick parameter to estimate d errorD(h1) - errorD(h2) 2. Choose an estimator d errorS1(h1) – errorS2(h2) 3. Determine probability distribution that governs estimator 4. Find interval (L, U) such that N% of probability mass falls in the interval ^
139
Paired t test to compare hA, hB
1. Partition data into k disjoint test sets T1, T2, . . ., Tk of equal size, where this size is at least 30. 2. For i from 1 to k, do i errorTi(hA) - errorTi(hB) 3. Return the value , where N% confidence interval estimate for d: Note i approximately Normally distributed
140
Comparing learning algorithms LA and LB (1/3)
What we’d like to estimate: ESD[errorD(LA (S)) - errorD(LB (S))] where L(S) is the hypothesis output by learner L using training set S i.e., the expected difference in true error between hypotheses output by learners LA and LB, when trained using randomly selected training sets S drawn according to distribution D. But, given limited data D0, what is a good estimator? could partition D0 into training set S0 and test set T0, and measure errorT0(LA (S0)) - errorT0(LB (S0)) even better, repeat this many times and average the results (next slide)
141
Comparing learning algorithms LA and LB (2/3)
1. Partition data D0 into k disjoint test sets T1, T2, . . ., Tk of equal size, where this size is at least 30. 2. For i from 1 to k, do use Ti for the test set, and the remaining data for training set Si Si { D0 – Ti } hA LA(Si) hB LB(Si) i errorTi(hA) - errorTi(hB) 3. Return the value , where
142
Comparing learning algorithms LA and LB (3/3)
Notice we’d like to use the paired t test on to obtain a confidence interval but not really correct, because the training sets in this algorithm are not independent (they overlap!) more correct to view algorithm as producing an estimate of ESD0[errorD(LA (S)) - errorD(LB (S))] instead of ESD[errorD(LA (S)) - errorD(LB (S))] but even this approximation is better than no comparison
143
Machine Learning Chapter 6. Bayesian Learning
Tom M. Mitchell
144
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Minimum description length principle Bayes optimal classifier Naive Bayes learner Example: Learning over text data Bayesian belief networks Expectation Maximization algorithm
145
Two Roles for Bayesian Methods
Provides practical learning algorithms: Naive Bayes learning Bayesian belief network learning Combine prior knowledge (prior probabilities) with observed data Requires prior probabilities Provides useful conceptual framework Provides “gold standard” for evaluating other learning algorithms Additional insight into Occam’s razor
146
Bayes Theorem P(h) = prior probability of hypothesis h
P(D) = prior probability of training data D P(h|D) = probability of h given D P(D|h) = probability of D given h
147
Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori hypothesis hMAP: If assume P(hi) = P(hj) then can further simplify, and choose the Maximum likelihood (ML) hypothesis
148
Bayes Theorem Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer. P(cancer) = P(cancer) = P(|cancer) = P(|cancer) = P(|cancer) = P(|cancer) =
149
Basic Formulas for Probabilities
Product Rule: probability P(A B) of a conjunction of two events A and B: P(A B) = P(A | B) P(B) = P(B | A) P(A) Sum Rule: probability of a disjunction of two events A and B: P(A B) = P(A) + P(B) - P(A B) Theorem of total probability: if events A1,…, An are mutually exclusive with , then
150
Brute Force MAP Hypothesis Learner
For each hypothesis h in H, calculate the posterior probability Output the hypothesis hMAP with the highest posterior probability
151
Relation to Concept Learning(1/2)
Consider our usual concept learning task instance space X, hypothesis space H, training examples D consider the FindS learning algorithm (outputs most specific hypothesis from the version space V SH,D) What would Bayes rule produce as the MAP hypothesis? Does FindS output a MAP hypothesis??
152
Relation to Concept Learning(2/2)
Assume fixed set of instances <x1,…, xm> Assume D is the set of classifications: D = <c(x1),…,c(xm)> Choose P(D|h): P(D|h) = 1 if h consistent with D P(D|h) = 0 otherwise Choose P(h) to be uniform distribution P(h) = 1/|H| for all h in H Then,
153
Evolution of Posterior Probabilities
154
Characterizing Learning Algorithms by Equivalent MAP Learners
155
Learning A Real Valued Function(1/2)
Consider any real-valued target function f Training examples <xi, di>, where di is noisy training value di = f(xi) + ei ei is random variable (noise) drawn independently for each xi according to some Gaussian distribution with mean=0 Then the maximum likelihood hypothesis hML is the one that minimizes the sum of squared errors:
156
Learning A Real Valued Function(2/2)
Maximize natural log of this instead...
157
Learning to Predict Probabilities
Consider predicting survival probability from patient data Training examples <xi, di>, where di is 1 or 0 Want to train neural network to output a probability given xi (not a 0 or 1) In this case can show Weight update rule for a sigmoid unit: where
158
Minimum Description Length Principle (1/2)
Occam’s razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that minimizes where LC(x) is the description length of x under encoding C Example: H = decision trees, D = training data labels LC1(h) is # bits to describe tree h LC2(D|h) is # bits to describe D given h Note LC2(D|h) = 0 if examples classified perfectly by h. Need only describe exceptions Hence hMDL trades off tree size for training errors
159
Minimum Description Length Principle (2/2)
Interesting fact from information theory: The optimal (shortest expected coding length) code for an event with probability p is –log2p bits. So interpret (1): –log2P(h) is length of h under optimal code –log2P(D|h) is length of D given h under optimal code prefer the hypothesis that minimizes length(h) + length(misclassifications)
160
Most Probable Classification of New Instances
So far we’ve sought the most probable hypothesis given the data D (i.e., hMAP) Given new instance x, what is its most probable classification? hMAP(x) is not the most probable classification! Consider: Three possible hypotheses: P(h1|D) = .4, P(h2|D) = .3, P(h3|D) = .3 Given new instance x, h1(x) = +, h2(x) = , h3(x) = What’s most probable classification of x?
161
Bayes Optimal Classifier
Bayes optimal classification: Example: P(h1|D) = .4, P(|h1) = 0, P(+|h1) = 1 P(h2|D) = .3, P(|h2) = 1, P(+|h2) = 0 P(h3|D) = .3, P(|h3) = 1, P(+|h3) = 0 therefore and
162
E[errorGibbs] 2E [errorBayesOptional]
Gibbs Classifier Bayes optimal classifier provides best result, but can be expensive if many hypotheses. Gibbs algorithm: 1. Choose one hypothesis at random, according to P(h|D) 2. Use this to classify new instance Surprising fact: Assume target concepts are drawn at random from H according to priors on H. Then: E[errorGibbs] 2E [errorBayesOptional] Suppose correct, uniform prior distribution over H, then Pick any hypothesis from VS, with uniform probability Its expected error no worse than twice Bayes optimal
163
Naive Bayes Classifier (1/2)
Along with decision trees, neural networks, nearest nbr, one of the most practical learning methods. When to use Moderate or large training set available Attributes that describe instances are conditionally independent given classification Successful applications: Diagnosis Classifying text documents
164
Naive Bayes Classifier (2/2)
Assume target function f : X V, where each instance x described by attributes <a1, a2 … an>. Most probable value of f(x) is: Naive Bayes assumption: which gives Naive Bayes classifier:
165
P(ai |vj) estimate P(ai |vj)
Naive Bayes Algorithm Naive Bayes Learn(examples) For each target value vj P(vj) estimate P(vj) For each attribute value ai of each attribute a P(ai |vj) estimate P(ai |vj) Classify New Instance(x) ^ ^
166
Naive Bayes: Example Consider PlayTennis again, and new instance
<Outlk = sun, Temp = cool, Humid = high, Wind = strong> Want to compute: P(y) P(sun|y) P(cool|y) P(high|y) P(strong|y) = .005 P(n) P(sun|n) P(cool|n) P(high|n) P(strong|n) = .021 vNB = n
167
Naive Bayes: Subtleties (1/2)
1. Conditional independence assumption is often violated ...but it works surprisingly well anyway. Note don’t need estimated posteriors to be correct; need only that see [Domingos & Pazzani, 1996] for analysis Naive Bayes posteriors often unrealistically close to 1 or 0
168
Naive Bayes: Subtleties (2/2)
2. what if none of the training instances with target value vj have attribute value ai? Then Typical solution is Bayesian estimate for where n is number of training examples for which v = vi, nc number of examples for which v = vj and a = ai p is prior estimate for m is weight given to prior (i.e. number of “virtual” examples)
169
Learning to Classify Text (1/4)
Why? Learn which news articles are of interest Learn to classify web pages by topic Naive Bayes is among most effective algorithms What attributes shall we use to represent text documents??
170
Learning to Classify Text (2/4)
Target concept Interesting? : Document {, } 1. Represent each document by vector of words one attribute per word position in document 2. Learning: Use training examples to estimate P() P() P(doc|) P(doc|) Naive Bayes conditional independence assumption where P(ai = wk | vj) is probability that word in position i is wk, given vj one more assumption:
171
Learning to Classify Text (3/4)
LEARN_NAIVE_BAYES_TEXT (Examples, V) 1. collect all words and other tokens that occur in Examples Vocabulary all distinct words and other tokens in Examples 2. calculate the required P(vj) and P(wk | vj) probability terms For each target value vj in V do docsj subset of Examples for which the target value is vj Textj a single document created by concatenating all members of docsj
172
Learning to Classify Text (4/4)
n total number of words in Textj (counting duplicate words multiple times) for each word wk in Vocabulary * nk number of times word wk occurs in Textj * CLASSIFY_NAIVE_BAYES_TEXT (Doc) positions all word positions in Doc that contain tokens found in Vocabulary Return vNB where
173
Twenty NewsGroups Given 1000 training documents from each group Learn to classify new documents according to which newsgroup it came from Naive Bayes: 89% classification accuracy comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey alt.atheism soc.religion.christian talk.religion.misc talk.politics.mideast talk.politics.misc talk.politics.guns sci.space sci.crypt sci.electronics sci.med
174
Learning Curve for 20 Newsgroups
Accuracy vs. Training set size (1/3 withheld for test)
175
Bayesian Belief Networks
Interesting because: Naive Bayes assumption of conditional independence too restrictive But it’s intractable without some such assumptions... Bayesian Belief networks describe conditional independence among subsets of variables allows combining prior knowledge about (in)dependencies among variables with observed training data (also called Bayes Nets)
176
Conditional Independence
Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z; that is, if (xi, yj, zk) P(X= xi|Y= yj, Z= zk) = P(X= xi|Z= zk) more compactly, we write P(X|Y, Z) = P(X|Z) Example: Thunder is conditionally independent of Rain, given Lightning P(Thunder|Rain, Lightning) = P(Thunder|Lightning) Naive Bayes uses cond. indep. to justify P(X, Y|Z) = P(X|Y, Z) P(Y|Z) = P(X|Z) P(Y|Z)
177
Bayesian Belief Network (1/2)
Network represents a set of conditional independence assertions: Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors. Directed acyclic graph
178
Bayesian Belief Network (2/2)
Represents joint probability distribution over all variables e.g., P(Storm, BusTourGroup, , ForestFire) in general, where Parents(Yi) denotes immediate predecessors of Yi in graph so, joint distribution is fully defined by graph, plus the P(yi|Parents(Yi))
179
Inference in Bayesian Networks
How can one infer the (probabilities of) values of one or more network variables, given observed values of others? Bayes net contains all information needed for this inference If only one variable with unknown value, easy to infer it In general case, problem is NP hard In practice, can succeed in many cases Exact inference methods work well for some network structures Monte Carlo methods “simulate” the network randomly to calculate approximate solutions
180
Learning of Bayesian Networks
Several variants of this learning task Network structure might be known or unknown Training examples might provide values of all network variables, or just some If structure known and observe all variables Then it’s easy as training a Naive Bayes classifier
181
Learning Bayes Nets Suppose structure known, variables partially observable e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not Lightning, Campfire... Similar to training neural network with hidden units In fact, can learn network conditional probability tables using gradient ascent! Converge to network h that (locally) maximizes P(D|h)
182
Gradient Ascent for Bayes Nets
Let wijk denote one entry in the conditional probability table for variable Yi in the network wijk = P(Yi = yij|Parents(Yi) = the list uik of values) e.g., if Yi = Campfire, then uik might be <Storm = T, BusTourGroup = F > Perform gradient ascent by repeatedly 1. update all wijk using training data D 2. then, renormalize the to wijk assure j wijk = 1 0 wijk 1
183
More on Learning Bayes Nets
EM algorithm can also be used. Repeatedly: 1. Calculate probabilities of unobserved variables, assuming h 2. Calculate new wijk to maximize E[ln P(D|h)] where D now includes both observed and (calculated probabilities of) unobserved variables When structure unknown... Algorithms use greedy search to add/substract edges and nodes Active research topic
184
Summary: Bayesian Belief Networks
Combine prior knowledge with observed data Impact of prior knowledge (when correct!) is to lower the sample complexity Active research area Extend from boolean to real-valued variables Parameterized distributions instead of tables Extend to first-order instead of propositional systems More effective inference methods …
185
Expectation Maximization (EM)
When to use: Data is only partially observable Unsupervised clustering (target value unobservable) Supervised learning (some instance attributes unobservable) Some uses: Train Bayesian Belief Networks Unsupervised clustering (AUTOCLASS) Learning Hidden Markov Models
186
Generating Data from Mixture of k Gaussians
Each instance x generated by 1. Choosing one of the k Gaussians with uniform probability 2. Generating an instance at random according to that Gaussian
187
EM for Estimating k Means (1/2)
Given: Instances from X generated by mixture of k Gaussian distributions Unknown means <1,…,k > of the k Gaussians Don’t know which instance xi was generated by which Gaussian Determine: Maximum likelihood estimates of <1,…,k > Think of full description of each instance as yi = < xi, zi1, zi2> where zij is 1 if xi generated by jth Gaussian xi observable zij unobservable
188
EM for Estimating k Means (2/2)
EM Algorithm: Pick random initial h = <1, 2> then iterate E step: Calculate the expected value E[zij] of each hidden variable zij, assuming the current hypothesis h = <1, 2> holds. M step: Calculate a new maximum likelihood hypothesis h' = <'1, '2>, assuming the value taken on by each hidden variable zij is its expected value E[zij] calculated above. Replace h = <1, 2> by h' = <'1, '2>.
189
EM Algorithm Converges to local maximum likelihood h and provides estimates of hidden variables zij In fact, local maximum in E[ln P(Y|h)] Y is complete (observable plus unobservable variables) data Expected value is taken over possible values of unobserved variables in Y
190
General EM Problem Given:
Observed data X = {x1,…, xm} Unobserved data Z = {z1,…, zm} Parameterized probability distribution P(Y|h), where Y = {y1,…, ym} is the full data yi = xi zi h are the parameters Determine: h that (locally) maximizes E[ln P(Y|h)] Many uses: Train Bayesian belief networks Unsupervised clustering (e.g., k means) Hidden Markov Models
191
Q(h'|h) E[ln P(Y| h')|h, X]
General EM Method Define likelihood function Q(h'|h) which calculates Y = X Z using observed X and current parameters h to estimate Z Q(h'|h) E[ln P(Y| h')|h, X] EM Algorithm: Estimation (E) step: Calculate Q(h'|h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y . Maximization (M) step: Replace hypothesis h by the hypothesis h' that maximizes this Q function.
192
Machine Learning Chapter 7. Computational Learning Theory
Tom M. Mitchell
193
Computational Learning Theory (1/2)
Setting 1: learner poses queries to teacher Setting 2: teacher chooses examples Setting 3: randomly generated instances, labeled by teacher Probably approximately correct (PAC) learning Vapnik-Chervonenkis Dimension Mistake bounds
194
Computational Learning Theory (2/2)
What general laws constrain inductive learning? We seek theory to relate: Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Manner in which training examples presented
195
Prototypical Concept Learning Task
Given: Instances X: Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast Target function c: EnjoySport : X {0, 1} Hypotheses H: Conjunctions of literals. E.g. <?, Cold, High, ?, ?, ?>. Training examples D: Positive and negative examples of the target function <x1, c(x1)>,… <xm, c(xm)> Determine: A hypothesis h in H such that h(x) = c(x) for all x in D? A hypothesis h in H such that h(x) = c(x) for all x in X?
196
Sample Complexity How many training examples are sufficient to learn the target concept? 1. If learner proposes instances, as queries to teacher Learner proposes instance x, teacher provides c(x) 2. If teacher (who knows c) provides training examples teacher provides sequence of examples of form <x, c(x)> 3. If some random process (e.g., nature) proposes instances instance x generated randomly, teacher provides c(x)
197
Sample Complexity: 1 Learner proposes instance x, teacher provides c(x) (assume c is in learner’s hypothesis space H) Optimal query strategy: play 20 questions pick instance x such that half of hypotheses in V S classify x positive, half classify x negative When this is possible, need log2 |H| queries to learn c when not possible, need even more
198
Sample Complexity: 2 Teacher (who knows c) provides training examples (assume c is in learner’s hypothesis space H) Optimal teaching strategy: depends on H used by learner Consider the case H = conjunctions of up to n boolean literals and their negations e.g., (AirTemp = Warm) (Wind = Strong), where AirTemp, Wind, …, each have 2 possible values. if n possible boolean attributes in H, n + 1 examples suffice why?
199
Sample Complexity: 3 Given:
set of instances X set of hypotheses H set of possible target concepts C training instances generated by a fixed, unknown probability distribution D over X Learner observes a sequence D of training examples of form <x, c(x)>, for some target concept c C instances x are drawn from distribution D teacher provides target value c(x) for each Learner must output a hypothesis h estimating c h is evaluated by its performance on subsequent instances drawn according to D Note: probabilistic instances, noise-free classifications
200
True Error of a Hypothesis
Definition: The true error (denoted errorD(h)) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D.
201
Two Notions of Error Training error of hypothesis h with respect to target concept c How often h(x) c(x) over training instances True error of hypothesis h with respect to c How often h(x) c(x) over future random instances Our concern: Can we bound the true error of h given the training error of h? First consider when training error of h is zero (i.e., h V SH,D)
202
Exhausting the Version Space
Definition: The version space V SH,D is said to be -exhausted with respect to c and D, if every hypothesis h in V SH,D has error less than with respect to c and D. (h V SH,D ) errorD(h) <
203
How many examples will -exhaust the VS?
Theorem: [Haussler, 1988]. If the hypothesis space H is finite, and D is a sequence of m 1 independent random examples of some target concept c, then for any 0 1, the probability that the version space with respect to H and D is not -exhausted (with respect to c) is less than |H|e-m Interesting! This bounds the probability that any consistent learner will output a hypothesis h with error(h) If we want to this probability to be below |H|e-m then
204
Learning Conjunctions of Boolean Literals
How many examples are sufficient to assure with probability at least (1- ) that every h in V SH,D satisfies errorD(h) Use our theorem: Suppose H contains conjunctions of constraints on up to n boolean attributes (i.e., n boolean literals). Then |H| = 3n, and or
205
How About EnjoySport? If H is as given in EnjoySport then |H| = 973, and ... if want to assure that with probability 95%, V S contains only hypotheses with errorD(h) .1, then it is sufficient to have m examples, where
206
PAC Learning Consider a class C of possible target concepts defined over a set of instances X of length n, and a learner L using hypothesis space H. Definition: C is PAC-learnable by L using H if for all c C, distributions D over X, such that 0 < < 1/2, and such that 0 < < 1/2 learner L will with probability at least (1- ) output a hypothesis h H such that errorD(h) , in time that is polynomial in 1/ , 1/ , n, and size(c).
207
Agnostic Learning So far, assumed c H
Agnostic learning setting: don’t assume c H What do we want then? The hypothesis h that makes fewest errors on training data What is sample complexity in this case? derived from Hoeffding bounds:
208
Shattering a Set of Instances
Definition: a dichotomy of a set S is a partition of S into two disjoint subsets. Definition: a set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy.
209
Three Instances Shattered
Instance space X
210
The Vapnik-Chervonenkis Dimension
Definition: The Vapnik-Chervonenkis dimension, V C(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then V C(H) .
211
VC Dim. of Linear Decision Surfaces
(b)
212
Sample Complexity from VC Dimension
How many randomly drawn examples suffice to -exhaust V SH,D with probability at least (1 - ) ?
213
Mistake Bounds So far: how many examples needed to learn?
What about: how many mistakes before convergence? Let’s consider similar setting to PAC learning: Instances drawn at random from X according to distribution D Learner must classify each instance before receiving correct classification from teacher Can we bound the number of mistakes learner makes before converging?
214
Mistake Bounds: Find-S
Consider Find-S when H = conjunction of boolean literals Find-S: Initialize h to the most specific hypothesis l1 l1 l2 l2 … ln ln For each positive training instance x Remove from h any literal that is not satisfied by x Output hypothesis h. How many mistakes before converging to correct h?
215
Mistake Bounds: Halving Algorithm
Consider the Halving Algorithm: Learn concept using version space Candidate-Elimination algorithm Classify new instances by majority vote of version space members How many mistakes before converging to correct h? ... in worst case? ... in best case?
216
Optimal Mistake Bounds
Let MA(C) be the max number of mistakes made by algorithm A to learn concepts in C. (maximum over all possible c C, and all possible training sequences) Definition: Let C be an arbitrary non-empty concept class. The optimal mistake bound for C, denoted Opt(C), is the minimum over all possible learning algorithms A of MA(C).
217
Machine Learning Chapter 8. Instance-Based Learning
Tom M. Mitchell
218
Instance Based Learning (1/2)
k-Nearest Neighbor Locally weighted regression Radial basis functions Case-based reasoning Lazy and eager learning
219
Instance-Based Learning (2/2)
Key idea: just store all training examples <xi, f(xi)> Nearest neighbor: Given query instance xq, first locate nearest training example xn, then estimate k-Nearest neighbor: Given xq, take vote among its k nearest nbrs (if discrete-valued target function) take mean of f values of k nearest nbrs (if real-valued)
220
When To Consider Nearest Neighbor
Instances map to points in Rn Less than 20 attributes per instance Lots of training data Advantages: Training is very fast Learn complex target functions Don’t lose information Disadvantages: Slow at query time Easily fooled by irrelevant attributes
221
Voronoi Diagram
222
Behavior in the Limit Consider p(x) defines probability that instance x will be labeled 1 (positive) versus 0 (negative). Nearest neighbor: As number of training examples , approaches Gibbs Algorithm Gibbs: with probability p(x) predict 1, else 0 k-Nearest neighbor: As number of training examples and k gets large, approaches Bayes optimal Bayes optimal: if p(x) > .5 then predict 1, else 0 Note Gibbs has at most twice the expected error of Bayes optimal
223
Distance-Weighted kNN
Might want weight nearer neighbors more heavily... and d(xq, xi) is distance between xq and xi Note now it makes sense to use all training examples instead of just k Shepard’s method where
224
Curse of Dimensionality
Imagine instances described by 20 attributes, but only 2 are relevant to target function Curse of dimensionality: nearest nbr is easily mislead when high-dimensional X One approach: Stretch jth axis by weight zj, where z1,…, zn chosen to minimize prediction error Use cross-validation to automatically choose weights z1,…, zn Note setting zj to zero eliminates this dimension altogether see [Moore and Lee, 1994]
225
Locally Weighted Regression
Note kNN forms local approximation to f for each query point xq Why not form an explicit approximation f(x) for region surrounding xq Fit linear function to k nearest neighbors Fit quadratic, ... Produces “piecewise approximation” to f Several choices of error to minimize: Squared error over k nearest neighbors Distance-weighted squared error over all nbrs ^
226
Radial Basis Function Networks
Global approximation to target function, in terms of linear combination of local approximations Used, e.g., for image classification A different kind of neural network Closely related to distance-weighted regression, but “eager” instead of “lazy”
227
Radial Basis Function Networks
where ai(x) are the attributes describing instance x, and One common choice for Ku(d(xu, x)) is
228
Training Radial Basis Function Networks
Q1: What xu to use for each kernel function Ku(d(xu, x)) Scatter uniformly throughout instance space Or use training instances (reflects instance distribution) Q2: How to train weights (assume here Gaussian Ku) First choose variance (and perhaps mean) for each Ku e.g., use EM Then hold Ku fixed, and train linear output layer efficient methods to fit linear function
229
Case-Based Reasoning Can apply instance-based learning even when X Rn need different “distance” metric Case-Based Reasoning is instance-based learning applied to instances with symbolic logic descriptions
230
Case-Based Reasoning in CADET (1/3)
CADET: 75 stored examples of mechanical devices each training example: < qualitative function, mechanical structure > new query: desired function, target value: mechanical structure for this function Distance metric: match qualitative function descriptions
231
Case-Based Reasoning in CADET (2/3)
A stored case: T-junction pipe A problem specification: Water faucet
232
Case-Based Reasoning in CADET (3/3)
Instances represented by rich structural descriptions Multiple cases retrieved (and combined) to form solution to new problem Tight coupling between case retrieval and problem solving Bottom line: Simple matching of cases useful for tasks such as answering help-desk queries Area of ongoing research
233
Lazy and Eager Learning
Lazy: wait for query before generalizing k-Nearest Neighbor, Case based reasoning Eager: generalize before seeing query Radial basis function networks, ID3, Backpropagation, NaiveBayes, . . . Does it matter? Eager learner must create global approximation Lazy learner can create many local approximations if they use same H, lazy can represent more complex fns (e.g., consider H = linear functions)
234
Machine Learning Chapter 9. Genetic Algorithm
Tom M. Mitchell
235
Genetic Algorithms Evolutionary computation Prototypical GA
An example: GABIL Genetic Programming Individual learning and population evolution
236
Evolutionary Computation
Computational procedures patterned after biological evolution Search procedure that probabilistically applies search operators to set of points in the search space
237
Biological Evolution (1/3)
Lamarck and others: Species “transmute” over time Darwin and Wallace: Consistent, heritable variation among individuals in population Natural selection of the fittest Mendel and genetics: A mechanism for inheriting traits genotype phenotype mapping
238
Biological Evolution (2/3)
GA(Fitness, Fitness_threshold, p, r, m) Initialize: P p random hypotheses Evaluate: for each h in P, compute Fitness(h) While [maxh Fitness(h)] < Fitness_threshold 1. Select: Probabilistically select (1-r)p members of P to add to PS.
239
Biological Evolution (3/3)
2. Crossover: Probabilistically select r ·p/2 pairs of hypotheses from P. For each pair, <h1, h2>, produce two offspring by applying the Crossover operator. Add all offspring to Ps. 3. Mutate: Invert a randomly selected bit in m ·p random members of Ps 4. Update: P Ps 5. Evaluate: for each h in P, compute Fitness(h) Return the hypothesis from P that has the highest fitness.
240
Representing Hypotheses
(Outlook = Overcast Rain) (Wind = Strong) by IF Wind = Strong THEN PlayTennis = yes Outlook Wind 011 10 Outlook Wind PlayTennis 111 10
241
Operators for Genetic Algorithms
Initial strings Crossover Mask Offspring Single-point crossover: Two-point crossover: Uniform crossover: Point mutation:
242
Selecting Most Fit Hypotheses
Fitness proportionate selection: ... can lead to crowding Tournament selection: Pick h1, h2 at random with uniform prob. With probability p, select the more fit. Rank selection: Sort all hypotheses by fitness Prob of selection is proportional to rank
243
GABIL [DeJong et al. 1993] Learn disjunctive set of propositional rules, competitive with C4.5 Fitness: Fitness(h) = (correct(h))2 Representation: IF a1 = Ta2 = F THEN c = T; IF a2 = T THEN c = F represented by Genetic operators: ??? want variable length rule sets want only well-formed bitstring hypotheses a1 a2 c 10 01 1 a1 a2 c 11 10
244
Crossover with Variable-Length Bitstrings
Start with 1. choose crossover points for h1, e.g., after bits 1, 8 2. now restrict points in h2 to those that produce bitstrings with well-defined semantics, e.g., <1, 3>, <1, 8>, <6, 8>. if we choose <1, 3>, result is
245
GABIL Extensions Add new genetic operators, also applied probabilistically: 1. AddAlternative: generalize constraint on ai by changing a 0 to 1 2. DropCondition: generalize constraint on ai by changing every 0 to 1 And, add new field to bitstring to determine whether to allow these So now the learning strategy also evolves!
246
GABIL Results Performance of GABIL comparable to symbolic rule/tree learning methods C4.5, ID5R, AQ14 Average performance on a set of 12 synthetic problems: GABIL without AA and DC operators: 92.1% accuracy GABIL with AA and DC operators: 95.2% accuracy symbolic learning methods ranged from 91.2 to 96.6
247
Schemas How to characterize evolution of population in GA?
Schema = string containing 0, 1, * (“don’t care”) Typical schema: 10**0* Instances of above schema: , , ... Characterize population by number of instances representing each possible schema m(s, t) = number of instances of schema s in pop at time t
248
Consider Just Selection
- f(t) = average fitness of pop. at time t m(s, t) = instances of schema s in pop at time t u(s, t) = ave. fitness of instances of s at time t Probability of selecting h in one selection step Probability of selecting an instance of s in one step Expected number of instances of s after n selections ^
249
Schema Theorem m(s, t) = instances of schema s in pop at time t
f(t) = average fitness of pop. at time t u(s, t) = ave. fitness of instances of s at time t pc = probability of single point crossover operator pm = probability of mutation operator l = length of single bit strings o(s) number of defined (non “*”) bits in s d(s) = distance between leftmost, rightmost defined bits in s - ^
250
Genetic Programming Population of programs represented by trees
251
Crossover
252
Block Problem (1/2) Goal: spell UNIVERSAL Terminals:
CS (“current stack”) = name of the top block on stack, or F. TB (“top correct block”) = name of topmost correct block on stack NN (“next necessary”) = name of the next block needed above TB in the stack
253
Block Problem (2/2) Primitive functions:
(MS x): (“move to stack”), if block x is on the table, moves x to the top of the stack and returns the value T. Otherwise, does nothing and returns the value F. (MT x): (“move to table”), if block x is somewhere in the stack, moves the block at the top of the stack to the table and returns the value T. Otherwise, returns F. (EQ x y): (“equal”), returns T if x equals y, and returns F otherwise. (NOT x): returns T if x = F, else returns F (DU x y): (“do until”) executes the expression x repeatedly until expression y returns the value T
254
(EQ (DU (MT CS)(NOT CS)) (DU (MS NN)(NOT NN)) )
Learned Program Trained to t 166 test problems Using population of 300 programs, found this after 10 generations: (EQ (DU (MT CS)(NOT CS)) (DU (MS NN)(NOT NN)) )
255
Genetic Programming More interesting example: design electronic filter circuits Individuals are programs that transform beginning circuit to final circuit, by adding/subtracting components and connections Use population of 640,000, run on 64 node parallel processor Discovers circuits competitive with best human designs
256
GP for Classifying Images
[Teller and Veloso, 1997] Fitness: based on coverage and accuracy Representation: Primitives include Add, Sub, Mult, Div, Not, Max, Min, Read, Write, If-Then-Else, Either,Pixel, Least, Most, Ave, Variance, Difference, Mini, Library Mini refers to a local subroutine that is separately co-evolved Library refers to a global library subroutine (evolved by selecting the most useful minis) Genetic operators: Crossover, mutation Create “mating pools” and use rank proportionate reproduction
257
Biological Evolution Lamark (19th century)
Believed individual genetic makeup was altered by lifetime experience But current evidence contradicts this view What is the impact of individual learning on population evolution?
258
Baldwin Effect (1/2) Assume Then
Individual learning has no direct influence on individual DNA But ability to learn reduces need to “hard wire” traits in DNA Then Ability of individuals to learn will support more diverse gene pool Because learning allows individuals with various “hard wired” traits to be successful More diverse gene pool will support faster evolution of gene pool individual learning (indirectly) increases rate of evolution
259
Baldwin Effect (2/2) Plausible example:
1. New predator appears in environment 2. Individuals who can learn (to avoid it) will be selected 3. Increase in learning individuals will support more diverse gene pool 4. resulting in faster evolution 5. possibly resulting in new non-learned traits such as instinctive fear of predator
260
Computer Experiments on Baldwin Effect
[Hinton and Nowlan, 1987] Evolve simple neural networks: Some network weights fixed during lifetime, others trainable Genetic makeup determines which are fixed, and their weight values Results: With no individual learning, population failed to improve over time When individual learning allowed Early generations: population contained many individuals with many trainable weights Later generations: higher fitness, while number of trainable weights decreased
261
Summary: Evolutionary Programming
Conduct randomized, parallel, hill-climbing search through H Approach learning as optimization problem (optimize fitness) Nice feature: evaluation of Fitness can be very indirect consider learning rule set for multistep decision making no issue of assigning credit/blame to indiv. steps
262
Machine Learning Chapter 10. Learning Sets of Rules
Tom M. Mitchell
263
Learning Disjunctive Sets of Rules
Method 1: Learn decision tree, convert to rules Method 2: Sequential covering algorithm: 1. Learn one rule with high accuracy, any coverage 2. Remove positive examples covered by this rule 3. Repeat
264
Sequential Covering Algorithm
COVERING (Target attribute; Attributes; Examples; Threshold) Learned rules {} Rule LEARN-ONE- RULE(Target_attribute, Attributes, Examples) while PERFORMANCE (Rule, Examples) > Threshold, do Learned_rules Learned_rules + Rule Examples Examples – {examples correctly classified by Rule} Rule LEARN-ONE- RULE (Target_attribute, Attributes, Examples) Learned_rules sort Learned_rules accord to PERFORMANCE over Examples return Learned_rules
265
Learn-One-Rule
266
Learn-One-Rule(Cont.)
Pos positive Examples Neg negative Examples while Pos, do Learn a NewRule - NewRule most general rule possible - NewRule Neg - while NewRuleNeg, do Add a new literal to specialize NewRule 1. Candidate literals generate candidates 2. Best_literal argmaxLCandidate literals Performance(SpecializeRule(NewRule; L)) 3. add Best_literal to NewRule preconditions 4. NewRuleNeg subset of NewRuleNeg that satisfies NewRule preconditions - Learned_rules Learned_rules + NewRule - Pos Pos – {members of Pos coverd by NewRule} Return Learned_rules
267
Subtleties: Learn One Rule
1. May use beam search 2. Easily generalizes to multi-valued target functions 3. Choose evaluation function to guide search: Entropy (i.e., information gain) Sample accuracy: where nc = correct rule predictions, n = all predictions m estimate:
268
Variants of Rule Learning Programs
Sequential or simultaneous covering of data? General specific, or specific general? Generate-and-test, or example-driven? Whether and how to post-prune? What statistical evaluation function?
269
Learning First Order Rules
Why do that? Can learn sets of rules such as Ancestor(x, y) Parent(x; y) Ancestor(x; y) Parent(x; z) ^ Ancestor(z; y) General purpose programming language PROLOG : programs are sets of such rules
270
First Order Rule for Classifying Web Pages
[Slattery, 1997] course(A) has-word(A, instructor), Not has-word(A, good), link-from(A, B), has-word(B, assign), Not link-from(B, C) Train: 31/31, Test: 31/34
272
Specializing Rules in FOIL
273
Information Gain in FOIL
274
Induction as Inverted Deduction
275
Induction as Inverted Deduction(Cont’)
276
Induction as Inverted Deduction(Cont’)
Induction is, in fact, the inverse operation of deduction, and cannot be conceived to exist without the corresponding operation, so that the question of relative importance cannot arise. Who thinks of asking whether addition or subtraction is the more important process in arithmetic? But at the same time much difference in difficulty may exist between a direct and inverse operation; : : : it must be allowed that inductive investigations are of a far higher degree of difficulty and complexity than any questions of deduction…. (Jevons 1874)
277
Induction as Inverted Deduction(Cont’)
278
Induction as Inverted Deduction(Cont’)
279
Induction as Inverted Deduction(Cont’)
280
Deduction: Resolution Rule
281
Inverting Resolution
282
Inverted Resolution (Propositional)
283
First order resolution
284
Inverting First order resolution
285
Cigol
286
Progol
287
Machine Learning Chapter 11. Analytical Learning
Tom M. Mitchell
288
Outline Two formulations for learning: Inductive and Analytical
Perfect domain theories and Prolog-EBG
289
A Positive Example
290
The Inductive Generalization Problem
Given: Instances Hypotheses Target Concept Training examples of target concept Determine: Hypotheses consistent with the training examples
291
The Analytical Generalization Problem(Cont’)
Given: Instances Hypotheses Target Concept Training examples of target concept Domain theory for explaining examples Determine: Hypotheses consistent with the training examples and the domain theory
292
An Analytical Generalization Problem
293
Learning from Perfect Domain Theories
Assumes domain theory is correct (error-free) Prolog-EBG is algorithm that works under this assumption This assumption holds in chess and other search problems Allows us to assume explanation = proof Later we’ll discuss methods that assume approximate domain theories
294
Prolog EBG Initialize hypothesis = {}
For each positive training example not covered by hypothesis: 1. Explain how training example satisfies target concept, in terms of domain theory 2. Analyze the explanation to determine the most general conditions under which this explanation (proof) holds 3. Refine the hypothesis by adding a new rule, whose preconditions are the above conditions, and whose consequent asserts the target concept
295
Explanation of a Training Example
296
Computing the Weakest Preimage of Explanation
297
Regression Algorithm
298
Lessons from Safe-to-Stack Example
Justified generalization from single example Explanation determines feature relevance Regression determines needed feature constraints Generality of result depends on domain theory Still require multiple examples
299
Perspectives on Prolog-EBG
Theory-guided generalization from examples Example-guided operationalization of theories "Just" restating what learner already "knows“ Is it learning? Are you learning when you get better over time at chess? Even though you already know everything in principle, once you know rules of the game... Are you learning when you sit in a mathematics class? Even though those theorems follow deductively from the axioms you’ve already learned...
300
Machine Learning Chapter 12
Machine Learning Chapter 12. Combining Inductive and Analytical Learning Tom M. Mitchell
301
Inductive and Analytical Learning
Inductive learning Hypothesis fits data Statistical inference Requires little prior knowledge Syntactic inductive bias Analytical learning Hypothesis fits domain the Deductive inference Learns from scarce data Bias is domain theory
302
What We Would Like General purpose learning method:
No domain theory learn as well as inductive methods Perfect domain theory learn as well as Prolog-EBG Accomodate arbitrary and unknown errors in domain theory Accomodate arbitrary and unknown errors in training data
303
Domain theory: Cup Stable, Liftable, Open Vessel Stable BottomIsFlat Liftable Graspable, Light Graspable HasHandle Open Vessel HasConcavity, ConcavityPointsUp Training examples:
304
KBANN KBANN (data D, domain theory B)
1. Create a feedforward network h equivalent to B 2. Use BACKPROP to tune h to t D
305
Neural Net Equivalent to Domain Theory
306
Creating Network Equivalent to Domain Theory
Create one unit per horn clause rule (i.e., an AND unit) Connect unit inputs to corresponding clause antecedents For each non-negated antecedent, corresponding input weight w W, where W is some constant For each negated antecedent, input weight w -W Threshold weight w0 -(n-.5)W, where n is number of non-negated antecedents Finally, add many additional connections with near-zero weights Liftable Graspable, Heavy
307
Result of refining the network
308
KBANN Results Classifying promoter regions in DNA leave one out testing: Backpropagation : error rate 8/106 KBANN: 4/106 Similar improvements on other classification, control tasks.
309
Hypothesis space search in KBANN
310
EBNN Key idea: Previously learned approximate domain theory
Domain theory represented by collection of neural networks Learn target function as another neural network
312
Modified Objective for Gradient Descent
314
Hypothesis Space Search in EBNN
315
Search in FOCL
316
FOCL Results Recognizing legal chess endgame positions:
30 positive, 30 negative examples FOIL : 86% FOCL : 94% (using domain theory with 76% accuracy) NYNEX telephone network diagnosis 500 training examples FOIL : 90% FOCL : 98% (using domain theory with 95% accuracy)
317
Machine Learning Chapter 13. Reinforcement Learning
Tom M. Mitchell
318
Control Learning Consider learning to choose actions, e.g.,
Robot learning to dock on battery charger Learning to choose actions to optimize factory output Learning to play Backgammon Note several problem characteristics: Delayed reward Opportunity for active exploration Possibility that state only partially observable Possible need to learn multiple tasks with same sensors/effectors
319
One Example: TD-Gammon
Learn to play Backgammon Immediate reward +100 if win -100 if lose 0 for all other states Trained by playing 1.5 million games against itself Now approximately equal to best human player
320
Reinforcement Learning Problem
321
Markov Decision Processes
Assume finite set of states S set of actions A at each discrete time agent observes state st S and chooses action at A then receives immediate reward rt and state changes to st+1 Markov assumption : st+1 = (st, at ) and rt = r(st, at ) i.e., rt and st+1 depend only on current state and action functions and r may be nondeterministic functions and r not necessarily known to agent
322
Agent's Learning Task
323
Value Function
325
What to Learn
326
Q Function
327
Training Rule to Learn Q
328
Q Learning for Deterministic Worlds
331
Nondeterministic Case
332
Nondeterministic Case(Cont’)
333
Temporal Difference Learning
334
Temporal Difference Learning(Cont’)
335
Subtleties and Ongoing Research
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.