Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMPT 310 Simon Fraser University Oliver Schulte Learning.

Similar presentations


Presentation on theme: "CMPT 310 Simon Fraser University Oliver Schulte Learning."— Presentation transcript:

1 CMPT 310 Simon Fraser University Oliver Schulte Learning

2 2/13 The Big Picture: AI for Model-Based Agents Artificial Intelligence a modern approach 2 Action Learning Knowledge Logic Probability Heuristics Inference Planning Decision Theory Game Theory Reinforcement Learning Machine Learning Statistics

3 3/13 Motivation Building a knowledge base is a significant investment of time and resources. Prone to error, needs debugging. Alternative approach: Learn rules from examples. Grand Vision: Start with “seed rules” from expert, use examples to expand and refine.

4 4/13 Overview Many learning models exist. Will consider two representative ones that are widely used in AI. Learning Bayesian network parameters. Learning a decision tree classifier.

5 5/13 Examples Program By Example Excel Flash Fill Excel Flash Fill Kaggle Data Science Competitions

6 6/13 Learning Bayesian Networks

7 7/13 Structure Learning Example: Sleep Disorder Network Source: Development of Bayesian Network models for obstructive sleep apnea syndrome assessment Fouron, Anne Gisèle. (2006). M.Sc. Thesis, SFU.Development of Bayesian Network models for obstructive sleep apnea syndrome assessment

8 8/13 Parameter Learning Common Approach Expert specifies Bayesian network structure (nodes and links). Program fills in parameters (conditional probabilities).

9 9/13 Parameter Learning Scenarios Complete data (today). Later: Missing data (EM). Parent Node/ Child Node DiscreteContinuous DiscreteMaximum Likelihood Decision Trees logit distribution (logistic regression) Continuousconditional Gaussian (not discussed) linear Gaussian (linear regression)

10 10/13 The Parameter Learning Problem Input: a data table X NxD. One column per node (random variable) One row per instance. How to fill in Bayes net parameters? PlayTennis Humidity DayOutlookTemperatureHumidityWindPlayTennis 1sunnyhothighweakno 2sunnyhothighstrongno 3overcasthothighweakyes 4rainmildhighweakyes 5raincoolnormalweakyes 6raincoolnormalstrongno 7overcastcoolnormalstrongyes 8sunnymildhighweakno 9sunnycoolnormalweakyes 10rainmildnormalweakyes 11sunnymildnormalstrongyes 12overcastmildhighstrongyes 13overcasthotnormalweakyes 14rainmildhighstrongno

11 11/13 Start Small: Single Node What would you choose? Humidity How about P(Humidity = high) = 50%? DayHumidity 1high 2 3 4 5normal 6 7 8high 9normal 10normal 11normal 12high 13normal 14high P(Humidity = high) θ

12 12/13 Parameters for Two Nodes DayHumidityPlayTennis 1highno 2highno 3highyes 4highyes 5normalyes 6normalno 7normalyes 8highno 9normalyes 10normalyes 11normalyes 12highyes 13normalyes 14highno PlayTennis Humidity P(Humidity = high) θ HP(PlayTennis = yes|H) high θ1θ1 normal θ2θ2 Is θ as in single node model? How about θ 1 =3/7? How about θ 2 =6/7?

13 13/13 Maximum Likelihood Estimation

14 14/13 MLE An important general principle: Choose parameter values that maximize the likelihood of the data. Intuition: Explain the data as well as possible. Recall from Bayes’ theorem that the likelihood is P(data|parameters) = P(D| θ ).

15 15/13 Finding the Maximum Likelihood Solution: Single Node Humidity P(Hi| θ ) high θ θ θ θ normal1- θ normal1- θ normal1- θ high θ normal1- θ normal1- θ normal1- θ high θ normal1- θ high θ P(Humidity = high) θ 1. Write down 2. In example, P(D| θ )= θ 7 (1- θ ) 7. 3. Maximize θ for this function. independent identically distributed data! iid

16 16/13 Solving the Equation 1. Often convenient to apply logarithms to products. ln(P(D| θ ))= 7ln( θ ) + 7 ln(1- θ ). 2. Find derivative, set to 0. 3. Exercise: try finding the minima of L given above.

17 17/13 Finding the Maximum Likelihood Solution: Two Nodes HumidityPlayTennisP(H,P| θ, θ 1, θ 2 highno θ x (1- θ 1) highno θ x (1- θ 1) highyes θ x θ 1 highyes θ x θ 1 normalyes(1- θ ) x θ 2 normalno(1- θ ) x (1- θ 2) normalyes(1- θ )x θ 2 highno θ x (1- θ 1) normalyes(1- θ ) x θ 2 normalyes(1- θ ) x θ 2 normalyes(1- θ )x θ 2 highyes θ x θ 1 normalyes(1- θ ) x θ 2 highno θ x (1- θ 1) P(Humidity = high) θ HP(PlayTennis = yes|H) high θ1θ1 normal θ2θ2 PlayTennis Humidity

18 18/13 Finding the Maximum Likelihood Solution: Two Nodes In a Bayes net, can maximize each parameter separately. Fix a parent condition  single node problem. HumidityPlayTennisP(H,P| θ, θ 1, θ 2 highno θ x (1- θ 1) highno θ x (1- θ 1) highyes θ x θ 1 highyes θ x θ 1 normalyes(1- θ ) x θ 2 normalno(1- θ ) x (1- θ 2) normalyes(1- θ )x θ 2 highno θ x (1- θ 1) normalyes(1- θ ) x θ 2 normalyes(1- θ ) x θ 2 normalyes(1- θ )x θ 2 highyes θ x θ 1 normalyes(1- θ ) x θ 2 highno θ x (1- θ 1) 1.In example, P(D| θ, θ 1, θ 2)= θ 7 (1- θ ) 7 ( θ 1) 3 (1- θ 1) 4 ( θ 2) 6 (1- θ 2). 2.Take logs and set to 0.

19 19/13 Finding the Maximum Likelihood Solution: Single Node, >2 possible values. DayOutlook 1sunny 2 3overcast 4rain 5 6 7overcast 8sunny 9 10rain 11sunny 12overcast 13overcast 14rain Outlook P(Outlook) sunny θ1θ1 overcast θ2θ2 rain θ3θ3 1.In example, P(D| θ 1, θ 2, θ 3)= ( θ 1) 5 ( θ 2) 4 ( θ 3) 5. 2.MLE solution for 2 possible values can be generalized, but needs more advanced math. 3.General solution: MLE = observed frequencies.

20 Decision Tree Classifiers

21 21/13 Multiple Choice Question A decision tree 1. helps an agent make decisions. 2. uses a Bayesian network to compute probabilities. 3. contains nodes with attribute values. 4. assigns a class label to a list of attribute values.

22 22/13 Classification Predict a single target or class label for an object, given a vector of features. Conditional probability P(label|features). Example: predict PlayTennis given 4 other features. DayOutlookTemperatureHumidityWindPlayTennis 1sunnyhothighweakno 2sunnyhothighstrongno 3overcasthothighweakyes 4rainmildhighweakyes 5raincoolnormalweakyes 6raincoolnormalstrongno 7overcastcoolnormalstrongyes 8sunnymildhighweakno 9sunnycoolnormalweakyes 10rainmildnormalweakyes 11sunnymildnormalstrongyes 12overcastmildhighstrongyes 13overcasthotnormalweakyes 14rainmildhighstrongno

23 23/13 Decision Tree Popular type of classifier. Easy to visualize. Especially for discrete values, but also for continuous. Learning: Information Theory.

24 24/13 Decision Tree Example

25 25/13 Exercise Find a decision tree to represent A OR B, A AND B. (A AND B) OR (C AND notD AND E).

26 26/13 Example: Rate of Reboot Failure

27 27/13 Big Decision Tree for NHL Goal Scoring

28 28/13 Decision Tree Learning Basic Loop: 1. A := the “best” decision attribute for next node. 2. For each value of A, create new descendant of node. 3. Assign training examples to leaf nodes. 4. If training examples perfect classified, then STOP. Else iterate over new leaf nodes.

29 Entropy

30 30/13 Multiple Choice Question Entropy 1. measures the amount of uncertainty in a probability distribution. 2. is a concept from relativity theory in physics. 3. refers to the flexibility of an intelligent agent. 4. is maximized by the ID3 algorithm.

31 31/13 Uncertainty and Probability The more “balanced” a probability distribution, the less information it conveys (e.g., about class label). How to quantify? Information Theory: Entropy = Balance. S is sample, p + is proportion positive, p - negative. Entropy(S) = -p + log2(p + ) - p - log2(p - )

32 32/13 Entropy: General Definition Important quantity in coding theory statistical physics machine learning

33 33/13 Intuition

34 34/13 Entropy

35 35/13 Coding Theory Coding theory: X discrete with 8 possible states (“messages”); how many bits to transmit the state of X ? Shannon information theorem: optimal code length assigns –log 2 p(x) to each “message” X = x. All states equally likely

36 36/13 Zipf’s Law General principle: frequent messages get shorter codes. e.g., abbreviations. Information Compression.

37 37/13 Another Coding Example

38 38/13 The Kullback-Leibler Divergence Measures information-theoretic “distance” between two distributions p and q. Distributions can be discrete or continuous. Aka cross-entropy. Code length of x in true distribution Code length of x in wrong distribution

39 Information Gain ID3 Decision Tree Learning

40 40/13 Splitting Criterion A new attribute value changes the entropy. Want to split on attribute that has the greatest reduction in entropy, averaged over its attribute values. Gain(S,A) = expected reduction in entropy due to splitting on A.

41 41/13 Example

42 42/13 Playtennis Example


Download ppt "CMPT 310 Simon Fraser University Oliver Schulte Learning."

Similar presentations


Ads by Google