CMPT 310 Simon Fraser University Oliver Schulte Learning.

Slides:



Advertisements
Similar presentations
Data Mining in Micro array Analysis
Advertisements

Data Pre-processing Data Cleaning : Sampling:
Decision Tree Learning - ID3
Decision Trees Decision tree representation ID3 learning algorithm
Bayesian Learning Provides practical learning algorithms
INC 551 Artificial Intelligence Lecture 11 Machine Learning (Continue)
Classification Algorithms
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Rutgers CS440, Fall 2003 Review session. Rutgers CS440, Fall 2003 Topics Final will cover the following topics (after midterm): 1.Uncertainty & introduction.
ICS320-Foundations of Adaptive and Learning Systems
Oliver Schulte Machine Learning 726
Intelligent Environments1 Computer Science and Engineering University of Texas at Arlington.
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Visual Recognition Tutorial
Data Mining Techniques Outline
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Induction of Decision Trees
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
By Wang Rui State Key Lab of CAD&CG
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Privacy Preserving Data Mining Yehuda Lindell Benny Pinkas Presenter: Justin Brickell.
Longin Jan Latecki Temple University
Oliver Schulte Machine Learning 726 Bayes Net Classifiers The Naïve Bayes Model.
Additive Data Perturbation: the Basic Problem and Techniques.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Classification Algorithms Decision trees Rule-based induction Neural networks Memory(Case) based reasoning Genetic algorithms Bayesian networks Basic Principle.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Decision Tree Learning
Bayesian Learning Provides practical learning algorithms
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Review of statistical modeling and probability theory Alan Moses ML4bio.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Oliver Schulte Machine Learning 726 Decision Tree Classifiers.
Exploration Seminar 8 Machine Learning Roy McElmurry.
Prof. Pushpak Bhattacharyya, IIT Bombay1 CS 621 Artificial Intelligence Lecture 12 – 30/08/05 Prof. Pushpak Bhattacharyya Fundamentals of Information.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Unsupervised Feature Learning Introduction Oliver Schulte School of Computing Science Simon Fraser University.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Oliver Schulte Machine Learning 726
Machine Learning Inductive Learning and Decision Trees
Oliver Schulte Machine Learning 726
DECISION TREES An internal node represents a test on an attribute.
Classification Algorithms
Bayes Net Learning: Bayesian Approaches
Data Science Algorithms: The Basic Methods
Oliver Schulte Machine Learning 726
Decision Tree Saed Sayad 9/21/2018.
Data Mining Lecture 11.
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Decision Trees Decision tree representation ID3 learning algorithm
Machine Learning Chapter 3. Decision Tree Learning
LECTURE 23: INFORMATION THEORY REVIEW
Decision Trees Decision tree representation ID3 learning algorithm
Parametric Methods Berlin Chen, 2005 References:
Recap: Naïve Bayes classifier
A task of induction to find patterns
A task of induction to find patterns
Applied Statistics and Probability for Engineers
Presentation transcript:

CMPT 310 Simon Fraser University Oliver Schulte Learning

2/13 The Big Picture: AI for Model-Based Agents Artificial Intelligence a modern approach 2 Action Learning Knowledge Logic Probability Heuristics Inference Planning Decision Theory Game Theory Reinforcement Learning Machine Learning Statistics

3/13 Motivation Building a knowledge base is a significant investment of time and resources. Prone to error, needs debugging. Alternative approach: Learn rules from examples. Grand Vision: Start with “seed rules” from expert, use examples to expand and refine.

4/13 Overview Many learning models exist. Will consider two representative ones that are widely used in AI. Learning Bayesian network parameters. Learning a decision tree classifier.

5/13 Examples Program By Example Excel Flash Fill Excel Flash Fill Kaggle Data Science Competitions

6/13 Learning Bayesian Networks

7/13 Structure Learning Example: Sleep Disorder Network Source: Development of Bayesian Network models for obstructive sleep apnea syndrome assessment Fouron, Anne Gisèle. (2006). M.Sc. Thesis, SFU.Development of Bayesian Network models for obstructive sleep apnea syndrome assessment

8/13 Parameter Learning Common Approach Expert specifies Bayesian network structure (nodes and links). Program fills in parameters (conditional probabilities).

9/13 Parameter Learning Scenarios Complete data (today). Later: Missing data (EM). Parent Node/ Child Node DiscreteContinuous DiscreteMaximum Likelihood Decision Trees logit distribution (logistic regression) Continuousconditional Gaussian (not discussed) linear Gaussian (linear regression)

10/13 The Parameter Learning Problem Input: a data table X NxD. One column per node (random variable) One row per instance. How to fill in Bayes net parameters? PlayTennis Humidity DayOutlookTemperatureHumidityWindPlayTennis 1sunnyhothighweakno 2sunnyhothighstrongno 3overcasthothighweakyes 4rainmildhighweakyes 5raincoolnormalweakyes 6raincoolnormalstrongno 7overcastcoolnormalstrongyes 8sunnymildhighweakno 9sunnycoolnormalweakyes 10rainmildnormalweakyes 11sunnymildnormalstrongyes 12overcastmildhighstrongyes 13overcasthotnormalweakyes 14rainmildhighstrongno

11/13 Start Small: Single Node What would you choose? Humidity How about P(Humidity = high) = 50%? DayHumidity 1high normal 6 7 8high 9normal 10normal 11normal 12high 13normal 14high P(Humidity = high) θ

12/13 Parameters for Two Nodes DayHumidityPlayTennis 1highno 2highno 3highyes 4highyes 5normalyes 6normalno 7normalyes 8highno 9normalyes 10normalyes 11normalyes 12highyes 13normalyes 14highno PlayTennis Humidity P(Humidity = high) θ HP(PlayTennis = yes|H) high θ1θ1 normal θ2θ2 Is θ as in single node model? How about θ 1 =3/7? How about θ 2 =6/7?

13/13 Maximum Likelihood Estimation

14/13 MLE An important general principle: Choose parameter values that maximize the likelihood of the data. Intuition: Explain the data as well as possible. Recall from Bayes’ theorem that the likelihood is P(data|parameters) = P(D| θ ).

15/13 Finding the Maximum Likelihood Solution: Single Node Humidity P(Hi| θ ) high θ θ θ θ normal1- θ normal1- θ normal1- θ high θ normal1- θ normal1- θ normal1- θ high θ normal1- θ high θ P(Humidity = high) θ 1. Write down 2. In example, P(D| θ )= θ 7 (1- θ ) Maximize θ for this function. independent identically distributed data! iid

16/13 Solving the Equation 1. Often convenient to apply logarithms to products. ln(P(D| θ ))= 7ln( θ ) + 7 ln(1- θ ). 2. Find derivative, set to Exercise: try finding the minima of L given above.

17/13 Finding the Maximum Likelihood Solution: Two Nodes HumidityPlayTennisP(H,P| θ, θ 1, θ 2 highno θ x (1- θ 1) highno θ x (1- θ 1) highyes θ x θ 1 highyes θ x θ 1 normalyes(1- θ ) x θ 2 normalno(1- θ ) x (1- θ 2) normalyes(1- θ )x θ 2 highno θ x (1- θ 1) normalyes(1- θ ) x θ 2 normalyes(1- θ ) x θ 2 normalyes(1- θ )x θ 2 highyes θ x θ 1 normalyes(1- θ ) x θ 2 highno θ x (1- θ 1) P(Humidity = high) θ HP(PlayTennis = yes|H) high θ1θ1 normal θ2θ2 PlayTennis Humidity

18/13 Finding the Maximum Likelihood Solution: Two Nodes In a Bayes net, can maximize each parameter separately. Fix a parent condition  single node problem. HumidityPlayTennisP(H,P| θ, θ 1, θ 2 highno θ x (1- θ 1) highno θ x (1- θ 1) highyes θ x θ 1 highyes θ x θ 1 normalyes(1- θ ) x θ 2 normalno(1- θ ) x (1- θ 2) normalyes(1- θ )x θ 2 highno θ x (1- θ 1) normalyes(1- θ ) x θ 2 normalyes(1- θ ) x θ 2 normalyes(1- θ )x θ 2 highyes θ x θ 1 normalyes(1- θ ) x θ 2 highno θ x (1- θ 1) 1.In example, P(D| θ, θ 1, θ 2)= θ 7 (1- θ ) 7 ( θ 1) 3 (1- θ 1) 4 ( θ 2) 6 (1- θ 2). 2.Take logs and set to 0.

19/13 Finding the Maximum Likelihood Solution: Single Node, >2 possible values. DayOutlook 1sunny 2 3overcast 4rain 5 6 7overcast 8sunny 9 10rain 11sunny 12overcast 13overcast 14rain Outlook P(Outlook) sunny θ1θ1 overcast θ2θ2 rain θ3θ3 1.In example, P(D| θ 1, θ 2, θ 3)= ( θ 1) 5 ( θ 2) 4 ( θ 3) 5. 2.MLE solution for 2 possible values can be generalized, but needs more advanced math. 3.General solution: MLE = observed frequencies.

Decision Tree Classifiers

21/13 Multiple Choice Question A decision tree 1. helps an agent make decisions. 2. uses a Bayesian network to compute probabilities. 3. contains nodes with attribute values. 4. assigns a class label to a list of attribute values.

22/13 Classification Predict a single target or class label for an object, given a vector of features. Conditional probability P(label|features). Example: predict PlayTennis given 4 other features. DayOutlookTemperatureHumidityWindPlayTennis 1sunnyhothighweakno 2sunnyhothighstrongno 3overcasthothighweakyes 4rainmildhighweakyes 5raincoolnormalweakyes 6raincoolnormalstrongno 7overcastcoolnormalstrongyes 8sunnymildhighweakno 9sunnycoolnormalweakyes 10rainmildnormalweakyes 11sunnymildnormalstrongyes 12overcastmildhighstrongyes 13overcasthotnormalweakyes 14rainmildhighstrongno

23/13 Decision Tree Popular type of classifier. Easy to visualize. Especially for discrete values, but also for continuous. Learning: Information Theory.

24/13 Decision Tree Example

25/13 Exercise Find a decision tree to represent A OR B, A AND B. (A AND B) OR (C AND notD AND E).

26/13 Example: Rate of Reboot Failure

27/13 Big Decision Tree for NHL Goal Scoring

28/13 Decision Tree Learning Basic Loop: 1. A := the “best” decision attribute for next node. 2. For each value of A, create new descendant of node. 3. Assign training examples to leaf nodes. 4. If training examples perfect classified, then STOP. Else iterate over new leaf nodes.

Entropy

30/13 Multiple Choice Question Entropy 1. measures the amount of uncertainty in a probability distribution. 2. is a concept from relativity theory in physics. 3. refers to the flexibility of an intelligent agent. 4. is maximized by the ID3 algorithm.

31/13 Uncertainty and Probability The more “balanced” a probability distribution, the less information it conveys (e.g., about class label). How to quantify? Information Theory: Entropy = Balance. S is sample, p + is proportion positive, p - negative. Entropy(S) = -p + log2(p + ) - p - log2(p - )

32/13 Entropy: General Definition Important quantity in coding theory statistical physics machine learning

33/13 Intuition

34/13 Entropy

35/13 Coding Theory Coding theory: X discrete with 8 possible states (“messages”); how many bits to transmit the state of X ? Shannon information theorem: optimal code length assigns –log 2 p(x) to each “message” X = x. All states equally likely

36/13 Zipf’s Law General principle: frequent messages get shorter codes. e.g., abbreviations. Information Compression.

37/13 Another Coding Example

38/13 The Kullback-Leibler Divergence Measures information-theoretic “distance” between two distributions p and q. Distributions can be discrete or continuous. Aka cross-entropy. Code length of x in true distribution Code length of x in wrong distribution

Information Gain ID3 Decision Tree Learning

40/13 Splitting Criterion A new attribute value changes the entropy. Want to split on attribute that has the greatest reduction in entropy, averaged over its attribute values. Gain(S,A) = expected reduction in entropy due to splitting on A.

41/13 Example

42/13 Playtennis Example