Learning….in a rather broad sense: improvement of performance on the basis of experience Machine learning…… improve for task T with respect to performance.

Slides:



Advertisements
Similar presentations
Learning from Observations Chapter 18 Section 1 – 3.
Advertisements

Conceptual Clustering
DECISION TREES. Decision trees  One possible representation for hypotheses.
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.
Classification Techniques: Decision Tree Learning
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Three kinds of learning
LEARNING DECISION TREES
Learning decision trees derived from Hwee Tou Ng, slides for Russell & Norvig, AI a Modern Approachslides Tom Carter, “An introduction to information theory.
ICS 273A Intro Machine Learning
Classification.
Decision Tree Learning
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Inductive learning Simplest form: learn a function from examples
Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Mohammad Ali Keyvanrad
CS Artificial Intelligence  Class Syllabus.
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
Learning from Observations Chapter 18 Section 1 – 3, 5-8 (presentation TBC)
Learning from Observations Chapter 18 Section 1 – 3.
CS Decision Trees1 Decision Trees Highly used and successful Iteratively split the Data Set into subsets one attribute at a time, using most informative.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
Decision Tree Learning
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Decision Trees.
CIS 335 CIS 335 Data Mining Classification Part I.
Chapter 18 Section 1 – 3 Learning from Observations.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Learning from Observations
Learning from Observations
Machine Learning Inductive Learning and Decision Trees
DECISION TREES An internal node represents a test on an attribute.
Introduce to machine learning
Presented By S.Yamuna AP/CSE
Ch9: Decision Trees 9.1 Introduction A decision tree:
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Machine Learning Chapter 3. Decision Tree Learning
Statistical Learning Dong Liu Dept. EEIS, USTC.
Learning from Observations
Learning from Observations
Decision trees One possible representation for hypotheses
Presentation transcript:

Learning….in a rather broad sense: improvement of performance on the basis of experience Machine learning…… improve for task T with respect to performance measure P, based on experience E What is learning?

Learning from Observations.  There are 3 main types of learning:  Supervised learning – used in environments where an action is followed by immediate feedback  Reinforcement learning – used in environments where feedback on actions is not immediate  Unsupervised learning – used where there isn’t any feedback on actions!

Inductive learning is defined to be the process of learning from pre-classified examples. T = {e1, e2,... en}, where each ei = (a, o) = (a1a2...am,o) 1. Choose h such thatis minimized 2. Hypothesis "goodness"

Inductive Learning – Supervised Learning  Gather a set of input-output examples from some application: Training Set i.e. Stock Forecasting  Train the learning model (decision tree, neural network, etc.) on the training set until “done”  The Goal is to “generalize” on novel data not yet seen  Gather a further set of input-output examples from the same application: Test Set in order to validate what the system is doing  Use the learning system on actual data  Formally, given  a function f  a set of examples (x, f(x))  produce h such that h approximates f

Motivation Cost and Errors in Programming A Solution Domain knowledge limited - financial forecasting Encoding/extracting of domain knowledge may be expensive Augment existing domain knowledge Adaptability General, easy-to use mechanism for a large set of applications Do better than current approaches

Learn This!

Which?

What explanations do we prefer? Common Biases in learning  Minimize Error on known examples  Information gain  Ockham’s Razor – Prefer the simplest hypothesis that describes the data (mml, mdl).  Without bias, you cannot learn!  Bias influences what you will learn. Sometimes these biases are inherent in the basic learning algorithm you choose, sometimes they are implicit in the error function you are using.  So which biases are the “good” biases?  The conservation law of generalization performance [Schaffer, 1994] proves that no learning bias (algorithm) can outperform any other bias (algorithm) over the space of all possible learning tasks

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no answer. Typically, each internal node in the tree corresponds to a test on a single attribute. However, you could have more complicated tests than that! And there are models that split on more than just a single attribute. The test need only split the data reaching the node in some way. The branches emanating from a decision node are labeled with the possible values of the test for that branch. Leaf nodes are where classification takes place. Leaf nodes are labeled with the Boolean value that should be returned if the node is reached (yes/no). Could also label with a probability.

Decision Tree Expressiveness  Any Boolean function can be represented by a decision tree.  In the worst case, the size of the tree needs to be exponential in the number of variables – i.e. a full-on representation of the truth table. Example – parity.  Can we do better than that? Is there a representation for Boolean functions that has a worst case performance better than a decision tree? Wouldn’t it be great if there was?  Each tree represents a disjunction of conjunctions of constraints on the attribute values of instances

 The problem is, there are a whole lot of Boolean functions on n variables. o A truth table with n variables has 2 n rows, 1 row for each possible truth setting of the variables. o For each row, the output of the function can be either a 0 or a 1. o So, we have 2 n bits that can be set to either 0 or 1. o That means there are 2^2^n possible Boolean functions on n variables.

ID3/C4.5  Top down induction of decision trees  non-incremental - are extensions  Highly used and successful  Attribute Features - discrete - output is also discrete  Search for smallest tree is too complex  Use greedy iterative approach

ID3 Learning Approach  C is a set of examples  A test on attribute A partitions C into {C 1, C 2,...,Cw} where w is number of states of A  First find good A for root - Attribute which is "most important"  Continue recursively until the training set is unambiguously classified or there are no more "relevant" features - C4.5 actually expands and then prunes back statistically irrelevant partitions afterwards

Choosing variables to split on - what should our Bias be?  Bias: Ockham’s razor – find the smallest possible tree  Roughly, we could accomplish this if we try to minimize depth of tree  How? Many different ways. How about by picking an attribute that maximizes classification accuracy for that step (Greedy approach)? If the first attribute classifies everything correctly, we’re done (depth=1)!

Information Theory  How much information do you need to be given in order to answer a yes/no question?  1 bit.  How much information do you need to answer a yes/no question if you know that the yes answer has probability 1?  0 bits. You already know the answer.  So, a yes/no question where each answer is.5 probable requires 1 bit of information to answer.  And, a yes/no question where one answer is 100% probable requires 0 bits.

ID3 Learning Algorithm 1. S = Training Set 2. Calculate gain for each remaining attribute 3. Select highest and create a new node for each partition 4. For each partition - if one class then end - else if > 1 class then goto 2 with remaining attributes - else if empty, label with most common class of parent (or set as null) 5. if attributes exhausted? - (this will only happen for an inconsistent S) - label with majority class  Attributes which best discriminate between classes are chosen  If the same ratios are found in partitioned set, then gain is 0

Over-fitting Definition? Over-fitting: If h1 fits the training data better than h2, but h1 performs worse than h2 on new data How do we avoid it?

 Could only allow attributes with info gains exceeding some threshold in order to sift noise. However, empirically tends to disallow relevant attribute tests.  Use statistical (such as Chi-square) test to decide confidence in whether attribute is irrelevant. Best ID3 results. (Takes amount of data into account which is not done by above)  Use a separate set of examples (holdout set) to determine when to stop ID3 Noise Handling Mechanisms – Early Stopping  When you decide to not split on any more attributes, label node with either most common, or with probability of most common (good for distribution vs function)

ID3 Noise Handling Mechanisms – Post Pruning Consider each node, remove the subtree centered on it and test Use a separate set of examples to prune Use statistical tests Rule Post Pruning –Convert tree to rules (1 rule for each path from a root to a leaf) –Generalize each rule by considering each of its pre-conditions –Sort rules according to estimated accuracy, and consider them in this order

ID3 - Missing Attribute Values - Learning  Throw out data with missing attributes - too common, could be important, not prepared to generalize with missing attributes  Set attribute to most probable attribute class  Set attribute to most probable attribute class given the example class - similar performance  Use a learning scheme (ID3, etc) to fill in attribute class where TS is made up of complete examples and the initial output class is just another attribute. Better, but not always empirically convincing  Let unknown be just another attribute value - for ID3 has anomaly of apparent higher gain due to more attributes, can fix with gain ratio

ID3 - Missing Attribute Values - Execution  When arriving at an attribute test for which the attribute is missing during execution  Each branch has a probability of being taken based on what percentage of TS examples went down each branch  Take all branches, but carry a weight representing the probability. Weights could be further modified (multiplied) by other missing attributes in current test example as they continue down the tree.  Results in multiple active leaf nodes. Set output as leaf with highest weight, or sum weights for each output class, and output the class with the largest sum

ID3 - Conclusions  Good Empirical Results  Comparable application robustness and accuracy with neural networks - faster learning (though NN are better with continuous - both input and output)  Most used and well known of current systems - used widely to aid in creating rules for expert systems