Information Extraction Lecture 6 – Decision Trees (Basic Machine Learning) CIS, LMU München Winter Semester 2014-2015 Dr. Alexander Fraser, CIS.

Slides:



Advertisements
Similar presentations
Introduction to Artificial Intelligence CS440/ECE448 Lecture 21
Advertisements

Data Mining Lecture 9.
Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Decision Trees Decision tree representation ID3 learning algorithm
Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Enhancements to basic decision tree induction, C4.5
Decision Tree Approach in Data Mining
Iterative Dichotomiser 3 (ID3) Algorithm Medha Pradhan CS 157B, Spring 2007.
Machine Learning Decision Trees. Exercise Solutions.
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Machine learning learning... a fundamental aspect of intelligent systems – not just a short-cut to kn acquisition / complex behaviour.
Lecture outline Classification Decision-tree classification.
Machine Learning Group University College Dublin Decision Trees What is a Decision Tree? How to build a good one…
Decision Tree Rong Jin. Determine Milage Per Gallon.
Learning: Identification Trees Larry M. Manevitz All rights reserved.
Decision Tree Learning
Ensemble Learning: An Introduction
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Induction of Decision Trees
Classification Continued
Three kinds of learning
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
ICS 273A Intro Machine Learning
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
MACHINE LEARNING. 2 What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.
ID3 Algorithm Allan Neymark CS157B – Spring 2007.
Issues with Data Mining
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Decision Tree Problems CSE-391: Artificial Intelligence University of Pennsylvania Matt Huenerfauth April 2005.
EIE426-AICV1 Machine Learning Filename: eie426-machine-learning-0809.ppt.
Decision Trees & the Iterative Dichotomiser 3 (ID3) Algorithm David Ramos CS 157B, Section 1 May 4, 2006.
Chapter 9 – Classification and Regression Trees
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Machine Learning Queens College Lecture 2: Decision Trees.
Scaling up Decision Trees. Decision tree learning.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
Learning with Decision Trees Artificial Intelligence CMSC February 20, 2003.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Searching by Authority Artificial Intelligence CMSC February 12, 2008.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length Abdomen Length.
Learning with Decision Trees Artificial Intelligence CMSC February 18, 2003.
CIS 335 CIS 335 Data Mining Classification Part I.
COM24111: Machine Learning Decision Trees Gavin Brown
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Medical Decision Making Learning: Decision Trees Artificial Intelligence CMSC February 10, 2005.
Machine Learning in Practice Lecture 8 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS
Iterative Dichotomiser 3 (ID3) Algorithm
Chapter 18 From Data to Knowledge
Ch9: Decision Trees 9.1 Introduction A decision tree:
Prof. Dr. Alexander Fraser, CIS
Learning with Identification Trees
Roberto Battiti, Mauro Brunato
Machine Learning: Lecture 3
Presentation transcript:

Information Extraction Lecture 6 – Decision Trees (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS

Schedule In this lecture we will learn about decision trees Sarawagi talks about classifier based IE in Chapter 3 Unfortunately, the discussion is very technical. I would recommend reading it, but not worrying too much about the math (yet)

Decision Tree Representation for ‘Play Tennis?’  Internal node ~ test an attribute  Branch ~ attribute value  Leaf ~ classification result Slide from A. Kaban

When is it useful?  Medical diagnosis  Equipment diagnosis  Credit risk analysis  etc Slide from A. Kaban

Outline Contingency tables – Census data set Information gain – Beach data set Learning an unpruned decision tree recursively – Gasoline usage data set Training error Test error Overfitting Avoiding overfitting

Using this idea for classification We will now look at a (toy) dataset presented in Winston's Artificial Intelligence textbook It is often used in explaining decision trees The variable we are trying to pick is "got_a_sunburn" (or "is_sunburned" if you like) We will look at different decision trees that can be used to correctly classify this data

13 Sunburn Data Collected Slide from A. Kaban

14 Decision Tree 1 is_sunburned Dana, Pete AlexSarah EmilyJohn Annie Katie Height average tall short Hair colourWeight brownredblonde Weight light average heavy light averageheavy Hair colour blonde red brown Slide from A. Kaban

15 Sunburn sufferers are... n If height=“average” then –if weight=“light” then return(true) ;;; Sarah –elseif weight=“heavy” then if hair_colour=“red” then –return(true) ;;; Emily n elseif height=“short” then –if hair_colour=“blonde” then if weight=“average” then –return(true) ;;; Annie n else return(false) ;;;everyone else Slide from A. Kaban

16 Decision Tree 2 Sarah, Annie is_sunburned Emily Alex Dana, Katie Pete, John Lotion used Hair colour brown red blonde yesno red brown blonde Slide from A. Kaban

17 Decision Tree 3 Sarah, Annie is_sunburned Alex, Pete, John Emily Dana, Katie Hair colour Lotion used blonde red brown noyes Slide from A. Kaban

18 Summing up n Irrelevant attributes do not classify the data well n Using irrelevant attributes thus causes larger decision trees n a computer could look for simpler decision trees n Q: How? Slide from A. Kaban

19 A: How WE did it? n Q: Which is the best attribute for splitting up the data? n A: The one which is most informative for the classification we want to get. n Q: What does it mean ‘more informative’? n A: The attribute which best reduces the uncertainty or the disorder Slide from A. Kaban

20 n We need a quantity to measure the disorder in a set of examples S={s 1, s 2, s 3, …, s n } where s1=“Sarah”, s2=“Dana”, … n Then we need a quantity to measure the amount of reduction of the disorder level in the instance of knowing the value of a particular attribute Slide from A. Kaban

21 What properties should the Disorder (D) have? n Suppose that D(S)=0 means that all the examples in S have the same class n Suppose that D(S)=1 means that half the examples in S are of one class and half are the opposite class Slide from A. Kaban

22 Examples n D({“Dana”,“Pete”}) =0 n D({“Sarah”,“Annie”,“Emily” })=0 n D({“Sarah”,“Emily”,“Alex”,“John” })=1 n D({“Sarah”,“Emily”, “Alex” })=? Slide from A. Kaban

D({“Sarah”,“Emily”, “Alex” })=0.918 Proportion of positive examples, p Disorder, D Slide from A. Kaban

24 Definition of Disorder The Entropy measures the disorder of a set S containing a total of n examples of which n + are positive and n - are negative and it is given by where Check it! D(0,1) = ?D(1,0)=?D(0.5,0.5)=? Slide from A. Kaban

25 Back to the beach (or the disorder of sunbathers)! D({ “Sarah”,“Dana”,“Alex”,“Annie”, “Emily”,“Pete”,“John”,“Katie”}) Slide from A. Kaban

26 Some more useful properties of the Entropy Slide from A. Kaban

27 n So: We can measure the disorder n What’s left: –We want to measure how much by knowing the value of a particular attribute the disorder of a set would reduce. Slide from A. Kaban

28 n The Information Gain measures the expected reduction in entropy due to splitting on an attribute A the average disorder is just the weighted sum of the disorders in the branches (subsets) created by the values of A. We want: -large Gain -same as: small avg disorder created Slide from A. Kaban

29 Back to the beach: calculate the Average Disorder associated with Hair Colour Sarah Annie Dana Katie Emily Alex Pete John Hair colour brownblonde red Slide from A. Kaban

30 …Calculating the Disorder of the “blondes” The first term of the sum: n D(S blonde )= D({ “Sarah”,“Annie”,“Dana”,“Katie”}) = D(2,2) =1 Slide from A. Kaban

31 …Calculating the disorder of the others The second and third terms of the sum: n S red ={“Emily”} n S brown ={ “Alex”, “Pete”, “John”}. These are both 0 because within each set all the examples have the same class So the avg disorder created when splitting on ‘hair colour’ is =0.5 Slide from A. Kaban

32 Which decision variable minimises the disorder? TestDisorder Hair0.5 – this what we just computed height0.69 weight0.94 lotion 0.61 Which decision variable maximises the Info Gain then? Remember it’s the one which minimises the avg disorder (see slide 21 for memory refreshing). these are the avg disorders of the other attributes, computed in the same way Slide from A. Kaban

33 So what is the best decision tree? ? is_sunburned Alex, Pete, John Emily Sarah Annie Dana Katie Hair colour brown blonde red Slide from A. Kaban

Outline Contingency tables – Census data set Information gain – Beach data set Learning an unpruned decision tree recursively – Good/bad gasoline usage = "miles per gallon" data set Training error Test error Overfitting Avoiding overfitting

Predict good

Things I didn't discuss How to deal with real-valued inputs – Either: discretize these into buckets before building a decision tree (as was done on the gasoline usage data set) – Or while building the decision tree, use less-than checks E.g., try all of age < 24 versus age < 25 versus age < 26 (etc...) and find the best split of the age variable But the details are complex and this requires many assumptions Information Gain can sometimes incorrectly favor splitting on labels which have many possible values – There are alternative criteria that are sometimes preferred – There are also ways to correct Information Gain for this problem There are also very different solutions that work well for classification like this, like Naive Bayes or linear models in general – These associate one weight with each feature value as we saw in the previous lecture! – The same ideas about generalization and overfitting apply! – We'll discuss these further in the next lecture

Slide sources – See Ata Kaban's machine learning class particularly for the intuitive discussion of the Winston sunburn problem and Information Gain – See Andrew W. Moore's website for a longer presentation of his slides on decision trees, and slides on many other machine learning topics: 65

Thank you for your attention! 66