Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

Similar presentations


Presentation on theme: "Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island."— Presentation transcript:

1 Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island

2 What is Data Mining? Data mining is the application of machine learning techniques to large databases in order to extract knowledge. (KDD – Knowledge Discovery in Databases) No longer strictly true, data mining now encompasses other computational techniques outside the classic machine learning domain.

3 What is Machine Learning? Programs that get better with experience given a task and some performance measure. –Learning to classify news articles –Learning to recognize spoken words –Learning to play board games –Learning to classify customers

4 What is Knowledge? •Structural descriptions of data (transparent) –If-then-else rules –Decision trees •Models of data (non-transparent) –Neural Networks –Clustering (Self-Organizing Maps, k-Means) –Naïve-Bayes Classifiers

5 Why Data Mining? Oversimplifying somewhat: Queries allow you to retrieve existing knowledge from a database. Data mining induces new knowledge in a database.

6 Why Data Mining? (Cont.) Example: Give me a description of customers who spent more than $100 in my store.

7 Why Data Mining? (Cont.) The Query: •The only thing a query can do is give you a list of every single customer who spent more than $100. •Probably not very informative with the exception that you will most likely see a lot of customer records.

8 Why Data Mining? (Cont.) Data Mining Techniques: •Data mining techniques allow you to generate structural descriptions of the data in question, i.e., induce new knowledge. •In the case of rules this might look something like: IF age < 35 AND car = MINIVAN THEN spent > $100

9 Why Data Mining? (Cont.) • In principle, you could generate the same kind of knowledge you gain with data mining techniques using only queries: – look at the data set of customers who spent more that $100 and propose a hypothesis – test this hypothesis against your data using a query – if the query returns a non-null result set then you have found a description of a subset of your customers • Time consuming, undirected search.

10 Decision Trees • Decision trees are concept learning algorithms • Once a concept is acquired the algorithm can classify objects according to this concept. • Concept Learning: – acquiring the definition of a general category given a sample of positive and negative examples of the category, – can be formulated as a problem of searching through a predefined space of potential concepts for the concept that best fits the training examples. • Best known algorithms: ID3, C4.5, CART

11 Example MI – Myocardial Infarction (Source: ® Neural Networks and Artificial Intelligence for Biomedical Engineering ”, IEEE Press, 2000) Below is a table of patients who have entered the emergency room complaining about chest pain – two types of diagnoses: Angina and Myocardial Infarction. Question: can we generalize beyond this data?

12 Example (Cont ’ d) C4.5 induces the following decision tree for the data: Decision Surface Systolic Blood Pressure > 130 <= 130 Angina Myocardial Infarction

13 Definition of Concept Learning Notes: – This is called supervised learning because of the necessity of labeled data provided by the trainer.  Once we have determined c ’ we can use it to make predictions on unseen elements of the data universe. • Given: – A data universe X  A sample set S, where S  X  Some target concept c: X  {true, false}  Labeled training examples D, where D = { | s  S } • Using D determine:  A function c ’ such that c ’ (x)  c(x) for all x  X.

14 The Inductive Learning Hypothesis Any function found to approximate the target concept well over a sufficiently large set of training examples will also approximate the target concept well over other unobserved examples. In other words, we are able to generalize beyond what we have seen.

15 Recasting our Example as a Concept Learning Problem  The data universe X are ordered pairs of the form Systolic Blood Pressure Ž White Blood Count  The sample set S  X is the table of value pairs we are given  Target concept: Diagnosis: X  {Angina, MI}  Training examples is the table where D = { | s  S }  Find a function Diagnosis ’ that best describes D.

16 Recasting our Example as a Concept Induction Problem A definition of the learned function Diagnosis ’ : Diagnosis ’ (Systolic Blood Pressure, White Blood Count) = IF Systolic Blood Pressure > 130 THEN Diagnosis ’ = Angina ELSE IF Systolic Blood Pressure <= 130 THEN Diagnosis ’ = MI.

17 Decision Tree Representation We can represent the learned function as a tree: " Each internal node tests an attribute • Each branch corresponds to attribute value • Each leaf node assigns a classification Systolic Blood Pressure > 130 <= 130 Angina Myocardial Infarction

18 Entropy • S is a sample of training examples  p  is the proportion of positive examples in S  p  is the proportion of negative examples in S  Entropy measures the impurity (randomness) of S Entropy(S)  - p + log 2 p + - p  log 2 p  p+p+

19 Top-down Induction of Decision Trees • Recursive Algorithm • Main loop: – Let attribute A be the attribute that minimizes the entropy at the current node – For each value of A, create new descendant of node – Sort training examples to leaf nodes – If training examples classified satisfactorily, then STOP, else iterate over new leaf nodes.

20 Information Gain Gain(S, A) = expected reduction in entropy due to sorting on A. In other words, Gain(S, A) is the information provided about the target concept, given the value of some attribute A.

21 Training, Evaluation and Prediction We know how to induce classification rules on the data, but: • How do we measure performance? • How do we use our rules to do prediction?

22 Training & Evaluation The simplest method of measuring performance is the hold- out method: • Given labeled data D, we divide the data into two sets: – A hold-out (test) set D h of size h,  A training set D t = D – D h.  The error of the induced function c ’ t is given as follows: where  (p, q) = 1 if p  q and 0 otherwise. å   h Dscs th scsc h error )(, ))(),('( 1 

23 Training & Evaluation • However, since we trained and evaluated the learner on a finite set of data we want to know what the confidence interval is. • We can compute the 95% confidence interval of error h as follows,  Assume that the hold-out set D h has h  30 members. – Assume that each d in D h has been selected independently and according to the probability distribution over the domain. Then: h error hh h )1( 96.1  

24 Prediction  As we have said earlier, the induced function c ’  c, that is, the induced function is an estimate of the target concept.  Therefore, we can use c ’ to estimate (predict) the label for any unseen instance x  X with an appropriate accuracy.

25 Summary • Data Mining is the application of machine learning algorithms to large databases in order to induce new knowledge. • Machine Learning can be considered to be a directed search over the space of all possible descriptions of the training data for the best description of the data set that also generalizes well to unseen instances. • Decision trees are concept learning algorithms that learn classification functions.


Download ppt "Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island."

Similar presentations


Ads by Google