Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 6 Classification and Prediction

Similar presentations


Presentation on theme: "Chapter 6 Classification and Prediction"— Presentation transcript:

1 Chapter 6 Classification and Prediction
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009

2 Classification and Prediction
Classification and Prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends For example: Bank loan applicants are “safe” or “risky” Guess a customer will buy a new computer? Analysis cancer data to predict which one of three specific treatments should apply

3 Classification Classification is a Two-Step Process Learning step:
classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction step: predicts categorical class labels (discrete or nominal)

4 Learning step: Model Construction
Classification Algorithms Training Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

5 Learning step Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae

6 Prediction step: Using the Model in Prediction
Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?

7 Prediction step Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur

8 N-fold Cross-validation
In order to solve over-fitting problem, n-fold cross-validation is usually used For example, 7 fold cross validation: Divide the whole training dataset into 7 parts equally Take the first part away, train the model on the rest 6 portions After the model is trained, feed the first part as testing dataset, obtain the accuracy Repeat step two and three, but take the second part away and so on…

9 Supervised learning VS Unsupervised learning
Because the class label of each training tuple is provided, this step is also known as supervised learning It contrasts with unsupervised learning (or clustering), in which the class label of each training tuple is unknown

10 Issues: Data Preparation
Data cleaning Preprocess data in order to reduce noise and handle missing values Relevance analysis (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data

11 Issues: Evaluating Classification Methods
Accuracy Speed time to construct the model (training time) time to use the model (classification/prediction time) Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability

12 Decision Tree Decision Tree induction is the learning of decision trees from class-labeled training tuples A decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute Each Branch represents an outcome of the test Each Leaf node holds a class label

13 Decision Tree Example

14 Decision Tree Algorithm
Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)

15

16 Decision Tree

17 Decision Tree means “age <=30” has 5 out of 14 samples, with 2 yes’s and 3 no’s. I(2,3) = -2/5 * log(2/5) – 3/5 * log(3/5)

18 Decision Tree Similarily, we can compute
Gain(income)=0.029 Gain(student)=0.151 Gain(credit_rating)=0.048 Since “student” obtains highest information gain, we can partition the tree into:

19 Attribute Selection Measure: Information Gain (ID3/C4.5)
Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D| Expected information (entropy) needed to classify a tuple in D:

20 Attribute Selection Measure: Information Gain (ID3/C4.5)
Information needed (after using A to split D into v partitions) to classify D: Information gained by branching on attribute A

21 Decision Tree

22 Decision Tree


Download ppt "Chapter 6 Classification and Prediction"

Similar presentations


Ads by Google