Data Analytics UNIT-IV :Classification

Data Analytics UNIT-IV :Classification

Chapter Sections Decision trees- Overview, general algorithm, decision tree algorithm, evaluating a decision tree. Naïve Bayes – Bayes‟ Algorithm, Naïve Bayes Classifier, smoothing, diagnostics. Diagnostics of classifiers, Additional classification methods.

Classification Classification is widely used for prediction
Most classification methods are supervised This chapter focuses on two fundamental classification methods Decision trees Naïve Bayes

Decision Trees Tree structure specifies sequence of decisions
Given input X={x1, x2,…, xn}, predict output Y Input attributes/features can be categorical or continuous Node = tests a particular input variable Root node, internal nodes, leaf nodes return class labels Depth of node = minimum steps to reach node Branch (connects two nodes) = specifies decision Two varieties of decision trees Classification trees: categorical output, often binary Regression trees: numeric output

Decision Trees Overview of a Decision Tree
Example of a decision tree Predicts whether customers will buy a product

Decision Trees Overview of a Decision Tree
Example: will bank client subscribe to term deposit?

Decision Trees The General Algorithm
Construct a tree T from training set S Requires a measure of attribute information Simplistic method (data from previous Fig.) Purity = probability of corresponding class E.g., P(no)=1789/2000=89.45%, P(yes)=10.55% Entropy methods Entropy measures the impurity of an attribute Information gain measures purity of an attribute

Entropy methods of attribute information Hx = the entropy of X Information gain of an attribute = base entropy – conditional entropy

Construct a tree T from training set S Choose root node = most informative attribute A Partition S according to A’s values Construct subtrees T1, T2… for the subsets of S recursively until one of following occurs All leaf nodes satisfy minimum purity threshold Tree cannot be further split with min purity threshold Other stopping criterion satisfied – e.g., max depth

Decision Trees Decision Tree Algorithms
ID3 Algorithm T=training set, P=output variable, A=attribute

Decision Trees Decision Tree Algorithms
C4.5 Algorithm Handles missing data Handles both categorical and sontinuous variables Uses bottom-up pruning to address overfitting CART (Classification And Regression Trees) Also handles continuous variables Uses Gini diversity index as info measure

Decision Trees Evaluating a Decision Tree
Decision trees are greedy algorithms Best option at each step, maybe not best overall Addressed by ensemble methods: random forest Model might overfit the data Blue = training set Red = test set Overcome overfitting: Stop growing tree early Grow full tree, then prune

Decision trees -> rectangular decision regions

Advantages of decision trees Computationally inexpensive Outputs are easy to interpret – sequence of tests Show importance of each input variable Decision trees handle Both numerical and categorical attributes Categorical attributes with many distinct values Variables with nonlinear effect on outcome Variable interactions

Disadvantages of decision trees Sensitive to small variations in the training data Overfitting can occur because each split reduces training data for subsequent splits Poor if dataset contains many irrelevant variables

Naïve Bayes The naïve Bayes classifier
Based on Bayes’ theorem (or Bayes’ Law) Assumes the features contribute independently Features (variables) are generally categorical Discretization of continuous variables is the process of converting continuous variables into categorical ones Output is usually class label plus probability score Log probability often used instead of probability

Naïve Bayes Bayes Theorem
where C = class, A = observed attributes Typical medical example Used because doctor’s frequently get this wrong

Naïve Bayes Naïve Bayes Classifier
Conditional independence assumption And dropping common denominator, we get Find cj that maximizes P(cj|A)

Example: client subscribes to term deposit? The following record is from a bank client. Is this client likely to subscribe to the term deposit?

Compute probabilities for this record

Compute Naïve Bayes classifier outputs: yes/no The client is assigned the label subscribed = yes The scores are small, but the ratio is what counts Using logarithms helps avoid numerical underflow

Naïve Bayes Smoothing A smoothing technique assigns a small nonzero probability to rare events that are missing in the training data E.g., Laplace smoothing assumes every output occurs once more than occurs in the dataset Smoothing is essential – without it, a zero conditional probability results in P(cj|A)=0

Naïve Bayes Diagnostics
Naïve Bayes advantages Handles missing values Robust to irrelevant variables Simple to implement Computationally efficient Handles high-dimensional data efficiently Often competitive with other learning algorithms Reasonably resistant to overfitting Naïve Bayes disadvantages Assumes variables are conditionally independent Therefore, sensitive to double counting correlated variables In its simplest form, used only for categorical variables

Naïve Bayes Naïve Bayes in R
This section explores two methods of using the naïve Bayes Classifier Manually compute probabilities from scratch Tedious with many R calculations Use naïve Bayes function from e1071 package Much easier – starts on page 222 Example: subscribing to term deposit

Diagnostics of Classifiers
The book covered three classifiers Logistic regression, decision trees, naïve Bayes Tools to evaluate classifier performance Confusion matrix

Bank marketing example Training set of 2000 records Test set of 100 records, evaluated below

Evaluation metrics

Evaluation metrics on bank marketing 100 test set poor poor

ROC curve: good for evaluating binary detection Bank marketing: 2000 training set test set > banktrain<-read.table("bank-sample.csv",header=TRUE,sep=",") > drops<-c("balance","day","campaign","pdays","previous","month") > banktrain<-banktrain[,!(names(banktrain) %in% drops)] > banktest<-read.table("bank-sample-test.csv",header=TRUE,sep=",") > banktest<-banktest[,!(names(banktest) %in% drops)] > nb_model<-naiveBayes(subscribed~.,data=banktrain) > nb_prediction<-predict(nb_model,banktest[,-ncol(banktest)],type='raw') > score<-nb_prediction[,c("yes")] > actual_class<-banktest$subscribed=='yes' > pred<-prediction(score,actual_class) # code problem

ROC curve: good for evaluating binary detection Bank marketing: 2000 training set test set

Additional Classification Methods
Ensemble methods that use multiple models Bagging: bootstrap method that uses repeated sampling with replacement Boosting: similar to bagging but iterative procedure Random forest: uses ensemble of decision trees These models usually have better performance than a single decision tree Support Vector Machine (SVM) Linear model using small number of support vectors

Summary How to choose a suitable classifier among
Decision trees, naïve Bayes, & logistic regression

Data Analytics UNIT-IV :Classification

Similar presentations

Presentation on theme: "Data Analytics UNIT-IV :Classification"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Analytics UNIT-IV :Classification

Similar presentations

Presentation on theme: "Data Analytics UNIT-IV :Classification"— Presentation transcript:

Similar presentations

About project

Feedback