Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.

Similar presentations


Presentation on theme: "Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually."— Presentation transcript:

1 Classification II

2 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually split into two parts Splitting position is determined using the idea of information gain Consider the following sorted temperature values and their class labels 64 65 68 69 70 71 72 75 80 81 83 85 Yes no yes yes yes no no yes no yes yes no yes yes –We could have created a split at any of the 11 positions between the numbers –For each split compute the information gain –Select the split that gives the highest information gain –E.g. temp<71.5 produces 4 yes and 2 no & –Temp>71.5 produces 5 yes and 3 no –Entropy([4,2],[5,3])=6/14*Entropy(4/6,2/6)+8/14*Entropy(5/8,3/8) = 0.939 bits

3 3 Missing Values Easy solution is to treat missing values as a new possible value for the attribute –E.g if Outlook has a missing value we have rainy, overcast, sunny and missing as its possible values –This makes sense if the fact that the attribute is missing is significant (e.g. missing medical test results) –Also the learning method needs no modification More complex solution is to let the missing value receive a proportion of value from each of the known values of the attribute –The proportion is estimated based on the proportions of instances with known value at a node –E.g. if Outlook has a missing value in one instance and has 4 instances with rainy, 2 with overcast and 3 with sunny then the missing value is {4/9 rainy, 2/9 overcast, 3/9 sunny} –All the computations (such as information gain) are performed using these weighted values

4 4 Overfitting When decision trees are grown until each leaf node is pure they might learn unnecessary details from the training data –This is called overfitting –Unnecessary details could be just noise or because the training data is not representative Overfitting makes the classifier perform poorly on independent test data Two solutions –Stop growing the tree earlier, before it overfits training data – hard to estimate when to stop in practice –prune overfitted parts of the decision tree by evaluating the utility of pruning nodes from the tree using part of the training data as validation data

5 5 Naïve Bayes Classifier A simple classifier using observed probabilities Assumes that all the attributes contribute towards classification –Equally importantly –Independently For some data sets this classifier achieves better results than decision trees Makes use of Bayes rule of conditional probability Bayes rule: If H is a hypothesis and E is its evidence, then P(H|E) = (P(E|H)P(H))/P(E) P(H|E) – conditional probability – probability of H given E

6 6 Example Data OutlookTemperatureHumidityWindyPlay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrueno overcastcoolnormaltrueyes sunnymildhighfalseno sunnycoolnormalfalseyes rainymildnormalfalseyes sunnymildnormaltrueyes overcastmildhightrueyes overcasthotnormalfalseyes rainymildhightrueno Class Attribute

7 7 Weather Data Counts and Probabilities OutlookTemperatureHumidityWindy Play noyesno yes false true yes no high normal hot mild cool sunny overcast rainy 23 40 32 2 2 42 3 1 34 6 1 6 2 3 3 95 false true high normal hot mild cool sunny overcast rainy 2/93/5 4/90/5 3/92/5 2/92/5 4/92/5 3/91/5 3/94/5 6/91/5 6/92/5 3/93/5 9/145/14

8 8 Naïve Bayes Example Given the training weather data, this algorithm computes probabilities shown on the previous slide For a test instance with {sunny, cool, high, true} the algorithm uses Bayes rule to compute probabilities of Play = yes and Play = no First let H be Play = yes Then from Bayes rule –P(yes|E) = (P(sunny|yes)*P(cool|yes)*P(high|yes)*P(true|yes)*P(yes))/P(E) –All the probabilities on the right hand side except P(E) are known from the previous slide –P(yes|E) = (2/9*3/9*3/9*3/9*9/14)/P(E) = 0.0053/P(E) –P(no|E) = (P(sunny|no)*P(cool|no)*P(high|no)*P(true|no)*P(no))/P(E) –P(no|E) = (3/5*1/5*4/5*3/5*5/14)/P(E) = 0.0206/P(E) –Because P(yes|E) + P(no|E) = 1, (0.0053|P(E)) + (0.0206|P(E)) = 1 –P(E) = 0.0053 + 0.0206 –P(yes|E) = 0.0053/(0.0053+0.0206) = 20.5% –P(no|E) = 0.0206 /(0.0053+0.0206) = 79.5%

9 9 K-nearest neighbour No model built explicitly – instances are stored verbatim For an unknown test instance, using a distance function the k nearest training instances are determined and their most common class is assigned For numeric attributes Euclidean distance is a natural distance function Consider two instances – I1 having attributes a 1, a 2, …, a n and –I2 having attributes a 1 ’, a 2 ’,…, a n ’ –Euclidean distance between the two instances is √((a 1 -a 1 ’) 2 +(a 2 -a 2 ’) 2 +…+(a n -a n ’) 2 ) Because different attributes have different scales, you need to normalize attribute values to lie between 0 and 1 –a i = (v i –min(v i )) /(max(v i )-min(v i )) Finding nearest neighbours is computationally expensive – because a rudimentary approach is to compute distances to all the instances –speeding techniques exist but not studied here

10 10 Classifier’s Performance Metrics Error rate of a classifier measures its overall performance –Error rate = proportion of errors = number of misclassifications/total number of test instances Error rate does not discriminate between different types of errors A binary classifier (yes and no) makes two kinds of errors –Calling an instance of ‘no’ as an instance of ‘yes’ False positives –Calling an instance of ‘yes’ as an instance of ‘no’ False negatives In practice false positives and false negatives have different associated costs –Cost of lending to a defaulter is larger than lost-business cost of refusing loan to a non-defaulter –Cost of failing to detect fire is larger than the cost of a false alarm

11 11 Confusion Matrix The four possible outcomes of a binary classifier are usually shown in a confusion matrix –TP – True Positives –TN – True Negatives –FP – False Positives –FN – False Negatives –P – Total Positives (TP+FN) –N – Total Negatives (FP+TN) A number of performance metrics defined using these counts YesNo Yes No Actual Class Predicted Class TP TNFP FN P N P’ N’

12 12 Performance Metrics Derived from Confusion Matrix True Positive rate, TPR = TP/P = TP/(TP+FN) –Also known as sensitivity and recall False Positive rate, FPR = FP/N=FP/(FP+TN) Accuracy = (TN+TP)/(TP+TN+FP+FN) Error Rate = 1 – Accuracy Specificity = 1 – FPR Positive Predictive Value = TP/P’ = TP/(TP+FP) –Also known as precision Negative Predictive Value = TN/N’ = TN(TN+FN)

13 13 ROC - Receiver Operating Characteristic Metrics derived from confusion matrix useful for comparing classifiers Particularly a plot of TPR on y- axis against FPR on x axis is known as ROC A, B, C, D and E are five classifiers with different TPR and FPR values A is the ideal classifier because it has TPR = 1.0 and FPR = 0 E is on the diagonal which stands for random guess C performs worse than random guess –But inverse of C which is B is better than D Classifiers should aim to be in the northwest 1.0 A B C D 0 E TPR FPR better worse

14 14 Testing Options 1 Testing the classifier on training data is not useful –Performance figures from such testing will be optimistic –Because the classifier is trained from the very data Ideally, a new data set called ‘test set’ needs to be used for testing –If test set is large performance figures will be more realistic –Creating test set needs experts’ time and therefore creating large test sets is expensive –After testing, test set is combined with training data to produce a new classifier –Sometimes, a third data set called ‘validation data’ used for fine tuning a classifier or to select a classifier among many In practice several strategies used to make up for lack of test data –Holdout procedure – a certain proportion of training data is held as test data and remaining used for training –Cross-validation –Leave-one-out cross-validation

15 15 Testing Options 2 Cross Validation –Partition the data into a fixed number of folds –Use data from each of the partitions for testing while using the remaining for training –Every instance is used for testing once –10-fold cross-validation is standard, particularly repeating it 10 times Leave-one-out –Is n-fold cross-validation, where n is the data size –One instance is held for testing while using the remaining for training –Results from single instance tests are averaged to obtain the final test result –Maximum utilization of data for training –No sampling of data for testing, each instance is systematically used for testing –High costs involved because classifier is trained n times –Hard to ensure representative data for training


Download ppt "Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually."

Similar presentations


Ads by Google