1 Test-Cost Sensitive Naïve Bayes Classification X. Chai, L. Deng, Q. Yang Dept. of Computer Science The Hong Kong University of Science and Technology.

Slides:



Advertisements
Similar presentations
Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Advertisements

COMP3740 CR32: Knowledge Management and Adaptive Systems
CHAPTER 9: Decision Trees
Active Cost-sensitive Learning (Intelligent Test Strategies)
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
1 Classification with Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Induction of Decision Trees
Classification Continued
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
LEARNING DECISION TREES
Decision Trees with Minimal Costs Charles Ling, Qiang Yang and Jianning Wang 2004.
Decision Trees with Minimal Costs (ICML 2004, Banff, Canada) Charles X. Ling, Univ of Western Ontario, Canada Qiang Yang, HK UST, Hong Kong Jianning Wang,
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Ensemble Learning (2), Tree and Forest
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Bayesian Networks. Male brain wiring Female brain wiring.
by B. Zadrozny and C. Elkan
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
LEARNING DECISION TREES Yılmaz KILIÇASLAN. Definition - I Decision tree induction is one of the simplest, and yet most successful forms of learning algorithm.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
Decision Trees. Decision trees Decision trees are powerful and popular tools for classification and prediction. The attractiveness of decision trees is.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
CS690L Data Mining: Classification
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Bayesian Classification Using P-tree  Classification –Classification is a process of predicting an – unknown attribute-value in a relation –Given a relation,
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Classification Today: Basic Problem Decision Trees.
Decision Trees with Minimal Test Costs (ICML 2004, Banff, Canada) Charles X. Ling, Univ of Western Ontario, Canada Qiang Yang, HK UST, Hong Kong Etc.
Bootstrapped Optimistic Algorithm for Tree Construction
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
DECISION TREES An internal node represents a test on an attribute.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Rule Induction for Classification Using
Presented by: Dr Beatriz de la Iglesia
Chapter 6 Classification and Prediction
Classification and Prediction
Classification by Decision Tree Induction
Data Mining – Chapter 3 Classification
Machine Learning: Lecture 3
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CSCI N317 Computation for Scientific Applications Unit Weka
Statistical Learning Dong Liu Dept. EEIS, USTC.
©Jiawei Han and Micheline Kamber
Presentation transcript:

1 Test-Cost Sensitive Naïve Bayes Classification X. Chai, L. Deng, Q. Yang Dept. of Computer Science The Hong Kong University of Science and Technology C. Ling Dept. of Computer Science The University of Western Ontario

2 Example – Medical Diagnosis temperature pressure blood test cardiogram essay 39 o c ? ? ? ? Is the patient healthy? Which test should be taken first? Which test to perform next? Concern: cost the patient as little as possible while maintaining low mis-diagnosis risk

3 Test-Cost Sensitive Learning Great success of traditional inductive learning techniques. (decision trees, NB) – do not handle different types of costs during classification Misclassification costs (C mc ): the costs incurred by classification errors – distinguish different types of classification errors – neglect the possibility of obtaining missing values in a test case through performing attribute tests Test costs (C test ): the costs incurred by obtaining missing values of attributes. Minimize the total costs C total = C mc + C test

4 Some Related Work MDP-based cost-sensitive learning (Zubek and Dietterich 2002) Cast as a Markov decision process Solutions are given in terms of optimal policies  Very high computational cost to conduct the search Decision trees with minimal cost (Ling et al 2004) Consider both misclassification and test costs in tree building Splitting criterion: minimal total cost instead of InfoGain  Attributes not appearing on the testing branch are ignored, although they are still informative for classification  Not suitable for batch tests due to its sequential nature

5 Decision trees with minimal cost (Ling et al 2004) Attribute selection criterion: minimal total cost (C total = C mc + C test ) instead of minimal entropy in C4.5 If growing a tree has a smaller total cost, then choose an attribute with minimal total cost. Otherwise, stop and form a leaf. Label leaf also according to minimal total cost: Suppose the leaf have P positive examples and N negative examples FP denotes the cost of a false positive example and FN false negative If (P×FN  N×FP) THEN label = positive ELSE label = negative

6 A Tree Building Example P:NP:N P1:N1P1:N1 P2:N2P2:N2 Attribute A with a test cost C C mc = min(P×FN, N×FP) C test = 0 C total = C mc + C test A = v 1 A = v 2 Consider attribute A for a potential splitting attribute C’ mc = min(P 1 ×FN, N 1 ×FP) + min(P 2 ×FN, N 2 ×FP) C’ test = (P 1 + N 1 + P 2 + N 2 ) × C C’ total = C’ mc + C’ test If C’ total < C total, splitting on A would reduce the total cost  Choose an attribute with the minimal total cost for splitting If C’ total  C total for all remaining attributes, no further sub-tree will be built, and the set will become a leaf.

7 Sequential Test Strategy Optimal Sequential Test (OST): each test example goes down the tree until an attribute whose value is unknown is met in the test example. Then the test is done and the missing value is revealed. The process continues until it falls into a leaf node. The leaf node label is used as prediction. The total cost is the sum of misclassification cost and test cost. Problems with the OST strategy: The algorithm chooses a locally optimal attribute without backtracking. Thus the OST strategy is not globally optimal. Attributes not appearing on the testing branch are ignored, although they are still informative for classification Not suitable for batch tests due to its sequential nature

8 Problem Formulation Given: D – a training dataset of N samples {x 1,…,x N } from P classes {c 1,…,c P }, where each sample x i is described by M attributes (A 1,…,A M ) among whom there can be missing values. C – a misclassification cost matrix. C ij = C(i,j) specifies the cost of classifying a sample from c i as belong to class c j T – a test-cost vector. T k = T(k) specifies the cost of taking a test on attribute A k (1  k  M) Build: csNB – a cost sensitive naïve Bayes classifier S – a test strategy for every new case with the aim to minimize the sum of the misclassification cost C mc and test cost C test

9 csNB classification Two procedures: Learning and prediction Learning a csNB classifier Same as learning a traditional NB classifier Estimate prior probabilities P(c j ) and P(A m =v m,k |c j ) from the training dataset D. Missing values are simply ignored in likelihood computation. Prediction Sequential test strategy Batch test strategy

10 Sequential Test Strategy v.s. Batch Test Strategy What is a sequential test strategy? – decisions are made sequentially on whether a further test on an unknown attribute should be performed, and if so, which attribute to select based on the values of the attributes initially known or previously tested. – a test strategy that is designed on the fly during classification. What is a batch test strategy? – selection of tests on unknown attributes must be determined in advance before any test is carried out. – a test strategy that is designed beforehand. Both are aimed to minimize the sum of misclassification and test costs.

11 Suppose a patient comes with all attribute values unknown: (?,?,?,?) Sequential test: Batch test: Example: Diagnosis of Hepatitis Assume: – 21% patients are positive (c 1 ) (have hepatitis) P(c 1 )=21% – 79% patients are negative(c 2 ) (healthy) P(c 2 )=79% – Classification costs: C 12 =450, C 21 =150, C 11 =C 12 =0 – Four attributes to describe a patient Test costs and likelihoods of each attribute: (?,?,?,?) test ascites (?,?,?,pos) (?,?,?,neg) test spiders… test spleen… (?,?,?,?) Test {spleen, spiders, ascites} (?,neg,neg,pos) classify

12 Prediction with Sequential Test Strategy Suppose x is a test example. Let denote the set of known attributes and the unknown attributes. We define the utility of testing unknown attribute is defined as: is the test cost attribute given by T i is the reduction in the expected misclassification cost if we know ’s true value Where:

13 Prediction with Sequential Test Strategy is the expected C mc based on takes expectation over all possible values of Gain(, ) is defined as: Where:

14 Prediction with Sequential Test Strategy Overall, an attribute is worth testing on if testing it offers more gain than the cost it brings. By calculating all the utilities of testing unknown attributes in, we can decide: Whether a further test is needed? Which attribute to test? After attribute is tested, its true value is revealed and it is removed from to. The same procedure continues until: no unknown attribute is left ( ) or the utility of testing any unknown attribute is non-positive Finally, the example is predicted as class and C test is the total costs of the tests performed.

15 csNB-sequential-predict Algorithm further test? Compute the utility of testing every unknown attribute … classify No Select the unknown attribute with the highest utility to test Yes

16 Prediction with Batch Test Strategy A natural extension from the sequential test algorithm of csNB All the attributes with non-negative utility are selected. The batch of attributes selected are, and the test cost After is selected, the values of these attributes are revealed and the class label is then predicted.

17 Experiments Experiments were carried out on eight datasets from UCI ML repository (Ecoli, heart, Australia, Voting, Breast, … ). Four algorithms were implemented for comparison: csNB – the test-cost sensitive naïve Bayes csDT – the cost-sensitive decision trees proposed in Ling et al LNB – lazy naïve Bayes, which predicts based only on the known attributes and requires no tests to be done on any unknown attribute ENB – Exacting naïve Bayes, which requires all the missing values to be made up before prediction. The performance of the algorithms is measured in terms of the total cost C total = C mc + C test, where C mc can be obtained by comparing the predicted and true labels of the test examples.

18 Experimental Results – Sequential Test Average total costs comparisons on datasets: Ecoli, Breast, Heart, Thyroid LNB ENB csNB csDT

19 Experimental Results – Sequential Test Average total costs comparisons on datasets: Australia, Cars, Voting, Mushroom

20 Experimental Results – Sequential Test Comparison of LNB, csNB and csDT with increasing percentage of unknown attributes Mushroom dataset

21 Experimental Results – Sequential Test Compared with csDT, csNB is more effective at balancing the misclassification and test costs. Comparison of csNB and csDT with varying test costs (missing rates are set to 20% and 60%) on the Mushroom dataset

22 Experimental Results – Batch Test Overall, csNB incurs 29.6% less total cost than csDT. csDT is inflexible to derive batch test strategies due to its sequential nature in tree building. csNB has no such constraints and all the attributes can be evaluated at the same level.

23 Conclusion and future work We proposed a test-cost sensitive naïve Bayes algorithm for designing classifiers that minimize the sum of the misclassification cost and test costs In the framework of csNB, attributes can be intelligently selected to design both sequential and batch test strategies. In the future, we plan to develop more effective algorithms and consider more complicated situations where the test cost of an attribute may be conditional on other attributes. It is also interesting to consider the cost of finding the missing values for training data

24 THANK YOU! Q & A