 # © Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

## Presentation on theme: "© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer."— Presentation transcript:

© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar

© Vipin Kumar CSci 8980 Fall 2002 2 Instance-Based Classifiers Store the training instances Use the training instances to predict the class label of unseen cases

© Vipin Kumar CSci 8980 Fall 2002 3 Nearest-Neighbor Classifiers l For data set with continuous attributes l Requires three things –The set of stored instances –Distance Metric to compute distance between instances. –The value of k, the number of nearest neighbors to retrieve l For classification : –Retrieve the k nearest neighbors –Use class labels of nearest neighbors to determine the class label of unseen instance (e.g., by taking majority vote)

© Vipin Kumar CSci 8980 Fall 2002 4 Definition of Nearest Neighbor K-nearest neighbors of an instance x are data points that have the k smallest distance to x

© Vipin Kumar CSci 8980 Fall 2002 5 1 nearest-neighbor Voronoi Diagram

© Vipin Kumar CSci 8980 Fall 2002 6 Nearest neighbor classification l Compute distance between two points: –Euclidean distance  take the majority vote of class labels among the k-nearest neighbors –Weighted distance  weight factor, w = 1/d 2  weigh the vote according to distance

© Vipin Kumar CSci 8980 Fall 2002 7 Nearest Neighbor classification… l Choosing the value of k: –If k too small, sensitive to noise points –If k too large,  it can be computationally expensive  neighborhood may include points from other classes

© Vipin Kumar CSci 8980 Fall 2002 8 Nearest Neighbor Classification… l Problem with Euclidean measure: –High dimensional data  curse of dimensionality –Can produce counter-intuitive results –Normalization 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 vs

© Vipin Kumar CSci 8980 Fall 2002 9 Nearest neighbor classification… l Other issues: –k-NN classifiers are lazy learners  it does not build models explicitly  unlike eager learners such as decision tree induction and rule-based systems  classifying instances are relatively expensive –Scaling of attributes  Person: (height in meters, weight in lbs, Class) –Height may vary from 1.5m to 1.85m –Weight may vary from 90lbs to 250lbs –distance measure could be dominated by difference in weights

© Vipin Kumar CSci 8980 Fall 2002 10 Bayes Classifier l A probabilistic (Bayesian) framework to solve the classification problem l Conditional Probability: l Bayes theorem:

© Vipin Kumar CSci 8980 Fall 2002 11 Example of Bayes Theorem l Given: –A doctor knows that meningitis causes stiff neck 50% of the time –Prior probability of any patient having meningitis is 1/50,000 –Prior probability of any patient having stiff neck is 1/20 l If a patient has stiff neck, what’s the probability he/she has meningitis?

© Vipin Kumar CSci 8980 Fall 2002 12 Bayesian Classifiers l Consider each attribute and class label as random variables. l Given a record of attributes (A 1, A 2,…,A n ) –Goal is to predict class C –Specifically, we want to find value of C that maximizes P(C| A 1, A 2,…,A n ) l Can we estimate P(C| A 1, A 2,…,A n ) directly from data?

© Vipin Kumar CSci 8980 Fall 2002 13 Bayesian Classifiers l Approach: –compute the posterior probability P(C | A 1, A 2, …, A n ) for all values of C using the Bayes theorem –Choose value of C that maximizes P(C | A 1, A 2, …, A n ) –Equivalent to choosing value of C that maximizes P(A 1, A 2, …, A n |C) P(C) l How to estimate P(A 1, A 2, …, A n | C )?

© Vipin Kumar CSci 8980 Fall 2002 14 Naïve Bayes Classifier l Assume independence among attributes A i when class is given: –P(A 1, A 2, …, A n |C) = P(A 1 | C j ) P(A 2 | C j )… P(A n | C j ) –Can estimate P(A i | C j ) for all A i and C j. –New point is classified to C j if P(C j )  P(A i | C j ) is maximal.

© Vipin Kumar CSci 8980 Fall 2002 15 How to Estimate Probabilities from Data? l Class: P(C) = N c /N –e.g., P(No) = 7/10 l For discrete attributes: P(A i | C k ) = |A ik |/ N c –where |A ik | is number of instances having attribute A i and belongs to class C k –Examples: P(Married|No) = 4/7 P(Refund=Yes|Yes)=0 k

© Vipin Kumar CSci 8980 Fall 2002 16 How to Estimate Probabilities from Data? l For continuous attributes: –Discretize the range into bins  one ordinal attribute per bin  violates independence assumption –Two-way split: (A v)  choose only one of the two splits as new attribute –Assume attribute obeys certain probability distribution  Typically, normal distribution is assumed  Use data to estimate parameters of distribution (e.g., mean and standard deviation)  Once probability distribution is known, can use it to estimate the conditional probability P(A i |c) k

© Vipin Kumar CSci 8980 Fall 2002 17 How to Estimate Probabilities from Data? l Normal distribution: –One for each (A i,c i ) pair l For (Income, Class=No): –If Class=No  sample mean = 110  sample variance = 2975

© Vipin Kumar CSci 8980 Fall 2002 18 Example of Naïve Bayes Classifier l P(X|Class=No) = P(Refund=No|Class=No)  P(Married| Class=No)  P(Income=120K| Class=No) = 4/7  4/7  0.0072 = 0.0024 l P(X|Class=Yes) = P(Refund=No| Class=Yes)  P(Married| Class=Yes)  P(Income=120K| Class=Yes) = 1  0  1.2  10 -9 = 0 Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X) => Class = No Given a Test instance:

© Vipin Kumar CSci 8980 Fall 2002 19 Naïve Bayes Classifier l If one of the conditional probability is zero, then the entire expression becomes zero –Independence  multiplication of probabilities l Laplace correction (also known as m-estimate):

© Vipin Kumar CSci 8980 Fall 2002 20 Example of Naïve Bayes Classifier A: attributes M: mammals N: non-mammals P(A|M)P(M) > P(A|N)P(N) => Mammals

© Vipin Kumar CSci 8980 Fall 2002 21 Naïve Bayes (Summary) l Robust to isolated noise points l Handle missing values by ignoring the instance during probability estimate calculations l Robust to irrelevant attributes l Independence assumption may not hold for some attributes –Use other techniques such as Bayesian Belief Networks (BBN)

© Vipin Kumar CSci 8980 Fall 2002 22 Model Evaluation l Evaluating the performance of a model (in terms of its predictive ability) l Confusion Matrix: PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesab Class=Nocd a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

© Vipin Kumar CSci 8980 Fall 2002 23 Model Evaluation l What are the measures to estimate performance? PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesab Class=Nocd

© Vipin Kumar CSci 8980 Fall 2002 24 Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) Class=YesClass=No Class=YesC(Yes|Yes)C(No|Yes) Class=NoC(Yes|No)C(No|No) C(i|j): Cost of misclassifying class j example as class i

© Vipin Kumar CSci 8980 Fall 2002 25 Cost-Sensitive Measures l Precision is biased towards C(Yes|Yes) & C(Yes|No) l Recall is biased towards C(Yes|Yes) & C(No|Yes) l F-measure is biased towards all except C(No|No)

© Vipin Kumar CSci 8980 Fall 2002 26 ROC (Receiver Operating Characteristic) l Developed in 1950s for signal detection theory to analyze noisy signals –Characterize the trade-off between positive hits and false alarms l ROC curve plots TP (on the y-axis) against FP (on the x-axis) l Performance of each classifier represented as a point in an ROC curve –changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point

© Vipin Kumar CSci 8980 Fall 2002 27 ROC Curve - 1-dimensional data set containing 2 classes (positive and negative) - any points located at x > t is classified as positive At threshold t: TP=0.5, FN=0.5, FP=0.12, FN=0.88

© Vipin Kumar CSci 8980 Fall 2002 28 ROC Curve (TP,FP): l (0,0): everything is negative l (1,1): everything is positive l (1,0): ideal l Diagonal line: –Random guessing –Below diagonal line:  prediction is opposite of the true class l Area under ROC curve –Another metric for comparison

Download ppt "© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer."

Similar presentations