Supervise Learning. 2 What is learning? “Learning denotes changes in a system that... enable a system to do the same task more efficiently the next time.”

Slides:



Advertisements
Similar presentations
Imbalanced data David Kauchak CS 451 – Fall 2013.
Advertisements

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Learning Algorithm Evaluation
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Indian Statistical Institute Kolkata
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
K nearest neighbor and Rocchio algorithm
Model Evaluation Metrics for Performance Evaluation
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Jeremy Wyatt Thanks to Gavin Brown
Part I: Classification and Bayesian Learning
Introduction to Machine Learning Approach Lecture 5.
Classification and Prediction: Basic Concepts Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Mining Instructor: Bajuna Salehe Web:
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Classification II (continued) Model Evaluation
Evaluating Classifiers
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Evaluation – next steps
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
CISC 4631 Data Mining Lecture 03: Introduction to classification Linear classifier Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Basic Data Mining Technique
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experimental Evaluation of Learning Algorithms Part 1.
Classification Techniques: Bayesian Classification
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Learning from observations
CSSE463: Image Recognition Day 11 Lab 4 (shape) tomorrow: feel free to start in advance Lab 4 (shape) tomorrow: feel free to start in advance Test Monday.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Supervised Learning. CS583, Bing Liu, UIC 2 An example application An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc)
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
CSSE463: Image Recognition Day 11 Due: Due: Written assignment 1 tomorrow, 4:00 pm Written assignment 1 tomorrow, 4:00 pm Start thinking about term project.
Evaluating Classification Performance
ECE 471/571 - Lecture 19 Review 11/12/15. A Roadmap 2 Pattern Classification Statistical ApproachNon-Statistical Approach SupervisedUnsupervised Basic.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
KNN & Naïve Bayes Hongning Wang
Evaluating Classifiers
Performance Evaluation 02/15/17
Chapter 3: Supervised Learning
Dipartimento di Ingegneria «Enzo Ferrari»,
Chapter 7 – K-Nearest-Neighbor
Data Mining Classification: Alternative Techniques
Features & Decision regions
Prepared by: Mahmoud Rafeek Al-Farra
Learning Algorithm Evaluation
Supervised vs. unsupervised Learning
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Model Evaluation and Selection
Roc curves By Vittoria Cozza, matr
COSC 4368 Intro Supervised Learning Organization
Presentation transcript:

Supervise Learning

2 What is learning? “Learning denotes changes in a system that... enable a system to do the same task more efficiently the next time.” –Herbert Simon “Learning is constructing or modifying representations of what is being experienced.” –Ryszard Michalski “Learning is making useful changes in our minds.” – Marvin Minsky

3 Road Map Basic concepts K-nearest neighbor Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support vector machines Summary

CS583, Bing Liu, UIC4 An example application An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether to put a new patient in an intensive-care unit. Due to the high cost of ICU, those patients who may survive less than a month are given higher priority. Problem: to predict high-risk patients and discriminate them from low-risk patients.

CS583, Bing Liu, UIC5 Another Application A credit card company receives thousands of applications for new cards. Each application contains information about an applicant, – age – Marital status – annual salary – outstanding debts – credit rating – etc. Problem: to decide whether an application should approved, or to classify applications into two categories, approved and not approved.

CS583, Bing Liu, UIC6 Machine learning and our focus Like human learning from past experiences. A computer does not have “experiences”. A computer system learns from data, which represent some “past experiences” of an application domain. Our focus: learn a target function that can be used to predict the values of a discrete class attribute, e.g., approve or not-approved, and high-risk or low risk. The task is commonly called: Supervised learning, classification, or inductive learning.

CS583, Bing Liu, UIC7 Data: A set of data records (also called examples, instances or cases) described by – k attributes: A 1, A 2, … A k. – a class: Each example is labelled with a pre- defined class. Goal: To learn a classification model from the data that can be used to predict the classes of new (future, or test) cases/instances. The data and the goal

Classification Classification problem is to identify which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known. The individual observations are analyzed into a set of quantifiable properties, known as various explanatory variables Binary Classification Multiclass Classification

CS583, Bing Liu, UIC9 An example: data (loan application) Approved or not

CS583, Bing Liu, UIC 10 An example: the learning task Learn a classification model from the data Use the model to classify future loan applications into – Yes (approved) and – No (not approved) What is the class for following case/instance?

CS583, Bing Liu, UIC11 Supervised vs. unsupervised Learning Supervised learning: classification is seen as supervised learning from examples. – Supervision: The data (observations, measurements, etc.) are labeled with pre-defined classes. It is like that a “teacher” gives the classes (supervision). – Test data are classified into these classes too. Unsupervised learning (clustering) – Class labels of the data are unknown – Given a set of data, the task is to establish the existence of classes or clusters in the data

CS583, Bing Liu, UIC12 Supervised learning process: two steps Learning (training): Learn a model using the training data Testing: Test the model using unseen test data to assess the model accuracy

CS583, Bing Liu, UIC13 What do we mean by learning? Given – a data set D, – a task T, and – a performance measure M, a computer system is said to learn from D to perform the task T if after learning the system’s performance on T improves as measured by M. In other words, the learned model helps the system to perform T better as compared to no learning.

15 Supervised concept learning Given a training set of positive and negative examples of a concept Construct a description that will accurately classify whether future examples are positive or negative That is, learn some good estimate of function f given a training set {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )} where each y i is either + (positive) or - (negative), or a probability distribution over +/-

CS583, Bing Liu, UIC16 An Example Data: Loan application data Task: Predict whether a loan should be approved or not. Performance measure: accuracy. No learning: classify all future applications (test data) to the majority class (i.e., Yes): Accuracy = 9/15 = 60%. We can do better than 60% with learning.

An Example Gene Expression Microarray Cancer Classification – Each gene is a individual feature – Each sample belong to one of classes

CS583, Bing Liu, UIC18 Fundamental Assumption of Learning Assumption: The distribution of training examples is identical to the distribution of test examples (including future unseen examples). In practice, this assumption is often violated to certain degree. Strong violations will clearly result in poor classification accuracy. To achieve good accuracy on the test data, training examples must be sufficiently representative of the test data.

CS583, Bing Liu, UIC 19 Evaluating classification methods Predictive accuracy Efficiency – time to construct the model – time to use the model Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability: – understandable and insight provided by the model Compactness of the model: size of the tree, or the number of rules.

CS583, Bing Liu, UIC20 Evaluation methods Holdout set: The available data set D is divided into two disjoint subsets, – the training set D train (for learning a model) – the test set D test (for testing the model) Important: training set should not be used in testing and the test set should not be used in learning. – Unseen test set provides a unbiased estimate of accuracy. The test set is also called the holdout set. (the examples in the original data set D are all labeled with classes.) This method is mainly used when the data set D is large.

CS583, Bing Liu, UIC21 Evaluation methods (cont…) n-fold cross-validation: The available data is partitioned into n equal-size disjoint subsets. Use each subset as the test set and combine the rest n-1 subsets as the training set to learn a classifier. The procedure is run n times, which give n accuracies. The final estimated accuracy of learning is the average of the n accuracies. 10-fold and 5-fold cross-validations are commonly used. This method is used when the available data is not large.

CS583, Bing Liu, UIC22 Evaluation methods (cont…) Leave-one-out cross-validation: This method is used when the data set is very small. It is a special case of cross-validation Each fold of the cross validation has only a single test example and all the rest of the data is used in training. If the original data has m examples, this is m- fold cross-validation

CS583, Bing Liu, UIC23 Evaluation methods (cont…) Validation set: the available data is divided into three subsets, – a training set, – a validation set and – a test set. A validation set is used frequently for estimating parameters in learning algorithms. In such cases, the values that give the best accuracy on the validation set are used as the final parameter values. Cross-validation can be used for parameter estimating as well.

CS583, Bing Liu, UIC24 Classification measures Accuracy is only one measure (error = 1-accuracy). Accuracy is not suitable in some applications. In text mining, we may only be interested in the documents of a particular topic, which are only a small portion of a big document collection. In classification involving skewed or highly imbalanced data, e.g., network intrusion and financial fraud detections, we are interested only in the minority class. – High accuracy does not mean any intrusion is detected. – E.g., 1% intrusion. Achieve 99% accuracy by doing nothing. The class of interest is commonly called the positive class, and the rest negative classes.

CS583, Bing Liu, UIC 25 Precision and recall measures Used in information retrieval and text classification. We use a confusion matrix to introduce them.

CS583, Bing Liu, UIC26 Precision and recall measures (cont…) Precision p is the number of correctly classified positive examples divided by the total number of examples that are classified as positive. Recall r is the number of correctly classified positive examples divided by the total number of actual positive examples in the test set.

Sensitivity and Specificity Count up the total number of each label (TP, FP, TN, FN) over a large dataset. Sensitivity = TP TP+FN Specificity = TN TN+FP Can be thought of as the likelihood of spotting a positive case when presented with one. Or… the proportion of edges we find. Can be thought of as the likelihood of spotting a negative case when presented with one. Or… the proportion of non-edges that we find

CS583, Bing Liu, UIC 28 An example This confusion matrix gives – precision p = 100% and – recall r = 1% because we only classified one positive example correctly and no negative examples wrongly. Note: precision and recall only measure classification on the positive class.

CS583, Bing Liu, UIC 29 F 1 -value (also called F 1 -score) It is hard to compare two classifiers using two measures. F 1 score combines precision and recall into one measure The harmonic mean of two numbers tends to be closer to the smaller of the two. For F 1 -value to be large, both p and r much be large.

Receiver Operating Characteristic A receiver operating characteristic (ROC Curve) is a plot which illustrates the performance of a binary classifier system with different threshold. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate.

ROC Curve

Why ROC Curve It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity). The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test. The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. Sometimes we do not have clearly idea on which threshold should be choose for classification. Therefore, we can use ROC curve to measure the performance of classifier.

A Simple Beginning KNN – K nearest neighbors – Features All instances correspond to points in an n-dimensional Euclidean space Classification is delayed till a new instance arrives Classification done by comparing feature vectors of the different points Target function may be discrete or real-valued

KNN – K nearest neighbors K=1 Target sample is positive by using 1NN Target Sample Positive Negative

KNN – K nearest neighbors K=3 Target sample is negative by using 3NN Target Sample Positive Negative

KNN – K nearest neighbors x1x1 x2x2 ? ? ? ? –Find the k nearest neighbors of the test example, and infer its class using their known class. –E.g. K=3 ?

Questions What distance measure to use? – Often Euclidean distance is used – Hamming distance – Locally adaptive metrics – More complicated with non-numeric data, or when different dimensions have different scales Choice of k? – Cross-validation – 1-NN often performs well in practice – k-NN needed for overlapping classes – Re-label all data according to k-NN, then classify with 1-NN

Distance Measurement Feature1Feature2Feature3Feature4Feature5Label S112342Positive S212312Positive S312311Positive S412312Positive S521313Negative S621342Negative S721341Negative S821331Negative Feature1Feature2Feature3Feature4Feature5Label T112341? T221313?

Distance Measurement T1T2 S116 S243 S344 S443 S560 S624 S725 S834 K=1, T1=Positive and T2=Negative K=3, T1=Negative and T2=Positive

Feature Selection Feature1Feature2Feature3Feature4Feature5Label S112342Positive S212312Positive S312311Positive S412312Positive S521313Negative S621342Negative S721341Negative S821331Negative If we only choose feature1 and feature 2 Feature1Feature2Feature3Feature4Feature5Label T112341? T221313?

Distance Measurement T1T2 S102 S202 S302 S402 S520 S620 S720 S820 K=1, T1=Positive and T2=Negative K=3, T1=Positive and T2=Negative