Computational Biology Lecture Slides Week 10 Classification (some parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar)

Slides:



Advertisements
Similar presentations
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Advertisements

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Data Engineering.
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Classification: Definition l Given a collection of records (training set) l Find a model.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Data Mining Classification This lecture node is modified based on Lecture Notes for Chapter 4/5 of Introduction to Data Mining by Tan, Steinbach, Kumar,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Lecture Notes for Chapter 4 (2) Introduction to Data Mining
Lecture Notes for Chapter 4 Part III Introduction to Data Mining
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Lecture Notes for Chapter 2 Introduction to Data Mining
Model Evaluation Metrics for Performance Evaluation
Lecture Notes for Chapter 4 Introduction to Data Mining
CSci 8980: Data Mining (Fall 2002)
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Lecture 5 (Classification with Decision Trees)
Data Mining Lecture 2: data.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 11 = Finish ch. 4 and start.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 2 = Start chapter 2 Agenda:
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Classification: Basic Concepts, Decision Trees, and Model Evaluation
Classification Basic Concepts, Decision Trees, and Model Evaluation
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Modul 6: Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Basics: Data Remark: Discusses “basics concerning data sets (first half of Chapter.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
What is Data? Attributes
CIS527: Data Warehousing, Filtering, and Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
1 Data Mining Lecture 2: Data. 2 What is Data? l Collection of data objects and their attributes l Attribute is a property or characteristic of an object.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining Lecture 4: Decision Tree & Model Evaluation.
Lecture Notes for Chapter 4 Introduction to Data Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4.
1 What is Data? l An attribute is a property or characteristic of an object l Examples: eye color of a person, temperature, etc. l Attribute is also known.
1 Illustration of the Classification Task: Learning Algorithm Model.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Computational Biology
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining Classification: Basic Concepts and Techniques
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining Classification: Alternative Techniques
CISC 4631 Data Mining Lecture 02:
Lecture Notes for Chapter 2 Introduction to Data Mining
Introduction to Data Mining, 2nd Edition by
Lecture Notes for Chapter 2 Introduction to Data Mining
آبان 96. آبان 96 Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan,
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining Lecture 02a: Theses slides are based on the slides by Data
Group 9 – Data Mining: Data
Data Pre-processing Lecture Notes for Chapter 2
COSC 4368 Intro Supervised Learning Organization
Presentation transcript:

Computational Biology Lecture Slides Week 10 Classification (some parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar)

MBG404 Overview Data Generation Processing Storage Mining Pipelining

Data Mining Data mining –noun Digital Technology. The process of collecting, searching through, and analyzing a large amount of data in a database, as to discover patterns or relationships: e.g.: the use of data mining to detect fraud. Machine learning a branch of artificial intelligence, concerning the construction and study of systems that can learn from data. For example, a machine learning system could be trained on messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new messages into spam and non-spam folders.

Application in Biology Data exploration –Microarrays –Next generation sequencing Prediction –microRNAs –Protein secondary structure

Parametrization

Types of Attributes There are different types of attributes –Nominal Examples: ID numbers, eye color, zip codes –Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1- 10), grades, height in {tall, medium, short} –Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. –Ratio Examples: temperature in Kelvin, length, time, counts

Data Type Distinctness (nominal) –Equal or unequal Order (ordinal) –>, =,<= Addition (interval) –+,- Multiplication (ratio) –*,/

Data Quality Missing data Noise False measurements Outliers Duplicate data Precision Bias Accuracy

Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation Discretization and Binarization Attribute Transformation

Aggregation Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation Variation of Precipitation in Australia

Dimensionality Reduction: PCA

Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.

Similarity Eucledian distance Simple matching coefficient –Jaccard coefficient Correlation Cosine similarity...

Sampling Curse of dimensionality –Feature selection –Dimensionality reduction –Principal component analysis –Aggregation –Mapping of data to different space

Sampling Dividing samples into –Training set –Test set Using not all samples from both sets

Classification Examples with known classes (labels) Learn rules of how the attributes define the classes Classify unknown samples into the appropriate class

Classification Workflow

End Theory I 5 min Mindmapping 10 min Break

Practice I

Exploring Data (Irises) Download the file Iris.txt Follow along

Exploring Data Frequencies Percentiles Mean, Median Visualizations

Data Selection Selecting columns Filtering rows

Data Transformation Discretize Continuize Feature construction

Visualizations

End Practice I Break 15 min

Theory II

Classification Workflow

Illustrating Classification Task

Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree

General Structure of Hunt’s Algorithm Let D t be the set of training records that reach a node t General Procedure: –If D t contains records that belong the same class y t, then t is a leaf node labeled as y t –If D t is an empty set, then t is a leaf node labeled by the default class, y d –If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. DtDt ?

Hunt’s Algorithm Don’t Cheat Refund Don’t Cheat Don’t Cheat YesNo Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married Taxable Income Don’t Cheat < 80K>= 80K Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married

How to Find the Best Split B? YesNo Node N3Node N4 A? YesNo Node N1Node N2 Before Splitting: M0 M1 M2M3M4 M12 M34 Gain = M0 – M12 vs M0 – M34

Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large

Overfitting due to Noise Decision boundary is distorted by noise point

Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) Class=YesClass=No Class=YesC(Yes|Yes)C(No|Yes) Class=NoC(Yes|No)C(No|No) C(i|j): Cost of misclassifying class j example as class i

Cost-Sensitive Measures l Precision is biased towards C(Yes|Yes) & C(Yes|No) l Recall is biased towards C(Yes|Yes) & C(No|Yes) l F-measure is biased towards all except C(No|No)

Receiver Operating Characteristic (ROC) Curve (TP,FP): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: –Random guessing –Below diagonal line: prediction is opposite of the true class

Using ROC for Model Comparison l No model consistently outperform the other l M 1 is better for small FPR l M 2 is better for large FPR l Area Under the ROC curve l Ideal:  Area = 1 l Random guess:  Area = 0.5

End Theory II 5 min Mindmapping 10 min Break

Practice II

Learning Supervised (Classification) –Classification Decision tree SVM

Classification Use the iris.txt file for classification Follow along as we classify

Classification Use the orangeexample file for classification We are interested if we can distinguish between miRNAs and random sequences with the selected features Try yourself