Download presentation

Presentation is loading. Please wait.

Published byGiovanna Edger Modified over 2 years ago

1
Computational Biology Lecture Slides Week 10 Classification (some parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar)

2
MBG404 Overview Data Generation Processing Storage Mining Pipelining

3
Data Mining Data mining –noun Digital Technology. The process of collecting, searching through, and analyzing a large amount of data in a database, as to discover patterns or relationships: e.g.: the use of data mining to detect fraud. Machine learning a branch of artificial intelligence, concerning the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new email messages into spam and non-spam folders.

4
Application in Biology Data exploration –Microarrays –Next generation sequencing Prediction –microRNAs –Protein secondary structure

5
Parametrization

6
Types of Attributes There are different types of attributes –Nominal Examples: ID numbers, eye color, zip codes –Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1- 10), grades, height in {tall, medium, short} –Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. –Ratio Examples: temperature in Kelvin, length, time, counts

7
Data Type Distinctness (nominal) –Equal or unequal Order (ordinal) –>, =,<= Addition (interval) –+,- Multiplication (ratio) –*,/

8
Data Quality Missing data Noise False measurements Outliers Duplicate data Precision Bias Accuracy

9
Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation Discretization and Binarization Attribute Transformation

10
Aggregation Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation Variation of Precipitation in Australia

11
Dimensionality Reduction: PCA

12
Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.

13
Similarity Eucledian distance Simple matching coefficient –Jaccard coefficient Correlation Cosine similarity...

14
Sampling Curse of dimensionality –Feature selection –Dimensionality reduction –Principal component analysis –Aggregation –Mapping of data to different space

15
Sampling Dividing samples into –Training set –Test set Using not all samples from both sets

16
Classification Examples with known classes (labels) Learn rules of how the attributes define the classes Classify unknown samples into the appropriate class

17
Classification Workflow

18
End Theory I 5 min Mindmapping 10 min Break

19
Practice I

20
Exploring Data (Irises) Download the file Iris.txt Follow along

21
Exploring Data Frequencies Percentiles Mean, Median Visualizations

22
Data Selection Selecting columns Filtering rows

23
Data Transformation Discretize Continuize Feature construction

24
Visualizations

25
End Practice I Break 15 min

26
Theory II

27
Classification Workflow

28
Illustrating Classification Task

29
Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree

30
General Structure of Hunt’s Algorithm Let D t be the set of training records that reach a node t General Procedure: –If D t contains records that belong the same class y t, then t is a leaf node labeled as y t –If D t is an empty set, then t is a leaf node labeled by the default class, y d –If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. DtDt ?

31
Hunt’s Algorithm Don’t Cheat Refund Don’t Cheat Don’t Cheat YesNo Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married Taxable Income Don’t Cheat < 80K>= 80K Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married

32
How to Find the Best Split B? YesNo Node N3Node N4 A? YesNo Node N1Node N2 Before Splitting: M0 M1 M2M3M4 M12 M34 Gain = M0 – M12 vs M0 – M34

33
Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large

34
Overfitting due to Noise Decision boundary is distorted by noise point

35
Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

36
Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) Class=YesClass=No Class=YesC(Yes|Yes)C(No|Yes) Class=NoC(Yes|No)C(No|No) C(i|j): Cost of misclassifying class j example as class i

37
Cost-Sensitive Measures l Precision is biased towards C(Yes|Yes) & C(Yes|No) l Recall is biased towards C(Yes|Yes) & C(No|Yes) l F-measure is biased towards all except C(No|No)

38
Receiver Operating Characteristic (ROC) Curve (TP,FP): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: –Random guessing –Below diagonal line: prediction is opposite of the true class

39
Using ROC for Model Comparison l No model consistently outperform the other l M 1 is better for small FPR l M 2 is better for large FPR l Area Under the ROC curve l Ideal: Area = 1 l Random guess: Area = 0.5

40
End Theory II 5 min Mindmapping 10 min Break

41
Practice II

42
Learning Supervised (Classification) –Classification Decision tree SVM

43
Classification Use the iris.txt file for classification Follow along as we classify

44
Classification Use the orangeexample file for classification We are interested if we can distinguish between miRNAs and random sequences with the selected features Try yourself

Similar presentations

OK

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on self development courses Ppt on road accidents Ppt on fundamental rights of india Ppt on astronomy and astrophysics review Historical backgrounds for ppt on social media Ppt on eia report sample Ppt on natural and whole numbers Ppt on developing emotional intelligence Ppt on credit policy pdf Ppt on two point perspective interior