Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Biology

Similar presentations


Presentation on theme: "Computational Biology"— Presentation transcript:

1 Computational Biology
Classification (some parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar) Lecture Slides Week 10

2 MBG404 Overview Processing Pipelining Generation Data Storage Mining

3 Data Mining Data mining Machine learning noun Digital Technology.
The process of collecting, searching through,  and analyzing a large amount of data in a database,  as to discover patterns or relationships:  e.g.: the use of data mining to detect fraud. Machine learning a branch of artificial intelligence, concerning the construction and study of systems that can learn from data. For example, a machine learning system could be trained on messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new messages into spam and non-spam folders.

4 Application in Biology
Data exploration Microarrays Next generation sequencing Prediction microRNAs Protein secondary structure

5 Parametrization

6 Types of Attributes There are different types of attributes Nominal
Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio Examples: temperature in Kelvin, length, time, counts

7 Data Type Distinctness (nominal) Order (ordinal) Addition (interval)
Equal or unequal Order (ordinal) >,<,>=,<= Addition (interval) +,- Multiplication (ratio) *,/

8 Data Quality Missing data Noise False measurements Outliers
Duplicate data Precision Bias Accuracy

9 Data Preprocessing Aggregation Sampling Dimensionality Reduction
Feature subset selection Feature creation Discretization and Binarization Attribute Transformation

10 Aggregation Variation of Precipitation in Australia
Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation

11 Dimensionality Reduction: PCA

12 Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.

13 Similarity Eucledian distance Simple matching coefficient Correlation
Jaccard coefficient Correlation Cosine similarity ...

14 Sampling Curse of dimensionality Feature selection
Dimensionality reduction Principal component analysis Aggregation Mapping of data to different space

15 Sampling Dividing samples into Using not all samples from both sets
Training set Test set Using not all samples from both sets

16 Classification Examples with known classes (labels)
Learn rules of how the attributes define the classes Classify unknown samples into the appropriate class

17 Classification Workflow

18 End Theory I 5 min Mindmapping 10 min Break

19 Practice I

20 Exploring Data (Irises)
Download the file Iris.txt Follow along

21 Exploring Data Frequencies Percentiles Mean, Median Visualizations

22 Data Selection Selecting columns Filtering rows

23 Data Transformation Discretize Continuize Feature construction

24 Visualizations

25 End Practice I Break 15 min

26 Theory II

27 Classification Workflow

28 Illustrating Classification Task

29 Example of a Decision Tree
categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Training Data Model: Decision Tree

30 General Structure of Hunt’s Algorithm
Let Dt be the set of training records that reach a node t General Procedure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Dt ?

31 Hunt’s Algorithm Refund Refund Refund Marital Marital Status Status
Don’t Cheat Yes No Don’t Cheat Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Taxable Income < 80K >= 80K

32 How to Find the Best Split
Before Splitting: M0 A? B? Yes No Yes No Node N1 Node N2 Node N3 Node N4 M1 M2 M3 M4 M12 M34 Gain = M0 – M12 vs M0 – M34

33 Underfitting and Overfitting
Underfitting: when model is too simple, both training and test errors are large

34 Overfitting due to Noise
Decision boundary is distorted by noise point

35 Overfitting due to Insufficient Examples
Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

36 Cost Matrix C(i|j): Cost of misclassifying class j example as class i
PREDICTED CLASS ACTUAL CLASS C(i|j) Class=Yes Class=No C(Yes|Yes) C(No|Yes) C(Yes|No) C(No|No) C(i|j): Cost of misclassifying class j example as class i

37 Cost-Sensitive Measures
Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes) F-measure is biased towards all except C(No|No)

38 Receiver Operating Characteristic (ROC) Curve
(TP,FP): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class

39 Using ROC for Model Comparison
No model consistently outperform the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve Ideal: Area = 1 Random guess: Area = 0.5

40 End Theory II 5 min Mindmapping 10 min Break

41 Practice II

42 Learning Supervised (Classification) Classification Decision tree SVM

43 Classification Use the iris.txt file for classification
Follow along as we classify

44 Classification Use the orangeexample file for classification
We are interested if we can distinguish between miRNAs and random sequences with the selected features Try yourself


Download ppt "Computational Biology"

Similar presentations


Ads by Google