Classification 10/03/07.

Classification 10/03/07

Diagnose disease by gene expression pattern
Golub et al. 1999

Two types of statistical learning
Supervised The classes are predefined. The membership for a set of objects are known. Try to develop a rule to predict the membership for a new object. Unsupervised Discover clusters of patterns from observed data. Both membership and the clusters need to be identified. Classification is a kind of supervised learning.

How good is good enough? Suppose a test is used to screen for a certain disease. The test has 99% sensitivity and 99% specificity. The disease is rare: 1 case out of 1 million people. Question: Is this test useful?

How good is good enough? Misclassification rate = * * 0.01 = 0.01 If we predict that no one has the disease, the misclassification rate = * 1 = Does that mean the test is no good?

Loss function Often our goal is to minimize the misclassification error rate. Sometimes an error in one direction outweighs an error in the other direction. For example, It is more costly to classify a patient as healthy then to classify a healthy patient as sick. In general, we want to minimize a loss function L(Ctrue, Cpredict).

Procedure for developing a classifier
Collect data with known class association. Take out a subset, don’t touch it. This will be the testing subset. Building a model using information from the rest of the data, i.e., the training set. Apply the trained model to the testing data. Evaluate model performance. If you use all data to train your model, then you will be overfitting your model and the performance will be exaggerated.

k-nearest-neighbor classifier
?

Find k-nearest neighbors 1 2 5 3 4

Find k-nearest neighbors Classify the unknown case by majority vote. Despite its simplicity, kNN can be effective. 1 2 5 3 4

Issues with k-nearest-neighbor classifier
Computationally intensive How to choose k Nearest-neighbors may not be close (especially when X is high dimensional). Most genes are probably irrelevant to the prediction anyhow. Pre-select features using dimension reduction methods (discussed by Prof. Cai last time). Dimension reduction is important for other classifiers as well.

Feature selection The dimension of the model = number of genes is very high. It is hard to find close neighbors in high dimensional space Many genes are irrelevant Pre-select genes using dimension reduction methods Dimension reduction is required for other models as well.

Feature selection

Feature selection methods
Stepwise regression PCA PLS Ridge regression LASSO etc. (Cai)

Classification Methods
Linear discriminant analysis (LDA) Logistic regression Classification trees Support vector machine (SVM) Neural network Many other methods!

Linear methods Class 2 Class 1 ???

Linear Discriminant Analysis (LDA)
Class 2 Approximate the probability distribution within each class by a Gaussian distribution. Class 1

Bayes Rule The posterior distribution
Select k with the largest posterior distribution. Minimizes the average misclassification rate. Maximum likelihood rule is equivalent to Bayes rule with uniform prior. Decision boundary is

Linear Discriminant Analysis
Assume

Linear Discriminant Analysis

LDA The boundary is linear if the variances for the two classes are the same. Otherwise, the boundary is quadratic and the method is called QDA. Class 2 Class 1

Diabetes Data Set

Logistic regression Model the log-odds between the k-th class vs a reference class: e.g. 1st class. Select k with the largest P(G = k | X = x) Question: How to estimate the b’s?

Fitting logistic regression model
Let Maximize the conditional log-likelihood. where In the special case of two classes, let yi = 0 when gi = 1, and yi = 1 when gi = 2. Then The maximum is achieved when

Fitting logistic regression model (ctd)
Since this is a non-linear equation, it can only be solved numerically. This is achieved by the Newton-Raphson method. where Note: global convergence is not guaranteed. For multiple classes b can be solved similarly.

Connection between LDA and logistic regression

Diabetes Data Set

Naïve Bayes method From Bayes’ rule,
If is high-dimensional (number of genes considered), pk(X) is difficult to estimate. However, if we assume the Xj’s are independent with each other, i.e., then pkj(Xj) can be easily estimated.

Naïve Bayes method Therefore
Note: Surprisingly, even though the assumption that Xj’s are independent is almost never met, the naïve Bayes classified often performs well, even beating more sophisticated methods. Up to here we talk about linear methods. Nonlinear methods will be discussed in the following.

Classification tree Goal: Predict whether a person owns a house by asking a few questions with yes or no answers. Predictors: Age, Car Type, etc. Age >=30 YES <30 NO Car Type sports car minivan

Age Car Type Age >= 30 Sports car >=30 YES <30 NO sports car
minivan minivan

Regression tree: Algorithm
Response function is continuous. Goal: select a partition of regions (nodes): R1, …, RM, so that the response can be modeled as a constant cm in each region. Step 1: For a splitting variable Xj and a splitting point s, define Seek j and s, so that is minimized. Step 2: For each Rm , refine the partition by repeating step 1, stop when the number of nodes reaches a predefined cutoff.

Classification tree: Pruning
Define a subtree to be any tree that can be obtained by pruning T. Let The quality of a tree is given by Define a cost-complexity criterion for a pre-selected level a Seek the subtree Ta that minimized the Ca(T).

Classification tree: Pruning
Find the weak link, that is, a node that leads to minimum increase of . Repeat the above procedure until a single node tree is achieved. Theorem (Breiman et al. 1984): The optimal subtree is contained in the above sequence of subtrees. The level of a can be determined through cross-validation. (We will talk about cross-validation later.)

Classification tree Classification tree differs from regression tree in the quality term. For regression tree, minimize For classification tree, minimize Misclassification error: Gini index or Cross-entropy or deviance

Classification tree Advantage Drawback Visually intuitive
Mathematically “simple” Drawback Unstable: tree structures are sensitive to data Theoretical properties are not well understood

Performance of a classifier
Cross-validation Bootstrap

Cross-validation The data is divided into a training subset and a testing subset. Model building must be independent of testing subset, including variable selection, tree structure, and so on. Example: n-fold cross-validation A dataset is randomly divided into n subsets of equal size. Each subset is selected in turn as the testing set, whereas the rest are used as the training set. Expand cross-validation

Bootstrap methods Idea: Random draw with replacement from the training data, each sample the same size as the original training set. Fit the model using the resampled data, then treat the original training data as testing data. Estimate Improved version

Use cross-validation to select parameters
A classifier may have several tunable parameters. For example, number of nearest neighbors, a for classification tree. These parameters can be selected by CV. In these cases, the full dataset is divided into three parts: training set, testing set 1, and testing set 2. Testing set 1 is used to tune parameters. So it cannot be used to objectively estimate model performance. Therefore, testing set 2 is needed.

Acknowledgement Sources of slides: Cheng Li

Classification 10/03/07.

Similar presentations

Presentation on theme: "Classification 10/03/07."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classification 10/03/07.

Similar presentations

Presentation on theme: "Classification 10/03/07."— Presentation transcript:

Similar presentations

About project

Feedback