Presentation on theme: "Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and."— Presentation transcript:
Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and training examples will be checked to assign a class label to the query instance.
KNN: k-Nearest Neighbor A test sample x can be best predicted by determining the most common class label among k training samples to which x is most similar. X j —jth training sample, y j —the class label for x j, N x —the set of k nearest neighbors of x in training set. Estimate the probability x belongs to ith class:
KNN: k-Nearest Neighbor, con’t Proportion of K nearest neighbors that belong to ith class: The ith class which maximizes the proportion above will be assigned as the label of x. Variants of KNN: filtering out irrelevant genes before applying KNN.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring
"Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring" Golub, Slonim, Tamayo, Huard, Gaasenbeek, Mesirov, Coller, Lo, Downing, Caligiuri, Bloomfield, Lander Appears in Science Volume 286, October 15, 1999 Whitehead Institute/MIT Center for Genome Research http://www-genome.wi.mit.edu/cancer...and Dana-Farber (Boston), St. Jude (Memphis), Ohio State...additional publications by same group shows similar technique applied to different disease modalities. Publication Info
Cancer Classification Class Discovery: defining previously unrecognized tumor subtypes Class Prediction: assignment of particular tumor samples to already-defined classes Given bone marrow samples: Which cancer classes are present among sample? How many cancer classes? 2, 4? Given samples are from leukemia patients, what type of leukemia is each sample (AML vs ALL)?
Cancer of bone marrow Myelogenous or lymphocytic, acute or chronic Acute Myelogenous Leukemia (AML) vs Acute Lymphocytic Leukemia (ALL) Marrow cannot produce appropriate amount of red and white blood cells Anemia -> weakness, minor infections; Platlet deficiency -> easy bruising AML: 10,000 new adult cases per year ALL: 3,500/2,400 new adult/child cases per year AML vs. ALL in adults & children Leukemia: Definitions & Symptoms
Leukemia: Treatment & expected outcome Diagnosis via highly specialized laboratory ALL: 58% survival rate AML: 14% survival rate Treatment: chemotherapy, bone marrow transplant ALL: corticosteroids, vincristine, methotrexate, L- asparaginase AML: daunorubicin, cytarabine Correct diagnosis very important for treatment options and expected outcome!!! Microarray could provide systematic diagnosis option BUT ONLY ONE TYPE OF DIAGNOSTIC TOOL!!!
38 bone marrow samples (27 AML, 11 AML) 6817 human gene probes Leukemia: Data set
Cancer Class Prediction Learning Task –Given: Expression profiles of leukemia patients –Compute: A model distinguishing disease classes (e.g., AML vs. ALL patients) from expression data. Classification Task –Given: Expression profile of a new patient + A learned model (e.g., one computed in a learning task) –Determine: The disease class of the patient (e.g., whether the patient has AML or ALL)
Cancer Class Prediction n genes measured in m patients g 1,1 g 1,n Ã class 1 g 2,1 g 2,n Ã class 2 g m,1 g m,n Ã class m Vector for a patient
Cancer Class Prediction Approach Rank genes by their correlation with class variable (AML/ALL) Select subset of “informative” genes Have these genes do a weighted vote to classify a previously unclassified patient. Test validity of predictors.
Ranking Genes Rank genes by how predictive they are (individually) of the class… g 1,1 g 1,n Ã class 1 g 2,1 g 2,n Ã class 2 g m,1 g m,n Ã class m
Ranking Genes Split the expression values for a given gene g into two pools – one for each class (AML vs. ALL) Determine their mean and standard deviation sigma of each pool Rank genes by correlation metric (separation) P(g, class) = ( ALL - AML )/( ALL + AML ) The mean difference between the classes relative to the SD within the classes.
Neighborhood Analysis Each gene g: V(g) = (e 1, e 2, …, e n ), e i : expression level of gene g in ith sample. Idealized pattern: c = (c 1, c 2, …, c n ), c i : 1 or 0 (sample I belongs to class 1 or 2. C* idealized random pattern. Counting the number of genes having various levels of correlation with C, compared with the corresponding distribution obtained for random pattern C *.
Selecting Informative Genes Select the k ALL top ranked genes (highly expressed in ALL) and the k AML bottom ranked genes (highly expressed in AML) P(g, class) = ( ALL - AML )/( ALL + AML ) In Golub’s paper, 25 most positively correlated and 25 most negatively correlated genes are selected.
Determine significant genes 1% significance level means 1% of random neighborhoods contain as many points as observed neighborhood. P(g,c)>0.30 is 709 genes (intersects 1%) Median is ~150 genes (if totally random)
Weighted Voting Given a new patient to classify, each of the selected genes casts a weighted vote for only one class. The class that gets the most vote is the prediction.
Weighted Voting Suppose that x is the expression level measured for gene g in the patient V = P(g,class) X |x – [ ALL + AML ]/2| Weight for gene g – weighting factor reflecting how well the gene is correlated with the class distinction Distance from the measurement to the class boundary -- reflecting the deviation of the expression level in the sample from the average of AML and ALL
Prediction Weighted vote: V AML = v i w i |v i is vote for AML where v i =|x i -( AML + ALL )/2|
Prediction Strength Can assess the “strength” of a prediction as follows: PS = (V winner – V loser )/(V winner + V loser ) where V winner is the summed vote (absolute value) from the winning class, and V loser is the summed vote (absolute value) for the losing class
Prediction Strength When classifying new cases, the algorithm ignores those cases where the strength of the prediction is below a threshold… Prediction = –[ALL, if V ALL > V AML Æ PS > –[AML, if V AML > V ALL Æ PS > –[No-call, otherwise.
Experiments Cross validation with the original set of patients –For i = 1 to 38 Hold the i th sample aside Use the other 37 samples to determine weights With this set of weights, make prediction on the i th samples Testing with another set of 34 patients…
"Training set" results were 36/38 with 100% accuracy, 2 unknown via cross-validation (37 train, 1 test) Independent "test set" consisted of 34 samples 24 bone marrow samples, 10 peripheral blood samples NOTE: "training set" was ONLY bone marrow samples "test set" contained childhood AML samples, different laboratories Strong predictions (PS=0.77) for 29/34 samples with 100% accuracy Low prediction strength from questionable laboratory Prediction: Results Slection of 8-200 genes gives roughly the same prediction quality.
Cancer Class Discovery Given –Expression profiles of leukemia patients Do –Cluster the profiles, leading to discovery of the subclasses of leukemia represented by the set of patients
Cancer Class Discovery Experiment Cluster the expression profiles of 38 patients in the training set –Using self-organizing maps with a predefined number of clusters (say, k) Run with k = 2 –Cluster 1 contained 1 AML, 24 ALL –Cluster 2 contained 10 AML, 3 ALL
Cancer Class Discovery Experiment Run with k = 4 –Cluster 1 contained mostly AML –Cluster 2 contained mostly T-cell ALL –Cluster 3 contained mostly B-cell ALL –Cluster 4 contained mostly B-cell ALL It is unlikely that the clustering algorithm was able to discover the distinction between T-cell and B-cell ALL cases