Boosting For Tumor Classification With Gene Expression Data

Boosting For Tumor Classification With Gene Expression Data
Marcel Dettling and Peter Buhlmann Bioinformatics, Vol.19, no , Hojung Cho Topics for Bioinformatics Oct

Outline Background Microarrays Scoring Algorithm Decision Stumps
Boosting (primer) Methods Feature preselection LogitBoost Choice of the stopping parameters Multiclass approach Results Data preprocessing Error rates ROC curves Validation of the results Simulation Discussion

Microarray data (p»n) Park et al, 2001 Boosting decision trees applied for the classification of gene expression data (Ben-Dor et al (2000), Dudoit et al.(2002)) “AdaBoost did not yield good results compared to other classifiers”

The objective The strategies
improve the performance of boosting for classification of gene expression data by modifying the algorithm The strategies Feature preselection: nonparametric scoring method Binary LogitBoost with decision stumps Multiclass approach: reduce to multiple binary classification

Background: Scoring Algorithm
Nonparametric Allows data analysis w/o assuming underlying distribution Feature preselection Score each gene according to its strength for phenotype discrimination Consider only genes differentially expressed across samples

Scoring Algorithm Score defined as smallest number of swaps of
Sort expression levels Group membership determined by resulting sequence of 0’s and 1’s Correspondence between expression levels and group membership determined by how well clustered together Score defined as smallest number of swaps of consecutive digits necessary to arrive at a perfect splitting (this example : score = +4) comparing pairwise comparison to comparison in rank

Background Scoring Algorithm
Allows ordering of genes according to their potential significance Captures to what extent a gene discriminates the response categories Both the near zero and the maximum score n0n1 indicate a differentially expressed gene Quality measure: restrict boosting classifier to work with this subset

Background: Decision Stumps
Test attribute Decision trees with only a single split The presence or absence of a single term as a predicate “predict basketball player if and only if height > 2m” Base learner for boosting procedure Weak learner Subroutine returns a hypothesis given some finite training set Performs slightly better than random choice: may be enhanced combined with the feature preselection Label A Label B

Background Boosting “AdaBoost does not yield very impressive results”
LogitBoost Relies on the binomial log-likelihood as a loss function Found to have a slight edge over AdaBoost in many classification problems Usually performs better on noisy data or when there are misspecifications or inhomogeneities of the class labels in the training data (common in microarray)

Methods Choice of the stopping parameter Rank and Select Features
: selection of genes with the highest values of the quality measure : the number of p can be determined by CV. Train LogitBoost classification using decision stumps as weak learners Choice of the stopping parameter Leave-One-Out Cross-Validation: find m for maximizing ( l(m) )

LogitBoost Algorithm Show the first step of the iteration when i= 1,
p(x) = P[Y=1|X=x] , p(0)(x) = ½ , F(0)(x) = 0 Show the first step of the iteration when i= 1,

Reducing multiclass to binary
Multiclass -> Multibinary Match each class against all the other classes (one-against-all) Combine binary probabilities to multiclass probabilities probability estimates for Y = j via normalization, Plugged into the Bayes classifier,

Sample data sets Data preprocessing
Data ChipType Leukemia 47 ALL Affymetrix Oligo 25 AML 3571 Genes Colon 40 Tumor 22 Normal 6500 Genes Estrogen and Nodal 25 ER+ Samples 24 ER Samples 7129 Genes Lymphoma 42 DLBL CDNA 9 Follicular 11 Chronic 4026 Genes NCI 7 Breat 5 CNS 7 Colon 6 Leukemia 8 Melanoma 9 NSCLC 6 Ovarian 9 Renal 5244 Genes Data preprocessing Leukemia, NCI: thresholding (floor 100, ceiling 16000), filtering (fold change > 5, max – min > 500), log transform, normalization Colon: log transform, normalization Estrogen and Nodal: thresholding, log transform, normalization Lymphoma: normalization

Results – Error Rates (1): The test error using symmetrically equal misclassification costs

Results – Error Rates (2)

Results:No. of Iterations & Performance
The choice of the stopping parameter for boosting is not very critical in all six datasets. ”stopping after a large, but arbitrary number of 100 iterations is a reasonable strategy in the microarray data”

Results: ROC curves The test error using asymmetric misclassification costs Both boosting classifiers yield the similar curve to the ideal ROC curve (red line) than the one from classification trees. Boosting has an advantage for small negative rates

Validation of the results
Disease This study AdaBoost SVM Others Leukemia 1/34 (Ben-dor et al) % (Furey et al) 2-4/34 (Golub et al.) 5/34 Colon % % 9.68% NCI 22.9 % (Dudoit et al.) 48% Estrogen and nodal Better predictions than Bayesian approach (West et al.) Lymphoma N/A Disease # of Classes Multiclass (Friedman et al) One-Against-All NCI 8 36.10% 22.90% Lymphoma 3 8.06% 1.61%

Simulation Model dataset
Gene expression profiles from a multivariate nomal distribution, where covariance is from the colon dataset. Assign one out of two response classes according to Bernoulli distribution

Conclusion Feature preselection generally improved the predictive power Slightly better performance of LogitBoost over AdaBoost Reducing multiclass problems to multiple binary problems yielded more accurate results

Discussion Biological interpretation
Marginal and “far from significant” edge of LogitBoost over AdaBoost Did feature preselection really improve the performance? Manipulated to make LogitBoost perform better? Cross-validation of algorithms with published data Authors have other considerations than simple performance of the algorithms with the training datasets Leave-One-Out is just one way to cross-validate Biological interpretation

Boosting For Tumor Classification With Gene Expression Data

Similar presentations

Presentation on theme: "Boosting For Tumor Classification With Gene Expression Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Boosting For Tumor Classification With Gene Expression Data

Similar presentations

Presentation on theme: "Boosting For Tumor Classification With Gene Expression Data"— Presentation transcript:

Similar presentations

About project

Feedback