Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini

Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini e-mail: valenti@disi.unige.it

Outline A real problem: Lymphoma gene expression data analysis by machine learning methods: Diagnosis of tumors using a supervised approach Discovering groups of genes related to carcinogenic processes Discovering subgroups of diseases using gene expression data.

DNA microarray DNA hybridization microarrays supply information about gene expression through measurements of mRNA levels of large amounts of genes in a cell They offer a snapshot of the overall functional status of a cell: virtually all differences in cell type or state are related with changes in the mRNA levels of many genes. DNA microarrays have been used in mutational analyses, genetic mapping studies, in genome monitoring of gene expression, in pharmacogenomics, in metabolic pathway analysis.

A DNA microarray image (E. coli) Each spot corresponds to the expression level of a particular gene Red spots correspond to over expressed genes Green spots to under expressed genes Yellow spots correspond to intermediate levels of gene expression

Analyzing microarray data by machine learning methods Unsupervised approach No or limited a priori knowledge. Clustering algorithms are used to group together similar expression patterns : grouping sets of genes grouping different cells or different functional status of the cell. Example: hierarchical clustering, fuzzy or possibilistic clustering, self- organizing maps. Supervised approach “A priori” biological and medical knowledge on the problem domain. Learning algorithms with labeled examples are used to associate gene expression data with classes: separating normal form cancerous tissues classifying different classes of cells on functional basis Prediction of the functional class of unknown genes. Example: multi-layer perceptrons, support vector machines, decision trees, ensembles of classifiers. The large amount of gene expression data requires machine learning methods to analyze and extract significant knowledge from DNA microarray data

A real problem: A gene expression analysis of lymphoma 1. Separating cancerous and normal tissues using the overall information available. 2. Two step method: A priori knowledge and unsupervised methods to select “candidate” subgroups SVM or MLP identify the most correlated subgroups 2. Identifying groups of genes specifically related to the expression of two different tumour phenotypes through expression signatures. Biological problems 1. - Support Vector Machines (SVM) : linear, RBF and polynomial kernels - Multi Layer Perceptron (MLP) - Linear Perceptron (LP) Machine learning methods

The data Data of a specialized DNA microarray, named "Lymphochip", developed at the Stanford University School of Medicine: 96 tissue samples from normal and cancerous populations of human lymphocytes 4026 different genes preferentially expressed in lymphoid cells or with known roles in processes important in immunology or cancer High dimensional data Small sample size A challenging machine learning problem

Types of lymphoma Three main classes of lymphoma: Diffuse Large B-Cell Lymphoma (DLBCL), Follicular Lymphoma (FL) Chronic Lymphocytic Leukemia (CLL) Transformed Cell Lines (TCL) and normal lymphoid tissues Type of tissueNumber of samples Normal lymphoid cells24 DLBCL46 FL9 CLL11 TCL6

Visualizing data with Tree View

The first problem: Separating normal from cancerous tissues. Our first task consists in distinguishing cancerous from normal tissues using the overall information available, i.e. all the gene expression data. From a machine learning standpoint it is a dichotomic problem. Data characteristics: Small sample size High dimension Missing values Noise Main applicative goal: Supporting functional- molecular diagnosis of tumors and polygenic diseases

Supervised approaches to molecular classification of diseases Several supervised methods have been applied to the analysis of cDNA microarrays and high density oligonucleotide chips: Decision trees Fisher linear discriminant Multi-Layer Perceptrons Nearest-Neighbours classifiers Proposed by different authors: Golub et al. (1999), Pavlidis et al. (2001), Khan et al. (2001), Furey et al. (2000), Ramaswamy et al. (2001), Yeang et al. (2001), Dudoit et al. (2002). Linear discriminant analysis Parzen windows Support Vector Machines

Why using Support Vector Machines ? “General” motivations SVM are two-class classifiers theoretically founded on Vapnik' s Statistical Learning Theory. They act as linear classifiers in a high dimensional feature space originated by a projection of the original input space. The resulting classifier is in general non linear in the input space. SVM achieves good generalization performances maximizing the margin between the classes. SVM learning algorithm has no local minima “Specific” motivations Kernel are well-suited to working with high dimensional data. Small sample sizes require algorithms with good generalization capabilities. Automatic diagnosis of tumors requires high sensitivity and very effective classifiers. SVM can identify mis-labeled data (i.e. incorrect diagnosis). We could design specific kernel to incorporate “a priori” knowledge about the problem.

SVM to classify cancerous and normal cells We consider 3 standard SVM kernels: Gaussian Polynomial Dot-product Varying: Values of the the kernel parameters The regularization factor C Estimation of the generalization error through: 10-fold cross- validation leave-one-out Comparing them with: MLP LP Varying: Number of hidden units Backpropagation parameters

Results Learning machine modelGen. errorSt. dev.Prec.Sens. SVM-linear1.043.1698.63100.0 SVM-poly4.175.4694.74100.0 SVM-RBF25.004.4875.00100.0 MLP2.084.4598.61 LP9.3810.2495.6591.66 10-fold cross-validation ~ leave-one-out estimation of error SVM-linear achieves the best results. High sensitivity, no matter what type of kernel function is used. Radial basis SVM high misclassification rate and high estimated VC dimension

ROC analysis The ROC curve of the SVM-linear is ideal The polynomial SVM also achieves a reasonably good ROC curve The SVM-RBF show a diagonal ROC curve: the highest sensitivity is achieved only when it completely fails to correctly detect normal cells. The ROC curve of the MLP is also nearly optimal Linear perceptron shows a worse ROC curve, but with reasonable values lying on the highest and leftmost part of the ROC plane.

Summary of the results on the first problem Using hierarchical clustering 14,6% of the examples are misclassified (Alizadeh, 2000), against the 1.04% of the SVM, the 2.08% of the MLP and the 9.38% of the LP. Supervised methods exploit a priori biological knowledge (i.e. labeled data), while clustering methods use only gene expression data to group together different tissues, without any labeled data. Linear SVM achieve the best results, but also MLP and 2 nd degree polynomial show a relatively low generalization error. Linear SVM and MLP can be used to build classifiers with a high- sensitivity and a low rate of false positives. These results must be considered with caution because the size of the available data set is too small to infer general statements about the performances of the proposed learning machines.

The second problem: Identifying DLBCL subgroups It starts from an hypothesis of Alizadeh et al. about the existence of two distinct functional types of lymphoma inside DLBCL. Actually, we consider two problems: 1. Validation of Alizadeh’s hypothesis They identified two subgroups of molecularly distinct DLBCL: germinal centre B-like (GCB-like) and activated B-like cells (AB-like). These two classes correspond to patients with very different prognosis. 2. Finding groups of genes mostly related to this separation Different subsets of genes could be responsible for the distinction of these two DLBCL subgroups: the expression signatures Proliferation, T-cell, Lymphnode and GCB (Lossos,2000).

A feature selection approach based on “a priori” knowledge Finding the most correlated genes involves an exponential combination of genes (2 n -1), where n is usually of the order of thousands. We need greedy algorithms and heuristic methods. Can we exploit “a priori” biological knowledge about the problem ?

An heuristic method (1) A two-stage approach: I. Select groups of coordinately expressed genes. II. Identify among them the ones mostly correlated to the disease. We do not consider single genes. We consider only groups of coordinately expressed genes.

An heuristic method (2) I. Selecting groups of coordinately expressed genes: Use “a priori” biological and medical knowledge about groups of genes with known or suspected roles in carcinogenic processes And/or Use unsupervised methods such as clustering algorithms to identify coordinately expressed sets of genes II. Identify subgroups of genes mostly related to the disease: 1.Train a set of classifiers using only the subgroups of genes selected in the first stage. 2.Evaluate and rank the performance of the trained classifiers. 3.Select the subgroups by which the corresponding classifiers achieve the best ranking.

Applying the heuristic method 1. Selecting “candidate” subgroups of genes: We used biological knowledge and hierarchical clustering algorithms to select four subgroups: Proliferation: sets of genes involved the biological process of proliferation T-cell: genes preferentially expressed in T-cells Lymphnode: Sets of genes normally expressed in lymphnodes GCB: genes that distinguish germinal centre B-cells from other stages in B-cell ontogeny 2. Identify subgroups of genes most related to the the separation GCB-like / AB-like Training of SVM, MLP and LP as classifiers using each subgroup of genes and all the subgroups together (All) Leave-one-out methods used with gaussian, polynomial and linear SVM 10-fold cross-validation with gaussian, polynomial and linear SVM, MLP and LP. 5 classification tasks

GCB signature Learn. machine modelGen. errorSt. dev.Prec.Sens. SVM-linear10.5011.1690.00 SVM-poly8.7014.5496.6788.33 SVM-RBF4.509.55100.090.00 MLP8.7010.5090.90 LP8.7010.5090.90 All signatures Learn. machine modelGen. errorSt. dev.Prec.Sens. SVM-linear15.0011.1685.00 SVM-poly14.0018.9793.3376.67 SVM-RBF10.0010.54100.0076.67 MLP8.7013.2895.0086.36 LP10.8714.2886.9690.90

Results

The second problem: summary The results support the hypothesis of Alizadeh about the existence of two distinct subgroups in DLBCL. The heuristic method identifies the GCB signature as a cluster of coordinately expressed genes related to the separation between the GCB-like and AB-like DLBCL subgroups.

Developments I. Methods to discover subclasses of tumors on molecular basis. Integrating “a priori” biological knowledge, supervised machine learning methods and unsupervised clustering methods Stratifying patients into molecularly relevant categories, enhancing the discrimination power and precision of clinical trials New perspectives on the development of new cancer therapeutics based on a molecular understanding of the cancer phenotype. II. Methods to identify small subsets of genes correlated to tumors - Refinements of the proposed heuristic method using clustering algorithms with semi-automatic selection of the number of the significant subgroups of genes. - Greedy algorithms based on mutual information measures. Enhancing biological knowledge about tumoral processes Automatic diagnosis of tumors using DNA microchips Discovery of new subclasses of tumors

Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini

Similar presentations

Presentation on theme: "Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini

Similar presentations

Presentation on theme: "Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini"— Presentation transcript:

Similar presentations

About project

Feedback