Presentation is loading. Please wait.

Presentation is loading. Please wait.

Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array.

Similar presentations


Presentation on theme: "Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array."— Presentation transcript:

1 Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array.

2 Characteristics of Microarray Data
High dimensionality of gene space, low dimensionality of sample space. Thousands to tens of thousands of genes, tens to hundreds of samples. Features (genes) correlation. Genes collaborate to function. Gene correlation characterizes how the system works. A plethora of domain knowledge. Tons of knowledge accumulated about genes in question.

3 Microarray Data Analysis
Analysis from two angles sample as object, gene as attribute gene as object, sample/condition as attribute Here we map the samples and the genes into 2-dimensional space. As we can see, the genes has some dense area, if we remove the outliers and zoom in the dense area, we will find detailed dense area and some outliers. So the gene distribution has some hierarchical-dense structure. But the samples are very sparse in high-dimensional space. Even mapped into 2-dimensional space, there are no class structure can be detected. We can partition the sample by many hyperplane, but cannot judge which partition is better. So the the techniques that are effective for gene-based analysis are not adequate for analyzing samples. Effective and efficient sample-based analysis remains a challenging problem.

4 Supervised Analysis Select training samples (hold out…)
Sort genes (t-test, ranking…) Select informative genes (top 50 ~ 200) Cluster based on informative genes Class 1 Class 2 g1 g2 . g4131 g4132 … … 0 … … 0 g1 g2 . g4131 g4132 … … 0 … … 0 The existing methods of selecting informative genes to cluster samples fall into two major categories: supervised analysis and unsupervised analysis. The supervised approach assumes that additional information is attached to some (or all) data, for example, that biological samples are labeled as diseased vs. normal. The most famous supervised method is the neighborhood analysis method which is a science paper published in 1999 and it stimulate the research of sample phenotype detection. Other supervised method include: tree harvesting, support vector machines, decision tree method, genetic algorithm, the artificial neural networks, and a variety of ranking based methods. The basic steps of these supervised methods is first select a subset of samples as the training set, using the phenotypes as a reference to select a small percent of informative genes which manifest the phenotype partition within the training samples. Finally, the whole set of samples are grouped according to the selected informative genes. … … 1 … … 1 … … 1 … … 1

5 Phenotype Structure Mining
samples gene1 gene6 gene7 gene2 gene4 gene5 gene3 gene1 gene6 gene7 gene2 gene4 gene5 gene3 Informative Genes Non- informative Genes An informative gene is a gene which manifests samples' phenotype distinction. Phenotype structure: sample partition + informative genes.

6 Existing Feature Selection and Extraction Algorithms
The characteristic of microarray data set makes feature selection a critical process. Too many features, too few samples. Existing feature selection/extraction algorithms include: Single gene based discriminative scores, such as t-test score, S2N, etc. Redundancy removal based FSS algorithms. General feature selection algorithms. (Relief family, Float selection, etc.). General feature extraction algorithms: PCA, SVD, FLD etc. Haven’t witnessed specific feature extraction algorithms.


Download ppt "Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array."

Similar presentations


Ads by Google