University at BuffaloThe State University of New York Mining Phenotype Structures Chun Tang and Aidong Zhang Bioinformatics Journal, 20(6):829-838, 2004.

University at BuffaloThe State University of New York Mining Phenotype Structures Chun Tang and Aidong Zhang Bioinformatics Journal, 20(6):829-838, 2004

University at BuffaloThe State University of New York Microarray Data Analysis  Analysis from two angles  sample as object, gene as attribute  gene as object, sample/condition as attribute

University at BuffaloThe State University of New York Supervised Analysis qSelect training samples (hold out…) qSort genes (t-test, ranking…) qSelect informative genes (top 50 ~ 200) qCluster based on informative genes 1 1 … 1 0 0 … 0 0 0 … 0 1 1 … 1 g 1 g 2. g 4131 g 4132 1 1 … 1 0 0 … 0 0 0 … 0 1 1 … 1 g 1 g 2. g 4131 g 4132 Class 1Class 2

University at BuffaloThe State University of New York Unsupervised Analysis We will focus on unsupervised sample partition which assume no phenotype information being assigned to any sample. qSince the initial biological identification of sample classes has been slow, typically evolving through years of hypothesis-driven research, automatically discovering sample pattern presents a significant contribution in microarray data analysis. qMany mature statistic methods can not be applied without the phenotypes of samples being known in advance.

University at BuffaloThe State University of New York 4 5 6 7 8 9 10 gene 1 gene 6 gene 7 gene 2 gene 4 gene 5 gene 3 gene 1 gene 6 gene 7 gene 2 gene 4 gene 5 gene 3 Unsupervised Analysis Informative Genes Non- informative Genes samples An informative gene is a gene which manifests samples' phenotype distinction. Phenotype structure: sample partition + informative genes. Automatic Phenotype Structure Mining 1 2 3

University at BuffaloThe State University of New York gene 1 gene 2 gene 3 Informative genes 1 2 3 4 5 6 7 Phenotype distinction Mining Gene expression matrix Result Automatic Phenotype Structure Mining Given a n  m data matrix M and the number of samples' phenotypes K. The goal is to find K mutually exclusive groups of the samples matching their empirical phenotypes, and to find the set of informative genes which manifests this phenotype distinction.

University at BuffaloThe State University of New York Requirements The expression levels of each informative gene should be similar over the samples within each phenotype The expression levels of each informative gene should display a clear dissimilarity between each pair of phenotypes

University at BuffaloThe State University of New York Challenges (1) The volume of genes is very large while the number of samples is very limited, no distinct class structures of samples can be properly detected by the existing techniques.

University at BuffaloThe State University of New York gene 1 gene 2 gene 3 gene 4 gene 5 gene 6 gene 7 gene 8 gene 9 gene 10 gene 11 gene 12 gene 13 gene 14 gene 15 The limited informative genes are buried in large amount of noise. gene 5 gene 9 gene 12 Challenges (2)

University at BuffaloThe State University of New York Challenges (3) Gene PROTEASOME IOTA X59417Gene C-myb U22376 Gene Fumarylacetoacetate M55150 Gene LTC4 synthase U50136 The values within data matrices are all real numbers None of the informative genes follows ideal “high-`low” pattern.

University at BuffaloThe State University of New York Related Work  New tools using traditional methods :  The similarity measures used in these methods are based on the full gene space.  PCs do not necessarily have strong correlation with informative genes. TreeView CLUTO CIT CNIO GeneSpring J-Express CLUSFAVOR SOM K-means Hierarchical clustering Graph based clustering PCA

University at BuffaloThe State University of New York Related Work (Cont’d)  Clustering with feature selection: (CLIFF, two-way ordering, SamCluster) 1.Filtering the invariant genes Rank variance PCA CV 2.Partition the samples Ncut, Min-Max Cut Hierarchical Clustering 3.Pruning genes based on the partition Markov blanket filter T-test

University at BuffaloThe State University of New York Related Work (Cont’d)  Subspace clustering : Bi-clustering δ-clustering

University at BuffaloThe State University of New York Related Work (Cont’d)  Subspace clustering only measure trend similarity. But in our model, we require each gene show consistent signals on the samples of the same phenotype.

University at BuffaloThe State University of New York Related Work (Cont’d)  Subspace clustering algorithms only detect local correlated features and objects without considering dissimilarity between different clusters. We want to get the genes which can differentiate all phenotypes.

University at BuffaloThe State University of New York Our Contributions We transferred the phenotype structure mining problem into an optimization problem. A series of statistic-based metrics are defined as objective functions. A heuristic searching method and a mutual reinforcing adjustment approach are proposed to find phenotype structures.

University at BuffaloThe State University of New York Model - Measurements gene 1 gene 2 gene 3 samples Intra-consistency Inter-divergency Phenotyp e Quality Intra-consistency S1S1 S2S2 G’

University at BuffaloThe State University of New York Intra-consistency Measure- ment Data(A)Data(B) residue0.19750.4506 MSR0.04940.4012 Ours339.06675.3000 NOT consistent consistent

University at BuffaloThe State University of New York Intra-pattern-consistency (Cont’d)  Variance of a single gene on the samples within one phenotype:  Intra-pattern-consistency: average row variance In a subset of genes (candidate informative genes), does every gene have good consistency on a set of samples? Average of variance of the subset of genes – the smaller the intra-phenotype consistency, the better.

University at BuffaloThe State University of New York Inter-pattern-divergence  Both “inter-pattern-consistency” and ``intra-pattern-divergence” on the same gene are reflected.  Average block distance: How a subset of genes (candidate informative genes) can discriminate two phenotypes of samples? Sum of the average difference between the phenotypes – the larger the inter-phenotype divergence, the better.

University at BuffaloThe State University of New York Pattern Quality  The purpose of pattern discovery is to identify the empirical patterns where the intra-pattern- consistency inside each phenotype is high and the inter-pattern-divergence between each pair of phenotypes is large. The higher the value, the better the quality.

University at BuffaloThe State University of New York Measurements  Inter-divergence:  Phenotype Quality  Intra-consistency

University at BuffaloThe State University of New York Phenotype Quality Data(A)Data(B)Data(C) Con4.253.444.52 Div41.6025.2046.16  14.26879.607415.3526 Highest phenotype quality

University at BuffaloThe State University of New York Model - Formalized Problem Input m samples and n genes the corresponding gene expression matrix M the number of phenotypes K Output q A K-partition of samples (phenotypes) and a subset of genes (informative space) that the phenotype quality  is maximized.

University at BuffaloThe State University of New York Strategy Maintain a candidate phenotype structure and iteratively adjust the candidate structure toward the optimal solution. Basic elements: qA candidate structure:  A partition of samples {S 1,S 2,…S k }  A subset of genes G’  G  The corresponding phenotype quality  qAn adjustment:  For a gene  G’, insert into G’  For a gene  G’, remove from G’  For a sample in a group S’, move to other group qThe quality gain measures the change of phenotype quality of before and after the adjustment.

University at BuffaloThe State University of New York Heuristic Searching intermediate candidate structure gene/sample Iterative Adjusting adjustment  Ω > 0 Y N adjusting pick up an object candidate structure generation

University at BuffaloThe State University of New York Heuristic Searching Starts with a random K-partition of samples and a subset of genes as the candidate of the informative space. Iteratively adjust the partition and the gene set toward a better solution. (Random order of genes and samples.) qfor each gene, try possible insert/remove qfor each sample, try best movement. Insert a gene Remove a geneMove a sample

University at BuffaloThe State University of New York Heuristic Search For each possible adjustment, compute  qFor each gene, try possible insert/remove qFor each sample, try the best movement  > 0  conduct the adjustment  < 0  conduct the adjustment with probability qT(i) is a decreasing simulated annealing function and i is the iteration number. T(0)=1, T(i)=1/(i+1) in our implementation

University at BuffaloThe State University of New York Mutual Reinforcing Adjustment - Motivation Drawbacks of the heuristic searching method: blind initialization, equal chance of samples and genes, noisy samples. The phenotype quality value of subset of informative genes and partially phenotype should also be high. Mining phenotypes and informative genes directly from high-dimensional noisy data is difficult, we start from small groups whose data distribution and patterns are much easier to be detected. Mining of phenotypes and informative genes should mutually reinforced.

University at BuffaloThe State University of New York Mutual Reinforcing Adjustment - Motivation ABC

University at BuffaloThe State University of New York Mutual Reinforcing Adjustment - Major Steps Partition the Matrix: divide the original matrix into a series of exclusive sub-matrices based on partitioning both the samples and genes. Reference Partition Detection: post a partial or approximate phenotype structure called a reference partition of samples. qcompute reference degree for each sample groups; qselect k groups of samples; qdo partition adjustment. Gene Adjustment: adjust the candidate informative genes.  compute  for reference partition on G qperform possible adjustment of each genes Refinement Phase

University at BuffaloThe State University of New York Method Detail - Iteration Phase informative genes G’ all samples partitioning the matrix all samples informative genes G’ reference partition detection reference partition gene adjustment informative genes G’ informative genes G’’ reference partition informative genes G’’ all samples to next iteration

University at BuffaloThe State University of New York Partitioning the Matrix Partition the samples and genes into multiple groups qUse CAST A threshold t decide the size of each group qBased on the Pearson’s correlation Coefficient Outliers will be filtered out from any group Samples or genes in the same group share similar patterns

University at BuffaloThe State University of New York Reference Partition Detection Select the groups of samples as potential phenotypes Pick the first group with the highest reference degree Select the other groups by considering the inter-phenotype divergence w.r.t. selected groups

University at BuffaloThe State University of New York Check the Missing Samples Probabilistically insert the remaining samples not in the selected groups into the most probably matching group In iterations, use the gene candidate sets to improve the reference partition

University at BuffaloThe State University of New York Gene Adjustment Gene adjustment: Test the possible adjustments that lead to improvement Insert a geneRemove a gene

University at BuffaloThe State University of New York  The partition corresponding to the best state may not cover all the samples.  Add every sample not covered by the reference partition into its matching group  the phenotypes of the samples.  Then, a gene adjustment phase is conducted. We execute all adjustments with a positive quality gain  informative space.  Time complexity O(n*m 2 *I) Method-Refinement Phase

University at BuffaloThe State University of New York gene 1 gene 6 gene 7 gene 2 gene 4 gene 3 samples 14823567910 gene 8 gene 9 Output: p phenotype structures where the t th structure is a K t -partition of samples (phenotypes) and a subset of genes (informative space) which manifest the sample partition. The overall phenotype quality is maximized. Empirical Phenotype Structure Hidden Phenotype Structure Mining Multiple Phenotype Structures

University at BuffaloThe State University of New York Maintain p candidate phenotype structures and iteratively adjust them toward the optimal solution. Basic elements of each candidate structure: qA candidate structure  A K t partition of samples  A subset of genes G’  G  The corresponding phenotype quality  t qAn adjustment  For a gene g i  G t, insert into G t  For a gene g i  G t, move from G t’ (t  t’) or remove from all structures  For a sample s i in group S’, move to other group qThe quality gain measures the change of pattern quality of the states after the adjustment. Extended Algorithm Strategy

University at BuffaloThe State University of New York The Extended Algorithm (Cont’d) Gene insert move remove Sample move candidate structure 1 candidate structure 2

University at BuffaloThe State University of New York Mining Multiple Phenotype Structures (Cont’d) Partially informative genes

University at BuffaloThe State University of New York Formalized Problem Input m samples and n genes the corresponding gene expression matrix M the number of phenotype structures p the set of numbers {K 1, K 2, …, K p } Output p phenotype structures where the t th structure is a K t -partition of samples (phenotypes) and a subset of genes (informative space) which manifest the sample partition. The overall phenotype quality is maximized.

University at BuffaloThe State University of New York The Algorithm Candidate Structure Generation qcluster genes into p’ group (p’>p) (CAST) qgenerate sample partitions one by one on clusters of genes, select best quality genes. Iterative Adjustment qfor each gene, try possible insert/move/remove qfor each sample, -examine all possible adjustment -select best movement.

University at BuffaloThe State University of New York The Algorithm (Cont’d) Gene (p possible adjustments) insert moveremove  Sample (K t -1 possible adjustments for each partition)

University at BuffaloThe State University of New York The Algorithm (Cont’d) Data Standardization the original gene intensity values  relative values where  Random order of genes and samples  Conduct negative action with a probability  Simulated annealing technique

University at BuffaloThe State University of New York Experiments Data Sets: qMultiple-sclerosis data  MS-IFN : 4132 * 28 (14 MS vs. 14 IFN)  MS-CON : 4132 * 30 (15 MS vs. 15 Control) qLeukemia data  7129 * 38 (27 ALL vs. 11 AML)  7129 * 34 (20 ALL vs. 14 AML) qColon cancer data  2000 * 62 (22 normal vs. 40 tumor colon tissue) qHereditary breast cancer data  3226 * 22 ( 7 BRCA1, 8 BRCA2, 7 Sporadics)

University at BuffaloThe State University of New York Rand Index Rand Index -A measurement of “agreement” between the ground-truth (P) and the results (Q) : q“a” : the number of pairs of objects that are in the same class in P and in the same class in Q; q“b” : the number of pairs of objects that are in the same class in P but not in the same class in Q; q“c” : the number of pairs of objects that are in the same class in Q but not in the same class in P; q“d” : the number of pairs of objects that are in different classes in P and in different class in Q. P Q s 1 s 2 s1s1 s2s2 s1s1 s2s2 s1s1 s2s2 s1s1 s2s2

University at BuffaloThe State University of New York Phenotype Structure Detection Data SetMS-IFNMS-CONLeukemia-G1Leukemia-G2ColonBreast Data Size4132*284132*307129*387129*342000*623226*22 J-Express 0.48150.48510.50920.49650.49390.4112 CLUTO 0.48150.48280.57750.48660.49660.6364 CIT 0.48410.48510.65860.49200.49660.5844 CNIO 0.48150.49200.60170.49200.49390.4112 CLUSFAVOR 0.52380.54020.50920.49200.49390.5844  -cluster 0.48940.48510.50070.45380.47960.4719 Heuristic 0.80520.62300.97610.70860.62930.8638 Mutual 0.83870.65130.97780.75580.68270.8749

University at BuffaloThe State University of New York Experiments Number of iterationsRunning time Data Sizemeanstandard deviation meanstandard deviation 4132*2815827.218035.1 4132*3016829.519537.8 7129*38 17116.143651.9 7129*34 19835.9458101.2 2000*62 13317.847998.5 3226*22 15722.216735.6 The mean value and standard deviation of the numbers of iterations and response time (in second) with respect to the matrix size.

University at BuffaloThe State University of New York Experimental Results (5) Phenotype Structure Detection (Cont’d)  The mutual reinforcing approach as applied to the MS-IFN group.  (A) shows the distribution of the original 28 samples. Each point represents a sample with 4132 genes mapped to two- dimensional space.  (B) shows the distribution in the middle of the adjustment.  (C) shows the distribution of the same 28 samples after the iterations. 76 genes was selected as informative space.

University at BuffaloThe State University of New York Experimental Results (5) Informative Gene Selection

University at BuffaloThe State University of New York Phenotype Structures

University at BuffaloThe State University of New York Experimental Results (5) Informative Gene Selection (Cont’d)

University at BuffaloThe State University of New York Experimental Results (5) Scalability Evaluation

University at BuffaloThe State University of New York Conclusion from the Experiments The work is motivated by the needs of emerging microarray data analysis. The strategy is designed for data which have the following properties: qThe number of samples is limited but the gene dimension is very large. qLarge volumes of irrelevant and redundant genes prevent accurate grouping of samples; qAnalyzing over one dimension object can enhance detecting meaningful patterns of another dimension.

University at BuffaloThe State University of New York Mining Phenotype Structures Chun Tang and Aidong Zhang Bioinformatics Journal, 20(6):829-838, 2004.

Similar presentations

Presentation on theme: "University at BuffaloThe State University of New York Mining Phenotype Structures Chun Tang and Aidong Zhang Bioinformatics Journal, 20(6):829-838, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University at BuffaloThe State University of New York Mining Phenotype Structures Chun Tang and Aidong Zhang Bioinformatics Journal, 20(6):829-838, 2004.

Similar presentations

Presentation on theme: "University at BuffaloThe State University of New York Mining Phenotype Structures Chun Tang and Aidong Zhang Bioinformatics Journal, 20(6):829-838, 2004."— Presentation transcript:

Similar presentations

About project

Feedback