Mining Phenotypes and Informative Genes from Gene Expression Data Chun Tang, Aidong Zhang and Jian Pei Department of Computer Science and Engineering State.

Slides:



Advertisements
Similar presentations
1/38 Jochen Jäger University of Washington Department of Computer Science Advisors: Larry Ruzzo Rimli Sengupta Improved gene selection in microarrays by.
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
An Association Analysis Approach to Biclustering website:
Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.
Minimum Redundancy and Maximum Relevance Feature Selection
The Broad Institute of MIT and Harvard Clustering.
Visual Analytics Research at WPI Dr. Matthew Ward and Dr. Elke Rundensteiner Computer Science Department.
University at BuffaloThe State University of New York Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data Daxin Jiang Jian.
Yang Xiang, Ruoming Jin, David Fuhry, Feodor F. Dragan
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Interactive Visualization of the Stock Market Graph Presented by Camilo Rostoker Department of Computer Science University of British.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
GEPAS -Gene Expression Pattern Analysis Suite Hongli Li Computer Science Department UMASS Lowell
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Clustering (Part II) 11/26/07. Spectral Clustering.
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Agilent: The Company, The Myth, The Lengend. Agilent: Agilent Technologies Inc. (NYSE: A) is a world-wide, diverse technology company focused on expansion.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Who am I and what am I doing here? Allan Tucker A brief introduction to my research
Introduction to Bioinformatics - Tutorial no. 12
University at BuffaloThe State University of New York Mining Phenotype Structures Chun Tang and Aidong Zhang Bioinformatics Journal, 20(6): , 2004.
Subspace Differential Coexpression Analysis for the Discovery of Disease-related Dysregulations Gang Fang, Rui Kuang, Gaurav Pandey, Michael Steinbach,
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
CSCE822 Data Mining and Warehousing
PRESENTATION - people ISG (Intelligent Systems Group) Researching Group Donostia- San Sebastián Computer Science Faculty - University.
Presented by Liu Qi An introduction to Bioinformatics Algorithms Qi Liu
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
COMMUNITIES IN MULTI-MODE NETWORKS 1. Heterogeneous Network Heterogeneous kinds of objects in social media – YouTube Users, tags, videos, ads – Del.icio.us.
Cancer classification by Regularized Least Square Classifiers Annarita D’Addabbo a, Rosalia Maglietta a, Sabino Liuni b, Graziano Pesole b,c and Nicola.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek.
Whole Genome Expression Analysis
University at BuffaloThe State University of New York Bioinformatics : Gene Expression Data Analysis Aidong Zhang Professor Computer Science and Engineering.
Exagen Diagnostics, Inc., all rights reserved Biomarker Discovery in Genomic Data with Partial Clinical Annotation Cole Harris, Noushin Ghaffari.
ALIGNMENT OF 3D ARTICULATE SHAPES. Articulated registration Input: Two or more 3d point clouds (possibly with connectivity information) of an articulated.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
The Broad Institute of MIT and Harvard Classification / Prediction.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.
CS491JH: Data Mining in Bioinformatics Introduction to Microarray Technology Technology Background Data Processing Procedure Characteristics of Data Data.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
Functional Genomics - clustering - classification - promoter analysis - expander tool - example - biclustering.
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
CSE 5331/7331 F'071 CSE 5331/7331 Fall 2007 Image Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
CSE 8331 Spring CSE 8331 Spring 2010 Image Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Literature Survey: Microarray Data Analysis Ei-Ei Gaw Arizona State University CSE 591 April 24, 2003.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip.
Presented by Ho Wai Shing
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
Consensus Group Stable Feature Selection
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
GeWorkbench Overview Support Team Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GRID distribution supporting chaotic map clustering on large mixed.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Subgraph Search Over Uncertain Graphs Erşan Demircioğlu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Subspace Clustering/Biclustering
GPX: Interactive Exploration of Time-series Microarray Data
Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array.
Presentation transcript:

Mining Phenotypes and Informative Genes from Gene Expression Data Chun Tang, Aidong Zhang and Jian Pei Department of Computer Science and Engineering State University of New York at Buffalo

cDNA Microarray Experiment

w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33 sample 1sample 2sample 3 genes samples Microarray Data  asymmetric dimensionality 10 ~ 100 samples 1000 ~ genes

Scope and Goal Sample Partition Microaray Database Gene Expression Matrices Gene Expression Data Analysis Important patterns Important patterns Important patterns Gene Microarray Images Gene Expression Patterns Visualization

Microarray Data Analysis  Analysis from two angles  sample as object, gene as attribute  gene as object, sample/condition as attribute

Sample-based Analysis Informative Genes Non- informative Genes gene 1 gene 6 gene 7 gene 2 gene 4 gene 5 gene samples

Related Work  New tools using traditional methods :  Clustering with feature selection:  Subspace clustering TreeView CLUTO CIT SOTA GeneSpring J-Express CLUSFAVOR SOM K-means Hierarchical clustering Graph based clustering PCA

Quality Measurement  Intra-phenotype consistency:  Inter-phenotype divergency:  The quality of phenotype and informative genes:

Heuristic Searching  Starts with a random K-partition of samples and a subset of genes as the candidate of the informative space.  Iteratively adjust the partition and the gene set toward the optimal solution. ofor each gene, try possible insert/remove ofor each sample, try best movement.

Mutual Reinforcing Adjustment  Divide the original matrix into a series of exclusive sub-matrices based on partitioning both the samples and genes.  Post a partial or approximate phenotype structure called a reference partition of samples. ocompute reference degree for each sample groups; oselect k groups of samples; odo partition adjustment.  Adjust the candidate informative genes. ocompute W for reference partition on G operform possible adjustment of each genes  Refinement Phase

Reference Partition Detection  Reference degree: measurement of a sample group over all gene groups  The sample group having the highest reference degree  S p0, S p1, S p2 … S px,…  Partition adjustment: check the missing samples

Gene Adjustment  For each gene, try possible insert/remove

 The partition corresponding to the best state may not cover all the samples.  Add every sample not covered by the reference partition into its matching group  the phenotypes of the samples.  Then, a gene adjustment phase is conducted. We execute all adjustments with a positive quality gain  informative space.  Time complexity O(n*m 2 *I) Refinement Phase

Phenotype Detection Data SetMS-IFNMS-CONLeukemia- G1 Leukemia- G2 ColonBreast Data Size4132*284132*307129*387129*342000*623226*22 J-Express SOTA CLUTO Kmeans/PCA SOM / PCA  -cluster Heuristic Mutual

Informative Gene Selection

References Agrawal, Rakesh, Gehrke, Johannes, Gunopulos, Dimitrios and Raghavan, Prabhakar. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, pages 94–105, Azuaje, Francisco. Making Genome Expression Data Meaningful: Prediction and Discovery of Classes of Cancer Through a Connectionist Learning Approach. In Proceedings of IEEE International Symposium on Bioinformatics and Biomedical Engineering, pages 208–213, Ben-Dor A., Friedman N. and Yakhini Z. Class discovery in gene expression data. In Proc. Fifth Annual Inter. Conf. on Computational Molecular Biology (RECOMB 2001), pages 31–38. ACM Press, Cheng Y., Church GM. Biclustering of expression data. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB), 8:93– 103, DeRisi J., Penland L., Brown P.O., Bittner M.L., Meltzer P.S., Ray M., Chen Y., Su Y.A. and Trent J.M. Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nature Genetics, 14:457–460, Getz G., Levine E. and Domany E. Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. USA, Vol. 97(22):12079–12084, October Golub T.R., Slonim D.K., Tamayo P., Huard C., Gassenbeek M., Mesirov J.P., Coller H., Loh M.L., Downing J.R., Caligiuri M.A., Bloomfield D.D. and Lander E.S. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, Vol. 286(15):531–537, October Yang, Jiong, Wang, Wei, Wang, Haixun and Yu, Philip S. δ-cluster: Capturing Subspace Correlation in a Large Data Set. In Proceedings of 18th International Conference on Data Engineering (ICDE 2002), pages 517–528, Xing E.P. and Karp R.M. Cliff: Clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. Bioinformatics, Vol. 17(1):306–315,  Agrawal, Rakesh, Gehrke, Johannes, Gunopulos, Dimitrios and Raghavan, Prabhakar. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, pages 94–105,  Ben-Dor A., Friedman N. and Yakhini Z. Class discovery in gene expression data. In Proc. Fifth Annual Inter. Conf. on Computational Molecular Biology (RECOMB 2001), pages 31–38. ACM Press,  Cheng Y., Church GM. Biclustering of expression data. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB), 8:93–103,  Golub T.R., Slonim D.K., Tamayo P., Huard C., Gassenbeek M., Mesirov J.P., Coller H., Loh M.L., Downing J.R., Caligiuri M.A., Bloomfield D.D. and Lander E.S. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, Vol. 286(15):531–537, October  Xing E.P. and Karp R.M. Cliff: Clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. Bioinformatics, Vol. 17(1):306–315, 2001.