Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Basic Gene Expression Data Analysis--Clustering
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
General Linear Model With correlated error terms  =  2 V ≠  2 I.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
(speaker) Fedor Groshev Vladimir Potapov Victor Zyablov IITP RAS, Moscow.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Artificial neural networks:
Data Mining Techniques: Clustering
SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.
University at BuffaloThe State University of New York Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data Daxin Jiang Jian.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Networks are useful for describing systems of interacting objects, where the nodes represent the objects and the edges represent the interactions between.
1 Maximizing Lifetime of Sensor Surveillance Systems IEEE/ACM TRANSACTIONS ON NETWORKING Authors: Hai Liu, Xiaohua Jia, Peng-Jun Wan, Chih- Wei Yi, S.
Mutual Information Mathematical Biology Seminar
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Modularity in Biological networks.  Hypothesis: Biological function are carried by discrete functional modules.  Hartwell, L.-H., Hopfield, J. J., Leibler,
What is Cluster Analysis
Clustering (Part II) 11/26/07. Spectral Clustering.
Chapter 2 Matrices Definition of a matrix.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Bioinformatics - Tutorial no. 12
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
CSCE822 Data Mining and Warehousing
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Statistical Analysis of Microarray Data
Matrix Approach to Simple Linear Regression KNNL – Chapter 5.
Arithmetic Operations on Matrices. 1. Definition of Matrix 2. Column, Row and Square Matrix 3. Addition and Subtraction of Matrices 4. Multiplying Row.
Clustering Unsupervised learning Generating “classes”
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Models of Influence in Online Social Networks
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
COMMUNITIES IN MULTI-MODE NETWORKS 1. Heterogeneous Network Heterogeneous kinds of objects in social media – YouTube Users, tags, videos, ads – Del.icio.us.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
Gene expression & Clustering (Chapter 10)
Bi-Clustering. 2 Data Mining: Clustering Where K-means clustering minimizes.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Distances Between Genes and Samples Naomi Altman Oct. 06.
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Clustering.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Cluster validation Integration ICES Bioinformatics.
Analyzing Expression Data: Clustering and Stats Chapter 16.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Computational Intelligence Winter Term 2015/16 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.
Jawad Tahsin Danish Mustafa Zaidi Kazim Zaidi Zulfiqar Hadi.
Subspace Clustering/Biclustering
Matrices Definition: A matrix is a rectangular array of numbers or symbolic elements In many applications, the rows of a matrix will represent individuals.
CS 485G: Special Topics in Data Mining
Clustering.
Matrix Multiplication Sec. 4.2
Presentation transcript:

Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004

outline 1.MicroArray and its relative research 1.1 MicroArray Gene Expression Data 1.2 Main research about MicroArray 2. Why Bicluster? 2.1 Preceding research and its faults 2.2 The concept of Bicluster 2.3 Similarity measure 3. The hardness of Bicluster

4. Methods proposed by this paper 4.1 Relative Works and paper’s goal 4.2 Definition of mean squared residue score 4.3 Some special matrices’ scores 4.4 Some Theorems deduced by authors 4.5 Algorithms proposed by this paper 5. Experiment 5.1 Data preparation 5.2 Determining Algorithm Parameters 5.3 Final Algorithm 5.4 Results and Display

1. MicroArray and its relative Research 1.1 MicroArray Gene expression data: Being generated by DNA chips and other microarray technique, Row---Genes, Column---Conditions or Samples 1.2 Main Research about MicroArray (1) Gene Clustering: Finding the genes having similar functions (2) Conditions Clustering: Helpful to case analysis (3) Classification: Tumor Classification, Cancer prediction (4) Gene Selection: Find the genes relative to some disease (5) Gen Network: Explore the regulatory interaction between the genes 1.3 Paper Target: Biclustering

2. Why Bicluster? 2.1 Preceding research and its faults Goal: Discover the regulatory patterns or condition similarities Methods: Based on Euclidean distance or the dot product between the vectors (equally weighted) (1) Group genes (row) (2) Group conditions (column) Result: Partition the genes or conditions into mutually exclusive groups or hierarchies

Faults: obscuring some other similarity groups while discovering some similarity groups 2.2 The concept of Bicluster Clustering the genes(rows) and conditions(columns) simultaneously---subspace clustering 2.3 Similarity Measure (1)Based on Distance Metric, such as Minkowski distances (2)Cosine Measure

(3)Pearson Correlation (4)Extended Jaccard Similarity (5)Mean Sqare Residue (proposed by this paper) + A measure of the coherence of the genes and conditions in the bicluster + Symmetric function of the genes and conditions + Group genes and conditions simultaneously

3. Hardness of the bicluster The problem of finding a maximum bicluster with a score lower than a threshold includes the problem of finding a maximum biclique in a bipartite graph as a special case Finding the largest constant square submatrix is proven to be NP-hard (Johnson, 1987) The problem of finding a minimum set of biclusters, either mutually exclusive or overlapping, to cover all the elements in a data matrix has been shown to be NP- hard(Orlin,1977)

4. Methods proposed by this paper 4.1 Relative Works and the paper’s goal (1)Relative Works Divisive algorithm: partitioning data into sets with approximately constant values, proposed by Morgan and Sonquist(1963) and Hartigan(1972) Hartigan mentioned that the criterion for partitioning may be a two-way analysis of variance model, similar to the mean squared residue scoring proposed in this article Mirkin(1996) presents a node addition algorithm.

“biclustering” has been used by Mirkin(1996), which means simultaneous clustering of both row and column sets in a data matrix. The term “direct clustering”(Hartigan 1972),and “box clustering”(Mirkin,1996) have the same meaning. (2) The Paper’s Goal and criterion: Goal: Finding of a set of genes showing strikingly similar up-regulation and down-regulation under a set of conditions. Criterion: A low mean squared residue score plus a large variation from the constant as a criterion for identifying these genes and conditions Overlapping: Biclusters should be allowed to overlap in expression data analysis

4.2 Definition of mean squared residue score

The row variance: It is an accompanying score to reject trival or constant biclusters.

4.3 Scores of some special matrice A special case for a perfect score( a zero mean squared residue score) is a constant bicluster of elements of a single value For the matrix aij=ij, i,j>0, no submatrix of a size larger than a single cell has a score lower than 0.5 A K×K matrix of all 0s except one 1 has the score Equation: A matrix with elements randomly and uniformly generated in the range of [a,b] has an expected score of (b-a)(b-a)/12. For example the range is [0,800], the expected score is 53,333.

Some characteristic of mean square residue score (1)Adding a constant number to the matrix will not affect the H(I,J) score (2)Multiplying a constant number will affect the score (by the square of the constant) (3)Both will not affect the ranking of the biclusters in a matrix

4.4 Theorems deduced by authors

Comments on Algorithm 0: Algorithm 0, although a polynomial-time one, will not be efficient enough for a quick analysis of most expression data matrices. The complexity of Algorithm 0 is o((n+m)nm)

Comments on Algorithm 1: In each iterate, a complete recalculation for step1 and step 2 is needed The time complexity of Algorithm 1 is o(nm) Higher efficiency than Algorithm 0, but not the best.

Comments on Algorithm 2: Need to properly select parameter α>1 Without updating the score after the removal of each node The time complexity of Algorithm 2 is o(logn+longm) One may miss some large δ-bicluster

Comments on Algorithm 3: The time complexity is o(mn) The resulting δ-bicluster may still not be maximal because of two reasons: (1)Lemma 3 only gives a sufficient condition for adding rows and conditions (2)By adding rows and columns, the score may decrease to the point it much smaller than δ

5. Experiment 5.1 Data preparation Datasets and Parameters: (1)Yeast data,o-value=300, n=100 (2)Human data, o-value=1200,n=100 Missing Data Replacement: Replace the missing data using the random number underlying the uinform distriubiton Biclusters is Compared to the Cluster results from (1)Travazoie et al. (1999) (2)Alizadeh et al. (2000)

5.2 Determining Algorithm Parameters Thinking about the clusters from the papres Travazoie et al. (1999) and Alizadeh et al. (2000) For yeast data, δ= 300, α=1.2 For human gene data, δ= 1200, α=1.2 The number of biclusters is n=100 Masking discovered Biclusters: Each time a bicluster was discovered, the elements in it will be replaced by random number because the algorithms are deterministic 5.3 Final Algorithm

Biclusters for Yeast data

Biclusters for human data