Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.

Slides:

Advertisements

Similar presentations

Yinyin Yuan and Chang-Tsun Li Computer Science Department

Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

Outlines Background & motivation Algorithms overview

Frequent Closed Pattern Search By Row and Feature Enumeration

Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.

Cluster Analysis Measuring latent groups. Cluster Analysis - Discussion Definition Vocabulary Simple Procedure SPSS example ICPSR and hands on.

More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering Dikla Dotan-Cohen, Simon Kasif and Avraham A. Melkman.

A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.

Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.

Mutual Information Mathematical Biology Seminar

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State.

‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.

MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.

Cluster Analysis: Basic Concepts and Algorithms

Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.

ICA-based Clustering of Genes from Microarray Expression Data Su-In Lee 1, Serafim Batzoglou 2 1 Department.

Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.

Chapter 2 Dimensionality Reduction. Linear Methods

Bi-Clustering. 2 Data Mining: Clustering Where K-means clustering minimizes.

A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.

CEMiner – An Efficient Algorithm for Mining Closed Patterns from Time Interval-based Data Yi-Cheng Chen, Wen-Chih Peng and Suh-Yin Lee ICDM 2011.

Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.

EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.

K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.

Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.

Analyzing Expression Data: Clustering and Stats Chapter 16.

Flat clustering approaches

Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.

Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.

Tutorial I: Missing Value Analysis

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.

Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.

Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,

University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.

DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

Clustering Gene Expression Data BMI/CS 776 Mark Craven April 2002.

The Impact of Concurrent Coverage Metrics on Testing Effectiveness

PREDICT 422: Practical Machine Learning

Cluster Analysis II 10/03/2012.

Hierarchical Clustering: Time and Space requirements

CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Dimension Reduction via PCA (Principal Component Analysis)

Elementary Statistics

William Norris Professor and Head, Department of Computer Science

Hierarchical clustering approaches for high-throughput data

William Norris Professor and Head, Department of Computer Science

GPX: Interactive Exploration of Time-series Microarray Data

Data Mining – Chapter 4 Cluster Analysis Part 2

SEG5010 Presentation Zhou Lanjun.

Volume 12, Issue 9, Pages (April 2002)

Inferring Cellular Processes from Coexpressing Genes

Presentation transcript:

Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012

Motivation A subset of genes showing correlated co-expression patterns across a subset of conditions are functionally related Existing algorithms only address pure shifting or scaling patterns in gene expression profiles. 2 P1 = P2 – 5 = P3 – 15 = P4 = P5 / 1.5 = P6 / 3 They are clustered into two groups

Motivation How to group the previous genes into one cluster? We need to handle shifting and scaling patterns simultaneously! Three genes g1, g2 and g3 are correlated under all the above conditions: g2 = -2.5 * g = -g1 + 30

Definition of Correlation Correlation means any of a broad class of statistically relationships between random variables and data values. In this paper, we only focus on linear correlation, including shifting and scaling. Positive and negative correlation correspond to positive and negative scaling factors respectively.

Definition of Bi-clustering 5 Simultaneous clustering of the rows and columns of a matrix, e.g. group genes which have similar expression patterns under a subset of conditions.

Clustering Definition: the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. 6 The result of a cluster analysis shown as the coloring of the squares into three clusters. K-means

Density-based Subspace Clustering Discover arbitrary-shaped clusters under a subspace. A cluster is regarded as a region, in which the density of data objects exceeds a threshold Suffer from the problem that each data object can only be assigned to one cluster only

Hierarchical Clustering Use previous established clusters to find successive clusters. It has two categories: agglomerative ("bottom-up") or divisive ("top-down") Only applicable to full space clustering.

Pattern-based and Tendency-based Biclustering Pattern-based biclustering measures similarities between objects based on the coherent pattern they exhibit. It only identifies pure shifting or scaling patterns Tendency-based biclustering focuses on linear ordering of gene expression levels without coherent guarantee Both methods fail to address the issue of negative correlation in subspace Both methods disregard the fact that patterns with smaller variations in expression values are probably of little biological meaning Both methods miss the co-regulated genes that have shifting- and-scaling patterns due to varying individual sensitivities

New algorithm: reg-cluster A reg-cluster exhibits the following characteristics which are suitable for expression data analysis: – Increase or decrease of gene expression levels across any two conditions of a reg-cluster is in proportion, allowing small variations deﬁned by the coherence threshold – Increase or decrease of gene expression levels across any two conditions of a reg-cluster is signiﬁcant with regard to the regulation threshold – Genes of a reg-cluster can be either positively correlated or negatively correlated 10

Challenges The biggest challenge is the need of a novel coherent cluster model that can capture the more general shifting-and-scaling co-regulation patterns Another challenge is how to apply a non-negative regulation threshold. Tendency-based models of are not suitable for adopting a regulation threshold 11

Regulation Measurement Notations: d ica and d icb are expression levels of gene g i under condition c a and c b respectively; γ is a user-defined gene expression threshold. g i is up-regulated from condition c b to c a if d ica – d icb > γ g i is down-regulated from condition c b to c a if d ica – d icb < -γ We represent them as: We call c b the regulation predecessor of c a ( ) and c a the regulation successor of c b ( ).

RWave Model To effectively find the regulation chains by keeping a record of the bordering regulation relationships order and find minimum pairs which exceeds threshold (γ1 = γ2 = 4.5 and γ3 = 1.8)

Coherent Measurement The shifting and scaling correlation between gene expression data d i and d j under condition set Y can be expressed as a linear equation: The correlation between d i and d j can equally be expressed in the following condition: where d ick+1 and d ick are neighboring expression levels of gene g i after all the levels are sorted non-descending order; and d jck+1 and d jck are neighboring expression levels of gene g j. Here ic 2 and ic 1 are baseline condition pairs

Coherent Measurement (cont’d) A coherent score for gene g i on conditions c k and c k+1 given baseline condition-pair c 1 and c 2 is defined as: Genes share the same coherent scores under a subset of conditions are shifting-and-scaling patterns. In practice, a coherent threshold є is applied to flexibly control the coherence of the clusters.

Reg-Cluster Model Definition In order to decide whether a subset of genes are shfiting-and- scaling patterns, the reg-cluster model proposed in this paper requires that both regulation and coherence requirements be satisfied: – All the genes should form a regulation chain under the subset of conditions, either up-regulation or down-regulation, i.e. – Any pair of genes should have a difference of coherence score smaller than the given coherence threshold є, i.e.

Algorithm and pruning The basic idea of the algorithm is to systematically identify the representative regulation chain for each validated reg-cluster. The algorithm performs a bi-directional depth-first search on the RWave model for representative regulation chains. 4 pruning strategies are applied: minimum gene number, minimum condition number, regulation threshold and coherent threshold.

Algorithm and pruning (cont’d) To avoid redundancy due to opposite chains with the same members, they also prune regulation chains which have fewer than half positive correlated gene members. They called positive correlated gene members p-members and negative correlated gene members n-members. Representative chains which survive the pruning steps and have with maximal gene set will be the output reg-clusters. 18

Algorithm

Efficiency The running time of reg-cluster is evaluated on synthetic datasets

Effectiveness 21 They run reg-cluster on a bench mark 2D yeast gene expression data. They identify three bi-clusters that previous algorithms fail to identify.

Biological Significance Evaluation 22 Yeast genome gene ontology term finder is used to evaluate the biological significance of the bi-clusters in three categories.

23 Cons of Reg-cluster: Identify arbitrary shifting-and-scaling co-regulation patterns Address both positive and negative correlation in the subspace Allow flexible regulation threshold to quantify up or down regulation Experiments proved that the bi-clusters found are of biological significance in a variety of biological process Conclusions

24 How to choose proper regulation and coherence thresholds to have a satisfactory tradeoff between sensitivity and specificity? The model propose can only handle linear correlation between co-regulated genes. This will still miss a lot of cases where co-regulated genes have non-linear patterns. Do we need a measurement of similarity between bi- clusters to combine those which engage in similar biological processes? Discussion