Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.

Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012

Motivation A subset of genes showing correlated co-expression patterns across a subset of conditions are functionally related Existing algorithms only address pure shifting or scaling patterns in gene expression profiles. 2 P1 = P2 – 5 = P3 – 15 = P4 = P5 / 1.5 = P6 / 3 They are clustered into two groups

Motivation How to group the previous genes into one cluster? We need to handle shifting and scaling patterns simultaneously! Three genes g1, g2 and g3 are correlated under all the above conditions: g2 = -2.5 * g3 + 35 = -g1 + 30

Definition of Correlation Correlation means any of a broad class of statistically relationships between random variables and data values. In this paper, we only focus on linear correlation, including shifting and scaling. Positive and negative correlation correspond to positive and negative scaling factors respectively.

Definition of Bi-clustering 5 Simultaneous clustering of the rows and columns of a matrix, e.g. group genes which have similar expression patterns under a subset of conditions.

Clustering Definition: the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. 6 The result of a cluster analysis shown as the coloring of the squares into three clusters. K-means

Density-based Subspace Clustering Discover arbitrary-shaped clusters under a subspace. A cluster is regarded as a region, in which the density of data objects exceeds a threshold Suffer from the problem that each data object can only be assigned to one cluster only

Hierarchical Clustering Use previous established clusters to find successive clusters. It has two categories: agglomerative ("bottom-up") or divisive ("top-down") Only applicable to full space clustering.

Pattern-based and Tendency-based Biclustering Pattern-based biclustering measures similarities between objects based on the coherent pattern they exhibit. It only identifies pure shifting or scaling patterns Tendency-based biclustering focuses on linear ordering of gene expression levels without coherent guarantee Both methods fail to address the issue of negative correlation in subspace Both methods disregard the fact that patterns with smaller variations in expression values are probably of little biological meaning Both methods miss the co-regulated genes that have shifting- and-scaling patterns due to varying individual sensitivities

New algorithm: reg-cluster A reg-cluster exhibits the following characteristics which are suitable for expression data analysis: – Increase or decrease of gene expression levels across any two conditions of a reg-cluster is in proportion, allowing small variations deﬁned by the coherence threshold – Increase or decrease of gene expression levels across any two conditions of a reg-cluster is signiﬁcant with regard to the regulation threshold – Genes of a reg-cluster can be either positively correlated or negatively correlated 10

Challenges The biggest challenge is the need of a novel coherent cluster model that can capture the more general shifting-and-scaling co-regulation patterns Another challenge is how to apply a non-negative regulation threshold. Tendency-based models of are not suitable for adopting a regulation threshold 11

Regulation Measurement Notations: d ica and d icb are expression levels of gene g i under condition c a and c b respectively; γ is a user-defined gene expression threshold. g i is up-regulated from condition c b to c a if d ica – d icb > γ g i is down-regulated from condition c b to c a if d ica – d icb < -γ We represent them as: We call c b the regulation predecessor of c a ( ) and c a the regulation successor of c b ( ).

RWave Model To effectively find the regulation chains by keeping a record of the bordering regulation relationships order and find minimum pairs which exceeds threshold (γ1 = γ2 = 4.5 and γ3 = 1.8)

Coherent Measurement The shifting and scaling correlation between gene expression data d i and d j under condition set Y can be expressed as a linear equation: The correlation between d i and d j can equally be expressed in the following condition: where d ick+1 and d ick are neighboring expression levels of gene g i after all the levels are sorted non-descending order; and d jck+1 and d jck are neighboring expression levels of gene g j. Here ic 2 and ic 1 are baseline condition pairs

Coherent Measurement (cont’d) A coherent score for gene g i on conditions c k and c k+1 given baseline condition-pair c 1 and c 2 is defined as: Genes share the same coherent scores under a subset of conditions are shifting-and-scaling patterns. In practice, a coherent threshold є is applied to flexibly control the coherence of the clusters.

Reg-Cluster Model Definition In order to decide whether a subset of genes are shfiting-and- scaling patterns, the reg-cluster model proposed in this paper requires that both regulation and coherence requirements be satisfied: – All the genes should form a regulation chain under the subset of conditions, either up-regulation or down-regulation, i.e. – Any pair of genes should have a difference of coherence score smaller than the given coherence threshold є, i.e.

Algorithm and pruning The basic idea of the algorithm is to systematically identify the representative regulation chain for each validated reg-cluster. The algorithm performs a bi-directional depth-first search on the RWave model for representative regulation chains. 4 pruning strategies are applied: minimum gene number, minimum condition number, regulation threshold and coherent threshold.

Algorithm and pruning (cont’d) To avoid redundancy due to opposite chains with the same members, they also prune regulation chains which have fewer than half positive correlated gene members. They called positive correlated gene members p-members and negative correlated gene members n-members. Representative chains which survive the pruning steps and have with maximal gene set will be the output reg-clusters. 18

Algorithm

Efficiency The running time of reg-cluster is evaluated on synthetic datasets

Effectiveness 21 They run reg-cluster on a bench mark 2D yeast gene expression data. They identify three bi-clusters that previous algorithms fail to identify.

Biological Significance Evaluation 22 Yeast genome gene ontology term finder is used to evaluate the biological significance of the bi-clusters in three categories.

23 Cons of Reg-cluster: Identify arbitrary shifting-and-scaling co-regulation patterns Address both positive and negative correlation in the subspace Allow flexible regulation threshold to quantify up or down regulation Experiments proved that the bi-clusters found are of biological significance in a variety of biological process Conclusions

24 How to choose proper regulation and coherence thresholds to have a satisfactory tradeoff between sensitivity and specificity? The model propose can only handle linear correlation between co-regulated genes. This will still miss a lot of cases where co-regulated genes have non-linear patterns. Do we need a measurement of similarity between bi- clusters to combine those which engage in similar biological processes? Discussion

Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.

Similar presentations

Presentation on theme: "Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.

Similar presentations

Presentation on theme: "Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012."— Presentation transcript:

Similar presentations

About project

Feedback