Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department.

Similar presentations


Presentation on theme: "Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department."— Presentation transcript:

1 Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

2 Microarray and Gene Expression Microarray is a high throughput technique that can assay gene expression levels of a large number of genes in a tissue Gene expression level is the relative amounts of mRNA produced at specific time point and under certain experiment conditions. Thus microarray provides a mean to decipher the logic of gene regulation, by monitoring the gene expression of all genes in a tissue.

3 Gene Expression Gene expression data are obtained from microarrays and organized into gene expression matrix for analysis in various methodologies for medical and biological purposes.

4 Gene Series Time Series A sequence of gene expression measured at successive time points at either uniform or uneven time intervals. Reveal more information than static data as time series data have strong correlations between successive points. Time Series Clustering Assumption: co-expression indicates co- regulation, thus clustering identify genes that share similar functions.

5 Probabilistic models A key challenge of gene expression time series research is the development of efficient and reliable probabilistic models Allow measurements of uncertainty Give analytical measurement of the confidence of the clustering result Indicate the significance of a data point Reflect temporal dependencies in the data points

6 Goal Identify highly informative genes Cluster genes in the dataset GO (Gene Ontology) analysis of biological function for each cluster.

7 HMMs and CRFs HMMs CRFs HMMs are trained to maximize the joint probability of a set of observed data and their corresponding labels. Independence assumptions are needed in order to be computationally tractable. Representing long-range dependencies between genes and gene interactions are computationally impossible.

8 Conditional Random Fields CRFs are undirected graphical models that define a probability distribution over the label sequences, globally conditioned on a set of observed features. – X = {x 1, x 2,…, x n }: variable over the observations; – Y = {y 1, y 2,…, y n }: variable over the corresponding labels. – Observed data x j and class labels y j for all j in a voting pool Ni for sample x i ;

9 CRFs Model The CRFs model can be expressed in a Gibbs form in terms of cost functions The CRFs model can be formulated as follows

10 Cost function The conditional random field model can also be expressed in a Gibbs form in terms of cost functions Cost function

11 Potential function Real-value potential functions are obtained and used to form the cost function D: the estimated threshold dividing the set of Euclidean distances into intra- and inter-class distances

12 Finding the optimal labels We adopt deterministic label selection, the optimal label is determined by

13 Pre-processing Linear Warping for data alignment τ -time point data transformed into τ-1feature space Differences between consecutive time points inversely proportional to time intervals are used as features as they can reflect the temporal structures in the time series. Voting pool: keeps one most similar sample, one most-different sample and k-2 randomly selected samples.

14 Process Initialization –Each sample is assigned a random label –Voting pools are formed randomly Samples interact with each other via its voting pool progressively –Update labels –Updata voting pool Until steady

15 Experimental Validation Both biological dataset and simulated dataset Adjusted Rand index: Similarity measure of two partitions Yeast galactose dataset –Gene expression measurements in galactose utilization in Saccharomyces cerevisiae –Subset of meansurements of 205 genes whose expression patterns reflect four functional categories in the Gene Ontology (GO) listings –4 repeated measurements across 20 time points

16 Results for Yeast galactose dataset The four functional categories of Yeast galactose dataset Experimental results on Yeast galactose dataset We obtained an average Rand index value of 0.943 in 10 experiments, greater than the result 0.7 in Tjaden et al. 2006.

17 Simulated Dataset Data are generated for 400 genes across 20 time points from six artificial patterns to model periodic, up-regulated and down regulated gene expression profiles. High Gaussian noise is added. Perfect partitions are obtained with 10 iterations

18 Conclusions A novel unsupervised Conditional Random Fields model for efficient and accurate gene expression time series clustering All data points are randomly initialized The randomness of the voting pool facilitates global interactions

19 Future work Various similarity measurement Advantage of information from repeated measurements Training and testing procedures


Download ppt "Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department."

Similar presentations


Ads by Google