Presentation is loading. Please wait.

Presentation is loading. Please wait.

An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,

Similar presentations


Presentation on theme: "An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,"— Presentation transcript:

1 An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics, 2008

2 Outline Introduction Methods Results and Discussion Conclusions 2

3 Introduction Gene expression Gene expression time-series data Conditional random fields model Rand index 3

4 Gene expression Gene expression: the process by which inheritable information from a gene(the DNA sequence) is made into a functional gene product(protein or RNA). It’s a temporal process, so we need to record time-course data of gene expression. 4

5 Gene expression data For example, Cho: mitotic cell cycle: 17 time points of expression data for synchronized yeast cells. Expression levels at time t=10 minutes after release from cell-cycle arrest. 5

6 Gene expression data The co-expression indicates co-regulation or relation in functional pathways. We need an efficient clustering method to identify genes. 6

7 Conditional Random Fields A type of discriminative probabilistic model most often used for the labeling or parsing of sequential data. Let X be a random variable over the observations and Y be a random variable over corresponding labels. A CRF model can be expressed as a joint distribution of Y given X: 7

8 Rand index Rand index is a measure of the similarity between two data clusters. The Rand index has a value between 0 and 1, with 0 indicating that the two data clusters do not agree on any pair of points and 1 indicating that the data clusters are exactly the same. 8

9 Methods Two steps: – Data transformation – Model formulation The proposed algorithm: CRFs clustering algorithm 9

10 Data Transformation How the gene expression time series are transformed into a multi-dimensional feature space? Consider that: – temporal correlation – time intervals 10

11 Data Transformation Let n denote the number of time series, i.e. the number of genes. Each time series T i of length τ is mapped to a point x i : 11 The length of interval. Temporal correlation. Each τ-time-point time series is transformed into a τ-1 dimension feature vector.

12 Model formulation A set of n observed time series, X = {x 1, x 2, …, x n }. A set of the corresponding labels, Y = {y 1, y 2, …, y n }. How to assign an optimal class label y i on tine series i conditioned on observed data x i ? 12

13 Model formulation Unsupervised clustering methods, we need two requirements: – Not using cluster centroids. – Allow each time series to be a singleton cluster. Consider that: – Voting pool – Cost function 13

14 Voting pool When a time series i is visited, its voting pool N i is formed with k time series. 14 Time series i Most similar(MS) time series: s Most different(MD) time series: 1 Time series selected at random: k-s-1 Voting pool: k

15 Voting pool Only the class labels currently assigned to the members of its voting pool are taken as candidates. The class label of time series i is decided with its voting pool. 15

16 Voting pool What purposes for MD, MS and ones selected at random? MD: inform what label to avoid. MS: inform what labels to choose. Ones selected at random: introduce new information. 16

17 Cost function Encourage ‘good labeling’ with low cost and penalize ‘poor labeling’ with high cost. Like a CRF model: 17 Conditional probability distribution The prior Gibbs form

18 Cost function The two cost functions: We obtain a new model as: 18 They are dependent on the same set of arguments!

19 Cost function How to set the new cost function? We don’t need to specify the number of clusters and the information about centroids. So, we define: 19 The estimated threshold between intra- and inter-class distances. : average of all the minimum distances : average of all the maximum distances time series i randomly picked: m maximum distance minimum distance

20 Cost function The cost function is defined as: The assignment of a label is determined as: 20

21 CRFs clustering algorithm CRFs clustering algorithm: 21 Data transformation Voting pool Cost function

22 Results and Discussion Evaluation on toy cases Experiments on time-series datasets – Simulated datasets – Yeast galactose dataset – Yeast cell-cycle dataset Evaluation of class prediction 22

23 Evaluation on toy cases How does the voting pool affect the performance? Experimental setting: – Three-dimensional features patterns. – Five clusters, each having 30 patterns. – Voting pool size: 4, 6, 10, 18. – One MS, one MD, (k-2) random samples. 23

24 Evaluation on toy cases The five clusters are randomly created with the five centroids fixed at: – μ1: (40, 45, 100) – μ2: (85, 60, 100) – μ3: (65, 100, 80) – μ4: (55, 120, 120) – μ5: (120, 50, 120) The same variance of 64. 24

25 Evaluation on toy cases Three cases: – MS and MD included – MD excluded – MS excluded 25

26 Evaluation on toy cases The results: 26 Without MS, the algorithm tends to guess correct labels. Without MD, the algorithm couldn’t know which label is not suitable.

27 Simulated datasets We synthesized five classes of 60, 60, 60, 80, 60 gene expression profiles: Gaussian noise is added to all data. 27

28 Simulated datasets The results: 28 0.9903, k = 12 Time complexity tends to decline steadily. The clustering accuracy and efficiency of the proposed method depend on the choice of the pool size.

29 Simulated datasets Another experimental setting: – Three datasets of 320 gene profiles consisting 4, 5 and 6 clusters. – Three datasets of 400 gene profiles consisting 4, 5 and 6 clusters. 29

30 Simulated datasets The results: 30 Voting pool size is 6-12Voting pool size is 8-16 A pool size in the range of 2~4% of the dataset size is reasonable.

31 Yeast galactose dataset The Yeast galactose dataset consists of gene expression measurements in galactose utilization in Saccharomyces cerevisiae. We compared the adjusted Rand index between our algorithm and CORE(Tjaden, 2006). CORE algorithm: a supervised k-means clustering method. 31

32 Yeast galactose dataset The results: 32 Good performance with voting pool size is 5-11. However, the optimal value of CORE is 0.7. Optimal value: 0.9478

33 Yeast cell-cycle dataset In the Yeast cell-cycle(Y5) dataset, there are more than 6000 yeast genes measured during two cell-cycles at 17 time points. 33 The genes are identified based on their peak times of five phase of the cell-cycle: early G1(G1E), late G1(G1L), S, G2, M.

34 Yeast cell-cycle dataset The results: 34 Optimal value is 0.4808 In the range, the adjusted Rand index are better than the best result 0.467 of all the methods without labeled data, like HMM, k-means and splines, etc.

35 Evaluation of class prediction We need to assign genes to known classes, i.e. class prediction. The datasets(Y5, Yeast galactose, Simulated) are divided into training and testing sets. 35 The high-prediction error rate is due to the high rate of misclassification of the training set.

36 Evaluation of class prediction The results: 36 Y5 Dataset: high prediction error rate.

37 Conclusion An efficient unsupervised CRFs model for both gene class discovery and class prediction without a priori knowledge is presented. 37


Download ppt "An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,"

Similar presentations


Ads by Google