# Yinyin Yuan and Chang-Tsun Li Computer Science Department

## Presentation on theme: "Yinyin Yuan and Chang-Tsun Li Computer Science Department"— Presentation transcript:

Yinyin Yuan and Chang-Tsun Li Computer Science Department
Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Microarray and Gene Expression
Microarray is a high throughput technique that can assay gene expression levels of a large number of genes in a tissue Gene expression level is the relative amounts of mRNA produced at specific time point and under certain experiment conditions. Thus microarray provides a mean to decipher the logic of gene regulation, by monitoring the gene expression of all genes in a tissue. Analysis and learning from these data on the molecular level are revolutionary in medicine because they are highly informative. Innovative model systems are needed instead of stra- ightforward adaptations of existing methodologies.

Gene Expression Gene expression data are obtained from microarrays and organized into gene expression matrix for analysis in various methodologies for medical and biological purposes. Data acquisition comprises microarray image processing for data extraction and the transformation of extracted data into gene expression matrix for further processing. After image processing, image analysis software normally transforms the data into a gene expression matrix by organization of the data from multiple hybridizations. Each column describes the expression levels for one gene under a series of experimental conditions. In other words, each position in this matrix characterizes the expression level for one gene in a certain experiment. Obtaining the gene expression matrix is none trivial. Data normalization and treatment of replicate measurements is needed before they can be compared among others relating to the same gene.

Gene Series Time Series
A sequence of gene expression measured at successive time points at either uniform or uneven time intervals. Reveal more information than static data as time series data have strong correlations between successive points. Time Series Clustering Gene expression time series is a sequence of gene expression measured at successive time points at either uniform or uneven time intervals. In static experiment only snapshots of the expression of gene are recorded, while in time series experiments a temporal process of gene expression is measured. In other words microarray experiments are performed in consecutive time points in order to record a time series of gene expression data. While static data are assumed to be independent, time series data have strong correlations between successive points. In this sense, the time series experiment should be designed carefully according to the available resource and the objective of the experiment. A number of parameters such as sampling rates and the number of time points needed are to be decided when the gene regulations are taken into account. As gene expression is a temporal process, it is necessary to measure a time series of gene expression in order to determine the set of gene that are expressed under certain conditions, the gene expression level and the interaction between these genes. This allows us to fully utilize the information we can get from the experiments as it reveals the pathway that leads from one state to the next, not just the stable state under a new condition. The underlying assumption in clustering gene expression data is that co-expression indicates co-regulation, thus clustering should identify genes that share similar functions. Assumption: co-expression indicates co-regulation, thus clustering identify genes that share similar functions.

Probabilistic models A key challenge of gene expression time series research is the development of efficient and reliable probabilistic models Allow measurements of uncertainty Give analytical measurement of the confidence of the clustering result Indicate the significance of a data point Reflect temporal dependencies in the data points A key challenge of gene expression time series research is the development of efficient and reliable probabilistic models. In response, we propose an unsupervised

Goal Identify highly informative genes Cluster genes in the dataset
GO (Gene Ontology) analysis of biological function for each cluster.

HMMs and CRFs HMMs CRFs HMMs are trained to maximize the joint probability of a set of observed data and their corresponding labels. Independence assumptions are needed in order to be computationally tractable. Representing long-range dependencies between genes and gene interactions are computationally impossible. As a popular method for probabilistic sequence data modelling, Dynamic Bayesian Networks (DBNs) are trained to maximize the joint probability of a set of observed data and their corresponding labels ¥cite{dojer06applying, husmeier03sensitivity}. Hidden Markov Models (HMMs), a special case of DBNs, have been previously applied to sequence data modelling in many fields such as speech recognition. Both of them have been applied to gene expression time series clustering ¥cite{schliep05analyzing, ji03mining}. As generative models, both DBNs and HMMs assign a joint probability distribution $P(X,Y)$ where $X$ and $Y$ are random variables respectively ranging over observation and label sequences. To define such a probability, they have to make some independence assumptions, which could be problematic, in order to be computationally tractable. Furthermore, in gene expression data modelling, it is impossible to represent gene interactions and long-range dependencies between genes for generative models as enumerating all possibilities is intractable.

Conditional Random Fields
CRFs are undirected graphical models that define a probability distribution over the label sequences, globally conditioned on a set of observed features. X = {x1, x2,…, xn}: variable over the observations; Y = {y1, y2,…, yn}: variable over the corresponding labels. Observed data xj and class labels yj for all j in a voting pool Ni for sample xi; CRFs are undirected graphical models that define a probability distribution over the label sequences, globally conditioned on a set of observed features. Let $X$ be a random variable over the observations and $Y$ be a random variable over the corresponding labels. When the notes corresponding to elements of $Y$ form a linear chain, the cliques are the edges and vertices as shown In the case of time series clustering X=\{x_{1}, x_{2}, …, x_{n}\} is the set of observed sequences and $Y=\{y_{1}, y_{2},\ldots,y_{n}\}$ is the set of corresponding labels. Let $G = (V, E)$ be an undirected graph such that $Y=(Y_{v}), v\in V$, and $Y$ obeys Markovian property as in the graph when conditioned on $X$, then $(X,Y)$ is a conditional random field.

CRFs Model The CRFs model can be formulated as follows
The CRFs model can be expressed in a Gibbs form in terms of cost functions The conditional random field model of Eq.(\ref{eq1}) can also be expressed in a Gibbs form in terms of cost functions $U^{c}_{i}(x_{i},x_{N_{i}}|y_{i},y_{N_{i}})$ and $U^{p}_{i}(y_{i}|y_{N_{i}})$, which are associated with the conditional probability and prior of Eq.(\ref{eq1}), respectively. Since the two cost functions are dependent on the same set of 'variables', by properly integrating the two, we obtain a new model as

Cost function The conditional random field model can also be expressed in a Gibbs form in terms of cost functions Cost function

Potential function Real-value potential functions are obtained and used to form the cost function D: the estimated threshold dividing the set of Euclidean distances into intra- and inter-class distances Inferred from the graphical structure of conditional random field, potential function aims to factorize the joint distribution over $y$ by operating on pairs of dependent variables, that is, on the edges in $G$. where the potential function $W_{i,j}$ is defined, based on the Euclidean distance between samples $i$ and $j$ as

Finding the optimal labels
We adopt deterministic label selection, the optimal label is determined by The optimal label $\hat{y_{i}}$ for a sample $i$ can be stochastically or deterministically selected according to Eq.(\ref{eq2}). In our work we adopt deterministic selection, as such picking a label corresponding to a large value of $P(\cdot)$ is equivalent to picking a label corresponding to a small value of $U_{i}(\cdot)$. Therefore, the optimal label $\hat{y_{i}}$ is selected according to

Pre-processing Linear Warping for data alignment
τ -time point data transformed into τ-1feature space Differences between consecutive time points inversely proportional to time intervals are used as features as they can reflect the temporal structures in the time series. Voting pool: keeps one most similar sample, one most-different sample and k-2 randomly selected samples. Preprocessing the time series data such as alignment and smoothing is necessary to remove the variability in the timing of biological processes and random variation. The alignment transforms gene expression time series that start at different cell cycle phases and occur on different time scales into comparable data. After data alignment, a simple linear transformation is carried out to transform the $\tau$-time-point data into a $\tau-1$ dimensional feature space Before the first iteration of the labelling process starts, each sample is assigned a label randomly picked from the integer range $[1, n]$, where $n$ is the number of samples. Therefore, the algorithm starts with a set of $n$ singleton clusters without user specifying the number of clusters. each individual sample to interact with its randomly formed \emph{voting pool} to find its own identity progressively. The rationale supporting our postulates is that the randomness of the voting pool facilitates global interactions in order to make local decisions.

Process Initialization
Each sample is assigned a random label Voting pools are formed randomly Samples interact with each other via its voting pool progressively Update labels Updata voting pool Until steady

Experimental Validation
Both biological dataset and simulated dataset Adjusted Rand index: Similarity measure of two partitions Yeast galactose dataset Gene expression measurements in galactose utilization in Saccharomyces cerevisiae Subset of meansurements of 205 genes whose expression patterns reflect four functional categories in the Gene Ontology (GO) listings 4 repeated measurements across 20 time points We use biological dataset as well as simulated datasets for validation. It is widely accepted that an algorithm is tested regarding to its accuracy using both biological datasets and the simulated datasets. Simulated datasets are necessary because the biological meanings of real datasets are very often not clear. Besides, simulated datasets provide more controllable conditions to test an algorithm and a standard for benchmarking. Simulated datasets also have the disadvantages of overlooking the real process and losing important biological features. Yeast galactose dataset consisting of gene expression measurements in galactose utilization in {\it Saccharomyces cerevisiae} is used in our experiment. A subset of meansurements of 205 genes whose expression patterns reflect four functional categories as illustrated in Fig. \ref{fig bio} in the Gene Ontology (GO) listings \cite{ashburner00gene} is used. The experiments are conducted with four repeated measurements across 20 time points. Thus we evaluate the algorithm accuracy by comparing the clustering result with the four functional categories as the ground truth.

Results for Yeast galactose dataset
The four functional categories of Yeast galactose dataset Experimental results on Yeast galactose dataset 17 iterations We obtained an average Rand index value of in 10 experiments, greater than the result 0.7 in Tjaden et al

Simulated Dataset Data are generated for 400 genes across 20 time points from six artificial patterns to model periodic, up-regulated and down regulated gene expression profiles. High Gaussian noise is added. Perfect partitions are obtained with 10 iterations Following \cite{schliep05analyzing, medvedovic04bayesian}, we generate simulation data for $n=400$ genes across $\tau=20$ time points from six artificial patterns to model periodic, up-regulated and down regulated gene expression profiles. In particular, four classes are generated from sine waves which have frequency and phrase randomness relative to each other, two classes are generated with linear functions as in Eq.(\ref{eq7}). High Gaussian noise $\varepsilon_{i}$ is added to all of the data as it is demonstrated in Fig. \ref{fig sim}.

Conclusions A novel unsupervised Conditional Random Fields model for efficient and accurate gene expression time series clustering All data points are randomly initialized The randomness of the voting pool facilitates global interactions

Future work Various similarity measurement
Advantage of information from repeated measurements Training and testing procedures