Presentation is loading. Please wait.

Presentation is loading. Please wait.

GPX: Interactive Exploration of Time-series Microarray Data

Similar presentations


Presentation on theme: "GPX: Interactive Exploration of Time-series Microarray Data"— Presentation transcript:

1 GPX: Interactive Exploration of Time-series Microarray Data
Daxin Jiang, Jian Pei, and Aidong Zhang Motivations Specific features of time-series microarray data Special requirements from the domain of biology Most clustering algorithms may not be effective to address the above problems

2 Time-series Microarray Data
Gene expression levels are monitored at different time points during a time series.

3 Co-expressed Genes and Coherent Patterns
Parallel coordinates for Iyer’s data Examples of co-expressed genes and coherent patterns in gene expression data [1] Iyer, V.R. et al. The transcriptional program in the response of human fibroblasts to serum. Science, 283:83–87, 1999.

4 Example – Cell Cycle S phase Early G1 phase The cell cycle Expression patterns of cell-cycle regulated genes of yeast reported by Spellman et al. G2 phase Late G1 phase [2] Spellman et al., (1998).  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.  Molecular Biology of the Cell 9, M phase

5 Cluster Analysis Partition the data set into several disjoint clusters
Each cluster is a group of co-expressed genes. The centroid of the cluster is the coherent pattern. Various of clustering methods Partition-based approaches Hierarchical approaches Density-based approaches ……

6 What the Data Look Like L. Zhang et al. Enhanced Visualization of Time Series through Higher Fourier Harmonics. BIOKDD 2003

7 High Connectivity of the Data
ga gb Two genes with complete different patterns connected by a “bridge”

8 Hierarchies of Co-expressed Genes and Coherent Patterns
The interpretation of co-expressed genes and coherent patterns mainly depends on the domain knowledge

9 To Split or Not Dependent on “domain knowledge” group A1 group A2

10 Which Split Option to Choose
Dependent on “domain knowledge” Various split options may correspond to different hypotheses regarding gene function.

11 What is a “Good” Clustering Algorithm
Form a hierarchical structure Flexible and convenient to derive clusters Support users’ domain knowledge Handle the high connectivity effectively

12 Partition-based Approaches
Form a hierarchical structure? Yes, if we use it as the split strategy in the divisive approach Flexible and convenient to derive clusters? No, since the parameters are hard to determine Handle the high connectivity effectively? No, since it partitions the data set by force

13 Cut cluster borders by force

14 Hierarchical Approaches
Form a hierarchical structure? Sure Flexible and convenient to derive clusters? Global threshold: convenient but not flexible Handle the high connectivity effectively? Depends on which inter-cluster measure is used e.g., complete-link may be better than single-link

15 Density-based Approaches
Form a hierarchical structure? Not explicitly Possible if we adjust the parameters level by level Flexible and convenient to derive clusters? DBSCAN and DENCLUE use global thresholds not flexible OPTICS plots cluster structure both flexible and convenient

16 Density-based Approaches
Handle the high connectivity effectively? DBSCAN and OPTICS are not effective “indirectly density-reachable” forms a chain DENCLUE cuts the cluster border by force center-defined clusters a local maximum of density is the “center” of a cluster other objects in the cluster are “attracted” to the local maximum

17 Our Solution– An Interactive Approach
Adopt a divisive approach to form a hierarchical structure Users can choose whether to split or not Still need one parameter robust easy to determine Plot the cluster structure of the data set Users can explore the data set by “drill down” and “roll up” operations based on their domain knowledge Apply a novel strategy to handle the high connectivity. Users can determine the cluster border

18 Pattern-based Strategy
coherent pattern To find co-expressed genes and coherent expression patterns Cluster-based strategy First find clusters as co-expressed genes Then use centroids as coherent expression patterns Pattern-based strategy First find coherent expression patterns Then determine the co-expressed genes conforming to the pattern Similarity Genes 0.94 gene g i6 gene g i5 0.95 gene g i4 0.98 gene g i3 gene g i2 0.99 gene g i1 -0.55 gene g in -0.45 gene g in-1 -0.44 gene g in-2    Pattern-based strategy

19 Distance Measure Users are interested in overall shape
Euclidean distance does not work well Normalize each data object O to O’ with a mean of 0 and a variance of 1 An object After normalization Shifting patterns m is the number of attributes,  and  are the mean and the standard deviation of O, respectively. Scaling patterns

20 Distance Measure Similarity and Distance between two genes (objects)
The similarity and distance measure defined above are consistent Given objects O1, O2, O3 and O4, Similarity(O1,O2)≥Similarity(O3,O4) if and only if Distance(O1,O2)  Distance(O3,O4)

21 A Density-based Model A group of co-expressed genes form a dense area;
Genes at the core area have high density, while genes at the boundary area have low density; Genes at the boundary area are “attracted” towards the local maximum level by level.

22 Density Measures Radius-based density KNN-based density
DENCLUE density

23 Definition of Density We modify the density definition by Denclue[3]
The influence function (attraction function) Given a data set D d(Oi,Oj) is the distance between Oi and Oj, and  is a parameter is the estimated average similarity within a cluster [3] Hinneburg, A. et al. An efficient approach to clustering in large multimedia database with noise. Proc. 4th Int. Con. on Knowledge discovery and data mining, 1998.

24 Attraction Tree The “attractor” of object O is its nearest neighbor with a higher density than O. Denoted by O  Attractor(O). We can derive an attraction tree based on the “attractor” relationship The weight for each edge e(Oi,Oj) on the attraction tree is defined as the similarity between Oi and Oj. Use Pearson’s correlation coefficient as similarity measure.

25 An Example of Attraction Tree
An example data set The attraction tree Three features of attraction tree: self-closed: a group of objects conforming to the same coherent pattern forms an attraction subtree. robust to intermediate genes (noise) three levels of edge weights

26 Index List Serialization of the attraction tree
Search the attraction tree based on the edge weight. Order the genes in the “index list”. The attraction tree The index list

27 Index list Similarity curve for Iyer’sdata set

28 Coherent Pattern Index Graph
Compute the “coherent pattern index (CPI)” for each gene. p is a parameter, Sim(gi) is the similarity between gi and its parent gj on the attraction tree The index list The coherent pattern index graph

29

30 M phase S phase Early G1 phase Late G1 phase G2 phase

31 Validation Measure P1 C1 P2 C2 P3 C3 P4 C4 … … Pn Cm
Ground truth patterns Reported patterns P1 is matched by C4 with similarity (suppose Sim(P1,C4)=0.95) P2 is matched by C1 with similarity (supposeSim(P2,C1)=0.9)

32 Comparison With Other Approaches
Pattern GPX (10) Kmeans (10) SOM (10) SOTA (50) Adapt(11) CLICK(7) CAST (9) 1 0.998 0.973 0.983 0.962 0.956 0.884 0.955 2 0.996 0.950 0.992 0.936 0.911 0.991 0.887 3 0.993 0.910 0.872 0.947 0.994 0.997 4 0.995 0.989 0.984 0.883 0.968 5 0.964 0.882 0.716 0.868 0.855 6 0.940 0.965 0.764 0.972 0.970 7 0.880 0.892 0.988 0.976 0.990 0.719 8 0.963 0.917 0.958 0.914 0.999 9 0.907 0.848 0.824 0.844 0.800 10 0.987 0.930 0.960 0.981 The similarity between the pattern reported by different approaches and the corresponding pattern in the ground truth (if any)

33 Comparison With Other Approaches
Pattern GPX (7) Kmeans (5) SOM (5) SOTA (99) Adapt(21) CLICK(9) CAST (19) 1 0.901 0.928 0.194 0.938 0.884 0.855 0.900 2 0.970 0.976 0.972 0.968 0.978 3 0.980 0.950 0.552 0.940 0.953 0.888 4 0.773 0.437 0.961 0.796 0.984 5 0.945 0.965 0.964 0.956 0.962

34 Comparison with Optics
Iyer’s data set Spellman’s data set

35 Effects of Parameters Spellman’s data set Iyer’s data set

36 Scalability The algorithm scales well with large data sets.
The computation time is dominated by the distance calculation.


Download ppt "GPX: Interactive Exploration of Time-series Microarray Data"

Similar presentations


Ads by Google