Presentation is loading. Please wait.

Presentation is loading. Please wait.


Similar presentations



2 2 The need for Bioinformatics Bioinformatics: Application of computational techniques to the management and analysis of biological information. Clustering techniques applied on data not enough. Need a good visual representation

3 3 Agenda Microarrays Review of existing clustering and visualization techniques on gene expression data The need for a customized visualization tool for use by Dr. Laxmi Parida & Dr. Ruhong Zhou of the computational biology group at the IBM Watson Research Center for visual analysis of protein characteristics Introduce our new technique that makes use of an animated terrain, implemented in the program called PROTERAN

4 4 Function of Genes & Proteins Through the proteins they encode genes orchestrate the mysteries of life Protein functions vary widely from mechanical support to transportation to regulation.

5 5 Still a lot of work ahead Traditional methods of discovering their functions were done on a gene-by-gene basis, thus throughput was low. Believed that many genes work together; this is not exhibited in a one-by-one fashion.

6 6 Microarrays Solve the throughput problem Allow scientists to see genes on a genomic level

7 7 Expression Matrix Experiment 1Experiment 2………..Experiment M Gene 1C5 11 /C3 11 C5 12 /C3 12 ………..C5 1M /C3 1M Gene 2C5 21 /C3 21 C5 22 /C3 22 ………..C5 2M /C3 2M.............................. Gene NC5 N1 /C3 N1 C5 N2 /C3 N2 ………..C5 NM /C3 NM

8 8 Clustering & Visualization Techniques Review

9 9 Clustering Clustering: Act of grouping similar objects together Applied to gene expression in order to find the function of unknown genes Many different clustering techniques in the literature. Represented techniques are discussed next.

10 10 Determining similarity between two genes Choose a similarity distance to compare genes e.g. Euclidian distance Experiment 1Experiment 2………..Experiment M Gene 1C5 11 /C3 11 C5 12 /C3 12 ………..C5 1M /C3 1M Gene 2C5 21 /C3 21 C5 22 /C3 22 ………..C5 2M /C3 2M.............................. Gene NC5 N1 /C3 N1 C5 N2 /C3 N2 ………..C5 NM /C3 NM

11 11 Hierarchical Clustering 1. Create distance matrix of all genes in relation to each other 2. Find the two closest genes 3. Merge these two genes and redo distance matrix 4. Repeat steps 2-3 until only one cluster left

12 12 Dendrogram Binary tree with a distinguished root, which has all the data items at the leaves Re-orders the expression matrix to place similar genes beside each other

13 13 Example ABCD A0168 B 057 C 02 D 0 (A,B)CD 057 C 02 D 0 (C,D) (A,B)05 (C,D) 0 Agglomerative Hierarchical Clustering

14 14 Advantages Familiar to biologists Few parameters to specify

15 15 Disadvantages Requires fast CPUs and large amounts of memory Does not identify important clusters Only represents hierarchical organized data Does not scale up

16 16 Disadvantages cont.. Dendrogram always offers 2 n-1 representations (where n = number of elements)

17 17 Self Organizing Maps (SOMs) User picks number of clusters called nodes Nodes randomly mapped to M-dimensional space (M = # of experiments) Node values are adjusted by random vectors picked from original data After node values settle vectors are clustered to closest node

18 18 Visualization 1. Dendrogram 2. Error Bar Representation

19 19 Visualization 3. U-Matrix

20 20 Advantages User has partial control over structure Fuzzy Clusters Variety of visual techniques applicable

21 21 Disadvantages Knowledge of number of clusters beforehand Many parameters to specify

22 22 Principle Component Analysis (PCA) Mathematical technique that can be used to reduce the number of dimensions of data Principal component analysis

23 23 Visualization

24 24 Advantages No parameters required 3D Visualization

25 25 Disadvantages Little control over structure Running time of O(N 3 ) Not applicable when input is a distance matrix

26 26 Biclustering Clustering of both rows and columns simultaneously

27 27 Available Software Software NameDescriptionAvailable at F-ScanQuantification and analysis of fluorescently probed microarrays; scatterplots; multiple image comparison. TIGR SpotFinderSpot identification. ClusterHierarchical clustering, K means clustering Self-Organizing Map (SOM), PCA tm GenesisA Java suite containing various tools such as filters, normalization, visualization tools, common clustering algorithms, SOM, k-means, PCA, GenesisCenter.html J-Express Pro 2.0Hierarchical clustering, K-means, Principal Component Analysis, Self-organizing maps, Profile similarity search, Normalization and filtering, Raw data import, Project organization /frm_jexpress.htm TreeViewCluster output visualization tm

28 28 Protein Folding

29 29 Reaction Coordinates Folding determines the function of protein All-atom recreation of protein unrealistic Reaction coordinates used to describe protein structure 1.Fraction of Native Contacts 2.Radius of Gyration 3.RMSD from the native structure 4.Number of beta-strand Hydrogen Bonds 5.Number of alpha helix turns 6.Hydrophobic core radius of gyration 7.Principle Components

30 30 Protein States While folding, a protein goes through certain states The raw data is similar to microarray data. Dr. Parida and Dr. Zhou have developed their own techniques and clustered β-Hairpin data.

31 31 Reaction Coordinates used on the β-Hairpin 1. Number of Native β-strand hydrogen bonds 2. Radius of gyration of the hydrophobic core residues 3. Radius of gyration of entire protein 4. Fraction of native contacts 5. Principle component 1 6. Principle component 2 7. Root mean square deviation (RMSD) from the native structure.

32 32 Raw Data

33 33 Patterned Cluster RED = Number of columns in pattern. (Also defined as the Pattern Type) WHITE = Column Number PURPLE = Column Value YELLOW = Number of occurrences GREEN = Occurrences 2 0 0.14 0.23 3 23 26 27

34 34 Sample Patterned Cluster File 20 7.3351 0.735 1006597288723594826-9483195748-9575295761-95763…120424-120426 20 7.3351 0.736 1003597288723594826-9483195748-9575295761-95763…95769 30 7.3354 -5.8816 3.292 10365972872071872359482694828-94831…95761-95763 30 7.3354 -5.8815 2.214 10565972872071872359482694828-94831…95761-95763 : 52 8.1443 0.8994 -3.8555 -33.5746 3.292 10894553359728720718723594826…95748-95752

35 35 The need for Visual Analysis of Patterned Cluster Data β-Hairpin file approx 500MB large Difficult to study the textual representation and get a global view Very difficult to see interaction of all patterned clusters in relation to each other Also very difficult to remember all patterned clusters and their occurrence in time

36 36 Visual Requirements Global View Navigation & Focus Relative growth Details of characteristics on demand

37 37 Need for Customized Tool All of the existing visualization techniques on microarrays had one or more drawbacks None were able to provide a visual for depicting relative growth of clusters.

38 38 Terrain Metaphor Has been shown to be a useful technique in searching a corpus of documents Very recently the idea has been applied to gene expression with high density clusters representing mountains

39 39 Using a Landscape Metaphor to solve our requirements Each mountain represents a patterned cluster Mountain growth represents evolution of patterned cluster Clicking on mountains returns details of patterned cluster


41 41 Mapping of Patterned Cluster Data into Terrain Geometry

42 42 Mapping of Patterned Cluster data into Terrain Geometry Pattern Type: Number of columns in a patterned cluster Column Combination: Unique number that identifies a combination of columns 2 0 0.14 0.23 3 23 26 27

43 43 Column Combinations c! (c – t)! * t! c = number of characteristics t = pattern number Pattern TypeNumber of Column Combinations 221 335 4 521 67 71

44 44 Layout We first thought of using an automated layout technique. However, one of Dr. Zhou’s requirements was that the same pattern cluster should appear in the same position for consistent interpretation. Another was that larger pattern types (6 and 7 column) must be very distinguishably placed. Hence it was decided to use a manual layout design described next.

45 45 Layout 010203012340123501236012013014015016 040506012450124601256023024025026034 121314013450134601356035036045046056 151623014560234502346123124125126134 242526023560245603456135136145146156 343536123451234612356234235236245246 454656124561345623456256345346356456 01230124012501260134 01350136014501460156 01234501234601235602340235023602450246 012345601245601345602345602560345034603560456 12345612341235123612451246 12561345134613561456 23452346235624563456

46 46 Top Patterned Clusters Visualized Final requirement by Dr. Parida and Dr. Zhou is that only the top 10 largest patterned clusters of each column combination should be visualized 10 TH Highest Occurrence of combination 01 9 TH Highest Occurrence of combination 01 2 ND Highest Occurence of combination 01 3 RD Highest Occurrence of combination 01 8 TH Highest Occurrence of combination 01 Highest Occurrence of combination 01 4 TH Highest Occurrence of combination 01 7 TH Highest Occurrence of combination 01 6 TH Highest Occurrence of combination 01 5 TH Highest Occurrence of combination 01


48 48 Animated Terrain Evolution Time proceeds from 0 to the maximum number of experiments Each time unit all patterned clusters are checked If there is an occurrence the mountain’s height is increased

49 49 Mountains of PROTERAN

50 50 Results & Extensions

51 51 Results Very encouraging feedback Easy to use layout and the interface allows 1.Identification of states 2.Obtain values of patterned clusters 3.Relation of patterned clusters to each other as they grow over time In the initial use itself, Dr. Zhou said that “ he was able to find that the hydrophobic core is largely formed before the beta-strand hydrogen bonds are formed.”

52 52 Future of PROTERAN Introduced at the Intelligent Systems For Molecular Biology (ISMB) in Scotland – Received very well Robert-Cedergren Bioinformatics Colloquium at University of Montreal (Sept 23-24 th )

53 53 Extensions Analyze with different types of protein data More generic layout with more characteristics Application with different types of data

54 54 Summary 1. Review of existing techniques to cluster and visualize gene expression data 2. Protein characteristics data is similar to that of gene expression data 3. None of the existing techniques applied, thus the need for a customized visual 4. Terrain Metaphor to solve our requirements implemented in the program PROTERAN

55 55 Questions


Similar presentations

Ads by Google