Presentation on theme: "1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY."— Presentation transcript:
1 PROTERAN: ANIMATED TERRAIN EVOLUTION FOR VISUAL ANALYSIS OF PROTEIN FOLDING TRAJECTORY
2 The need for Bioinformatics Bioinformatics: Application of computational techniques to the management and analysis of biological information. Clustering techniques applied on data not enough. Need a good visual representation
3 Agenda Microarrays Review of existing clustering and visualization techniques on gene expression data The need for a customized visualization tool for use by Dr. Laxmi Parida & Dr. Ruhong Zhou of the computational biology group at the IBM Watson Research Center for visual analysis of protein characteristics Introduce our new technique that makes use of an animated terrain, implemented in the program called PROTERAN
4 Function of Genes & Proteins Through the proteins they encode genes orchestrate the mysteries of life Protein functions vary widely from mechanical support to transportation to regulation.
5 Still a lot of work ahead Traditional methods of discovering their functions were done on a gene-by-gene basis, thus throughput was low. Believed that many genes work together; this is not exhibited in a one-by-one fashion.
6 Microarrays Solve the throughput problem Allow scientists to see genes on a genomic level
9 Clustering Clustering: Act of grouping similar objects together Applied to gene expression in order to find the function of unknown genes Many different clustering techniques in the literature. Represented techniques are discussed next.
10 Determining similarity between two genes Choose a similarity distance to compare genes e.g. Euclidian distance Experiment 1Experiment 2………..Experiment M Gene 1C5 11 /C3 11 C5 12 /C3 12 ………..C5 1M /C3 1M Gene 2C5 21 /C3 21 C5 22 /C3 22 ………..C5 2M /C3 2M Gene NC5 N1 /C3 N1 C5 N2 /C3 N2 ………..C5 NM /C3 NM
11 Hierarchical Clustering 1. Create distance matrix of all genes in relation to each other 2. Find the two closest genes 3. Merge these two genes and redo distance matrix 4. Repeat steps 2-3 until only one cluster left
12 Dendrogram Binary tree with a distinguished root, which has all the data items at the leaves Re-orders the expression matrix to place similar genes beside each other
13 Example ABCD A0168 B 057 C 02 D 0 (A,B)CD 057 C 02 D 0 (C,D) (A,B)05 (C,D) 0 Agglomerative Hierarchical Clustering
14 Advantages Familiar to biologists Few parameters to specify
15 Disadvantages Requires fast CPUs and large amounts of memory Does not identify important clusters Only represents hierarchical organized data Does not scale up
16 Disadvantages cont.. Dendrogram always offers 2 n-1 representations (where n = number of elements)
17 Self Organizing Maps (SOMs) User picks number of clusters called nodes Nodes randomly mapped to M-dimensional space (M = # of experiments) Node values are adjusted by random vectors picked from original data After node values settle vectors are clustered to closest node
18 Visualization 1. Dendrogram 2. Error Bar Representation
19 Visualization 3. U-Matrix
20 Advantages User has partial control over structure Fuzzy Clusters Variety of visual techniques applicable
21 Disadvantages Knowledge of number of clusters beforehand Many parameters to specify
22 Principle Component Analysis (PCA) Mathematical technique that can be used to reduce the number of dimensions of data Principal component analysis
24 Advantages No parameters required 3D Visualization
25 Disadvantages Little control over structure Running time of O(N 3 ) Not applicable when input is a distance matrix
26 Biclustering Clustering of both rows and columns simultaneously
27 Available Software Software NameDescriptionAvailable at F-ScanQuantification and analysis of fluorescently probed microarrays; scatterplots; multiple image comparison. TIGR SpotFinderSpot identification.http://www.tigr.org/software/ ClusterHierarchical clustering, K means clustering Self-Organizing Map (SOM), PCA tm GenesisA Java suite containing various tools such as filters, normalization, visualization tools, common clustering algorithms, SOM, k-means, PCA, GenesisCenter.html J-Express Pro 2.0Hierarchical clustering, K-means, Principal Component Analysis, Self-organizing maps, Profile similarity search, Normalization and filtering, Raw data import, Project organization /frm_jexpress.htm TreeViewCluster output visualizationhttp://rana.lbl.gov/EisenSoftware.h tm
28 Protein Folding
29 Reaction Coordinates Folding determines the function of protein All-atom recreation of protein unrealistic Reaction coordinates used to describe protein structure 1.Fraction of Native Contacts 2.Radius of Gyration 3.RMSD from the native structure 4.Number of beta-strand Hydrogen Bonds 5.Number of alpha helix turns 6.Hydrophobic core radius of gyration 7.Principle Components
30 Protein States While folding, a protein goes through certain states The raw data is similar to microarray data. Dr. Parida and Dr. Zhou have developed their own techniques and clustered β-Hairpin data.
31 Reaction Coordinates used on the β-Hairpin 1. Number of Native β-strand hydrogen bonds 2. Radius of gyration of the hydrophobic core residues 3. Radius of gyration of entire protein 4. Fraction of native contacts 5. Principle component 1 6. Principle component 2 7. Root mean square deviation (RMSD) from the native structure.
32 Raw Data
33 Patterned Cluster RED = Number of columns in pattern. (Also defined as the Pattern Type) WHITE = Column Number PURPLE = Column Value YELLOW = Number of occurrences GREEN = Occurrences
34 Sample Patterned Cluster File … … … … : …
35 The need for Visual Analysis of Patterned Cluster Data β-Hairpin file approx 500MB large Difficult to study the textual representation and get a global view Very difficult to see interaction of all patterned clusters in relation to each other Also very difficult to remember all patterned clusters and their occurrence in time
36 Visual Requirements Global View Navigation & Focus Relative growth Details of characteristics on demand
37 Need for Customized Tool All of the existing visualization techniques on microarrays had one or more drawbacks None were able to provide a visual for depicting relative growth of clusters.
38 Terrain Metaphor Has been shown to be a useful technique in searching a corpus of documents Very recently the idea has been applied to gene expression with high density clusters representing mountains
39 Using a Landscape Metaphor to solve our requirements Each mountain represents a patterned cluster Mountain growth represents evolution of patterned cluster Clicking on mountains returns details of patterned cluster
41 Mapping of Patterned Cluster Data into Terrain Geometry
42 Mapping of Patterned Cluster data into Terrain Geometry Pattern Type: Number of columns in a patterned cluster Column Combination: Unique number that identifies a combination of columns
43 Column Combinations c! (c – t)! * t! c = number of characteristics t = pattern number Pattern TypeNumber of Column Combinations
44 Layout We first thought of using an automated layout technique. However, one of Dr. Zhou’s requirements was that the same pattern cluster should appear in the same position for consistent interpretation. Another was that larger pattern types (6 and 7 column) must be very distinguishably placed. Hence it was decided to use a manual layout design described next.
46 Top Patterned Clusters Visualized Final requirement by Dr. Parida and Dr. Zhou is that only the top 10 largest patterned clusters of each column combination should be visualized 10 TH Highest Occurrence of combination 01 9 TH Highest Occurrence of combination 01 2 ND Highest Occurence of combination 01 3 RD Highest Occurrence of combination 01 8 TH Highest Occurrence of combination 01 Highest Occurrence of combination 01 4 TH Highest Occurrence of combination 01 7 TH Highest Occurrence of combination 01 6 TH Highest Occurrence of combination 01 5 TH Highest Occurrence of combination 01
47 PROTERAN LAYOUT
48 Animated Terrain Evolution Time proceeds from 0 to the maximum number of experiments Each time unit all patterned clusters are checked If there is an occurrence the mountain’s height is increased
49 Mountains of PROTERAN
50 Results & Extensions
51 Results Very encouraging feedback Easy to use layout and the interface allows 1.Identification of states 2.Obtain values of patterned clusters 3.Relation of patterned clusters to each other as they grow over time In the initial use itself, Dr. Zhou said that “ he was able to find that the hydrophobic core is largely formed before the beta-strand hydrogen bonds are formed.”
52 Future of PROTERAN Introduced at the Intelligent Systems For Molecular Biology (ISMB) in Scotland – Received very well Robert-Cedergren Bioinformatics Colloquium at University of Montreal (Sept th )
53 Extensions Analyze with different types of protein data More generic layout with more characteristics Application with different types of data
54 Summary 1. Review of existing techniques to cluster and visualize gene expression data 2. Protein characteristics data is similar to that of gene expression data 3. None of the existing techniques applied, thus the need for a customized visual 4. Terrain Metaphor to solve our requirements implemented in the program PROTERAN