Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gilad Lerman Department of Mathematics University of Minnesota

Similar presentations


Presentation on theme: "Gilad Lerman Department of Mathematics University of Minnesota"— Presentation transcript:

1 Gilad Lerman Department of Mathematics University of Minnesota
Exploring Functional Landscapes of Proteins via Manifold Embeddings of the Gene Ontology Gilad Lerman Department of Mathematics University of Minnesota IMA, UMN, 11/14/07 IPAM, UCLA, 11/29/07

2 Fundamental Problem in Molecular Evolution
How do we quantify the relationship between structure and function? More specifically: Given two protein domains, how similar are they in terms of function ? (i.e. form a functional distance for protein domains)

3 “Nothing in Biology makes sense except in the light of evolution”
GATCTACCATGAAAGACTTGTGAATCCAGGAAGAGAGACTGACTGGGCAACATGTTATTCAGGTACAAAAAGATTTGGACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGGTGCAATGGGAAGCTCTTCTGGAGAGTGAGAGAAGCTTCCAGTTAAGGTGACATTGAAGCCAAGTCCTGAAAGATGAGGAAGAGTGGGGAGGGAAGGGGGAGGTGGAGGGATGGGGAATGGGCCGGGATGGGATAGCGCAAACTGCCCGGGAAGGGAAACCAGCACTGTACAGACCTGAACAACGAAGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGTGGCAGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAATTTTGCCTGAGAGACCTCATTGCAGTTTCTGATTTTTTGATGTCTTCATCCATCACTGTCCTTGTCAAATAGTTTGGAACAGGTATAATGATCACAATAACCCCAAGCATAATATTTCGTTAATTCTCACAGAATCACATATAGGTGCCACAGTTATCCCCATTTTATGAATGGAGTGATGAAAACCTTAGGAATAATGAATGATTTGCGCAGGCTCACCTGGATATTAAGACTGAGTCAAATGTTGGGTCTGGTCTGACTTTAATGTTTGCTTTGTTCATGAGCACCACATATTGCCTCTCCTATGCAGTTAAGCAGGTAGGTGACAGAAAAGCCCATGTTTGTCTCTACTCACACACTTCCGACTGAATGTATGTATGGAGTTTCTACACCAGATTCTTCAGTGCTCTGGATATTAACTGGGTATCCCATGGCTTGTCAAATAGTTTCCGGACCTTGTCAAATAGTTTGGAGTCCTTGTCAAATAGTTTGGGGTTAGCACAGACCCCACAAGTTAGGGGCTCAGTCCCACGAGGCCATCCTCACTTCAGATGACAATGGCAAGTCCTAAGTTGTCACCATACTTTTGACCAACCTGTTACCAATCGGGGGTTCCCGTAACTGTCTTCTTGGGTTTAATAATTTGCTAGAACAGTTTACGGAACTCAGAAAAACAGTTTATTTTCTTTTTTTCTGAGAGAGAGGGTCTTATTTTGTTGCCCAGGCTGGTGTGCAATGGTGCAGTCATAGCTCATTGCAGCCTTGATTGTCTGGGTTCCAGTGGTTCTCCCACCTCAGCCTCCCTAGTAGCTGAGACTACATGCCTGCACCACCACATCTGGCTAGTTTCTTTTATTTTTTGTATAGATGGGGTCTTGTTGTGTTGGCCAGGCTGGCCACAAATTCCTGGTCTCAAGTGATCCTCCCACCTCAGCCTCTGAAAGTGCTGGGATTACAGATGTGAGCCACCACATCTGGCCAGTTCATTTCCTATTACTGGTTCATTGTGAAGGATACATCTCAGAAACAGTCAATGAAAGAGACGTGCATGCTGGATGCAGTGGCTCATGCCTGTAATCTCAGCACTTTGGGAGGCCAAGGTGGGAGGATCGCTTAAACTCAGGAGTTTGAGACCAGCCTGGGCAACATGGTGAAAACCTGTCTCTATAAAAAATTAAAAAATAATAATAATAACTGGTGTGGTGTTGTGCACCTAGAGTTCCAACTACTAGGGAAGCTGAGATGAGAGGATACCTTGAGCTGGGGACTGGGGAGGCTTAGGTTACAGTAAGCTGAGATTGTGCCACTGCACTCCAGCTTGGACAAAAGAGCCTGATCCTGTCTCAAAAAAAAGAAAGATACCCAGGGTCCACAGGCACAGCTCCATCGTTACAATGGCCTCTTTAGACCCAGCTCCTGCCTCCCAGCCTTCTAGACCTCATTGCAGTTTCTGATTTTTTGATGTCTTCATCCATCACTGTCCTTGTCAAATAGTTT “Nothing in Biology makes sense except in the light of evolution” Theodosius Dobzhansky ( ) Guides the construction of our functional metric Relevant in interpreting our results Diffusion distances mimic evolutionary processes

4 Evolution of This Talk…
Background Framework to study Structure-Function Functional distance between protein domains Function-Structure correlation Convergent and divergent evolution What’s next?

5 Background Structure (Proteins)
Proteins are assembled spatially out of distinct structural units These structural units are called protein domains Protein domains fold independently Transferase (Methyltransferase) 1adm

6 Decomposing a Protein into its Domains
Fibronectin protein–1fnf

7 3-D Structure Comparison
DALI Automated comparison of 3D protein structures by 2D distance matrices Z-score – structure similarity score Holm L, Sander C., JMB 1993, 233:

8 Function: Gene Ontology (GO)
GO Goal: controlled vocabulary of genes + products in any organism (since 1998) Gene Ontology: tool for unification of biology M. Ashburner et al. (the gene ontology consortium). Nature Genet 25, 2000 3 structured vocabularies (species-independent) to describe gene products in terms of: 1) biological processes 2) cellular components 3) molecular functions GO is friendly (google )

9 GO Demonstration

10 Sequence-Structure-Function
Structures Protein domains Sequences of amino acids folding into domains Molecular Functions Gene Ontology (GO) BLAST: Basic Local Alignment Search Tool Shakhnovich BE. et al. BMC Bioinformatics. 2003, 4:34 Shakhnovich BE..PLoS Comp. Biol Jun;1(1):e9.

11 Similarity Measures Structure (protein domains): Z-scores
Sequences: BLAST Phylogenetic Information: MI score Function Scores ???? Holm L, Sander C., JMB 1993, 233: Altschul SF, et. al JMB 1990 Oct 5;215(3): Pellegrini M, et. Al Proc Natl Acad Sci U S A Apr 13;96(8): Structure is based on the 3-d geometry Sequence, edit distances of alphabets

12 Previous Functional “Distances”?
1. Similarity measures of ontologies (individual nodes) Lord PW et. al, Bioinformatics, 2003, 19(10): Assign local fractions p(n) for each node pms(n1, n2) = min{p(n)} among parents n of n1 and n2 “Distance” between protein domains (subgraphs) Shakhnovich BE, PLoS Comput Biol, 2005 Jun;1(1):e9. pA,i /pB,i - percentage of sequences that fold into structure A/B and annotated as function i

13 Our Goal: Forming Distances
What’s given? GO graph & subgraphs of protein domains Questions: How to form meaningful similarities (between nodes)? How to form distances from similarities (nodes)? How to use these to form distances between domains (subgraphs)?

14 Using Similarities to Create Distances for Nodes
Machine Learning Framework: Given: points (nodes) {xi}i=1,…,N, similarities K(xi , xj), such that K is symmetric and positive Distance: d2(xi,xj) = K(xi,xi)+K(xj, xj)-2K(xi,xj) Interpretation: K(xi,xj) = ‹φ(xi), φ(xj)›, then d2(xi,xj) = ||φ(xi) - φ(xj)||2 φ – embedding from input to feature space (N) K – the kernel

15 The mapping φ It can be obtained by either
1. Find the eigenpairs (u1, λ1),…,(uN, λN ) of K and set • Note that K(xi,xj) = ‹φ(xi), φ(xj)› Form RKHS induced by K

16 The “manifold embedding”
→”φ”→ Remark: we do not use ”φ”, only the kernel K Figure by Todd Wittman (mani) →”φ”→ Figure by Coifman & Lafon

17 How to Assign Similarities?
Local/ad hoc similarity Global similarity: obtained by propagating local similarities (diffusion on graph mimicking evolutionary process)

18 Forming a Diffusion Kernel
nij number of domains (subgraphs) shared by nodes i & j Fiedler M Czech. Math. Journal, 25: Chung F (book). AMS Kondor R, Lafferty JD: ICML 2002: Belkin M, Niyogi P.Tech Report 2002 U. Chicago Ham J. et al. ICML 2004: Coifman et al. PNAS 2005, 102 (21): 7426 Km is a diffusion kernel with parameter m

19 Forming a Diffusion Distance
Formally: Interpretation: It describes the rate of connectivity between vertices according to paths of length m (Szummer M. Jaakkola T. NIPS 2001, 14) (Ham J. et al. ICML 2004: ) Coifman et al. PNAS 2005, 102 (21): 7426.

20 Another Diffusion Distance
Another kernel… The corresponding distance to this kernel is the expected time to travel from one vertex to another and then back again Coifman et al. PNAS 2005, 102 (21): 7426 Ham J. et al. ICML 2004:

21 “Distances” Between Domains
Given: d(x,y) – diffusion distance between annotation x and y Compute: d(x,A) – distance between annotation x and domain A d(A,B) – “distance” between domains A and B Dubuisson MP, Jain AK. IAPR Memoli F, Sapiro G. Found. Comput. Math

22 Quick Summary Formed diffusion distance between functional annotation (nodes) Formed functional distances between protein domains (subgraphs)

23 What’s Next We put those distances in context with the geometric structure We indicate how those distances can infer evolutionary information

24 Comparisons

25 Functional Domain Universe Graph
FDUG: Connect all edges (domains) with functional distance < Fmax Color the top nine commonly occurring folds (use SCOP) Identify main functional domains, e.g. B: DNA Binding, C: RNA Binding, D: Exonucleases, E: Transcription Factors F_max = .23 A: Heat Shock Proteins, B: DNA Binding, C: RNA Binding, D: Exonucleases, E: Transcription Factors, F:Glucose/Galactose Enzymes, G: AAA-ATPases, H: Oxidoreductases, I: Dehydrogenases, J:Retroviral Integrases and K:Kinases. Remark: Exonucleases are enzymes that cleave nucleotides one at a time from an end of a polynucleotide chain. These enzymes hydrolyze phosphodiester bonds from either the 3' or 5' terminus of polynucleotide molecules.

26 Observation Domain sharing fold classification form clusters with common functions Domains with related functions are proximal A dehydrogenase is an enzyme that oxidizes a substrate by transferring one or more protons and a pair of electrons to an acceptor, usually NAD/NADP or a flavin coenzyme such as FAD or FMN. Oxidoreductases catalyze oxidation/reduction reactions. They are classified as EC 1 in the EC number classification. Redox reactions include all chemical processes in which atoms have their oxidation number (oxidation state) changed. This can be a simple redox process, such as the oxidation of carbon to yield carbon dioxide, it could be the reduction of carbon by hydrogen to yield methane, or it could be the oxidation of sugar in the human body, through a series of very complex electron transfer processes. The term redox comes from the two concepts of reduction and oxidation. It can be explained in simple terms: Oxidation describes the loss of an electron by a molecule, atom or ion Reduction describes the gain of an electron by a molecule, atom or ion H: Oxidoreductases, I: Dehydrogenases B: DNA Binding, C: RNA Binding, D: Exonucleases, E: Transcription Factors

27 Traversing the FDUG 1hlv Centromere Binding Protein
1gdt Site Specific Resolvase 3 helical bundle Recall B: protein binding (help package dna), E: transcription factors Proteins work in pairs (bind as diamers) 2hdd – the double line represent the DNA (the proteins are hardly separated), bind to specific dna sequences 1gdt - (only look at left part), the right is unrelated (part of the binding, 1gdt is required to start replication) and the binding is not as specific but also not too separated 3 helices = 1 protein 1hlv- more distant binding (it packages dna), binds non-specifically 2hdd Engrailed Transcription Factor

28 Divergent Evolution Biological characteristics with a common evolutionary origin that have diverged over evolutionary time. Previous example may indicate a case of divergent evolution (common ancestry)

29 Convergent Evolution Definition in molecular evolution:
Two proteins with no apparent homology performing the same function We may identify such cases by searching for low F-scores and low Z-scores (large distances) Example: convergence of tRNA synthases 1pys and 1a8h, F-score = .001, Z-score < 2 This example is well-documented Mosyak L. et al. Nat Struct Biol 1995, 2:537-47 Sugiura I. et al Nucliec Acids Res , D189-92 In evolutionary biology: organisms acquiring similar characteristics while evolving in separate and sometimes varying ecosystems

30 Summary Defined a distance between protein functions (nodes) and functional distance between protein domains (subgraphs) Shown correlation with structure, sequence and phylogeny Explored structure-function relation via FDUG (functional domain universe graph) Indicated examples of divergent and convergent evolution

31 Some Future Projects Extension to cellular components and processes and their use in quantitative research of convergent evolution Infer function from structure (or vice versa) via supervised/semisupervised learning.

32 Hybrid Linear Modeling Another Direction in Evolution
Very Recent Interests Study of evolution of transcriptional response to osmotic stress Applying recent tools of knowledge discovery

33 Thanks Contact: lerman@umn.edu Supplementary webpage: Collaborator:
Collaborator: Borya Shakhnovich, O’shea Lab, Harvard Support: NSF Thanks: IPAM (Mark Green) for 2003 proteomics workshop R.R. Coifman (Yale), S. Lafon (Google), M. Maggioni (Duke) Organizers of current workshop

34 Embedding Annotations on top 2 coordinates

35 Embedding Protein Domains


Download ppt "Gilad Lerman Department of Mathematics University of Minnesota"

Similar presentations


Ads by Google