The Birth of Smooth Biological Codes in a Rough Evolutionary World Shalev Itzkovitz, Guy Shinar, Uri Alon T T.

Slides:



Advertisements
Similar presentations
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
Advertisements

DYNAMICS OF RANDOM BOOLEAN NETWORKS James F. Lynch Clarkson University.
WARM-UP On your warm-up paper Your lab (dot lab)
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
1 Evolution of the Genetic Code Adi Stern 30/3/05.
The Concept of Functional Constraint. The intensity of purifying selection is determined by the degree of intolerance characteristic of a site or a genomic.
Xianfeng Gu, Yaling Wang, Tony Chan, Paul Thompson, Shing-Tung Yau
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information.
Biochemistry 2/e - Garrett & Grisham Copyright © 1999 by Harcourt Brace & Company Chapter 32 The Genetic Code to accompany Biochemistry, 2/e by Reginald.
Genetica per Scienze Naturali a.a prof S. Presciuttini 1. Enzymes build everything Enzymes allow nutrients to be digested; they convert food into.
27 August, 2004 Chapters 2-3 Nucleic Acid Structure and Weak Bonds.
Gene Activity: How Genes Work
Bacterial Keys to Success Respond quickly to environmental changes –Simultaneous transcription and translation Avoid wasteful activities by using biochemical.
The origins & evolution of genome complexity Seth Donoughe Lynch & Conery (2003)
A simple model for the evolution of molecular codes driven by the interplay of accuracy, diversity and cost Tsvi Tlusty, Physical Biology Gidi Lasovski.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Chapter 22 (Part 1) Protein Synthesis. Translating the Message How does the sequence of mRNA translate into the sequence of a protein? What is the genetic.
2.7 DNA Replication, transcription and translation
DNA Past Paper Questions. 1. Draw as simple diagram of the molecular structure of DNA. 5 marks.
Mutation and Miscellany
Chapter 1 Invitation to Biology Hsueh-Fen Juan 阮雪芬 Sep. 11, 2012.
Large-scale organization of metabolic networks Jeong et al. CS 466 Saurabh Sinha.
Evolutionary Algorithms BIOL/CMSC 361: Emergence Lecture 4/03/08.
Bio 1010 Dr. Bonnie A. Bain. DNA Structure and Function Part 2.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
PROTEINS Nicky Mulder Acknowledgements: Anna Kramvis for lecture material (adapted here)
13.2 Ribosomes and Protein Synthesis
DNA Structure & Function. Perspective They knew where genes were (Morgan) They knew what chromosomes were made of Proteins & nucleic acids They didn’t.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
The Biology and Genetic Base of Cancer. 2 (Mutation)
Molecular Genetics gene: specific region of DNA that determines the type of proteins to be made.
Unit 5: Molecular Genetics. DNA Transcription Translation TraitRNA Protein The “Central Dogma” of Molecular Genetics.
Chapter 05. Building Proteins DNA’s instructions are translated into thousands of proteins that do a cell’s work Protein molecules communicate and coordinate.
PROTEIN SYNTHESIS. Protein Synthesis: overview  DNA is the code that controls everything in your body In order for DNA to work the code that it contains.
What is a QUASI-SPECIES By Ye Dan U062281A USC3002 Picturing the World through Mathematics.
Tutorial -1: BB 101 (30/7/13) Q.1: The language of life is coded into two sets of alphabets. The genetic information which is coded in the DNA is read.
15.1 Many Genes Encode Proteins The One Gene One Enzyme Hypothesis: Genes function by encoding enzymes, and each gene encodes a separate enzyme. More specific:
Genetics 314 – Spring, 2009 Lecture 7 Reading – Chapter 13 First Exam – Friday, February 6 th, 2009 Review Session – Wednesday, Feb. 4th.
A Theoretical Approach for the Genetic Code Paul SORBA Seminar dedicated to my always young friend Branko Belgrad, Sept.2015.
Coding Theory Efficient and Reliable Transfer of Information
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Compositional Assemblies Behave Similarly to Quasispecies Model
Chapter 17 From Gene to Protein.
Chapter 17 From Gene to Protein. 2 DNA contains the genes that make us who we are. The characteristics we have are the result of the proteins our cells.
IX: DNA Function: Protein Synthesis A. Overview: B. Transcription: C. RNA Processing: D. Deciphering the Genetic Code.
Robustness in biology Eörs Szathmáry Eötvös University Collegium Budapest.
09/20/04 Introducing Proteins into Genetic Algorithms – CSIMTA'04 Introducing “Proteins” into Genetic Algorithms Virginie LEFORT, Carole KNIBBE, Guillaume.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Genetic algorithms: A Stochastic Approach for Improving the Current Cadastre Accuracies Anna Shnaidman Uri Shoshani Yerach Doytsher Mapping and Geo-Information.
400x 1000x 400x Problem of the day: Are these prokaryotic or eukaryotic cells?
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
FROM GENE TO PROTEIN: TRANSLATION & MUTATIONS Chapter
Presented By: Farid, Alidoust Vahid, Akbari 18 th May IAUT University – Faculty.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Modeling promoter search by E.coli RNA polymerase : One-dimensional diffusion in a sequence-dependent energy landscape Journal of Theoretical Biology 2009.
It’s in the genes Lesson 3.2.
From Gene to Protein: Translation & Mutations
SUMMARY OVERVIEW OF PROTEIN SYNTHESIS
Summary and Recommendations
UNIT 5 Protein Synthesis.
Natural Selection & Evolution
Dr. Kenneth Stanley February 6, 2006
Pedir alineamiento múltiple
Boltzmann Machine (BM) (§6.4)
Plant Biotechnology Lecture 2
Chapter 17 From Gene to Protein.
Summary and Recommendations
Random Neural Network Texture Model
Presentation transcript:

The Birth of Smooth Biological Codes in a Rough Evolutionary World Shalev Itzkovitz, Guy Shinar, Uri Alon T T

o Biological codes are information channels or maps with natural ‘fitness’ measure. o Codes are evolved and selected according to their fitness or ‘smoothness’. o The emergence of a code is a phase transition in an information channel. o Topology of errors (noise) governs the emergent code.

Biological codes are (often) maps Biological code is a mapping between two sets of molecules: –Transcription net: Proteins → DNA binding sites –Protein-protein recognition: immune system… –Protein synthesis: DNA → Proteins DNA Proteins The genetic code

Information flows from DNA to RNA to proteins through the genetic code The 20 letters are the amino acids. Proteins are amino acid polymers. DNA ACGGAGGTACCC 4 letters RNA ACGGAGGUACCC 4 letters Protein 20 letters Thr Glu Val Pro

Each of the 20 amino acids has specific chemistry Amino acid = backbone + specific side group. Some amino acids are hydrophilic, hydrophobic, basic, acidic… The diversity of amino acids allows proteins to perform a wide variety of functions efficiently.

Each of the 20 amino acids is encoded by a triplet of RNA letters Genetic Code = mapping triplets to amino acids. 64 = 4 3 triplet codons encode only 20 amino acids (degeneracy) Only 48 discernable codons due to U-C “wobble” at 3 rd base. Thr Glu Val Pro ACG GUA GAG CCC

The genetic code is smooth, degenerate and compact Redundancy – only 20 of 48. Degeneracy – mostly in the 3 rd base Close codons separated by a single letter (Hamming Distance = 1) Smoothness – Close codons encode chemically similar amino acids. ( Hydrophobic xUx, hydrophilic xAx). Compactness – single contiguous domain per each amino-acid. The code is highly nonrandom (“one in a million” [Haig & Hurst] ). Shades: lighter (darker) – low (high) polarity. Letters: black (white) – hydrophobic (hydrophilic) yellow – medium. [Knight, Freeland, Landweber]

Biological codes evolve(d) to cope with inherent noise Messages are written in molecular words that are read and interpreted by other molecules, which calculate the response etc… Typical energy scale ~ a few k B T. Thermal noise → errors. Information channels adapt to errors through evolutionary of selection-mutation Some errors = mutations are essential to evolution …

The code is an information channel with an average distortion,   i j encoding misreading decoding distortion H UV = ∑ paths P αijβ D αβ = ∑ α,I,j,β P α U αi W ij V jβ D αβ U and V are binary matrices that determine the code W is the misreading (noise) stochastic matrix U V W

Fitter code is one with less distortion The ‘error-load’ H measures the difference between desired and the reproduced amino-acids. H is a natural measure for the fitness of the code. For better codes the encoding U and the decoding V are optimized with respect to the reading W. The decoded amino-acids must be diverse enough to map diverse chemical properties. However, to minimize the impact of errors it is preferable to decode fewer amino-acids.

Theories on the origin of the code: Frozen accident or optimization? Frozen accident hypothesis: Any change in the code affects all the proteins in the cell and therefore will be too harmful: Life began with very few amino- acids. New amino-acids were added until eventually the code became frozen in its present form. [Crick 1968] Load minimization hypothesis: Darwinian dynamics optimize the code to minimize errors in information flow (due to mutations, misreading). [Sonneborn, Zuckerkandl & Pauling… 1965]

Variant codes - evidence for ongoing optimization of the code Variants of the “universal” genetic code in many organisms [Osawa, Jukes 1992]. All variants use the same twenty amino-acids (universal invariant?) Continuity - Most changes are to a neighboring amino-acid. (‘hydrodynamic’ flow ?)

o Biological codes are information channels or maps with natural ‘fitness’ measure. o Codes are evolved and selected according to their fitness. o The emergence of a code is a phase transition in an information channel. o Topology of errors (noise) governs the emergent code.

Codes compete by their error-load One letter change in DNA can change one amino acid in one protein. If the new amino acid is similar to the original the upset is minimal. The organism with the smallest error-load takes over the population. - relatively small population - high noise levels in protein synthesis weak selection forces « random drift

Code’s evolution reaches steady-state Small effective population and strong drift. Population is in detailed balance and therefore P(fitness) ~ exp(fitness/T) [Lassig,Sella & Hirsh] Smaller population is hotter: T ~ 1/N eff. The Boltzmannian probability P UV ~ exp(-H UV /T) minimizes a ‘free energy’ F= -TS = ∑H UV P UV + ∑ P UV logP UV F is used to optimize information channels …

At high T no code is chosen At high T (small populations) Boltzmann implies that all codes are equally probable: = 1/N C The natural order parameter is u αi = -1/N C At high T the state is random ‘non-coding’ u αi =0 Stability of F is determined by w – the preference of the reading w = W − 1/N C d – normalized chemical distance matrix δF ~ u t (TI δ ×I w – w 2 ×d)u

o Biological codes are information channels or maps with natural ‘fitness’ measure. o Codes are evolved and selected according to their fitness. o The emergence of a code is a phase transition in an information channel. o Topology of errors (noise) governs the emergent code.

Code emerges at a phase transition When T is decreased below T c an inhomogeneous coding state appears δF ~ u t (TI δ ×I w – w 2 ×d)u Critical temperature T c = λ w 2 × λ d The code is the mode u αi of F that corresponds to these maximal eigenvalues. T c increases with the accuracy of reading w. The phase transition is continuous (2 nd order). Analogous phase transition in information channels

Why twenty amino-acids? Code is the mode u αi that minimizes the free energy. This mode corresponds to the maximal w - eigenvalue. Knowledge of w at the phase transition yields code. What can we say without such knowledge? (Why 20?) More amino-acids more sensitivity to errors. Fewer amino-acids reduce functionality of proteins. Historical mechanisms : Freezing, Biosynthetic etc.. Twenty as a topological feature of generic evolutionary phase transition?

o Biological codes are information channels or maps with natural ‘fitness’ measure. o Codes are evolved and selected according to their fitness. o The emergence of a code is a phase transition in an information channel. o Topology of errors (noise) governs the emergent code.

The probable errors define the graph and the topology of the genetic code Graph = codon vertices + one-letter difference edges ( Hamming = 1 ) U A G C U A G C UC A G XX K4 X K4 X K3K4 X K4 X K3

Topology and genus of a simpler code UUAUCU UAAACA UCACCC V = vertices, E = edges, F = faces Euler’s characteristic χ = V – E + F Euler Genus (# holes) γ = 1 - (1/2) χ Doublet Code with 3 bases is imbedded on a torus Each codon has 4 neighbors Faces are quadrilateral mutation cycles F=V (d/4)= 9 ; E=V (d/2)=18 A C U A C U X

The genetic code graph is holey The 48-codon graph : –Each codon has degree d = = 8 therefore E = 48 (d/2) = 192 edges F = 48 (d/4) = 96 faces The Euler characteristic is χ = V – E + F = -48 and –Euler’s genus is γ = 1 - (1/2) χ = 25 (24 holes + Klein) –Embedding by group Automorphism analysis Can one hear the shape of The code? K 4 X K 4 X K 3 K

The genetic code has a spectrum u αi is average preference of codon i to encode α. Every mode corresponds to an amino-acid -> number of modes = number of amino-acids. Misreading w is actually the graph Laplacian w = -(Δ-Δ random ) where Δ ij =-W ij Δ ii =Σ j≠i W ij Δ measures the difference between codons and their neighbors, a natural measure for error load. Maximal mode of w is the 2 nd eigenmode of Δ Courant’s theorem: u αi have a single maximum -> single contiguous domain for each amino-acid.

u αi have single compact domains with one maximum and one minimum (Courant’s theorem). Compact organization reduces impact of errors Single domain in any direction (linearity) Σn α u αi Embedding in R N-1 is tight → The code graph contains complete graph K N [Banchoff 1965, Colin de Verdiére’s 1987] amino-acids # = N = chr(γ) Topology optimizes amino-acid assignment is in compact domains

Coloring number of graph code is an upper limit for the number of amino-acids What is the minimal number of colors required in a map so that no two adjacent regions have the same color? The coloring number is a topological invariant and therefore a function of the genus solely. Heawood’s conjecture [Ringel & Youngs, Appel & Haken]

The genetic code coevolves with increasing accuracy of translation A path for evolution of codes: from early codes with higher codon degeneracy and fewer amino acids to lower degeneracy codes with more amino acids. Preliminary simulations Twenty amino acids is invariant even in variant codes. 21 st and 22 nd amino acids are context dependent. 1 st 2 nd 3 rd chr # K 4 X K 4

Summary The 64 3-letter triplet code is patterned and degenerate, maps only 20 amino acids. The governing evolutionary dynamics is interplay between protein diversity and error penalty described by stochastic diffusion equation. The 1 st excited state of this diffusive mapping dynamics on the high-genus surface of the code yield a pattern of ordered 20 amino acids (20 = the coloring number of the graph). Topology + dynamics  Coloring (?)

Transcription network is a code that relates DNA sites and binding proteins Reading DNA to synthesize proteins is controlled by a system of protein-DNA interactions (transcription net). Presence/absence of transcription factor may repress/enhance synthesis of protein from nearby gene. The transcription network is actually a code that relates proteins with their DNA targets. Like the genetic code, transcription is subject to evolutionary forces and adapts to minimize errors. Pol TF DNA

Probable recognition errors define the binding sequence space sphere packing (Shannon) Overlap and continuity Typical binding site: 4 base pairs = 12 bit Hamming = 1 K 4 6 -> 4096 ‘codons’ TF  AA Codon  binding site

Probable recognition errors define the binding sequence space Coloring number estimate: v = 4 L (L=6) e ~ 4 L (3/2)L f ~ 4 L (3/4)L -> γ ~ 4 L (3/8)L The coloring # chr(γ) ~ 300

???? Why does the code exhaust the coloring limit? Other population dynamics models (‘quasi-species’) Glassy 'almost-frozen' dynamics? The necessity of the wobble (64/48)? 25 acids? Generic phase transition scenario that does not depend finely on missing details of the evolutionary pathway. Although not much is known about the primordial environment, minimal assumptions about the topology of probable errors can yield characteristics of biological codes. Esp. the number of twenty amino-acids in the present picture is reminiscent of a 'shell magic number‘.

Shalev Itzkovitz Guy Shinar Uri Alon Guy Sella J. –P. Eckmann Elisha Moses