Presentation is loading. Please wait.

Presentation is loading. Please wait.

Quantifying contributions of mutations and homologous recombination to E. coli genomic diversity Sergei Maslov Department of Biosciences Brookhaven National.

Similar presentations


Presentation on theme: "Quantifying contributions of mutations and homologous recombination to E. coli genomic diversity Sergei Maslov Department of Biosciences Brookhaven National."— Presentation transcript:

1 Quantifying contributions of mutations and homologous recombination to E. coli genomic diversity Sergei Maslov Department of Biosciences Brookhaven National Laboratory, New York

2 Bacterial genome evolution happens in cooperation with phages +=

3 Variation between E. coli strains M Touchon et al. PLoS Genetics (2009) Pan-genome of E. coli Copy and InsertCopy and Replace FW Studier, P Daegelen, RE Lenski, S Maslov, JF Kim, JMB (2009) Comparison of B vs K-12 strains of E. coli

4 Usual suspects are there but do not explain heterogeneity Negative correlation with protein abundance: 2.5% of variation, P-value=10 -5 Positive correlation with distance from origin of replication: 0.4% of variation, P-value=10 -2

5 High SNP numbers are clustered along the chromosome

6 Recombined Clonal

7 P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)

8

9 SNPs by recombination/ SNPs by clonal mutations r/μ=6±1 Clonal regions Recombined regions P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)

10 Strains: K-12 vs ETEC-H10407 HS O157-H7-Sakai Neutral model: Mutations and Recombinations among 70 “genes”, population of 10 4 C. Fraser et al. (2007) and (2009) P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)

11 Phase transition Δ c =1.5% P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)

12

13 Why exponential tail? Time to coalescence: Prob(t)= 1/N e (1-1/N e ) t-1 =exp(  exponential slope =1/2μN e or 1/θ Population size Ne=1±0.1 x 10 9 consistent with earlier estimates

14 Why N e << N ? Phages: But: there are phages that cross species boundaries. Also slope is similar for different species Restriction modification system: Recombined segments are not continuous [Milkman R, Bridges MM. Genetics 1990] Recombination efficiency: Need 20-30 identical bases to start recombination Our slope predicts 60 bases which roughly matches 30 in the neginnng and 30 in the end Species are defined by recombination

15 Are our 30+ strains a representative sample? Fully sequenced genomes: 1000s of genes (unbiased and complete) 10s of strains (biased) MLST data: 10s of genes (biased) 1000s of strains (unbiased, I hope) Database http://mlst.ucc.iehttp://mlst.ucc.ie ∼ 3000 E. coli strains 7 short regions of ~500 base pairs each in housekeeping genes

16

17 MLST -- Genomes

18 Is it really phages? Phage capacity: 20kb Other strains up to 40kb K-12 to B comparison 1kb: gene length

19 Does neutral model explain everything? At 3 standard deviations 19 1kb regions supervariable 29 1kb regions superconserved

20 Collaborators & funding Bill Studier (BNL) Purushottam Dixit (BNL) Tin Yau Pang (Stony Brook) Rich Lenski (Michigan State) Patrick Daegelen (France) Jinhyun Kim (Korea) DOE Systems Biology Knoledgebase (KBase) Adam Arkin (Berkley) Rick Stevens (Argonne) Bob Cottingham (Oak Ridge) Mark Gerstein (Yale) Doreen Ware (Cold Spring Harbor) Mike Schatz (Cold Spring Harbor) Dave Weston (ORNL) 60+ other collaborators

21 Thank you!

22

23

24 Genes encoded in bacterial genomes Packages installed on Linux computers 24 ~

25 Complex systems have many components Genes (Bacteria) Software packages (Linux OS) Components do not work alone: they need to be assembled to work In individual systems only a subset of components is used Genome (Bacteria) – bag of genes Computer (Linux OS) – installed packages Components have vastly different frequencies of use 25

26 Justin Pollard, http://www.designboom.com 26 IKEA: has many components

27 Justin Pollard, http://www.designboom.com 27 They need to be assembled to work

28 Different frequencies of use vs CommonRare 28

29 What determines the frequency of use? Popularity : AKA preferential attachment Frequency ~ self-amplifying popularity Relevant for social systems: WWW links, facebook friendships, scientific citations Functional role : Frequency ~ breadth or importance of the functional role Relevant for biological and technological systems where selection adjusts undeserved popularity 29

30 Empirical data on component frequencies Bacterial genomes (eggnog.embl.de): 500 sequenced prokaryotic genomes 44,000 Orthologous Gene families Linux packages (popcon.ubuntu.com): 200,000 Linux packages installed on 2,000,000 individual computers Binary tables: component is either present or not in a given system 30

31 Frequency distributions P(f)~ f -1.5 except the top √N “universal” components with f~1 31 Cloud Shell Core ORFans

32 How to quantify functional importance? Components do not work alone Breadth/Importance ~ Component is needed for proper functioning of other components Dependency network A  B means A depends on B for its function Formalized for Linux software packages For metabolic enzymes given by upstream- downstream positions in pathways Frequency ~ dependency degree, K dep K dep = the total number of components that directly or indirectly depend on the selected one 32

33 33

34 Correlation coefficient ~0.4 for both Linux and genes Could be improved by using weighted dependency degree Frequency is positively correlated with functional importance 34

35 Tree-like metabolic network 35 K dep =5 K dep =15 TCA cycle

36 Dependency degree distribution on a critical branching tree P(K)~K -1.5 for a critical branching tree Paradox: K max -0.5 ~ 1/N  K max =N 2 >N Answer: parent tree size imposes a cutoff: there will be √N “core” nodes with K max =N present in almost all systems (ribosomal genes or core metabolic enzymes) Need a new model: in a tree D=1, while in real systems D~2>1 36

37 Dependency network evolution New components added gradually over time New component depends on D existing components selected randomly K dep (t) ~(t/N) -D P(K dep (t)>K)=P(t/N<K -1/D )=K -1/D P(K dep )=K dep -(1+1/D) =K dep -1.5 for D=2 N universal =N (D-1)/D =N 0.5 for D=2 37

38 K dep decreases layer number 38 Linux Model with D=2

39 Zipf plot for K dep distributions 39 Metabolic enzymes vs Model Linux vs Model

40 Frequency distributions P(f)~ f -1.5 except the top √N “universal” components with f~1 40 Shell Core ORFans Cloud

41 Why should we care about P(f)? 41

42 Metagenomes and pan-genomes 42 The Human Microbiome Project Consortium, Nature (2012) For P(f) ~ f -1.5: (Pan-genome size)~ ~(# of samples) 0.5

43 Pan-genome of E. coli strains M Touchon et al. PLoS Genetics (2009)

44 Genome evolution in E. coli Studier FW, Daegelen P, Lenski RE, Maslov S, Kim JF J. Mol Biol. (2009) P. Dixit, T. Y. Pang, Studier FW, Maslov S, submitted (2013)

45 How many transcription factors does an organism need? Regulator genesWorker genes S. Maslov, TY Pang, K. Sneppen, S. Krishna, PNAS (2009) TY Pang, S. Maslov, PLoS Comp Bio (2011)

46 Figure adapted from S. Maslov, TY Pang, K. Sneppen, S. Krishna, PNAS (2009) + N R ~ N G 2  N R /N G ~ N G

47 “… bureaucracy grew by 5-7% per year "irrespective of any variation in the amount of work (if any) to be done." Why? 1)"An official wants to multiply subordinates, not rivals" 2)"Officials make work for each other.“ so that “Work expands so as to fill the time available for its completion” Is this what happens in bacterial genomes? Probably not! Cyril Northcote Parkinson (1909 -1993)

48 Economies of scale in bacterial evolution N R =N G 2 /80,000  N G /N R =80,000/N G Economies of scale: as genome gets larger: new pathways get shorter

49 nutrient Horizontal gene transfer: entire pathways could be added in one step nutrient Redundant enzymes are removed Central metabolic core  anabolic pathways  biomass production

50 Adapted from “scope-expansion” algorithm by R. Heinrich et al. Minimal metabolic pathways from reactions in KEGG database (# of pathways or their regulators) ~ (# of enzymes ) 2 NGNG NRNR

51 What it all means for regulatory networks? Scale-free regulatory networks with “hubs” due to power law distribution of branch sizes: P(S)~S -3 Trends in complexity of regulation vs. genome size N R =N G =number of regulatory interactions E. van Nimwegen, TIG (2003) N R /N G = / increases with N G Either decreases with N G : functions become more specialized Or grows with N G : regulation gets more coordinated & interconnected Most likely both trends at once

52 nutrient TF1 nutrient TF2 Regulatory templates: one worker – one boss :  =1=const

53 nutrient TF1 nutrient Regulatory templates: long top-to-bottom regulation =const :  TF2 :  : 

54 nutrient TF1 TF2 Regulatory templates: hierarchy & middle management TF3

55 Histogram of the # of SNPs in genes FW Studier, P Daegelen, RE Lenski, S Maslov, JF Kim, JMB (2009) Comparison of B vs K-12 strains of E. coli 50% of genes have very few SNPs 1253: 0 SNPs 445: 1 SNP 232: 2 SNP The remaining 50% are in exponential tail up to 100 SNPs (10% divergence) and higher


Download ppt "Quantifying contributions of mutations and homologous recombination to E. coli genomic diversity Sergei Maslov Department of Biosciences Brookhaven National."

Similar presentations


Ads by Google