1
Seven clusters and four types of symmetry in microbial genomes Andrei Zinovyev Bioinformatics service group of M.Gromov Tatyana Popova R&D Centre in Biberach, Germany Alexander Gorban Centre for Mathematical Modelling

2
Symbol of GofG’05

3
Genomic sequence as a text in unknown language tagggrcgcacgtggtgagctgatgctaggg frequency dictionaries: t a g g g r c g c a c g t g g t g a g c t g a t g c t a g g g ta gg gr cg ca cg tg gt ga gc tg at gc ta gg tagg grcg cacg tggt gagc tgat gcta gggr N = 4=4 1 N = 16=4 2 N = 64=4 3 N=256=4 4 gggrcgccacgttggtgagctgatgctagggrcgacgtgg tagggrcgcacgtggtgagctgatgctagggrcgacgtgg agggrcgcacgtggtgagctgatgctagggrcgacgtggc..cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc…

4
From text to geometry cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc 10 7 cgtggtgagctgatgctagggrcgcac ggtgagctgatgctagggrcgcacact tgagctgatgctagggrcgcacaattc gtgagctgatgctagggrcgcacggtg …… gagctgatgctagggrcgcacaagtga length~ fragments RNRN

5
Method of visualization principal components analysis RNRN R2R2 R2R2 PCA plot

6
Caulobacter crescentus singles N=4 doublets N=16 triplets N=64 quadruplets N=256 !!! the information in genomic sequence is encoded by non-overlapping triplets (Nature, 1961)

7
First explanation cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc

8
tga tgc tag ggr cgc acg tgg ctg atg cta ggg rcg cac gtg Basic 7-cluster structure gtgagctgatgctagggrcgcacgtggtgagc gct gat gct agg grc gca cgt gtgaatcggtgggtgaqtgtgctgctatgagc atc ggt ggg tga gtg tgc tgc tcg gtg ggt gag tgt gct gct cgg tgg gtg agt gtg ctg ctg

9
Non-coding parts gtgagctgatgctagggr cgcacgaat Point mutations: insertions, deletions a

10
The flower-like 7 clusters structure is flat

11
Seven classes vs Seven clusters Stanford TIGR Georgia Institute of Technology

12
Computational gene prediction Accuracy >90%

13
Mean-field approximation for triplet frequencies F IJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ): F AAA, F AAT, F AAC … F GGC, F GGG : 64 numbers position-specific letter frequency + correlations : 12 numbers

14
Why hexagonal symmetry? GC-content = P C + P G

15
Genome codon usage and mean-field approximation ggtgaATG gat gct agg … gtc gca cgc TAAtgagct … correct frameshift 64 frequencies F IJK … ggtgaATG gat gct agg … gtc gca cgc TAAtgagct 12 frequencies P I 1, P J 2, P K 3

16
P I J are linear functions of GC-content eubacteria archae

17
THE MYSTERY OF TWO STRAIGHT LINES ??? R 12 R 64 F IJK = P 1 I P 2 J P 3 K + correlations

18
Codon usage signature 0-+

19
19 possible eubacterial signatures

20
Example: Palindromic signatures

21
Four symmetry types of the basic 7-cluster structure eubacteria flower-like degenerated perpendicular triangles parallel triangles

22
B.Halodurans (GC=44%) S.Coelicolor (GC=72%) F.Nucleatum (GC=27%) E.Coli (GC=51%)

23
Web-site cluster structures in genomic sequences

24
Human genome (chr19) non-repetitive sequences repetitive sequences singles doublets triplets

25
Letter frequencies (3 dimensions) GC-content (50%) Purine- Pyrimidine (33%) Amino- Keto (17%) a t c g a t c g a c g t

26
Non-linear good 2D representation (elastic principal manifolds) A T G C 0% 100%

27
Measuring densities A T G C A T G C

28
Contrasting density distribution (two ideas) Noise is Gaussian Noise is smooth

29
Contrasted density A T G C A T G C

30
Excluding repeats A T G C A T G C

31
A T G C A T G C

32
Papers (type Zinovyev in Google) Gorban A, Zinovyev A PCA deciphers genome Arxiv preprint Gorban A, Popova T, Zinovyev A Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences Physica A 353, Gorban A, Popova T, Zinovyev A Four basic symmetry types in the universal 7-cluster structure of microbial genomic sequences In Silico Biology 5, 0025 Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributions Seven clusters in genomic triplet distributions In Silico Biology. V.3, Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene Identification Self-Organizing Approach for Automated Gene Identification Open Systems and Information Dynamics 10 (4).

33
People Dr. Tanya Popova Institute of Computational Modeling Russia Professor Alexander Gorban University of Leicester UK

