Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques group of M.Gromov.

Similar presentations


Presentation on theme: "Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques group of M.Gromov."— Presentation transcript:

1 Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

2 Plan of the talk Genomic sequences: geometric approach, clustering Genomic sequence as text Basic 7-cluster structure Global structure of codon frequencies Internal structure of codon frequencies Applications

3 Introduction Frequency dictionaries

4 Genomic sequence as a text in unknown language tagggrcgcacgtggtgagctgatgctaggg frequency dictionaries: t a g g g r c g c a c g t g g t g a g c t g a t g c t a g g g ta gg gr cg ca cg tg gt ga gc tg at gc ta gg tagg grcg cacg tggt gagc tgat gcta gggr N = 4=4 1 N = 16=4 2 N = 64=4 3 N=256=4 4 gggrcgccacgttggtgagctgatgctagggrcgacgtgg tagggrcgcacgtggtgagctgatgctagggrcgacgtgg agggrcgcacgtggtgagctgatgctagggrcgacgtggc..cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc…

5 From text to geometry cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc 10 7 cgtggtgagctgatgctagggrcgcac ggtgagctgatgctagggrcgcacact tgagctgatgctagggrcgcacaattc gtgagctgatgctagggrcgcacggtg …… gagctgatgctagggrcgcacaagtga length~300-400 3000-4000 fragments RNRN

6 Method of visualization principal components analysis RNRN R2R2 R2R2 PCA plot

7 Chapter 1 Basic 7-cluster structure (level 1 of non-randomness)

8 Caulobacter crescentus singles N=4 doublets N=16 triplets N=64 quadruplets N=256 !!! the information in genomic sequence is encoded by non-overlapping triplets

9 First explanation cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc

10 tga tgc tag ggr cgc acg tgg ctg atg cta ggg rcg cac gtg Basic 7-cluster structure gtgagctgatgctagggrcgcacgtggtgagc gct gat gct agg grc gca cgt gtgaatcggtgggtgaqtgtgctgctatgagc atc ggt ggg tga gtg tgc tgc tcg gtg ggt gag tgt gct gct cgg tgg gtg agt gtg ctg ctg

11 Non-coding parts gtgagctgatgctagggr cgcacgaat Point mutations: insertions, deletions a

12 Mean-field approximation for triplet frequencies F IJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ): F AAA, F AAT, F AAC … F GGC, F GGG : 64 numbers letter frequency + correlations : 12 numbers

13 Why hexagonal symmetry? 0-+ -+0 +0- +-0 -0+ 0+- GC-content = P C + P G

14 Chapter 2 Global structure of codon frequencies (143 complete bacterial genomes)

15 Genome codon usage and mean-field approximation ggtgaATG gat gct agg … gtc gca cgc TAAtgagct … correct frameshift 64 frequencies F IJK … ggtgaATG gat gct agg … gtc gca cgc TAAtgagct 12 frequencies P I 1, P J 2, P K 3

16 Global structure of codon frequencies eubacteria archaea

17 P I J are linear functions of GC-content

18 Four symmetry types of the basic 7-cluster structure eubacteria flower-like degenerated perpendicular triangles parallel triangles

19 Chapter 3 Internal structure of codon frequencies (level 2 of non-randomness)

20 Second level of hierarchy ?

21 Distribution of genes R 64 function1 function2 function3

22 Fast-growing bacteria IV II I III Genes of class I (most of) Genes of class II (higly expressed) Genes of class III (unusual) Genes of class IV (hydrophobic proteins)

23 Escherichia coli Genes of class I (most of) Genes of class II (higly expressed) Genes of class III (unusual) Genes of class IV (hydrophobic proteins)

24 Chapter 4 Applications

25 Computational gene prediction Accuracy >90%

26 Protein expression optimization IV II I III gene sequence S, protein A gene sequence S, same protein A, higher expression

27 Web-site http://www.ihes.fr/~zinovyev/7clusters cluster structures in genomic sequences

28 Papers Gorban A, Popova T, Zinovyev A Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences. 2004. Arxive e-print. Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributions Seven clusters in genomic triplet distributions. 2003. In Silico Biology. V.3, 0039. Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene Identification for Automated Gene Identification. 2003. Open Systems and Information Dynamics 10 (4).

29 People Dr. Tanya Popova Institute of Computational Modeling Russia Professor Alexander Gorban University of Leicester UK


Download ppt "Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques group of M.Gromov."

Similar presentations


Ads by Google