Presentation is loading. Please wait.

Presentation is loading. Please wait.

Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette.

Similar presentations


Presentation on theme: "Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette."— Presentation transcript:

1 Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette

2 n Transition probabilities = n Frequencies of N-grams …AGGTCGATC … Markov chain models

3 AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC Sliding window width W f AAA f AAC f GGG … = f ijk, i,j,k in [A,C,G,T]

4 AGGTCG ATG AATCCGTATTGACAAATGAATCCG TAA TGACATGACAATCCAACATGACAAT Protein-coding sequences bacterial gene correct frame f ijk f ijk (1) f ijk (2)

5 TCCAGC TTA TGAGGCATAACTGTTTACTGAGGC CAT ACT GTACTGTTAGGTTGTACTGTTA AGGTCG AAT ACTCCGTATTGACAAATGACTCCG GTA TGACATGACAATCCAACATGACAAT “Shadow” genes shadow gene, =G=G

6 When we can detect genes (by their content)?, 1.When non-coding regions are very different in base composition (e.g., different GC-content) 2.When distances between the phases are large: non-coding

7 Simple experiment, 1. Only the forward strands of genomes are used for triplet counting 2. Every p positions in the sequence, open a window (x-W/2,x+W/2,) of size W and centered at position x 3. Every window, starting from the first base-pair, is divided into W/3 non-overlapping triplets, and the frequencies of all triplets f ijk are calculated 4. The dataset consists of N = [L/p] points, where L is the entire length of the sequence 5. Every data point X i ={x is } corresponds to one window and has 64 coordinates, corresponding to the frequencies of all possible triplets s = 1,…,64

8 Principal Component Analysis, Maximal dispersion 1 st Principal axis 2 nd principal axis

9 ViDaExpert tool,

10 Caulobacter crescentus (GenBank NC_002696),

11 “Path” of sliding window,

12 Helicobacter pylori (GenBank NC_000921),

13 Saccharomyces cerevisiae chromosome IV,

14 Model sequences: (random codon usage),

15 Model sequences: (random codon usage+ 50% of frequencies are set to 0),

16 Graph of coding phase,

17 Assessment, SequenceLW % of coding bases Sn 1 Sp 1 Sn 2 Sp 2 Helicobacter pylori, complete genome (NC_000921) Caulobacter crescentus, complete genome (NC_002696) Prototheca wickerhamii mitochondrion (NC_001613) Saccharomyces cerevisiae chromosome III (NC_001135) Saccharomyces cerevisiae chromosome IV (NC_001136) 1643831 4016947 55328 316613 1531929 300 120 399 90 91 49 69 73 0.93 0.82 0.90 0.89 0.97 0.93 0.88 0.91 0.93 0.94 0.84 0.90 0.92 0.98 0.95 0.90 0.92 Model text RANDOM Model text RANDOM_BIAS 100000 500 49 45 0.90 0.99 0.61 0.83 0.82 0.94 0.77 0.90 Completely blind prediction

18 Dependence on window size,

19 , W = 51 W = 252 W = 900 W = 2000

20 State of art: GLIMMER strategy, 1.Use MM of 5 th order (hexamers) 2.Use interpolation for transition probabilities 3.Use long ORF (>500bp) as learning dataset Problems: 1.The number of hexamers to be evaluated is still big 2.Applicable only for collected genomes of good quality (<1frameshift/1000bp)

21 What can we learn from this game?, Learning can be replaced with self-learning Bacterial gene-finders work relatively well, when concentration of coding sequences is high Correlations in the order of codons are small Codon usage is approximately the same along the genome The method presented allows self-learning on pieces of even uncollected DNA (>150 bp) The method gives alternative to HMM view on the problem of gene recognition

22 Acknowledgements, Professor Alexander Gorban Professor Misha Gromov My coordinates: http://www.ihes.fr/~zinovyev


Download ppt "Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette."

Similar presentations


Ads by Google