Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette.

Slides:



Advertisements
Similar presentations
Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques group of M.Gromov.
Advertisements

Probabilistic sequence modeling II: Markov chains Haixu Tang School of Informatics.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Seven clusters and four types of symmetry in microbial genomes Andrei Zinovyev Bioinformatics service group of M.Gromov Tatyana Popova R&D Centre.
Ab initio gene prediction Genome 559, Winter 2011.
MNW2 course Introduction to Bioinformatics
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Profiles for Sequences
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Predicting Genes in Mycobacteriophages December 8, In Silico Workshop Training D. Jacobs-Sera.
McPromoter – an ancient tool to predict transcription start sites
Non-linear Principal Manifolds a Useful Tool in Bioinformatics and Medical Applications Andrei Zinovyev Institute des Hautes Etudes Scientifique, France.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Codons, Genes and Networks Bioinformatics service group of M.Gromov Andrei Zinovyev.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
ECE 501 Introduction to BME
Master’s course Bioinformatics Data Analysis and Tools
CSE182-L10 Gene Finding.
Gene Finding. Biological Background The Central Dogma Transcription RNA Translation Protein DNA.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Lecture 12 Splicing and gene prediction in eukaryotes
DNA Feature Sensors B. Majoros. What is Feature Sensing? A feature is any DNA subsequence of biological significance. For practical reasons, we recognize.
Biological Motivation Gene Finding in Eukaryotic Genomes
Gene Structure and Identification
CAI and the most biased genes Zinovyev Andrei Institut des Hautes Études Scientifiques.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
Markov Chain Models BMI/CS 576 Fall 2010.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
MNW2 course Introduction to Bioinformatics Lecture 22: Markov models Centre for Integrative Bioinformatics FEW/FALW
BINF6201/8201 Hidden Markov Models for Sequence Analysis
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
10/29/20151 Gene Finding Project (Cont.) Charles Yan.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
DNA, chromosomes, genes What is a gene? Triplet code? Compare prokaryotic and eukaryotic DNA.
(H)MMs in gene prediction and similarity searches.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Course 3 Binary Image Binary Images have only two gray levels: “1” and “0”, i.e., black / white. —— save memory —— fast processing —— many features of.
DNA Sequences Analysis Hasan Alshahrani CS6800 Statistical Background : HMMs. What is DNA Sequence. How to get DNA Sequence. DNA Sequence formats. Analysis.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
ORF Calling.
bacteria and eukaryotes
Genome Annotation (protein coding genes)
Interpolated Markov Models for Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
More on translation.
Generalizations of Markov model to characterize biological sequences
What do you with a whole genome sequence?
Reading Frames and ORF’s
Microbial gene identification using interpolated Markov models
Discussion Section Week 9
The Toy Exon Finder.
Presentation transcript:

Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette

n Transition probabilities = n Frequencies of N-grams …AGGTCGATC … Markov chain models

AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC Sliding window width W f AAA f AAC f GGG … = f ijk, i,j,k in [A,C,G,T]

AGGTCG ATG AATCCGTATTGACAAATGAATCCG TAA TGACATGACAATCCAACATGACAAT Protein-coding sequences bacterial gene correct frame f ijk f ijk (1) f ijk (2)

TCCAGC TTA TGAGGCATAACTGTTTACTGAGGC CAT ACT GTACTGTTAGGTTGTACTGTTA AGGTCG AAT ACTCCGTATTGACAAATGACTCCG GTA TGACATGACAATCCAACATGACAAT “Shadow” genes shadow gene, =G=G

When we can detect genes (by their content)?, 1.When non-coding regions are very different in base composition (e.g., different GC-content) 2.When distances between the phases are large: non-coding

Simple experiment, 1. Only the forward strands of genomes are used for triplet counting 2. Every p positions in the sequence, open a window (x-W/2,x+W/2,) of size W and centered at position x 3. Every window, starting from the first base-pair, is divided into W/3 non-overlapping triplets, and the frequencies of all triplets f ijk are calculated 4. The dataset consists of N = [L/p] points, where L is the entire length of the sequence 5. Every data point X i ={x is } corresponds to one window and has 64 coordinates, corresponding to the frequencies of all possible triplets s = 1,…,64

Principal Component Analysis, Maximal dispersion 1 st Principal axis 2 nd principal axis

ViDaExpert tool,

Caulobacter crescentus (GenBank NC_002696),

“Path” of sliding window,

Helicobacter pylori (GenBank NC_000921),

Saccharomyces cerevisiae chromosome IV,

Model sequences: (random codon usage),

Model sequences: (random codon usage+ 50% of frequencies are set to 0),

Graph of coding phase,

Assessment, SequenceLW % of coding bases Sn 1 Sp 1 Sn 2 Sp 2 Helicobacter pylori, complete genome (NC_000921) Caulobacter crescentus, complete genome (NC_002696) Prototheca wickerhamii mitochondrion (NC_001613) Saccharomyces cerevisiae chromosome III (NC_001135) Saccharomyces cerevisiae chromosome IV (NC_001136) Model text RANDOM Model text RANDOM_BIAS Completely blind prediction

Dependence on window size,

, W = 51 W = 252 W = 900 W = 2000

State of art: GLIMMER strategy, 1.Use MM of 5 th order (hexamers) 2.Use interpolation for transition probabilities 3.Use long ORF (>500bp) as learning dataset Problems: 1.The number of hexamers to be evaluated is still big 2.Applicable only for collected genomes of good quality (<1frameshift/1000bp)

What can we learn from this game?, Learning can be replaced with self-learning Bacterial gene-finders work relatively well, when concentration of coding sequences is high Correlations in the order of codons are small Codon usage is approximately the same along the genome The method presented allows self-learning on pieces of even uncollected DNA (>150 bp) The method gives alternative to HMM view on the problem of gene recognition

Acknowledgements, Professor Alexander Gorban Professor Misha Gromov My coordinates: