Presentation is loading. Please wait.

Presentation is loading. Please wait.

A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive.

Similar presentations


Presentation on theme: "A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive."— Presentation transcript:

1 A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive UMR CNRS 5558, Université Claude Bernard Lyon 1, France 2 Institut Camille Jordan UMR CNRS 5208, Université Claude Bernard Lyon 1, France PRABI

2 Introduction HMM for the genomic structure of DNA sequences Discrimination method based on HMM Contents Conclusion Direction of research

3 Introduction Intensive sequencing Genes represent only 3% of the human genome Markovian models are widely used for the identification of genes We propose an analysis of the structural properties of genes, using a discrimination method based on HMMs

4  Advantages:  Each state represents a different type of region in the sequence  The complexity of the algorithm is linear with respect to the length of the sequence Hidden Markov model  Drawback:  The distribution of the sojourn time in a given state is geometric The empirical distribution of the length of the exons is not geometric ! Introduction

5 HMM for the genomic structure of DNA sequences CDS No CDS Structure of the HMM model 1-t 1 1-t 2 t1t1 t2t2 Basesprobabilities ApA CpC GpG TpT Basesprobabilities AqA CqC GqG TqT CDS: coding sequence

6 Model of order 5 HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account

7 Model of order 5 StSt S t-1 S t-2 S t-3 S t-4 S t-5 S t-6 XtXt X t-1 X t-2 X t-3 X t-4 X t-5 X t-6 HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account

8 Model of order 5 StSt S t-1 S t-2 S t-3 S t-4 S t-5 S t-6 XtXt X t-1 X t-2 X t-3 X t-4 X t-5 X t-6 HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account

9 Model of order 5 StSt S t-1 S t-2 S t-3 S t-4 S t-5 S t-6 XtXt X t-1 X t-2 X t-3 X t-4 X t-5 X t-6 HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account

10 Model of order 5 StSt S t-1 S t-2 S t-3 S t-4 S t-5 S t-6 XtXt X t-1 X t-2 X t-3 X t-4 X t-5 X t-6 HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account

11 Model of order 5 StSt S t-1 S t-2 S t-3 S t-4 S t-5 S t-6 XtXt X t-1 X t-2 X t-3 X t-4 X t-5 X t-6 HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account

12 Intergenic region Single exon Initial exon Initial intron Internal intron Internal exon Terminal intron Terminal exon HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account Length distributions of exons and introns according to their position in genes:

13 Intergenic region Single exon Initial exon Initial intron Internal intron Internal exon Terminal intron Terminal exon HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account Length distributions of exons and introns according to their position in genes:

14 Intergenic region Single exon Initial exon Initial intron Internal intron Internal exon Terminal intron Terminal exon Several biological properties of DNA sequences were taken into account HMM for the genomic structure of DNA sequences Length distributions of exons and introns according to their position in genes:

15 Intergenic region Single exon Initial exon Initial intron Internal intron Internal exon Terminal intron Terminal exon HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account Length distributions of exons and introns according to their position in genes:

16 Direct and reverse strands Intergenic region Single exon Initial exon Initial intron Internal intron Internal exon Terminal intron Terminal exon HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account Length distributions of exons and introns according to their position in genes:

17 Codons: 1-p Exon p frame 0frame 1frame 2 ppp 1-p HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account

18 Sojourn time in a HMM state must follows a geometric law Length of a hidden state CDS p T: sojourn time in a given state T follows a geometric law Geometric law 1-p HMM for the genomic structure of DNA sequences Times of stay in state CDS Probability 11-p 2p (1-p) 3p 2 (1-p) … np n-1 (1-p)

19 Probability Length of the internal exons Méthode HMM for the genomic structure of DNA sequences Method: estimation of the length of a region Geometric laws does not fit the empirical distribution of the length of exons

20 Probability Length of the internal exons Méthode HMM for the genomic structure of DNA sequences Method: estimation of the length of a region We suggest to: State 1State 2State Geometric laws does not fit the empirical distribution of the length of exons

21 Probability Length of the internal exons Méthode HMM for the genomic structure of DNA sequences Method: estimation of the length of a region We suggest to: State 1State 2State Good fit with sums of 5 geometric random variables Length of the internal exons Probabilityt

22 Method: estimation of the length of a region Data: Human genome * extracted from HOVERGEN Different length distributions: * Sum of  geometric laws of equal parameter with  =1..7 * Sum of 2 or 3 geometric laws of different parameters For each region: * We choose parameters that minimize the Kolmogorov-Smirnov distance * We do not use the maximum likelihood method HMM for the genomic structure of DNA sequences

23 Results: Estimation of the length of a region HMM for the genomic structure of DNA sequences Probability Length of the initial exon Maximum likelihood estimation Kolmogorov-Smirnov estimation

24 The model fits very well the empirical distribution HMM for the genomic structure of DNA sequences Results: Estimation of the length distribution of internal exons Length of the internal exons Probabilityt Sum of 5 geometric laws p=1/26

25 HMM for the genomic structure of DNA sequences Results: Estimation of the length distribution of intronless genes Many small genes with single exons are pseudogenes Sum of 2 geometric laws p=1/440

26 Introduction HMM for the genomic structure of DNA sequences Discrimination method based on HMM Conclusion Contents Direction of research

27 Emission probabilities for each state are estimated by the frequencies of words with 6 letters (model of order 5) Method: A model for initial, internal, terminal exons Discrimination method based on HMM

28 Emission probabilities for each state are estimated by the frequencies of words with 6 letters (model of order 5) Method: A model for initial, internal, terminal exons Discrimination method based on HMM D = { log P(S/ HMM 1 ) - log P(S/ HMM 2 ) } / |S| (Eq. 1) S is the test sequence of length |S| Discrimination method to test the homogeneity between regions: HMM 1 : Initial ExonHMM 2 : Internal exon Sequence likelihood Sequence is characterized by the HMM with the best likelihood

29 Quality of the decision: We want to know if models are well adapted to their regions (HMMs are compared pair wise) {Initial exon sequences} N Decision N 1 initial exonsN-N 1 internal exons N1N1 N-N 1 Discrimination method based on HMM Each model is characterized by the frequency of sequence recognition

30 Results: Comparison of different HMMs on different test sequences Internal exon ≈ Terminal exon Initial exon ≠ Internal exon Initial exon ≠ Terminal exon Discrimination method based on HMM

31 Results: Comparison of different HMMs on different test sequences Internal exon ≈ Terminal exon Initial exon ≠ Internal exon Initial exon ≠ Terminal exon Discrimination method based on HMM

32 Results: Comparison of different HMMs on different test sequences Internal exon ≈ Terminal exon Initial exon ≠ Internal exon Initial exon ≠ Terminal exon Discrimination method based on HMM

33 Results: Comparison of different HMMs on different test sequences Internal exon ≈ Terminal exon Initial exon ≠ Internal exon Initial exon ≠ Terminal exon Discrimination method based on HMM

34 To determine the break point in first exon sequences, we consider different HMMs: HMM StartHMM End Initial exon HMM k The HMM representing the initial exon was split into 2 HMMs around the k th base A “Start” HMM is trained on the first k bases An “End” HMM is trained on the remaining bases Discrimination method based on HMM Results: Break in the homogeneity of the first coding exon

35 M_EI 80 Other models Discrimination method based on HMM

36 Results: Break in the homogeneity of the first coding exon M_EI 80 Other models Discrimination method based on HMM

37 Results: Break in the homogeneity of the first coding exon M_EI 80 Other models Discrimination method based on HMM

38 Results: Break in the homogeneity of the first coding exon M_EI 80 Other models Discrimination method based on HMM

39 Results: Initial exons HMM Start HMM End 25% 75% with peptide signal (SignalP) Discrimination method based on HMM

40 Result: Initial exons HMM Start HMM End 25% 75% with peptide signal (SignalP) HMM Start characterizes well the peptide signal 90% 10% without peptide signal Discrimination method based on HMM

41 Modelling of the exons length distribution: The model has relatively few parameters  Sum of 5 geometric laws of the same parameter (internal exons)  Sum of 3 geometric laws of different parameters (terminal exons) Sums of geometric laws fit well the distribution of exons lengths Conclusion

42 Modelling of the exons length distribution: The model has relatively few parameters  Sum of 5 geometric laws of the same parameter (internal exons)  Sum of 3 geometric laws of different parameters (terminal exons) Sums of geometric laws fit well the distribution of exons lengths Conclusion Discrimination method based on HMM: Bad annotation in database of the intronless genes Homogeneity between internal and terminal exons Break of homogeneity of initial exon around 80 th base Peptide signal

43 Introduction HMM for the genomic structure of DNA sequences Discrimination method based on HMM Conclusion Contents Direction of research

44 Versteeg 2003 Chromosome 9 Content of GC Markovian models for the analysis of the organization of genomes Direction of research

45 Versteeg 2003 Chromosome 9 Content of GC Genes density Markovian models for the analysis of the organization of genomes Direction of research

46 Versteeg 2003 Chromosome 9 Genes density Content of GC Size of introns Markovian models for the analysis of the organization of genomes Direction of research

47 Versteeg 2003 Chromosome 9 Genes density Content of GC Size of introns Repeated elements Markovian models for the analysis of the organization of genomes Direction of research

48 Versteeg 2003 Chromosome 9 Genes density Content of GC Size of introns Repeated elements Genes expression Markovian models for the analysis of the organization of genomes Direction of research

49 Structure superposition in genomes A chromosome Isochore level Gene level Exon-intron level Codon level intron exon acc gcc agt tac ccc aga Direction of research

50 –Build 3 HMMs adapted to the organization structure of each of the 3 isochores classes H, L, M H = [72%, 100%] M = ]56%, 72%[ L = [0%, 56%] –Human chromosomes are divided into overlapping 100 kb segments. Two successive segments overlap by half of their length. –Bayesian approach: for each segment and for each model (H, L and M), we compute the probability P[Model | Segment] Segment is characterized by the model with the best probability Scan the genome Direction of research

51 Results: Human chromosome 1 Model H Model M Model L Genes density Repartition of isochores G+C content Direction of research

52 Comparing the human genome with genomes of different organisms can be useful to: better understand the structure and function of human genes study evolutionary changes among organisms help to identify the genes that are conserved among species Comparative Genomic Analysis

53 Human ChimpanzeeMouse Chicken Tetraodon Direction of research


Download ppt "A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive."

Similar presentations


Ads by Google