Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.

Similar presentations


Presentation on theme: "Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar."— Presentation transcript:

1 Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

2 Gene Prediction Introduction Protein-coding gene prediction RNA gene prediction Modification and finishing Project schema

3 Gene Prediction IntroductionIntroduction Protein-coding gene prediction RNA gene prediction Modification and finishing Project schema

4 Why gene prediction? experimental way?

5 Why gene prediction? Exponential growth of sequences Metagenomics: ~1% grow in lab New sequencing technology

6 How to do it?

7 It is a complicated task, let’s break it into parts

8 How to do it? It is a complicated task, let’s break it into parts Genome

9 How to do it? It is a complicated task, let’s break it into parts Genome

10 How to do it? Protein-coding gene prediction Phillip Lee & Divya Anjan Kumar Homology Search ab initio approach Nadeem Bulsara & Neha Gupta

11 How to do it? RNA gene prediction Amanda McCook & Chengwei Luo tRNA rRNA sRNA

12 Gene Prediction Introduction Protein-coding gene predictionProtein-coding gene prediction RNA gene prediction Modification and finishing Project schema

13 Homology Search

14

15 Strategy

16 open reading frame(ORF)

17 How/Why find ORF?

18

19

20 Protein Database Searches

21 Domain searches

22 Limits of Extrinsic Prediction

23 ab initio Prediction

24 Homology Search is not Enough! Biased and incomplete Database Sequenced genomes are not evenly distributed on the tree of life, and does not reflect the diversity accordingly either. Number of sequenced genomes clustered here

25 ab initio Gene Prediction

26 Features

27 ORFs (6 frames)

28 Codon Statistics

29 Features (Contd.)

30 Probabilistic View

31 Supervised Techniques

32 Unsupervised Techniques

33 Usually Used Tools GeneMark GLIMMER EasyGene PRODIGAL

34 GeneMark Developed in 1993 at Georgia Institute of Technology as the first gene finding tool. Used markov chain to represent the statistics of coding and noncoding reading frames using dicodon statistics. Shortcomings Inability to find exact gene boundaries

35 GeneMark.hmm

36 Probability of any sequence S underlying functional sequence X is calculated as P(X|S)=P(x 1,x 2,…………,x L | b 1,b 2,…………,b L ) Viterbi algorithm then calculates the functional sequence X * such that P(X * |S) is the largest among all possible values of X. Ribosome binding site model was also added to augment accuracy in the prediction of translational start sites.

37 GeneMark RBS feature overcomes this problem by defining a % position nucleotide matrix based on alignment of 325 E coli genes whose RBS signals have already been annotated. Uses a consensus sequence AGGAG to search upstream of any alternative start codons for genes predicted by HMM. GENEMARKS Considered the best gene prediction tool. Based on unsupervised learning. Even in prokaryotic genomes gene overlaps are quite common GeneMarkS

38 GLIMMER Used IMM (Interpolated Markov Models) for the first time. Predictions based on variable context (oligomers of variable lengths). More flexible than the fixed order Markov models. Principle IMM combines probability based on 0,1……..k previous bases, in this case k=8 is used. But this is for oligomers that occur frequently. However, for rarely occurring oligomers, 5th order or lower may also be used. Maintained by Steven Salzberg, Art Delcher at the University of Maryland, College Park

39 Glimmer development Glimmer 2 (1999) Increased the sensitivity of prediction by adding concept of ICM (Interpolated Context Model) Glimmer 3 (2007) Overcomes the shortcomings of previous models by taking in account sum of RBS score, IMM coding potentials and a score for start codons which is dependent on relative frequency of each possible start codon in the same training set used for RBS determination. Algorithm used reverse scoring of IMM by scoring all ORF (open reading frames) in reverse, from the stop codon to start codon. Score being the sum of log likelihood of the bases contained in the ORF.

40 Glimmer3.02

41 PRODIGAL Prokaryotic Dynamic Programming Gene Finding Algorithm Developed at Oak Ridge National Laboratory and the University of Tennessee

42 PRODIGAL-Features

43

44 EasyGene Developed at University of Copenhagen Statistical significance is the measure for gene prediction.

45 Comparison of Different Tools

46 Gene Prediction Introduction Protein-coding gene prediction RNA gene predictionRNA gene prediction Modification and finishing Project schema

47 RNA Gene Prediction

48 Why Predict RNA?

49 Regulatory sRNA

50 sRNA Challenges

51 Fundamental Methodology

52 RFAM

53 What Is Covariance? Fig: Christian Weile et al. BMC Genomics (2007) 8:244

54 Noncomparative Prediction Fig: James A. Goodrich & Jennifer F. Kugel, Nature Rev. Mol. Cell Biol. (2006) 7:612

55 Noncomparative Prediction *Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1

56 Comparative+Noncomparative Effective sRNA prediction in V. cholerae Non-enterobacteria sRNAPredict2 32 novel sRNAs predicted 9 tested 6 confirmed Jonathan Livny et al. Nucleic Acids Res. (2005) 33:4096

57 Software *Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1 Eva K. Freyhult et al. Genome Res. (2007) 17:117

58 Gene Prediction Introduction Protein-coding gene prediction RNA gene prediction Modification and finishingModification and finishing Project schema

59 Modification & Finishing Consensus strategy to integrate ab initio results Broken gene recruiting TIS correcting IS calling operon annotating Gene presence/absence analysis

60 Modification & Finishing Consensus strategy pass fail Broken gene recruiting ab initio results homology search candidate fragments

61 Modification & Finishing TIS correcting Start codon redundancy:ATG, GTG, TTG, CTG Markov iteration, experimental verified data Leaderless genes

62 Modification & Finishing IS callingOperon annotating IS Finder DB

63 Modification & Finishing Gene Presence/absence analysis

64 Gene Prediction Introduction Protein-coding gene prediction RNA gene prediction Modification and finishing Project schemaProject schema

65 Schema (proposed)

66 assembly group

67 Schema (proposed) assembly group


Download ppt "Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar."

Similar presentations


Ads by Google