Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Similar presentations


Presentation on theme: "Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame."— Presentation transcript:

1 Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame

2 Cédric Notredame (22/04/2015) Naked Genome

3 Cédric Notredame (22/04/2015) All Dressed Up!

4 Cédric Notredame (22/04/2015)

5

6 Naked Genomes are Useless Useful Genome  Accurate Annotation -Experimental Methods -Computational Methods -ESTs, THS, DNA Chips… -Homology, Ab-Initio

7 Cédric Notredame (22/04/2015) ANNOTATION -Where are the genes ? -What do they do: Biochemistry ? -When do they do it: Regulation ? -Who do they do it for: Metabolic ?

8 Cédric Notredame (22/04/2015) Outline Naked Genome => Fully Dressed Sequence 1. Cleaning the genome 2. Similarity methods 3. Experimental Methods 4. Ab-initio Methods Eukaryotes Prokaryotes 5-How Good Are The Methods ??

9 Cédric Notredame (22/04/2015) Outline Eukaryotes Prokaryotes

10 Cédric Notredame (22/04/2015)

11 Gene Fishing in Prokaryotic Genomes

12 Cédric Notredame (22/04/2015) What is a Prokaryotic Gene ? Gene Promoter RBS Protein ORF mRNA STOPATG Terminator

13 Cédric Notredame (22/04/2015) What is a Prokaryotic Gene:Operon

14 Cédric Notredame (22/04/2015) 2-Homology Based Methods1-Ab-initio: -ORFing -Codon Bias Promoter RBS mRNA STOP Terminator 3-Regulatory Sequence Detection -Non Coding -Short Genes

15 Cédric Notredame (22/04/2015) Prokaryotic Genomes -High Gene Density: Haemophilus Influenza: 85% -No Introns -Operons In a prokaryotic Genome, any ORF longer than 300 nt Can SAFELY be considered to be a gene

16 Cédric Notredame (22/04/2015) Prokaryotic Genomes Clean-upORFingHomology SearchGene PredictionPromoter Detection

17 Cédric Notredame (22/04/2015) Cleaning Your DNA Sequence

18 Cédric Notredame (22/04/2015) Cleaning a DNA Sequence Is My Sequence Contaminated ? -Cloning may lead to the inclusion of Vector Sequences. -These sequences must be removed

19 Cédric Notredame (22/04/2015) Paste in your new sequence

20 Cédric Notredame (22/04/2015) Crop Our sequence displays two vector contaminations

21 Cédric Notredame (22/04/2015) Contamination Matters Contaminations Look Like Horizontal Transfers BUT Genuine Genome may Contain Similarity to the Cloning vector (Antibiotics Resistance) -Wrong Phylogeny -Error Propagation in Secondary Databases -Eukaryote Genomes can also be cleaned this way

22 Cédric Notredame (22/04/2015) ORFing Prokaryotic Genomes

23 Cédric Notredame (22/04/2015) Prokaryotic Genomes: ORFing Where are the ORFs In my Sequence ?

24 Cédric Notredame (22/04/2015) Prokaryotic Genomes: ORFing ATG (Start) Codons STOP Codons

25 Cédric Notredame (22/04/2015) Prokaryotic Genomes: ORFing

26 Cédric Notredame (22/04/2015) Prokaryotic Genomes: GORF www.ncbi.nih.gov/gorf/gorf.html

27 Cédric Notredame (22/04/2015) Prokaryotic Genomes: GORF

28 Cédric Notredame (22/04/2015) Prokaryotic Genomes: GORF TO COG TO BLAST

29 Cédric Notredame (22/04/2015) Prokaryotic Genomes: GORF

30 Cédric Notredame (22/04/2015) GORF: Can You Trust it ??? Random ORF  Random 3 rd Position Real ORF  Biased 3 rd Position

31 Cédric Notredame (22/04/2015) GORF: Can You Trust it ???

32 Cédric Notredame (22/04/2015) Prokaryotic Genomes: GORFing cDNAs BUT… -Will NOT detect SHORT genes -Will NOT detect Non Coding Genes Works with Bacterial Genomes Good enough for ~85% proteome Works with Eukaryotic cDNA

33 Cédric Notredame (22/04/2015) Ab-Initio Gene Predictions In Prokaryotic Genomes

34 Cédric Notredame (22/04/2015) Predicting Genes What are the sequences in my genome that LOOK LIKE Genes

35 Cédric Notredame (22/04/2015) Using The Codon Biases

36 Cédric Notredame (22/04/2015) Using The Codon Biases Coding Regions Do NOT look Like Random DNA: -Codon Bias

37 Cédric Notredame (22/04/2015) Real Genes Use Mostly the Optimal Codons

38 Cédric Notredame (22/04/2015) Predicting Genes ALL the characteristics of a Gene can be Built into a model Hidden Markov Model

39 Cédric Notredame (22/04/2015) Hidden Markov Model -Each Nucleotide has a STATE: Coding/Non Coding … -This STATE is HIDDEN -The HMM tries to UNCOVER the STATE of each Nucleotide.

40 Cédric Notredame (22/04/2015) Hidden Markov Model Occasionally Dishonest CAsino … -This STATE is HIDDEN in the data Observation: 122234455666125654151661661515566616166661 State : FFFFFFFFLLLLFFFFFFLLLLLLLLLLLLLLLLLLLLLLLLLLL

41 Cédric Notredame (22/04/2015) GeneMark

42 Cédric Notredame (22/04/2015) Simplified HMM for Coding Regions S GGG0.02G GGA0.00G GGT0.6G GGC0.38G TGG1.00W 64 Codons GGG0.02G GGA0.00G GGT0.6G GGC0.38G TGG1.00W 64 Codons E

43 Cédric Notredame (22/04/2015) Emission ProbaTransition Proba Simplified HMM for Coding Regions

44 Cédric Notredame (22/04/2015) Proba of seq (GGG-TGG Given Model) = Proba(GGG)*Proba(GGG->TGG)*Proba(TGG) HMM order 5: 6 th Nucleotide depends on the 5 previous Takes into account Codon Bias AND dipeptide Comp Simplified HMM for Coding Regions

45 Cédric Notredame (22/04/2015) Translate Predicted Genes into Proteins Text Output http://opal.biology.gatech.edu/GeneMark/

46 Cédric Notredame (22/04/2015)

47 Non Standard FASTA

48 Cédric Notredame (22/04/2015) GLIMMER: An alternative to GeneMark

49 Cédric Notredame (22/04/2015) Main Problems

50 Cédric Notredame (22/04/2015) GeneMark and HMM predictions Works Very Well Good enough for ~99% proteome BUT… -Will NOT detect Some SHORT genes -Will NOT detect Non Coding Genes

51 Cédric Notredame (22/04/2015) Which Program ??? The established programs ALL work well No point in fighting if your users have their mind set on a brand…

52 Cédric Notredame (22/04/2015) If Your Gene is NON-Coding… The only existing model for NON-Coding genes are those for tRNA

53 Cédric Notredame (22/04/2015) Homology Based Gene Prediction In Prokaryotic Genomes

54 Cédric Notredame (22/04/2015) BLASTx What are the portion of my Genome That Look like a Known Gene/Protein?

55 Cédric Notredame (22/04/2015) blastx protein nucleotide protein VS Non Coding, but works only for higly similar sequences ( >70%)

56 Cédric Notredame (22/04/2015) BlastX and HMM predictions BUT… Needs Homology  Depends on the databases Very Reliable on Prokaryotes Can Help in Eukaryotes

57 Cédric Notredame (22/04/2015) Finding Promoters In Prokaryotic Genomes

58 Cédric Notredame (22/04/2015) Promoter Hunting Are There known promoters in my Sequence ? Ideal for -Finding Small Proteins -Finding Non Coding Genes

59 Cédric Notredame (22/04/2015)

60

61

62 Cédric Notredame (22/04/2015)

63 prodoric.tu-bs.de/

64 Cédric Notredame (22/04/2015) prodoric.tu-bs.de/

65 Cédric Notredame (22/04/2015)

66 rsat.ulb.ac.be/rsat/RSA_home.cgi

67 Cédric Notredame (22/04/2015)

68 Fishing Genes In Eukaryotic Genomes

69 Cédric Notredame (22/04/2015)

70 2-Homology Based Methods1-Transcript Based Methods3-Ab-initio: -HMMs 4-Regulatory Sequence Detection Promoter mRNA (form2) exon mRNA (form2)

71 Cédric Notredame (22/04/2015) Eukaryote Genomes Clean-up Transcripts Prediction Homology Promoter Detection

72 Cédric Notredame (22/04/2015) Know your Opponent …

73 Cédric Notredame (22/04/2015) Exons are longer in Vertebrates

74 Cédric Notredame (22/04/2015) Introns are longer in Vertebrates -100 bp in Fungi -1000 bp in Vertebrates

75 Cédric Notredame (22/04/2015) Genes contain more Introns in Mammals

76 Cédric Notredame (22/04/2015) Cleaning Eukaryotic Genomes

77 Cédric Notredame (22/04/2015) Repeats Repeats  Transposable elements, simple repeats RepeatMasker RepeatMasker  Smith and Waterman Clean-up. Avoiding Repeats  Plus-Remove lots of noise.  Minus-Changes Sequence Statistics.

78 Cédric Notredame (22/04/2015) Homology Based Gene Prediction In Eukaryotic Genomes

79 Cédric Notredame (22/04/2015) Homology Based Predictions What are the portion of my Genome That Look like a Known Protein?

80 Cédric Notredame (22/04/2015) Three Tools GeneWise: Most Common Procrustes: Most Sophisticated BlastX/TBlastX Simplest

81 Cédric Notredame (22/04/2015) blastx protein Genome protein VS BLASTX

82 Cédric Notredame (22/04/2015) tblastx protein Genome protein ESTs VS TBLASTX: Exon Fishing

83 Cédric Notredame (22/04/2015) genomic sequence Protein Procrustes

84 Cédric Notredame (22/04/2015) 40% id

85 Cédric Notredame (22/04/2015) www.ebi.ac.uk/Wise2/advanced.html GeneWise genomic sequence Protein

86 Cédric Notredame (22/04/2015) Transcript Based Gene Prediction In Eukaryotic Genomes

87 Cédric Notredame (22/04/2015) Gene indices Using Established ESTs Collections

88 Cédric Notredame (22/04/2015) AAAAAA... putative mRNA exon 15‘UTRexon 23‘UTR expressed sequence tags (ESTs) ESTs give us an Insight into this Complexity 1-Cluster the ESTs to reconstitute a gene

89 Cédric Notredame (22/04/2015) EMBL database Quality clipping BLAST search, clustering EST Collection Quality clipping Assembly, Consensus sequences Visualization Gene indices Typical WorkFlow

90 Cédric Notredame (22/04/2015) Gene indices Alignment consensus

91 Cédric Notredame (22/04/2015) Gene indices Alignment Software Phrap (Phil Green) CAP3 (X. Huang) TIGR assembler GAP4 (R. Staden)

92 Cédric Notredame (22/04/2015) Gene indices Consensus sequences Reduced error rate Long Consensus Efficient database search exon/intron boundaries Alternative Splicing

93 Cédric Notredame (22/04/2015) UniGene (NCBI) TIGR Gene Indices STACK (SANBI) GeneNest (DKFZ,MPI) Goal: One cluster  One Gene Gene indices Clustering of EST and mRNA sequences of an organism to reduce redundance in sequence data.

94 Cédric Notredame (22/04/2015) GeneNest genenest.molgen.mpg.de

95 Cédric Notredame (22/04/2015) TIGR Gene Indices Alignment scheme www.tigr.org

96 Cédric Notredame (22/04/2015) UniGene www.ncbi.nih.nlm.gov/UniGene

97 Cédric Notredame (22/04/2015) UniGene www.ncbi.nih.nlm.gov/UniGene

98 Cédric Notredame (22/04/2015) Gene indices

99 Cédric Notredame (22/04/2015) Gene indices Applications Detection of exon/intron boundaries Detection of alternative splicing Detection of Single Nucleotide Polymorphisms Genome annotation Analysis of gene expression Design of DNA-chips/arrays

100 Cédric Notredame (22/04/2015) Mapping of EST consensus sequences on genomic DNA genomic sequence exons consensus sequence (  mRNA)

101 Cédric Notredame (22/04/2015) Comparing Your Genome with Transcripts How to do It ? How Long ? BLAST : 36 hours BLAST : 36 hours Popular and well described Popular and well described HSPs tend to mangle Introns HSPs tend to mangle Introns EST_GENOME 80 hours EST_GENOME 80 hours Dynamic Program. post process Dynamic Program. post process Slow and sometimes hard to use Slow and sometimes hard to use BLAT: 0.5 hours BLAT: 0.5 hours Next Generation Next Generation Look for nearly identical seq. Look for nearly identical seq. SIM4 pbil.univ-lyon1.fr/sim4.php SIM4 pbil.univ-lyon1.fr/sim4.php Similar to BLAT (slower) Similar to BLAT (slower) Allows Large Gaps Allows Large Gaps

102 Cédric Notredame (22/04/2015)

103 Mapping cDNA on genomic DNA splicenest.molgen.mpg.de

104 Cédric Notredame (22/04/2015) Gene indices Applications Detection of exon/intron boundaries Detection of alternative splicing Detection of Single Nucleotide Polymorphisms Genome annotation Analysis of gene expression Design of DNA-chips/arrays

105 Cédric Notredame (22/04/2015) Alternative Splicing genomic sequence exons consensus sequence (  mRNA) splice variant

106 Cédric Notredame (22/04/2015)

107 Splice variants of APECED gene number of sequencesgenomic sequence alternative variants splicenest.molgen.mpg.de Alternative Splicing

108 Cédric Notredame (22/04/2015) Alternative Splicing (additional exon) 1-skipped exon Splice variants of adenylsuccinate lyase 2-unspliced ? 3-gene prediction errors ? splicenest.molgen.mpg.de

109 Cédric Notredame (22/04/2015) Alternative Splicing (alternative donor site)

110 Cédric Notredame (22/04/2015) Alternative Splicing (unknown gene Hs16936)

111 Cédric Notredame (22/04/2015) Ab-Initio Gene Prediction In Eukaryotic Genomes

112 Cédric Notredame (22/04/2015)

113 Three Categories of Methods Rule Based –Uses explicit set of rules to make decisions –GeneFinder Neural Network –Uses a data set to build rules. –Grail HMM –Finding the state of each Nucleotide (Coding…) –Genscan

114 Cédric Notredame (22/04/2015) Rule Based - GeneFinder CodonBias -> score1 Splice Site description -> score2 ORFs -> score3 Proba (Gene)= F(score1, score2, score3..)

115 Cédric Notredame (22/04/2015) Train Neural Network on Known Genes: Discriminate GrailExp : Measure several coding potentials Blast Coding Potential … Feed all the scores into the Neural Network Neural Networks - Grail

116 Cédric Notredame (22/04/2015)

117 Genscan models the genes with a Hidden Markov model, that models coding and non-coding regions. HMM-Genscan Fifth order inhomogeneous HMM –Fifth order : use 6-tuples (two codons) –Inhomogeneous: each position is special (0,1,2)

118 Cédric Notredame (22/04/2015) Burge, C. and S. Karlin, Prediction of complete gene structures in human genomic DNA. J Mol Biol, 1997. 268(1): p. 78-94

119 Cédric Notredame (22/04/2015) Your Genomic Sequence A Collection of Proteins

120 Cédric Notredame (22/04/2015)

121 Evaluating Eukaryote Gene Prediction

122 Cédric Notredame (22/04/2015) PMID:11042160

123 Cédric Notredame (22/04/2015) Nucleotide Accuracy TP FN SnSp TP FP + + TPFN + TPFP + TNFP + TNFN + = = TN + + TP + AC=0.5*() -1 ((TP+FN)*(TN+FP)*(TP+FP)*(TN+FN)) 1/2 (TN*TP)+(FN*FP) CC= sensitivityspecificity approximate correlation correlation coefficient

124 Cédric Notredame (22/04/2015) Exon Accuracy

125 Cédric Notredame (22/04/2015) Blastx: Good Gene Hunter/Poor Modeler GeneWise Best Homology Gene Modeling

126 Cédric Notredame (22/04/2015) GenScan: Distracted by Complete Genome Use GenomeScan instead

127 Cédric Notredame (22/04/2015) GenScan: Ab-Initio Methods are more Robust

128 Cédric Notredame (22/04/2015) http://www.cs.ubc.ca/~rogic/evaluation.html PMID: 8786136

129 Cédric Notredame (22/04/2015)

130 High and Low GC contents can confuse Predictions

131 Cédric Notredame (22/04/2015)

132 Annnnnnd The Winner is …

133 Cédric Notredame (22/04/2015) http://www.cbs.dtu.dk/services/HMMgene/

134 Cédric Notredame (22/04/2015) Working on a Genome

135 Cédric Notredame (22/04/2015) www.sanger.ac.uk/Software/Artemis/

136 Cédric Notredame (22/04/2015)

137 Wrapping It Up

138 Cédric Notredame (22/04/2015) Predicting Genes Using Homology: BlastX, Procrustes ORFing: GORF Cleaning Up Data

139 Cédric Notredame (22/04/2015) Predicting Genes Promoter Prediction PRODORIC Transcript Based Predictions GeneNest, UniGene, BLAT Ab-Initio Predictions with HMMs GenomeScan and HMMgene

140 Cédric Notredame (22/04/2015)


Download ppt "Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame."

Similar presentations


Ads by Google