Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 8 Gene Prediction. Automated sequencing of genomes require automated gene assignment Includes detection of open reading frames (ORFs) Identification.

Similar presentations


Presentation on theme: "Chapter 8 Gene Prediction. Automated sequencing of genomes require automated gene assignment Includes detection of open reading frames (ORFs) Identification."— Presentation transcript:

1 Chapter 8 Gene Prediction

2 Automated sequencing of genomes require automated gene assignment Includes detection of open reading frames (ORFs) Identification of the introns and exons Gene prediction a very difficult problem in pattern recognition Coding regions generally do not have conserved sequences Much progress made with prokaryotic gene prediction Eukaryotic genes more difficult to predict correctly

3 Ab initio methods Predict genes on given sequence alone Uses gene signals Start/stop codon Intron splice sites Transcription factor binding sitesribosomal binding sites Poly-A sites Codon demand multiple of three nucleotides Gene content Nucleotide composition – use HMMs Homology based methods Matches to known genes Matches to cDNA Consensus based Uses output from more than one program

4 Prokaryotic gene structure ATG (GTG or TTG less frequent) is start codon Ribosome binding site (Shine-Dalgarno sequence) complementary to 16S rRNA of ribosome AGGAGGT TAG stop codon Transcription termination site (  -independent termination) Stem-loop secondary structure followed by string of Ts

5 Translate sequence into 6 reading frames Stop codon randomly every 20 codons Look for frame longer that 30 codons (normally codons) Presence of start codon and Shine-Dalgarno sequence Translate putative ORF into protein, and search databases Non-randomness of 3 rd base of codon, more frequently G/C Plotting wobble base GC% can identify ORFs 3 rd base also repeats, thus repetition gives clue on gene location

6 Markov chains and HMMs Order depends on k previous positions The higher the order of a Markov model to describe a gene, the more non-randomness the model includes Genes described in codons or hexamers HMMs trained with known genes Codon pairs are often found, thus 6 nucleotide patterns often occur in ORFs – 5 th -order Markov chain 5 th -order HMM gives very accurate gene predictions Problem may be that in short genes there are not enough hexamers Interpolated Markov Model (IMM) samples different length Markov chains Weighing scheme places less weight on rare k-mers Final probability is the probability of all weighted k-mers Typical and atypical genes

7 GeneMark (http://exon.gatech.edu/genemark/)http://exon.gatech.edu/genemark/ Trained on complete microbial genomes Most closely related organism used for predictions Glimmer (Gene Locator and Interpolation Markov Model) (http://www.cbcb.umd.edu/software/glimmer/)http://www.cbcb.umd.edu/software/glimmer/ FGENESB (http://linux1.softberry.com/)http://linux1.softberry.com/ 5 th -order HMM Trained with bacterial sequences Linear discriminant analysis (LDA) RBSFinder (ftp://ftp.tigr.org ) Takes output from Glimmer and searches for S-D sequences close to start sitesftp://ftp.tigr.org

8

9 Performance evaluation Sensitivity S n = TP/(TP+FN) Specificity S p = TP/(TP+FP) CC=TP.TN-FP.FN/([TP+FP][TN+FN][TP+TN]) 1/2

10 Gene prediction in Eukaryotes Low gene density (3% in humans) Space between genes very large with multiply repeated sequences and transposable elements Eukaryotic genes are split (introns/exons) Transcript is capped (methylation of 5’ residue) Splicing in spliceosome Alternative splicing Poly adenylation (~250 As added) downstream of CAATAAA(T/C) consensus box Major issue identification of splicing sites GT-AG rule (GTAAGT/ Y 12 NCAG 5’/3’ intron splice junctions) Codon use frequencies ATG start codon Kozak sequence (CCGCCATGG)

11 Ab initio programs Gene signals Start/stop Putative splice signals Consensus sequences Poly-A sites Gene content Coding statistics Non-random nucleotide distributions Hexamer frequencies HMMs

12 Discriminant analysis Plot 2D graph of coding length versus 3’ splice site Place diagonal line (LDA) that separates true coding from non-coding sequences based on learnt knowledge QDA fits quadratic curve FGENES uses LDA MZEF(Michael Zang’s Exon Finder uses QDA)

13 Neural Nets A series of input, hidden and output layers Gene structure information is fed to input layer, and is separated into several classes Hexamer frequencies splice sites GC composition Weights are calculated in the hidden layer to generate output of exon When input layer is challenged with new sequence, the rules that was generated to output exon is applied to new sequence

14 HHMs GenScan (http://genes.mit.edu/GENSCAN.html) 5 th -order HMMhttp://genes.mit.edu/GENSCAN.html Combined hexamer frequencies with coding signals Initiation codons TATA boxes CAP site Poly-A Trained on Arabidopsis and maize data Extensively used in human genome project HMMgene (http://www.cbs.dtu.dk/services/HMMgene)http://www.cbs.dtu.dk/services/HMMgene Identified sub regions of exons from cDNA or proteins Locks such regions and used HMM extension into neighboring regions

15

16

17 Homology based programs Uses translations to search for EST, cDNA and proteins in databases GenomeScan (http://genes.mit.edu/genomescan.html)http://genes.mit.edu/genomescan.html Combined GENSCAN with BLASTX EST2Genome (http://bioweb.pasteur.fr/seqanal/interfaces/est2genome.html)http://bioweb.pasteur.fr/seqanal/interfaces/est2genome.html Compares EST and cDNA to user sequence TwinScan Similar to GenomeScan

18

19 Consensus-based programs Uses several different programs to generate lists of predicted exons Only common predicted exons are retained GeneComber (http://www.bioinformatics.ubc.ca/gencombver/index.php)http://www.bioinformatics.ubc.ca/gencombver/index.php Combined HMMgene with GenScan DIGIT (http://digit.gsc.riken.go.jp/cgi-bin/index.cgi)http://digit.gsc.riken.go.jp/cgi-bin/index.cgi Combines FGENESH, GENSCAN and HMMgene

20 Nucleotide LevelExon Level SnSpCCSnSp(Sn+Sp) /2 MEWE FGENES GeneMark Genie GenScAN HMMgene Morgan , ; MZEF Accuracy

21 Chapter 9 Promoter and regulatory element prediction

22 Promoters are short regions upstream of transcription start site Contains short (6-8nt) transcription factor recognition site Extremely laborious to define by experiment Sequence is not translated into protein, so no homology matching is possible Each promoter is unique with a unique combination of factor binding sites – thus no consensus promoter

23 polymerase ORF -35 box -10 box TF site TF  70 factor binds to -35 and -10 boxes and recruit full polymerase enzyme -35 box consensus sequence: TTGACA -10 box consensus sequence: TATAAT Transcription factors that activate or repress transcription Bind to regulatory elements DNA loops to allow long-distance interactions Prokaryotic gene

24 Polymerase I, II and III Basal transcription factors (TFIID, TFIIA, TFIIB, etc.) TATA box (TATA(A/T)A(A/T) “Housekeeping” genes often do not contain TATA boxes Initiatior site (Inr) (C/T) (C/T) CA(C/T) (C/T) coincides with transcription start Many TF sites Activation/repression TF site TATAInr Pol II Eukaryotic gene structure

25 Ab initio methods Promoter signals TATA boxes Hexamer frequencies Consensus sequence matching PSSM Numerous FPs HMMs incorporate neighboring information

26 Promoter prediction in prokaryotes Find operon Upstream offirst gene is promoter Wang rules (distance between genes, no  -independent termination, number of genomes that display linkage) BPROM (http://www.softberry.com)http://www.softberry.com Based of arbitarry setting of operon egen distances 200bop uopstream of first gene ‘many FPs FindTerm (http://sun1.softberry.com)http://sun1.softberry.com Searches for  -independent termination signals

27 Prediction in eukaryotes Searching for consensus sequences in databases (TransFac) Increase specuificity by searching for CpG islands High density fo trasncription factor binding sitres CpGProD (http://pbil.univ-lyon1.fr/software/cpgprod.html)http://pbil.univ-lyon1.fr/software/cpgprod.html CG% inmoving window Eponine (http://servlet.sanger.ac.uk:8080/eponine/ )http://servlet.sanger.ac.uk:8080/eponine/ Matches TATA box, CCAAT bvox, CpG island to PSSM Cluster-Buster (http://zlab.bu.edu/cluster-buster/cbust.html)http://zlab.bu.edu/cluster-buster/cbust.html Detects high concentrations of TF sites FirstEF (http://rulai.cshl.org/tools/FirstEF/)http://rulai.cshl.org/tools/FirstEF/ QDA of fisrt exonboundary McPromoter (http://genes.mit.edu/McPromoter.html)http://genes.mit.edu/McPromoter.html Neural net of DNA bendability, TAT box,initator box Trained for Drosophila and human sequences

28 Phylogenetic footprinting technique Identify conserved regulatory sites Human-chimpanzee too close Human fish too distant Human0-mouse appropriate ConSite (http://mordor.cgb.ki.se/cgi-bin/CONSITE/consite)http://mordor.cgb.ki.se/cgi-bin/CONSITE/consite Align two sequences by global; alignment algorithm Identify conserved regions and compare to TRANSFAC database High scoring hits returned as positives rVISTA (http://rvista.dcode.org)http://rvista.dcode.org Identified TRANSFAC sites in two orthologous sequences Aligns sequences with local alignment algorithm Highest identity regions returned as hits Bayes aligner (http://www.bioinfo.rpi.edu/applications/bayesian/bayes/bayes.align12. pl)http://www.bioinfo.rpi.edu/applications/bayesian/bayes/bayes.align12. pl Aligns two sequences with Bayesian algorithm Even weakly conserved regions identified

29 Expression-profiling based method Microarray analyses allows identification of co-regulated genes Assume that promoters contain similar regulatory sites Find such sites by EM and Gibbs sampling using iteration of PSSM Co-expressed genes may be regulated at higher levels MEME (http://meme.sdsc.edu/meme/website/meme-intro.html)http://meme.sdsc.edu/meme/website/meme-intro.html AlignACE (http://atlas.med.harvard.edu/cgi-bin/alignace.pl)http://atlas.med.harvard.edu/cgi-bin/alignace.pl Gibbs sampling algorithm

30 Web humour…


Download ppt "Chapter 8 Gene Prediction. Automated sequencing of genomes require automated gene assignment Includes detection of open reading frames (ORFs) Identification."

Similar presentations


Ads by Google