Presentation is loading. Please wait.

Presentation is loading. Please wait.

Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.

Similar presentations


Presentation on theme: "Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine."— Presentation transcript:

1 Srr-1 from Streptococcus

2

3 i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine (polar uncharged)

4 Streptococcal Srr proteins S, signal sequence N, non-repeat region RI, small repeat region I RII, large repeat region II A, cell wall sorting signal (X)S, di-peptide repeat motif.

5 Gene prediction sequence

6 Prokaryotic gene “Small” genomes, high gene density –Haemophilus influenza genome 85% genic Operons –One transcript, many genes No introns –One gene, one protein Open reading frames –One ORF per gene –ORFs begin with start, end with stop codon

7 Eukaryotic Gene Much lower gene density Undergo several post transcriptional modifications. –5’ CAP –Poly A tail –Splicing

8 Goal of Genomics To understand the function of every gene in an organism 1. Sequence the genome 2. Characterize each gene Some are already known Many are similar to known genes 40% are unknown (no homolog characterized)

9 Collating the evidence DNA databases (EMBL/Genbank/DDBJ) Protein databases (Swall) TrEMBL (automatic translation of CDS from DNA db’s) Swissprot (curated data) mRNA (cDNA) dbEST (ESTs) Genomic (finished, draft) Genome Browsers (Ensembl, UCSC, NCBI) Genome assembly Gene prediction Gene/Protein info Supporting evidence Exon/intron structure Ancilliary databases Reference sequences (REFSEQ) NM_00001 (mRNA) XM_00001 (predicted mRNA) Domain databases (Interpro, CDD) PFAM, ProDom Smart, Prints Prosite, TIGRfam LocusLink/Gene Gene/Locus Pubmed Unigene Omim Homology maps Human mutation db supporting evidence

10 Genome Browsers Ensembl: www.ensembl.org EBI and Sanger collaboration Gene build, predict novel genes UCSC: genome.ucsc.edu University of Santa Cruz Annotate other gene builds NCBI: www.ncbi.nlm.nih.gov/mapview/ NCBI map viewer Gene build, predicts novel genes

11 Predicting genes Open Reading Frames (ORFs) freqency of stop codons simple algorithm, easy to interpret Composition bias coding vs. noncoding Sequence Signals enhancers, promoters, start codons, intron/exon boundaries, stop codons, poly-A addition signals…

12

13 Predicted genes are of 4 types Known genes (highest quality) as catalogued by the reference sequence project Ensembl known genes (red genes) NCBI known genes Novel genes (1) (high quality) based on similarity to known genes, or cDNAs these need not have 100% matching supporting evidence Ensembl novel genes (black) NCBI Loc genes

14 Novel genes (2) (high quality) based on the presence of ESTs resource of alternative splicing EST genes in Ensembl (purple) Database of transcribed sequences (DOTs) Assembly Ab initio gene prediction (questionable) Single organsism: Genscan Comparative information: Twinscan Pseudogenes - matches a known gene but with a a disrupted ORF - a minefield! Predicted genes are of 4 types

15 Gene prediction programs Ab initio gene prediction –First ones predicted single exons, e.g. GRAIL (Uberbacher, ‘91) or MZEF (Zhang, ‘97) –Later, predict entire genes e.g. Genscan (Burge ‘97) and Fgenesh (Solovyev, ‘95) –Predict individual exons based on codon usage and sequence signals (start, stop, splice sites) followed by assembly of putative exons into genes –Genscan predicts 90% of coding nucleotides, and 70% of coding exons (Guigo, ‘00) –Can not use gene prediction methods alone to accurately identify every gene in a genome

16 Twinscan Gene structure prediction model Extends probability model of GENSCAN Exploits homology between two related genomes Notable improvement on GENSCAN

17 Output from Artemis

18 Bias in nucleotide frequency

19 Prediction of URO-D structure using different programs

20 Prediction of URO-D structure using GRAIL and an external EST database

21 Prediction of URO-D structure using GENEWISE and different species as targets

22 Region of URO-D gene from the UCSC genome browser. Note RepeatMasker output

23 Supporting evidence mRNA reverse transcription cDNA Expressed Sequence Tag (EST) full length cDNA sequence

24 Measuring accuracy Sn = Sensitivity = TP/(TP+FN) –How many exons were found out of total present? Sp = Specificity = TP/(TP+FP) –How many predicted exons were correct out of total exons predicted?

25 Twinscan

26 Why the errors? First exons tend to be short so there is less information to use. Parameters for one organism may not be useful for another organism. Quality degrades with phylogenetic distance. EST libraries contaminated with genomic sequences Pseudogenes - test rate of synonymous substitutions (stops are more rare)

27 Other sources of gene prediction ORF detectors –NCBI: http://www.ncbi.nih.gov/gorf/gorf.html ***http://www.ncbi.nih.gov/gorf/gorf.html Promoter predictors –CSHL: http://rulai.cshl.org/software/index1.htmhttp://rulai.cshl.org/software/index1.htm –BDGP: fruitfly.org/seq_tools/promoter.htmlfruitfly.org/seq_tools/promoter.html –ICG: TATA-Box predictorTATA-Box predictor PolyA signal predictors –CSHL: argon.cshl.org/tabaska/polyadq_form.htmlargon.cshl.org/tabaska/polyadq_form.html Splice site predictors –BDGP: http://www.fruitfly.org/seq_tools/splice.htmlhttp://www.fruitfly.org/seq_tools/splice.html Start-/stop-codon identifiers –DNALC: Translator/ORF-FinderTranslator/ORF-Finder –BCM: SearchlauncherSearchlauncher


Download ppt "Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine."

Similar presentations


Ads by Google