Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.

Srr-1 from Streptococcus

i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine (polar uncharged)

Streptococcal Srr proteins S, signal sequence N, non-repeat region RI, small repeat region I RII, large repeat region II A, cell wall sorting signal (X)S, di-peptide repeat motif.

Gene prediction sequence

Prokaryotic gene “Small” genomes, high gene density –Haemophilus influenza genome 85% genic Operons –One transcript, many genes No introns –One gene, one protein Open reading frames –One ORF per gene –ORFs begin with start, end with stop codon

Eukaryotic Gene Much lower gene density Undergo several post transcriptional modifications. –5’ CAP –Poly A tail –Splicing

Goal of Genomics To understand the function of every gene in an organism 1. Sequence the genome 2. Characterize each gene Some are already known Many are similar to known genes 40% are unknown (no homolog characterized)

Collating the evidence DNA databases (EMBL/Genbank/DDBJ) Protein databases (Swall) TrEMBL (automatic translation of CDS from DNA db’s) Swissprot (curated data) mRNA (cDNA) dbEST (ESTs) Genomic (finished, draft) Genome Browsers (Ensembl, UCSC, NCBI) Genome assembly Gene prediction Gene/Protein info Supporting evidence Exon/intron structure Ancilliary databases Reference sequences (REFSEQ) NM_00001 (mRNA) XM_00001 (predicted mRNA) Domain databases (Interpro, CDD) PFAM, ProDom Smart, Prints Prosite, TIGRfam LocusLink/Gene Gene/Locus Pubmed Unigene Omim Homology maps Human mutation db supporting evidence

Genome Browsers Ensembl: www.ensembl.org EBI and Sanger collaboration Gene build, predict novel genes UCSC: genome.ucsc.edu University of Santa Cruz Annotate other gene builds NCBI: www.ncbi.nlm.nih.gov/mapview/ NCBI map viewer Gene build, predicts novel genes

Predicting genes Open Reading Frames (ORFs) freqency of stop codons simple algorithm, easy to interpret Composition bias coding vs. noncoding Sequence Signals enhancers, promoters, start codons, intron/exon boundaries, stop codons, poly-A addition signals…

Predicted genes are of 4 types Known genes (highest quality) as catalogued by the reference sequence project Ensembl known genes (red genes) NCBI known genes Novel genes (1) (high quality) based on similarity to known genes, or cDNAs these need not have 100% matching supporting evidence Ensembl novel genes (black) NCBI Loc genes

Novel genes (2) (high quality) based on the presence of ESTs resource of alternative splicing EST genes in Ensembl (purple) Database of transcribed sequences (DOTs) Assembly Ab initio gene prediction (questionable) Single organsism: Genscan Comparative information: Twinscan Pseudogenes - matches a known gene but with a a disrupted ORF - a minefield! Predicted genes are of 4 types

Gene prediction programs Ab initio gene prediction –First ones predicted single exons, e.g. GRAIL (Uberbacher, ‘91) or MZEF (Zhang, ‘97) –Later, predict entire genes e.g. Genscan (Burge ‘97) and Fgenesh (Solovyev, ‘95) –Predict individual exons based on codon usage and sequence signals (start, stop, splice sites) followed by assembly of putative exons into genes –Genscan predicts 90% of coding nucleotides, and 70% of coding exons (Guigo, ‘00) –Can not use gene prediction methods alone to accurately identify every gene in a genome

Twinscan Gene structure prediction model Extends probability model of GENSCAN Exploits homology between two related genomes Notable improvement on GENSCAN

Output from Artemis

Bias in nucleotide frequency

Prediction of URO-D structure using different programs

Prediction of URO-D structure using GRAIL and an external EST database

Prediction of URO-D structure using GENEWISE and different species as targets

Region of URO-D gene from the UCSC genome browser. Note RepeatMasker output

Supporting evidence mRNA reverse transcription cDNA Expressed Sequence Tag (EST) full length cDNA sequence

Measuring accuracy Sn = Sensitivity = TP/(TP+FN) –How many exons were found out of total present? Sp = Specificity = TP/(TP+FP) –How many predicted exons were correct out of total exons predicted?

Twinscan

Why the errors? First exons tend to be short so there is less information to use. Parameters for one organism may not be useful for another organism. Quality degrades with phylogenetic distance. EST libraries contaminated with genomic sequences Pseudogenes - test rate of synonymous substitutions (stops are more rare)

Other sources of gene prediction ORF detectors –NCBI: http://www.ncbi.nih.gov/gorf/gorf.html ***http://www.ncbi.nih.gov/gorf/gorf.html Promoter predictors –CSHL: http://rulai.cshl.org/software/index1.htmhttp://rulai.cshl.org/software/index1.htm –BDGP: fruitfly.org/seq_tools/promoter.htmlfruitfly.org/seq_tools/promoter.html –ICG: TATA-Box predictorTATA-Box predictor PolyA signal predictors –CSHL: argon.cshl.org/tabaska/polyadq_form.htmlargon.cshl.org/tabaska/polyadq_form.html Splice site predictors –BDGP: http://www.fruitfly.org/seq_tools/splice.htmlhttp://www.fruitfly.org/seq_tools/splice.html Start-/stop-codon identifiers –DNALC: Translator/ORF-FinderTranslator/ORF-Finder –BCM: SearchlauncherSearchlauncher

Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.

Similar presentations

Presentation on theme: "Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.

Similar presentations

Presentation on theme: "Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine."— Presentation transcript:

Similar presentations

About project

Feedback