Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction methods Gene indices Mapping cDNA on genomic DNA Genome-genome.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Genome Annotation and the landscape of the Human Genome Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
How to access genomic information using Ensembl August 2005.
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
Eukaryotic Gene Finding
Eukaryotic Gene Finding
Genome Annotation BCB 660 October 20, From Carson Holt.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Doug Brutlag 2011 Next Generation Sequencing and Human Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University.
Gene Structure and Identification
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Grupo 5. 5’site 3’site branchpoint site exon 1 intron 1 exon 2 intron 2 AG/GT CAG/NT.
Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Genome Annotation Rosana O. Babu.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Complexities of Gene Expression Cells have regulated, complex systems –Not all genes are expressed in every cell –Many genes are not expressed all of.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
(H)MMs in gene prediction and similarity searches.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
EGASP 2005 Evaluation Protocol
EGASP 2005 Evaluation Protocol
Genes, Genomes, and Genomics
Ab initio gene prediction
Gene Annotation with DNA Subway
Introduction to Bioinformatics II
Gene Structure and Identification
Next Generation Sequencing and Human Genome Databases
From DNA to Protein Class 4 02/11/04 RBIO-0002-U1.
Genome Annotation and the Human Genome
From gene to protein.
Introduction to Alternative Splicing and my research report
Presentation transcript:

Srr-1 from Streptococcus

i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine (polar uncharged)

Streptococcal Srr proteins S, signal sequence N, non-repeat region RI, small repeat region I RII, large repeat region II A, cell wall sorting signal (X)S, di-peptide repeat motif.

Gene prediction sequence

Prokaryotic gene “Small” genomes, high gene density –Haemophilus influenza genome 85% genic Operons –One transcript, many genes No introns –One gene, one protein Open reading frames –One ORF per gene –ORFs begin with start, end with stop codon

Eukaryotic Gene Much lower gene density Undergo several post transcriptional modifications. –5’ CAP –Poly A tail –Splicing

Goal of Genomics To understand the function of every gene in an organism 1. Sequence the genome 2. Characterize each gene Some are already known Many are similar to known genes 40% are unknown (no homolog characterized)

Collating the evidence DNA databases (EMBL/Genbank/DDBJ) Protein databases (Swall) TrEMBL (automatic translation of CDS from DNA db’s) Swissprot (curated data) mRNA (cDNA) dbEST (ESTs) Genomic (finished, draft) Genome Browsers (Ensembl, UCSC, NCBI) Genome assembly Gene prediction Gene/Protein info Supporting evidence Exon/intron structure Ancilliary databases Reference sequences (REFSEQ) NM_00001 (mRNA) XM_00001 (predicted mRNA) Domain databases (Interpro, CDD) PFAM, ProDom Smart, Prints Prosite, TIGRfam LocusLink/Gene Gene/Locus Pubmed Unigene Omim Homology maps Human mutation db supporting evidence

Genome Browsers Ensembl: EBI and Sanger collaboration Gene build, predict novel genes UCSC: genome.ucsc.edu University of Santa Cruz Annotate other gene builds NCBI: NCBI map viewer Gene build, predicts novel genes

Predicting genes Open Reading Frames (ORFs) freqency of stop codons simple algorithm, easy to interpret Composition bias coding vs. noncoding Sequence Signals enhancers, promoters, start codons, intron/exon boundaries, stop codons, poly-A addition signals…

Predicted genes are of 4 types Known genes (highest quality) as catalogued by the reference sequence project Ensembl known genes (red genes) NCBI known genes Novel genes (1) (high quality) based on similarity to known genes, or cDNAs these need not have 100% matching supporting evidence Ensembl novel genes (black) NCBI Loc genes

Novel genes (2) (high quality) based on the presence of ESTs resource of alternative splicing EST genes in Ensembl (purple) Database of transcribed sequences (DOTs) Assembly Ab initio gene prediction (questionable) Single organsism: Genscan Comparative information: Twinscan Pseudogenes - matches a known gene but with a a disrupted ORF - a minefield! Predicted genes are of 4 types

Gene prediction programs Ab initio gene prediction –First ones predicted single exons, e.g. GRAIL (Uberbacher, ‘91) or MZEF (Zhang, ‘97) –Later, predict entire genes e.g. Genscan (Burge ‘97) and Fgenesh (Solovyev, ‘95) –Predict individual exons based on codon usage and sequence signals (start, stop, splice sites) followed by assembly of putative exons into genes –Genscan predicts 90% of coding nucleotides, and 70% of coding exons (Guigo, ‘00) –Can not use gene prediction methods alone to accurately identify every gene in a genome

Twinscan Gene structure prediction model Extends probability model of GENSCAN Exploits homology between two related genomes Notable improvement on GENSCAN

Output from Artemis

Bias in nucleotide frequency

Prediction of URO-D structure using different programs

Prediction of URO-D structure using GRAIL and an external EST database

Prediction of URO-D structure using GENEWISE and different species as targets

Region of URO-D gene from the UCSC genome browser. Note RepeatMasker output

Supporting evidence mRNA reverse transcription cDNA Expressed Sequence Tag (EST) full length cDNA sequence

Measuring accuracy Sn = Sensitivity = TP/(TP+FN) –How many exons were found out of total present? Sp = Specificity = TP/(TP+FP) –How many predicted exons were correct out of total exons predicted?

Twinscan

Why the errors? First exons tend to be short so there is less information to use. Parameters for one organism may not be useful for another organism. Quality degrades with phylogenetic distance. EST libraries contaminated with genomic sequences Pseudogenes - test rate of synonymous substitutions (stops are more rare)

Other sources of gene prediction ORF detectors –NCBI: *** Promoter predictors –CSHL: –BDGP: fruitfly.org/seq_tools/promoter.htmlfruitfly.org/seq_tools/promoter.html –ICG: TATA-Box predictorTATA-Box predictor PolyA signal predictors –CSHL: argon.cshl.org/tabaska/polyadq_form.htmlargon.cshl.org/tabaska/polyadq_form.html Splice site predictors –BDGP: Start-/stop-codon identifiers –DNALC: Translator/ORF-FinderTranslator/ORF-Finder –BCM: SearchlauncherSearchlauncher