Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
Ab initio gene prediction Genome 559, Winter 2011.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Gene Finding Charles Yan.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Genome Annotation BCB 660 October 20, From Carson Holt.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
The Ensembl Gene set The “Genebuild” 21 April 2008.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)
Use cases for Tools at the Bovine Genome Database Apollo and Bovine QTL viewer.
GeneWise and Artemis Exercises Spliced Alignment using GeneWise Click on the GeneWise hyperlink on the course links page,
EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.
Coding Domain Sequence Prediction and Alternative Splicing Detection in Human Malaria Gambiae Jun Li 1, Bing-Bing Wang 2, Jose M. Ribeiro 3, Kenneth D.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Part I: Identifying sequences with … Speaker : S. Gaj Date
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Genome Annotation Rosana O. Babu.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
SPIDA Substitution Periodicity Index and Domain Analysis Combining comparative sequence analysis with EST alignment to identify coding regions Damian Keefe.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Bioinformatics and Computational Biology
How can we find genes? Search for them Look them up.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
By Michael Han Sanger Wormbase Group SAB 2008 Comparative Genomics with.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Advisory Board Meeting, Caltech 2004 Genome Sequence Updates. Paul Davis The Sanger Institute.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
1 Many to 1 Gene Associations The following slides show a few examples of gene predictions by one annotation group that overlap one or more genes from.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Sequence Curation Paul Davis Sanger Institute. Overview Sequence curation within WormBase consortium. Import of sequence data. Prediction stats. Work.
Annotation of eukaryotic genomes
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Welcome to the combined BLAST and Genome Browser Tutorial.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
BLAST: Basic Local Alignment Search Tool Robert (R.J.) Sperazza BLAST is a software used to analyze genetic information It can identify existing genes.
Annotation for D. virilis
EGASP 2005 Evaluation Protocol
VectorBase genome annotation
EGASP 2005 Evaluation Protocol
Visualization of genomic data
Genome Center of Wisconsin, UW-Madison
Ab initio gene prediction
Genome Editing with Apollo
Gene Annotation with DNA Subway
Introduction to Bioinformatics II
Ensembl Genome Repository.
1. C. briggsae sequence curation 2. SNP data handling
Basic Local Alignment Search Tool
Part II SeqViewer AraCyc Help
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

Curation Tools Gary Williams Sanger Institute

SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of 100 Twinscan predictions checked: –55 were predicted correctly –29 differed from the curated sequence –7 merged/split genes incorrectly –1 predicted pseudogenes as CDS –2 missed a gene entirely –6 genes predicted where none

SAB 2008 Gene curation – sources of data We have traditionally relied heavily on EST transcription data to correct predictions. Now we have many extra data sources –Protein homology –Mass-spec peptides –Chip-based expression data –Comparative species synteny/homology –Other data coming (ENCODE etc.)

SAB 2008 Confirming the correct structure Evidence for a correct structure: –Protein homology, transcript data, ab initio predictions, mass-spec peptides, tiling array, trans- spliced leader sequence, strong splice sites, etc. Evidence against a correct structure –Unmatched instances of the above –Frameshifts in protein alignment –Overlapping exons –Genes overlapping repeat regions

SAB 2008 How to curate efficiently Ad hoc lists of problems Scan by eye Find anomalous regions

SAB 2008 Curation methodology Lists of problems –Keep returning to previously curated regions –Tedious to get to next genome position Scan by eye –Pilot scan of 1Mb done –Inefficient & error-prone because most gene models are now correct Find problem areas –Database of evidence against “good” gene structure. – Look for concentrations of anomalies

SAB 2008 Anomalous regions database Have a database of problem regions. Anomaly = conflicts with the curated data Assumption: problem areas that need the most curation will have more anomalies than other places. Problem areas Anomalies

SAB 2008 Anomaly database Anomalies that have been seen can be flagged to be ignored in future. All anomalies in a region are presented for inspection en masse. We can track what has been seen and measure progress.

SAB 2008 Simple anomalies Protein homology unmatched by curated CDS Unmatched conserved coding regions Unmatched TSL sites Unmatched Twinscan/Genefinder Short exons (< 30 bases) CDS exons overlapping repeat region

SAB 2008 Unmatched anomalies Anomalies Expression CDS Protein hits Twinscan Splice sites

SAB 2008 Frameshift in exon Anomalies Expression CDS exon Protein hits Frame 1Frame 2Frame 3

SAB 2008 Anomaly database Store anomalies in each 10 Kb region Sort windows by sum of anomaly scores Curator selects next 10 Kb window Curator selects anomaly to curate Acedb editor displays region

SAB 2008 Anomaly database – list of regions List of 10Kb windows sorted by anomaly score.

SAB 2008 Anomaly database – select region Select a region List of anomalies in region

SAB 2008 Anomaly database – select anomaly Select an anomaly Display of the anomaly (Unmatched twinscan)

SAB 2008 Efficiency Standard set of anomalies for curators to work on. Anomalies are not missed. Can quickly accept or reject regions to curate after a cursory glance. Makes finding problem areas easy – concentrate efforts on problem regions – no unnecessary repeat visits to a region. Complex problem areas can still take a long time to solve.

SAB 2008 Other anomalies Work is continuing to add new types of anomaly. – Tiling array expressed regions – Conflicts with nGASP prediction – Missing/extra exons compared to other genes in homologs Adding a new anomaly type requires no changes to the database or curation tool and it is amalgamated with the existing anomalies. Any new data can easily be added.

SAB 2008 Other species The anomaly database system can be used for curating the Tier II species. We will make the anomalies data for Tier II species available on the Genome Browser for users to see –As with C. elegans The curation database system could be made avalailable for the use of other model organism projects

end

SAB 2008 More anomalies Frame-shifts defined by protein homologies. Genes to potentially be merged by protein homology evidence. Genes to potentially be split by protein groups evidence.

Megabase scan changes St. Louis only Hinxton only Agreed by both Plus 7 agreed discrepancies

SAB 2008 Unmatched anomalies Twinscan C. remanei Protein C. briggsae sequence conservations (codingWABA) TSL C. briggsae Protein C. elegans Protein No curated CDS

Frame-shifts by protein homology Frame-shift A protein aligned by BLAST. Small/no apparent intron. Near-contiguous regions of the protein. Frame 1 Frame 2

Frameshift in exon

Genes to merge by protein homology? One protein matches two CDS in contiguous regions of the protein CDS 1 CDS 2

Genes to merge by protein homology? CDS 1 CDS 2 Flybase, Human, SwissProt, TrEMBL Proteins homologous to the two CDS

Gene to split by protein groups? CDS Protein group 1 Protein group 2 No members in common between the two non-overlapping groups.

Gene to split by protein groups? protein group 3 protein group 1 protein group 2

SAB 2008 We will continue to do… C. elegans genomic sequence changes – Transcript data – 3 rd party submissions C. elegans gene model curation – Curation tool anomalies – User input – Literature

SAB 2008 Progress – anomalies checked

SAB 2008 nGASP problems in C. elegans nGASP gene predictors are still not perfect. Out of 100 Jigsaw (Twinscan) predictions checked: –81 (55) were predicted correctly –1 (0) correctly indicated a required change –10 (25) differed (7 probably incorrectly) –3 (7) merged/split genes incorrectly –3 (1) predicted pseudogenes as CDS –1 (2) missed a gene entirely –1 (6) gene predicted where none