Presentation is loading. Please wait.

Presentation is loading. Please wait.

Annotation and Visualization Doreen Ware. Project Challenges Rapidly growing sequence data Full annotation of all clones New high-performance computing.

Similar presentations


Presentation on theme: "Annotation and Visualization Doreen Ware. Project Challenges Rapidly growing sequence data Full annotation of all clones New high-performance computing."— Presentation transcript:

1 Annotation and Visualization Doreen Ware

2 Project Challenges Rapidly growing sequence data Full annotation of all clones New high-performance computing cluster 2,000 nodes Scheduling system (SunGrid Engine) NFS issues EnsEMBL Code Integration

3 Milestones www.maizesequence.org released Customized entry points of the Ensembl browser for the maize community. Adapted modules to the new compute cluster Blue Helix and automated gene predictions, MDR analysis, repeat masker Alignments of cereal sequence using Gramene Biopipe (needs to be automated) Transitioned from annotating Finished BACs to all BACs as they become available Blast Server FTP site DAS server (displaying Twinscan annotations)

4 Index Page

5 Maizesequence.org RSS BAC Notification Users can be notified of sequence and annotation updates to a particular region of interest on the FPC map via a RSS (Really Simple Syndication) notification system. Data is delivered as XML to the users favorite feed reader or is parsed in RSS enabled browsers. The URL for any given query is persistent and dynamically retrieves database updates in the user-specified region. … www.maizesequence.org/Zea_mays/notification

6 Maizesequence.org FTP and Blast DB Ensembl BAC DB Weekly Bulk Genome Dump Maize FTP BACs BAC Contigs Ab initio predictions Ab initio translations Maize Blast BAC Contigs Ab initio predictions Ab initio translations BACs, BAC Contigs, FgenesH predictions (TE and non-TE classes), and FgenesH translations are dumped on a weekly basis. Sequence dumps are posted to the FTP site. (ftp.maizesequence.org)ftp.maizesequence.org Sequence dumps are also used to update the blast databases. (www.maizesequence.org/Multi/blastview)www.maizesequence.org/Multi/blastview

7 MapView

8 CytoView

9 ContigView

10 GeneView

11

12 ExportView

13

14

15 Maize Databases and Annotation Pipeline

16 Classification of Gene Models Ab initio gene prediction on non-masked contigs with FGENESH using Monocot parameters. Classified gene models by BLASTP to Genbank NRAA. TE = Alignment to transposable elements (TE), as specified within curated database. NH = No detectable homology. WH = Significant alignment to non-TE. Corrupted_translation = Ensembl translation inconsistent with FGENESH. Gene Model ClassMinimumMaximumAverageMedian Standard Deviation TE size (bases)5123,9132,7392,4021,916 WH size (bases)7325,8162,4651,8292,146 NH size (bases)319,465975645944 Corrupted_translation (bases)825,8692,2511,8451,813 Data generated as of September 2007

17 Nucleotide Coverage of Mathematically-Defined Repats in 10,352 Annotated BACs (130,978 Contigs) MDR Type*Total NucleotidesNucleotide Coverage 2 copies 1,325,811,40779.11% 10 copies 937,789,15355.96% 100 copies 602,350,02435.94% 1000 copies 218,650,68913.05% *Mathematically defined r epeats indicate regions of repetitive DNA. The frequency of each constituent 20-mer along the BAC sequence was determined within the raw reads of the maize whole genome shotgun sequence (DOE Joint Genome Institute). MDR type 2 copies indicates regions over which 20-mers occurred two or more times. Thus, MDR type 10 copies, MDR type 100 copies, and MDR type 1000 copies indicate; respectively, regions over which 20-mers occurred, ten or more times, one hundred or more times, and one thousand or more times. The most repetitive regions correspond to regions in the MDR type 1000 copies. The least repetitive regions correspond to areas in the MDR type 2 copies. Data generated as of September 2007

18 Nucleotide Coverage of Repeats in 10,352 Annotated BACs (130,978 Contigs) Repeat Type*Total NucleotidesNucleotide Coverage MIPS/REcat Class I Retroelements1,503,929,79375.66% MIPS/REcat Class II/III Transposable Elements36,620,6461.84% MIPS/Recat Other16,048,9370.81% All Repeats1,553,118,76978.13% *Repetitive sequences were annotated and masked using RepeatMasker and the MIPS-Redat library. Data generated as of September 2007

19 Outreach and Collaborations MaizeGDB EBI EnsEMBL Gramene Maize Array Working Group Maize Optical Map Transposon Annotation TWINSCAN Vmatch Student Annotation (Howard Hughes)

20 Objectives for Year 3 Whole Genome Alignments for rice, maize and arabidopsis Evidence based gene builds Gramene modified Ensembl pipeline and FGENESH++ in combiner mode BioMart for complex query generation Whole Genome Alignments for rice, maize and arabidopsis SyntenyView based on whole genome alignment Transition from Gramene Biopipe -> Ensembl Exonerate pipeline to automate sequence alignments Annotation of non-coding RNA using tRNAScan and microRNA Gene Ontology using dbxref pipeline Incorporation in Gramene Compara builds; GeneTree view MySQL Database dumps Tutorials for website using Camptasia Submit paper on MDR analysis Shiran Pasternak, Apurva Narechania, Linda McMahan, Joshua Stein


Download ppt "Annotation and Visualization Doreen Ware. Project Challenges Rapidly growing sequence data Full annotation of all clones New high-performance computing."

Similar presentations


Ads by Google