The Trace Archive Steven Leonard PSG.

Slides:



Advertisements
Similar presentations
InterScan AppletTrap Zhang Hong Trend Micro, AppletTrap Team (Nanjing)
Advertisements

Enter Presentation Everything you expect …plus DNASIS MAX 2.0 Sequence Analysis Software.
Refeng Wu CQ5 WCM System Administrator
Data Search and Retrieval
Customizing the MOSS 2007 Search Results November 2007 Rafael Perez.
Execute Stored Procedure having Output Parameter and returning Result Set in Adeptia Server.
The Maize Inflorescence Project Website Tutorial Nov 7, 2014.
Intermediate Level Course. Text Format The text styles, bold, italics, underlining, superscript and subscript, can be easily added to selected text. Text.
Crawler-Based Search Engine By: Bryan Chapman, Ryan Caplet, Morris Wright.
Mgt 240 Lecture Website Construction: Software and Language Alternatives March 29, 2005.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Chapter 9 Collecting Data with Forms. A form on a web page consists of form objects such as text boxes or radio buttons into which users type information.
Update on the DAS Registry DAS Workshop 2011 Jonathan Warren.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
NASA/ESA Interoperability Efforts CEOS Subgroup - CINTEX Alexandria, Sept 12, 2002 Ananth Rao Yonsook Enloe SGT, Inc.
Welcome to CPT 499 XML Course Introduction Eugenia Fernandez IUPUI.
Using the Georgia Online Assessment System(OAS) We will lead the nation in improving student achievement. Kathy Cox, State Superintendent of Schools.
File formats Wrapping your data in the right package Deanna M. Church
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
1 In the good old days... Years ago… the WWW was made up of (mostly) static documents. –Each URL corresponded to a single file stored on some hard disk.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Overview Scale out architecture Servers, services, and topology in Central Administration.
Chapter 8 Cookies And Security JavaScript, Third Edition.
NMED 3850 A Advanced Online Design January 12, 2010 V. Mahadevan.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Week Nine Week Nine focuses on Collecting Images and Web Page URLs to use for your final Web Page Project. Discussions on using Netscape Communicator Composer.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Wellcome Trust Sanger Institute Informatics Systems Group Ensembl Compute Grid issues James Cuff Informatics Systems Group Wellcome Trust Sanger Institute.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Producing a high-impact web experience by integrate Macromedia Flash and ASP By Katie Tuttle CS 330: Internet Architecture and Programming Project.
ARGOS (A Replicable Genome InfOrmation System) for FlyBase and wFleaBase Don Gilbert, Hardik Sheth, Vasanth Singan { gilbertd, hsheth, vsingan
Double –Click on the Netscape Icon on your desktop The following are a series of steps to help you get started with Netscape Composer.
® IBM Software Group © 2006 IBM Corporation JSF Rich Text Area Component This Learning Module describes the use of the JSF Rich Text Area component – for.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Chapter 12© copyright Janson Industries Java Server Faces ▮ Explain the JSF framework ▮ SDO (service data objects) ▮ Facelets ▮ Pagecode classes.
Copyright OpenHelix. No use or reproduction without express written consent1.
Mobile Site Cleanup Reducing the code errors and fixing behaviours in Cisco Mobile sites.
Validation db status and plans (what happened since the Collaboration meeting) Hans Wenzel 10th Physics Lists and Validation Tools working group meeting.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Migrating Wordpress Migrating Wordpress can sometimes get more complicated as it should. There is no plugin that does this for you, the best way is to.
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
Microsoft Office One Note
Section 10.1 Define scripting
SPS Spotlight Series October 2014
PD² Adapter Q&A Webinar
ASP.NET Programming with C# and SQL Server First Edition
Running a Forms Developer Application
Module 11: File Structure
IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.
Exploring Microsoft Office PowerPoint 2000 Chapter 2
Regulatory Genomics Lab
Step 1 Create Database Info activity in Adeptia Server specifying the driver, URL and user credentials information for the database in which stored.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 13: Administering Web Resources.
Introduction to XHTML.
Data File Import / Export
Integrity - Service - Innovation
Customization
Unit 9 NT1330 Client-Server Networking II Date: 8/9/2016
FTS 2 Failure Tracking System 2 User‘s Guide Process Flow
Cuong Nguyen, Deng Xin, Dongmei, Zheng Wang
Performance Log REST Endpoint
Web Development Using ASP .NET
Getting Started With Solr
TargetDB and PEPCDB •
Regulatory Genomics Lab
M. Kezunovic (P.I.) S. S. Luo D. Ristanovic Texas A&M University
Welcome - webinar instructions
Java Code Review with CheckStyle
Regulatory Genomics Lab
How to search NCBI.
Presentation transcript:

The Trace Archive Steven Leonard PSG

TraceArchive Background Contents Database and File Structure Web pages Future Work

Background Human ramp-up 2000 Mouse Sequencing Consortium 6th October 2000 .. In fact, the incorporation of the whole genome shotgun sequencing component has led to adoption of a new, even more rapid data release policy whereby the actual raw data (that is, individual DNA sequence traces, about 500 bases long taken directly from the automated instruments) will be deposited regularly in a newly-established public databases operated by the NCBI and EBI ..

Contents CENTRE SPECIES TRACE_TYPE COUNT BCM Human shotgun 300247 BCM Mouse WGS 27886 BCM Mouse shotgun 368707 BCM Rat WGS 962663 BCM Rat shotgun 221646 SC Mouse WGS 3195771 SC Zebrafish WGS 1684255 WIBR Mouse WGS 8746451 WUGSC Mouse WGS 3374037

Tar file format TRACEINFO traces MD5 checksums tab delimited text or XML traces SCF, ABI etc. MD5 checksums ./traces/mI2C-a1174a02.p1c.scf.gz dd4875dd4201381232cfb95cde0cb6e3

<volume_date>2001-06-15</volume_date> <volume_name>wugsc-mouse-wgs-992643442</volume_name> <volume_date>2001-06-15</volume_date> <volume_version>0.2</volume_version> <trace> <trace_name>jdx52e12.g1</trace_name> <trace_file>./traces/WUGSCarchive.010615.010754/jdx52e12.g1.scf.gz</trace_file> <center_name>WUGSC</center_name> <center_project>M_WGS013Z001</center_project> <chemistry_type>t</chemistry_type> <clip_quality_left>112</clip_quality_left> <clip_quality_right>551</clip_quality_right> <clip_vector_left>0</clip_vector_left> <iteration>1</iteration> <plate_id>jdx52</plate_id> <program_id>phred-980904.a</program_id> <run_date>2000-12-8</run_date> <run_machine_id>190</run_machine_id> <source_type>G</source_type> <species_code>mus musculus</species_code> <strategy>WGS</strategy> <submission_type>new</submission_type> <subspecies_id>C57BL/6J</subspecies_id> <svector_code>potw13</svector_code> <template_id>jdx52e12</template_id> <trace_direction>R</trace_direction> <trace_end>R</trace_end> <trace_format>scf</trace_format> <trace_type_code>WGS</trace_type_code> <well_id>e12</well_id> </trace>

Text trace_name trace_format trace_direction trace_end center_name center_project seq_lib_id species_code source_type strategy trace_type_code submission_type clone_id template_id run_machine_id plate_id well_id run_lane insert_size insert_stdev primer_code svector_code trace_file 19866873850757 scf F F CRA RatBN RatBN2.5.2L Rattus norvegicus G WGS shotgun new 19600430726162 19667033328873 S100000035 NU02001XBI A01 001 3000 0 M13 Forward pUC194C NU02001XBI#0984066701_A01_001_00000019866873850757.pro.scf 19866873850773 scf F F CRA RatBN RatBN2.5.2L Rattus norvegicus G WGS shotgun new 19600430726163 19667033328874 S100000035 NU02001XBI A02 017 3000 0 M13 Forward pUC194C NU02001XBI#0984066701_A02_017_00000019866873850773.pro.scf

Trace Format gzipped SCF v3.0 convert_trace input output Staden iolib-1.8.7 James Bonfield input SCF, ABI, etc. (CTF, ZTR) output gzipped SCF v3.0 (CTF, ZTR and EXP)

Database only TRACEINFO organism, centre clip L/R trace file location Index tar file with index_tar tarfile name, trace name, offset Extract files using convert_tar Staden iolib-1.8.7

File Structure Tar balls < 1.5 Gbytes gzipped SCF v3.0 early tarballs 1 to 1 single job to extract and re-make tarball now have two jobs extract and collect

1. Extract consistency check TRACINFO, MD5 and traces for each trace extract trace check MD5 convert to gzipped SCF v3.0

2. Collect given a set of directories Calc. Size of SCF files gathers trace info verify/add defaults TRACEINFO files in original tarballs Experiment files Gather into tarballs account for duplicate traces calc. MD5 checksums

3. Insert given a set of tarball extract TRACEINFO and parse XML indexes the tarball for each trace check for pre-existing traces get dictionary id’s populate trace_info/ancillary tables

Finally generate Fasta/Quality files generate Clip files Update FTP site Re-build SSAHA hash tables

generate a single fasta files Extend this to generate a tarfile Given a list of traces generate a single fasta files Extend this to generate a tarfile SCF FASTA QUAL TRACEINFO text or XML Need to restrict size of tarballs and cache previous results

Clipping Info Pass ml2C-b205c06.p1c 22 571 274 276 550 Fail ml2C-b205c07.p1c Contam CVEC:pBACe3.6 Fail ml2C-b111e07.p1c Qual Where the above is "Pass|Fail", readname, start, end, GC, AT, (end-start+1) so the start and stop positions are inclusive and base counting starts at 1. I define the starting quality clip point as the start of the first 20bp window where the integrated error rate within the window drops below 1.00 and the ending quality clip point as the end of the last 20bp window where the integrated error rate rises above 1.00.

Problems Duplicate trace names Bad tarballs/tapes FTP errors 1,000,000+ Whitehead <10,000 WUGSC/BCM None Sanger (5,000,000) Bad tarballs/tapes FTP errors Backup it up lost 150 Gbytes, 3,000,000+ traces

Synchronise with NCBI NCBI SC 23 million traces 20,000 traces/Gbyte 50 Gbytes/million traces SC 19 million traces 30,000 traces/Gbyte 33 Gbytes/million traces

Acknowledgements Sanger NCBI Richard Durbin, Jim Mullikin Andy Smith, Simon Mercer Tony Cox + Web team James Cuff + SSG Santhi Sivadasan, MartinWidlake NCBI Vladimir Alekseyev, Eugene Yaschenko, Deanna Church

Future Work Other organisms True Trace Server C.briggsae Dicty Tetraodon Xenopus EST’s submit to EMBL from the archive True Trace Server Asp Gap4, etc.

==> Netscape URL http://trace.ensembl.org/ search on trace_name trace/list of traces add quality values add trace (need Java)

==> Netscape SSAHA Server 14GBytes to build, 4-5 Gbytes to run preload file in browser for search passed reads organism specific modified headers

==> Netscape FTP site fasta/qual/traceinfo fixed clip updated Duplicate/Updated

==> NETSCAPE Chromosome 1