The Trace Archive Steven Leonard PSG
TraceArchive Background Contents Database and File Structure Web pages Future Work
Background Human ramp-up 2000 Mouse Sequencing Consortium 6th October 2000 .. In fact, the incorporation of the whole genome shotgun sequencing component has led to adoption of a new, even more rapid data release policy whereby the actual raw data (that is, individual DNA sequence traces, about 500 bases long taken directly from the automated instruments) will be deposited regularly in a newly-established public databases operated by the NCBI and EBI ..
Contents CENTRE SPECIES TRACE_TYPE COUNT BCM Human shotgun 300247 BCM Mouse WGS 27886 BCM Mouse shotgun 368707 BCM Rat WGS 962663 BCM Rat shotgun 221646 SC Mouse WGS 3195771 SC Zebrafish WGS 1684255 WIBR Mouse WGS 8746451 WUGSC Mouse WGS 3374037
Tar file format TRACEINFO traces MD5 checksums tab delimited text or XML traces SCF, ABI etc. MD5 checksums ./traces/mI2C-a1174a02.p1c.scf.gz dd4875dd4201381232cfb95cde0cb6e3
<volume_date>2001-06-15</volume_date> <volume_name>wugsc-mouse-wgs-992643442</volume_name> <volume_date>2001-06-15</volume_date> <volume_version>0.2</volume_version> <trace> <trace_name>jdx52e12.g1</trace_name> <trace_file>./traces/WUGSCarchive.010615.010754/jdx52e12.g1.scf.gz</trace_file> <center_name>WUGSC</center_name> <center_project>M_WGS013Z001</center_project> <chemistry_type>t</chemistry_type> <clip_quality_left>112</clip_quality_left> <clip_quality_right>551</clip_quality_right> <clip_vector_left>0</clip_vector_left> <iteration>1</iteration> <plate_id>jdx52</plate_id> <program_id>phred-980904.a</program_id> <run_date>2000-12-8</run_date> <run_machine_id>190</run_machine_id> <source_type>G</source_type> <species_code>mus musculus</species_code> <strategy>WGS</strategy> <submission_type>new</submission_type> <subspecies_id>C57BL/6J</subspecies_id> <svector_code>potw13</svector_code> <template_id>jdx52e12</template_id> <trace_direction>R</trace_direction> <trace_end>R</trace_end> <trace_format>scf</trace_format> <trace_type_code>WGS</trace_type_code> <well_id>e12</well_id> </trace>
Text trace_name trace_format trace_direction trace_end center_name center_project seq_lib_id species_code source_type strategy trace_type_code submission_type clone_id template_id run_machine_id plate_id well_id run_lane insert_size insert_stdev primer_code svector_code trace_file 19866873850757 scf F F CRA RatBN RatBN2.5.2L Rattus norvegicus G WGS shotgun new 19600430726162 19667033328873 S100000035 NU02001XBI A01 001 3000 0 M13 Forward pUC194C NU02001XBI#0984066701_A01_001_00000019866873850757.pro.scf 19866873850773 scf F F CRA RatBN RatBN2.5.2L Rattus norvegicus G WGS shotgun new 19600430726163 19667033328874 S100000035 NU02001XBI A02 017 3000 0 M13 Forward pUC194C NU02001XBI#0984066701_A02_017_00000019866873850773.pro.scf
Trace Format gzipped SCF v3.0 convert_trace input output Staden iolib-1.8.7 James Bonfield input SCF, ABI, etc. (CTF, ZTR) output gzipped SCF v3.0 (CTF, ZTR and EXP)
Database only TRACEINFO organism, centre clip L/R trace file location Index tar file with index_tar tarfile name, trace name, offset Extract files using convert_tar Staden iolib-1.8.7
File Structure Tar balls < 1.5 Gbytes gzipped SCF v3.0 early tarballs 1 to 1 single job to extract and re-make tarball now have two jobs extract and collect
1. Extract consistency check TRACINFO, MD5 and traces for each trace extract trace check MD5 convert to gzipped SCF v3.0
2. Collect given a set of directories Calc. Size of SCF files gathers trace info verify/add defaults TRACEINFO files in original tarballs Experiment files Gather into tarballs account for duplicate traces calc. MD5 checksums
3. Insert given a set of tarball extract TRACEINFO and parse XML indexes the tarball for each trace check for pre-existing traces get dictionary id’s populate trace_info/ancillary tables
Finally generate Fasta/Quality files generate Clip files Update FTP site Re-build SSAHA hash tables
generate a single fasta files Extend this to generate a tarfile Given a list of traces generate a single fasta files Extend this to generate a tarfile SCF FASTA QUAL TRACEINFO text or XML Need to restrict size of tarballs and cache previous results
Clipping Info Pass ml2C-b205c06.p1c 22 571 274 276 550 Fail ml2C-b205c07.p1c Contam CVEC:pBACe3.6 Fail ml2C-b111e07.p1c Qual Where the above is "Pass|Fail", readname, start, end, GC, AT, (end-start+1) so the start and stop positions are inclusive and base counting starts at 1. I define the starting quality clip point as the start of the first 20bp window where the integrated error rate within the window drops below 1.00 and the ending quality clip point as the end of the last 20bp window where the integrated error rate rises above 1.00.
Problems Duplicate trace names Bad tarballs/tapes FTP errors 1,000,000+ Whitehead <10,000 WUGSC/BCM None Sanger (5,000,000) Bad tarballs/tapes FTP errors Backup it up lost 150 Gbytes, 3,000,000+ traces
Synchronise with NCBI NCBI SC 23 million traces 20,000 traces/Gbyte 50 Gbytes/million traces SC 19 million traces 30,000 traces/Gbyte 33 Gbytes/million traces
Acknowledgements Sanger NCBI Richard Durbin, Jim Mullikin Andy Smith, Simon Mercer Tony Cox + Web team James Cuff + SSG Santhi Sivadasan, MartinWidlake NCBI Vladimir Alekseyev, Eugene Yaschenko, Deanna Church
Future Work Other organisms True Trace Server C.briggsae Dicty Tetraodon Xenopus EST’s submit to EMBL from the archive True Trace Server Asp Gap4, etc.
==> Netscape URL http://trace.ensembl.org/ search on trace_name trace/list of traces add quality values add trace (need Java)
==> Netscape SSAHA Server 14GBytes to build, 4-5 Gbytes to run preload file in browser for search passed reads organism specific modified headers
==> Netscape FTP site fasta/qual/traceinfo fixed clip updated Duplicate/Updated
==> NETSCAPE Chromosome 1