Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Trace Archive Steven Leonard PSG.

Similar presentations


Presentation on theme: "The Trace Archive Steven Leonard PSG."— Presentation transcript:

1 The Trace Archive Steven Leonard PSG

2 TraceArchive Background Contents Database and File Structure Web pages
Future Work

3 Background Human ramp-up 2000 Mouse Sequencing Consortium
6th October 2000 .. In fact, the incorporation of the whole genome shotgun sequencing component has led to adoption of a new, even more rapid data release policy whereby the actual raw data (that is, individual DNA sequence traces, about 500 bases long taken directly from the automated instruments) will be deposited regularly in a newly-established public databases operated by the NCBI and EBI ..

4 Contents CENTRE SPECIES TRACE_TYPE COUNT BCM Human shotgun 300247
BCM Mouse WGS BCM Mouse shotgun BCM Rat WGS BCM Rat shotgun SC Mouse WGS SC Zebrafish WGS WIBR Mouse WGS WUGSC Mouse WGS

5 Tar file format TRACEINFO traces MD5 checksums
tab delimited text or XML traces SCF, ABI etc. MD5 checksums ./traces/mI2C-a1174a02.p1c.scf.gz dd4875dd cfb95cde0cb6e3

6 <volume_date>2001-06-15</volume_date>
<volume_name>wugsc-mouse-wgs </volume_name> <volume_date> </volume_date> <volume_version>0.2</volume_version> <trace> <trace_name>jdx52e12.g1</trace_name> <trace_file>./traces/WUGSCarchive /jdx52e12.g1.scf.gz</trace_file> <center_name>WUGSC</center_name> <center_project>M_WGS013Z001</center_project> <chemistry_type>t</chemistry_type> <clip_quality_left>112</clip_quality_left> <clip_quality_right>551</clip_quality_right> <clip_vector_left>0</clip_vector_left> <iteration>1</iteration> <plate_id>jdx52</plate_id> <program_id>phred a</program_id> <run_date> </run_date> <run_machine_id>190</run_machine_id> <source_type>G</source_type> <species_code>mus musculus</species_code> <strategy>WGS</strategy> <submission_type>new</submission_type> <subspecies_id>C57BL/6J</subspecies_id> <svector_code>potw13</svector_code> <template_id>jdx52e12</template_id> <trace_direction>R</trace_direction> <trace_end>R</trace_end> <trace_format>scf</trace_format> <trace_type_code>WGS</trace_type_code> <well_id>e12</well_id> </trace>

7 Text trace_name trace_format trace_direction trace_end center_name center_project seq_lib_id species_code source_type strategy trace_type_code submission_type clone_id template_id run_machine_id plate_id well_id run_lane insert_size insert_stdev primer_code svector_code trace_file scf F F CRA RatBN RatBN2.5.2L Rattus norvegicus G WGS shotgun new S NU02001XBI A M13 Forward pUC194C NU02001XBI# _A01_001_ pro.scf scf F F CRA RatBN RatBN2.5.2L Rattus norvegicus G WGS shotgun new S NU02001XBI A M13 Forward pUC194C NU02001XBI# _A02_017_ pro.scf

8 Trace Format gzipped SCF v3.0 convert_trace input output
Staden iolib-1.8.7 James Bonfield input SCF, ABI, etc. (CTF, ZTR) output gzipped SCF v3.0 (CTF, ZTR and EXP)

9 Database only TRACEINFO organism, centre clip L/R trace file location
Index tar file with index_tar tarfile name, trace name, offset Extract files using convert_tar Staden iolib-1.8.7

10

11 File Structure Tar balls < 1.5 Gbytes gzipped SCF v3.0
early tarballs 1 to 1 single job to extract and re-make tarball now have two jobs extract and collect

12 1. Extract consistency check TRACINFO, MD5 and traces for each trace
extract trace check MD5 convert to gzipped SCF v3.0

13 2. Collect given a set of directories Calc. Size of SCF files
gathers trace info verify/add defaults TRACEINFO files in original tarballs Experiment files Gather into tarballs account for duplicate traces calc. MD5 checksums

14 3. Insert given a set of tarball extract TRACEINFO and parse XML
indexes the tarball for each trace check for pre-existing traces get dictionary id’s populate trace_info/ancillary tables

15 Finally generate Fasta/Quality files generate Clip files
Update FTP site Re-build SSAHA hash tables

16 generate a single fasta files Extend this to generate a tarfile
Given a list of traces generate a single fasta files Extend this to generate a tarfile SCF FASTA QUAL TRACEINFO text or XML Need to restrict size of tarballs and cache previous results

17 Clipping Info Pass ml2C-b205c06.p1c Fail ml2C-b205c07.p1c Contam CVEC:pBACe3.6 Fail ml2C-b111e07.p1c Qual Where the above is "Pass|Fail", readname, start, end, GC, AT, (end-start+1) so the start and stop positions are inclusive and base counting starts at 1. I define the starting quality clip point as the start of the first 20bp window where the integrated error rate within the window drops below 1.00 and the ending quality clip point as the end of the last 20bp window where the integrated error rate rises above 1.00.

18 Problems Duplicate trace names Bad tarballs/tapes FTP errors
1,000, Whitehead <10, WUGSC/BCM None Sanger (5,000,000) Bad tarballs/tapes FTP errors Backup it up lost 150 Gbytes, 3,000,000+ traces

19 Synchronise with NCBI NCBI SC 23 million traces 20,000 traces/Gbyte
50 Gbytes/million traces SC 19 million traces 30,000 traces/Gbyte 33 Gbytes/million traces

20 Acknowledgements Sanger NCBI Richard Durbin, Jim Mullikin
Andy Smith, Simon Mercer Tony Cox + Web team James Cuff + SSG Santhi Sivadasan, MartinWidlake NCBI Vladimir Alekseyev, Eugene Yaschenko, Deanna Church

21 Future Work Other organisms True Trace Server C.briggsae Dicty
Tetraodon Xenopus EST’s submit to EMBL from the archive True Trace Server Asp Gap4, etc.

22 ==> Netscape URL http://trace.ensembl.org/ search on trace_name
trace/list of traces add quality values add trace (need Java)

23 ==> Netscape SSAHA Server
14GBytes to build, 4-5 Gbytes to run preload file in browser for search passed reads organism specific modified headers

24 ==> Netscape FTP site
fasta/qual/traceinfo fixed clip updated Duplicate/Updated

25 ==> NETSCAPE Chromosome 1


Download ppt "The Trace Archive Steven Leonard PSG."

Similar presentations


Ads by Google