Presentation is loading. Please wait.

Presentation is loading. Please wait.

November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division.

Similar presentations


Presentation on theme: "November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division."— Presentation transcript:

1 November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division (LBNL)

2 November 18, 2003 SC’O3 Collaborators Nancy MeyerNERSC - HPSS Harvard HolmesNERSC - HPSS Jonathan Carter NERSC - User Services Horst SimonNERSC Center Director Susan LucasJGI-PGF - Head, Production Sequencing Arthur KobayashiJGI-PGF - Production Informatics Eddy RubinJGI Director Arie ShoshaniLBNL Computational Research Division Millions of MicrobesEverywhere

3 November 18, 2003 SC’O3 General Goals Genomic Data Life after the Human Genome Project NERSC Storage Systems Data Management Future Directions

4 November 18, 2003 SC’O3 General Goals 1.Distribute, archive, and enhance access to the data generated at DOE’s Joint Genome Institute(JGI) Production Genomic Facility(PGF) 2.Serve as a resource for community access to these data. 3.Establish a long term collaboration between the JGI and the NERSC Center. High Performance Storage System (HPSS)

5 November 18, 2003 SC’O3 Environmental Genomics Carbon Cycle

6 November 18, 2003 SC’O3 Environmental Genomics < 1% of microbes are culturable Many unculturables live in interdependent consortia of considerable diversity Aim: to recover genome-scale sequences and reveal metabolic capabilities How can we understand the action of microbes at the molecular level? What is the structure of natural microbial populations? What is a microbial species?

7 November 18, 2003 SC’O3 Future environmental targets for JGI Newman and Banfield, Science 2002 Whole metagenome shotgun sequencing and targeted fosmid-based methods can be used to recover useful draft genomes

8 November 18, 2003 SC’O3 JGI Microbial Program JGI microbial sequencing targets a broad range of bacteria and archaea with relevance to: Bioremediation Carbon Sequestration Global Climate Change Biodiversity Biomass Conversion Energy Production Disease

9 November 18, 2003 SC’O3 EUCARYA Single origin of Mitochondria ? BACTERIA ARCHAEA Plants, Animals, Fungi

10 November 18, 2003 SC’O3 JGI Microbial Program Lactic acid bacteria Lactobacillus gasseri (Klaenhammer) Oenoccoccus oeni (Mills) Complex polysaccharide degradation Clostridium thermocellum (Wu) Microbulbifer degradans (Weiner) (complements white rot fungus sequence) Phototrophic bacteria Rhodospirillium rubrum (Roberts) (complements Rhodopseudomonas palustris and Rhodobacter spheroides) Toxic waste degradation and microbial ecology Desulfuromonas acetoxidans (Lovely) Desulfovibrio desulfuricans Microbes in extreme environments Psychrobacter (Thomashow) Methanococcoides burtonii (Sowers, Cavicchioli) Infectious diseases of plants and animals Erlichia chaffeensis (Yu) Pseudomonas syringae (Lindow) Anaerobic methane oxidizing consortium “ball of bugs” (DeLong, Monterey Bay) one (or two?!) reverse methanogenic archaea in core plus sulfur reducing bacterium on surface

11 November 18, 2003 SC’O3 JGI - Then & Now Then: Single project - Human Genome (ch 5,16,& 19) All data sent to NCBI/GenBank for storage and distribution Minimum local responsibility for data stewardship Relatively low production sequencing rate Now: Dozens of whole genome projects (2 million to more than a billion bases, each) Multiple species (microbial to vertebrates) Complex environmental genomic communities Full responsibility for data storage and distribution Limited storage capacity Production sequencing rate is increasing

12 November 18, 2003 SC’O3 JGI Monthly Production Millions of Bases 5yr History12 months

13 November 18, 2003 SC’O3 1 CAGGTCAACG GATCATCTGT TTCTGACCAT TCCTTCCCGT TCCTGACCCC AGGGAGTGCA 61 GGGTGTCCTA GCCAAGCCGG CGTCCCTCCT AGTAGTACCG CTGCTCTCTA ACCTCAGGAC 121 GTCAAGGGCC TAGAGCGACA GATGTTTCCC AGCAGGGGGT TCTGAGGCTG TGCGCCCAGA 181 TCGCGAGAGA GGCAAGTGGG GTGACGAGGT CGTGCACTGA GGGTGGACGT AGAGGCCAGG 241 AGTAGCAGGC GGCCGGGGAA AAGAGGTGGA GAAAGGAAAA AAGAGGAGAA AAGTGGAGGA 301 GGGCGAGTAG GGGGGTGGGG CAGAGAGGGG CGGGCCCGAG TGCGCCCCCC GCCCCCAGCC 361 CCGCTCTGCC AGCTCCCTCC CAGCCCAGCC GGCTACATCT GGCGGCTGCC CTCCCTTGTT 421 TCCGCTGCAT CCAGACTTCC TCAGGCGGTG GCTGGAGGCT GCGCATCTGG GGCTTTAAAC 481 ATACAAAGGG ATTGCCAGGA CCTGCGGCGG CGGCGGCGGC GGCGGGGGCT GGGGCGCGGG 541 GGCCGGACCA TGAGCCGCTG AGCCGGGCAA ACCCCAGGCC ACCGAGCCAG CGGACCCTCG 601 GAGCGCAGCC CTGCGCCGCG GACCAGGCTC CAACCAGGCG GCGAGGCGGC CACACGCACC 661 GAGCCAGCGA CCCCCGGGCG ACGCGCGGGG CCAGGGAGCG CTACGATGGA GGCGCTAATG 721 GCCCGGGGCG CGCTCACGGG TCCCCTGAGG GCGCTCTGTC TCCTGGGCTG CCTGCTGAGC 781 CACGCCGCCG CCGCGCCGTC GCCCATCATC AAGTTCCCCG GCGATGTCGC CCCCAAAACG 841 GACAAAGAGT TGGCAGTGGT GAGTTGCT This is Not Raw Data

14 November 18, 2003 SC’O3 Neither is This

15 November 18, 2003 SC’O3 These are the Raw Data

16 November 18, 2003 SC’O3 Genome Sequencing Start with genomic DNA Make sheared fragments Sequence both ends of fragments Reconstruct genome computationally Provide genome and tools to community High-throughput computational analysis

17 November 18, 2003 SC’O3 Paired Plasmid Sequencing

18 November 18, 2003 SC’O3 JGI Data Production Millions of files per month of raw trace data 100 assembled projects per month(50MB-250MB) and several large assembled projects per year More data are being generated than ever before Currently trace data are maintained online only while projects are in process. Whole completed projects are available to download. They are large and contain millions of files.

19 November 18, 2003 SC’O3 JGI Raw Data Organization Project =Series of Libraries that define a genome Library =Series of Plates Plate = 384 Clones Clone=2 Lanes 1 Lane = ~1MB each distributed into 4 files: 1 FASTA file = 1KB 1 scf file = 50KB 1 abd file=250KB 1 rsd/ab1file = 650KB In May-03, PGF ran 2.5 million successful lanes = 2.5TB/month; 10 million files (0.75TB/month (9 TB/year) non-trace files) This does not include any assembly, database or metadata!

20 November 18, 2003 SC’O3 Current Access to JGI Data Access to these data is in demand by scientific fields that were not anticipated by the Human Genome Project Microbiologists Environmental Scientists Evolutionary Scientists GtL projects The computational sophistication of the user community is uneven, at best. Not everyone will want the same kind of files. GenBank is not capable of serving all of the JGI’s needs.

21 November 18, 2003 SC’O3 Current Access to JGI Data (cont.) The data are processed by researchers using iterative and pattern matching techniques often requiring access to data that spans several projects and genomes. This is different from the Human Project. Currently, this requires downloads of projects and then unpacking the project files to access the data. Millions of files to unpack and slow transfer of whole project files. At best, the raw data used to generate the sequences in a project are very difficult to retrieve and interrogate.

22 November 18, 2003 SC’O3 NERSC Storage Systems DOE’s largest unclassified storage systems with current archival capacity of 8PBs Robust and available 24x7 with high reliability and excellent network connectivity Very configurable and currently provides good service for both large streaming data and concurrent direct access. Experienced and innovative staff are adding new capabilities and distributing storage as the NERSC Center data requirements change over time.

23 November 18, 2003 SC’O3 Distribute and Enhance Access 1. Initially, we plan to hold all the sequence data online or near-line. We will prototype and select the best way to do this: distributed file systems local file systems cached web servers tools. 2. Collaborate with JGI to organize and cluster the sequence data so they can be retrieved in meaningful pieces.

24 November 18, 2003 SC’O3 Distribute and Enhance Access (cont.) 3.Distribute the data between JGI and NERSC/HPSS: Develop tools and methodologies to move the data between JGI and NERSC/HPSS for timely access to sequence data as they are being generated. Incorporate this into regular site backups 4. Build a web interface to the data providing a consistent view of the data (allowing the data to be distributed underneath) with a link to the data at JGI for ease of access.

25 November 18, 2003 SC’O3 1. Metadata for the files being collected -- schema definition development -- the database system to support the metadata -- query interfaces to query the metadata -- possible rapid prototyping using the OPM tools 2. Data entry tools for the metadata -- procedure to enforce metadata entry -- checks on the correctness of the metadata entered Data Organization Requirements None of this was contemplated in the Human Project

26 November 18, 2003 SC’O3 3. Robust massive file movement -- from daily generated files into NERSC's HPSS -- insure correctness in spite of system, network, and HPSS transient failures -- automated reporting of errors / failures -- possible use of HRM technology 4. Managing annotations of genomic data -- need to support history of annotation, perhaps by version hierarchy -- need for a controlled vocabulary (an ontology) for searching the annotations Data Organization Requirements (cont.)

27 November 18, 2003 SC’O3 Future Goals 1. Hold more partial and raw data online 2. Enhance searching these data using annotated databases. 3.Enhance current iterative processing of the data by moving some of this processing close to the data. For example some programs could run on the web server with access to a local file system of data for matches and selections of data. NERSC to become the repository of DOE genomic data focusing on microbial and environmental genomics


Download ppt "November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division."

Similar presentations


Ads by Google