November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division.

Slides:



Advertisements
Similar presentations
Cyber Metagenomics; Challenge to See The Unseen Majority in The Ocean
Advertisements

The Dryad Data Repository Ryan Scherle 1, Hilmar Lapp 1, Amol Bapat 2, Sarah Carrier 2, Jane Greenberg 2, Peggy Schaeffer 1, Todd Vision 1,3, Hollie White.
Maines Sustainability Solutions Initiative (SSI) Focuses on research of the coupled dynamics of social- ecological systems (SES) and the translation of.
V Alyssa Rosemartin 1, Lee Marsh 1, Ellen Denny 1, Bruce Wilson USA National Phenology Network, Tucson, AZ; 2 - Oak Ridge National Laboratory, Oak.
JGI Timeline 1997 JGI April 2003 Human Genome Program Officially Ended Human Genome Program Officially Launched 1990 Joint Genome Institute ………………….(JGI)
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
SEVENPRO – STREP KEG seminar, Prague, 8/November/2007 © SEVENPRO Consortium SEVENPRO – Semantic Virtual Engineering Environment for Product.
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
The CEGIS Online Bibliography Holly K. Caro In late May of 2009, the Center of Excellence for Geospatial Information Science (CEGIS) decided to consolidate.
Office of Science Office of Biological and Environmental Research Susan K. Gregurick, Ph.D. Program Manager Computational Biology & Bioinformatics Biological.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Bioinformatics and Phylogenetic Analysis
Integrating Historical and Realtime Monitoring Data into an Internet Based Watershed Information System for the Bear River Basin Jeff Horsburgh David Stevens,
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Operational Dataset Update Functionality Included in the NCAR Research Data Archive Management System 1 Zaihua Ji Doug Schuster Steven Worley Computational.
Chapter 1 Introduction to Databases
Using the Drupal Content Management Software (CMS) as a framework for OMICS/Imaging-based collaboration.
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
Lesson 10 Bioinformatics
October 16-18, Research Data Set Archives Steven Worley Scientific Computing Division Data Support Section.
January, 23, 2006 Ilkay Altintas
The BIO Directorate Microbial Biology Emphasis BIO Advisory Committee April, 2005.
Development of Bioinformatics and its application on Biotechnology
Molecular Microbial Ecology
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS (Cont’d) Instructor Ms. Arwa Binsaleh.
Proposition: Digital Collections Are Easier to Find and Use through DLF Aquifer’s American Social History Online Katherine Kott, Aquifer Director Library.
DDN & iRODS at ICBR By Alex Oumantsev History of ICBR  Campus wide Interdisciplinary Center for Biotechnology Research  Core Facility  Funded by the.
Advancing Science with DNA Sequence GENEBOREE A Tool for Collaborative Gene Annotation DOE Joint Genome Institute Integrated Microbial Genomes Annette.
1 Use of SRMs in Earth System Grid Arie Shoshani Alex Sim Lawrence Berkeley National Laboratory.
Advancing Science with DNA Sequence Undergraduate Genomics in a Research University Environment A Collaborative Effort between the JGI and UC Merced M.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
Data Management BIRN supports data intensive activities including: – Imaging, Microscopy, Genomics, Time Series, Analytics and more… BIRN utilities scale:
Roadmap for Soil Community Metagenomics of DOE’s FACE & OTC Sites
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
ASCAC-BERAC Joint Panel on Accelerating Progress Toward GTL Goals Some concerns that were expressed by ASCAC members.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Genomes To Life Biology for 21 st Century A Joint Initiative of the Office of Advanced Scientific Computing Research and Office of Biological and Environmental.
Data Integration and Management A PDB Perspective.
P HYLO P AT : AN UPDATED VERSION OF THE PHYLOGENETIC PATTERN DATABASE CONTAINS GENE NEIGHBORHOOD Presenter: Reihaneh Rabbany Presented in Bioinformatics.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
The US Long Term Ecological Research (LTER) Network: Site and Network Level Information Management Kristin Vanderbilt Department of Biology University.
Hellenic Centre for Marine Research (HCMR) MedOBIS - Ocean Biogeographic Information System for the Eastern Mediterranean and Black Sea.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
| nectar.org.au NECTAR TRAINING Module 2 Virtual Laboratories and eResearch Tools.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
Preserving Electronic Mailing Lists as Scholarly Resources: The H-Net Archives Lisa M. Schmidt
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
XMC Cat: An Adaptive Catalog for Scientific Metadata Scott Jensen and Beth Plale School of Informatics and Computing Indiana University-Bloomington Current.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
Global Change Master Directory (GCMD) Mission “To assist the scientific community in the discovery of Earth science data, related services, and ancillary.
Efforts to Link Ecological Metadata with Bacterial Gene Sequences at the Sapelo Island Microbial Observatory Wade M. Sheldon Mary Ann Moran James T. Hollibaugh.
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
High throughput biology data management and data intensive computing drivers George Michaels.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Presented by SciDAC-2 Petascale Data Storage Institute Philip C. Roth Computer Science and Mathematics Future Technologies Group.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
Boundless Lecture Slides Free to share, print, make copies and changes. Get yours at Available on the Boundless Teaching Platform.
Metagenomic Species Diversity.
Seminar in Bioinformatics (236818)
Chapter 1 Database Systems
SDM workshop Strawman report History and Progress and Goal.
Genomes and Their Evolution
Database Systems Chapter 1
Operational Dataset Update Functionality Included in the NCAR Research Data Archive Management System Zaihua Ji Doug Schuster Steven Worley Computational.
Chapter 1 Database Systems
Consortium: National networks in 16 European countries.
Presentation transcript:

November 18, 2003 SC’O3 Optimizing Genomic Data Storage for Wide Accessibility Joint Genome Institute (JGI) NERSC Center Computational Research Division (LBNL)

November 18, 2003 SC’O3 Collaborators Nancy MeyerNERSC - HPSS Harvard HolmesNERSC - HPSS Jonathan Carter NERSC - User Services Horst SimonNERSC Center Director Susan LucasJGI-PGF - Head, Production Sequencing Arthur KobayashiJGI-PGF - Production Informatics Eddy RubinJGI Director Arie ShoshaniLBNL Computational Research Division Millions of MicrobesEverywhere

November 18, 2003 SC’O3 General Goals Genomic Data Life after the Human Genome Project NERSC Storage Systems Data Management Future Directions

November 18, 2003 SC’O3 General Goals 1.Distribute, archive, and enhance access to the data generated at DOE’s Joint Genome Institute(JGI) Production Genomic Facility(PGF) 2.Serve as a resource for community access to these data. 3.Establish a long term collaboration between the JGI and the NERSC Center. High Performance Storage System (HPSS)

November 18, 2003 SC’O3 Environmental Genomics Carbon Cycle

November 18, 2003 SC’O3 Environmental Genomics < 1% of microbes are culturable Many unculturables live in interdependent consortia of considerable diversity Aim: to recover genome-scale sequences and reveal metabolic capabilities How can we understand the action of microbes at the molecular level? What is the structure of natural microbial populations? What is a microbial species?

November 18, 2003 SC’O3 Future environmental targets for JGI Newman and Banfield, Science 2002 Whole metagenome shotgun sequencing and targeted fosmid-based methods can be used to recover useful draft genomes

November 18, 2003 SC’O3 JGI Microbial Program JGI microbial sequencing targets a broad range of bacteria and archaea with relevance to: Bioremediation Carbon Sequestration Global Climate Change Biodiversity Biomass Conversion Energy Production Disease

November 18, 2003 SC’O3 EUCARYA Single origin of Mitochondria ? BACTERIA ARCHAEA Plants, Animals, Fungi

November 18, 2003 SC’O3 JGI Microbial Program Lactic acid bacteria Lactobacillus gasseri (Klaenhammer) Oenoccoccus oeni (Mills) Complex polysaccharide degradation Clostridium thermocellum (Wu) Microbulbifer degradans (Weiner) (complements white rot fungus sequence) Phototrophic bacteria Rhodospirillium rubrum (Roberts) (complements Rhodopseudomonas palustris and Rhodobacter spheroides) Toxic waste degradation and microbial ecology Desulfuromonas acetoxidans (Lovely) Desulfovibrio desulfuricans Microbes in extreme environments Psychrobacter (Thomashow) Methanococcoides burtonii (Sowers, Cavicchioli) Infectious diseases of plants and animals Erlichia chaffeensis (Yu) Pseudomonas syringae (Lindow) Anaerobic methane oxidizing consortium “ball of bugs” (DeLong, Monterey Bay) one (or two?!) reverse methanogenic archaea in core plus sulfur reducing bacterium on surface

November 18, 2003 SC’O3 JGI - Then & Now Then: Single project - Human Genome (ch 5,16,& 19) All data sent to NCBI/GenBank for storage and distribution Minimum local responsibility for data stewardship Relatively low production sequencing rate Now: Dozens of whole genome projects (2 million to more than a billion bases, each) Multiple species (microbial to vertebrates) Complex environmental genomic communities Full responsibility for data storage and distribution Limited storage capacity Production sequencing rate is increasing

November 18, 2003 SC’O3 JGI Monthly Production Millions of Bases 5yr History12 months

November 18, 2003 SC’O3 1 CAGGTCAACG GATCATCTGT TTCTGACCAT TCCTTCCCGT TCCTGACCCC AGGGAGTGCA 61 GGGTGTCCTA GCCAAGCCGG CGTCCCTCCT AGTAGTACCG CTGCTCTCTA ACCTCAGGAC 121 GTCAAGGGCC TAGAGCGACA GATGTTTCCC AGCAGGGGGT TCTGAGGCTG TGCGCCCAGA 181 TCGCGAGAGA GGCAAGTGGG GTGACGAGGT CGTGCACTGA GGGTGGACGT AGAGGCCAGG 241 AGTAGCAGGC GGCCGGGGAA AAGAGGTGGA GAAAGGAAAA AAGAGGAGAA AAGTGGAGGA 301 GGGCGAGTAG GGGGGTGGGG CAGAGAGGGG CGGGCCCGAG TGCGCCCCCC GCCCCCAGCC 361 CCGCTCTGCC AGCTCCCTCC CAGCCCAGCC GGCTACATCT GGCGGCTGCC CTCCCTTGTT 421 TCCGCTGCAT CCAGACTTCC TCAGGCGGTG GCTGGAGGCT GCGCATCTGG GGCTTTAAAC 481 ATACAAAGGG ATTGCCAGGA CCTGCGGCGG CGGCGGCGGC GGCGGGGGCT GGGGCGCGGG 541 GGCCGGACCA TGAGCCGCTG AGCCGGGCAA ACCCCAGGCC ACCGAGCCAG CGGACCCTCG 601 GAGCGCAGCC CTGCGCCGCG GACCAGGCTC CAACCAGGCG GCGAGGCGGC CACACGCACC 661 GAGCCAGCGA CCCCCGGGCG ACGCGCGGGG CCAGGGAGCG CTACGATGGA GGCGCTAATG 721 GCCCGGGGCG CGCTCACGGG TCCCCTGAGG GCGCTCTGTC TCCTGGGCTG CCTGCTGAGC 781 CACGCCGCCG CCGCGCCGTC GCCCATCATC AAGTTCCCCG GCGATGTCGC CCCCAAAACG 841 GACAAAGAGT TGGCAGTGGT GAGTTGCT This is Not Raw Data

November 18, 2003 SC’O3 Neither is This

November 18, 2003 SC’O3 These are the Raw Data

November 18, 2003 SC’O3 Genome Sequencing Start with genomic DNA Make sheared fragments Sequence both ends of fragments Reconstruct genome computationally Provide genome and tools to community High-throughput computational analysis

November 18, 2003 SC’O3 Paired Plasmid Sequencing

November 18, 2003 SC’O3 JGI Data Production Millions of files per month of raw trace data 100 assembled projects per month(50MB-250MB) and several large assembled projects per year More data are being generated than ever before Currently trace data are maintained online only while projects are in process. Whole completed projects are available to download. They are large and contain millions of files.

November 18, 2003 SC’O3 JGI Raw Data Organization Project =Series of Libraries that define a genome Library =Series of Plates Plate = 384 Clones Clone=2 Lanes 1 Lane = ~1MB each distributed into 4 files: 1 FASTA file = 1KB 1 scf file = 50KB 1 abd file=250KB 1 rsd/ab1file = 650KB In May-03, PGF ran 2.5 million successful lanes = 2.5TB/month; 10 million files (0.75TB/month (9 TB/year) non-trace files) This does not include any assembly, database or metadata!

November 18, 2003 SC’O3 Current Access to JGI Data Access to these data is in demand by scientific fields that were not anticipated by the Human Genome Project Microbiologists Environmental Scientists Evolutionary Scientists GtL projects The computational sophistication of the user community is uneven, at best. Not everyone will want the same kind of files. GenBank is not capable of serving all of the JGI’s needs.

November 18, 2003 SC’O3 Current Access to JGI Data (cont.) The data are processed by researchers using iterative and pattern matching techniques often requiring access to data that spans several projects and genomes. This is different from the Human Project. Currently, this requires downloads of projects and then unpacking the project files to access the data. Millions of files to unpack and slow transfer of whole project files. At best, the raw data used to generate the sequences in a project are very difficult to retrieve and interrogate.

November 18, 2003 SC’O3 NERSC Storage Systems DOE’s largest unclassified storage systems with current archival capacity of 8PBs Robust and available 24x7 with high reliability and excellent network connectivity Very configurable and currently provides good service for both large streaming data and concurrent direct access. Experienced and innovative staff are adding new capabilities and distributing storage as the NERSC Center data requirements change over time.

November 18, 2003 SC’O3 Distribute and Enhance Access 1. Initially, we plan to hold all the sequence data online or near-line. We will prototype and select the best way to do this: distributed file systems local file systems cached web servers tools. 2. Collaborate with JGI to organize and cluster the sequence data so they can be retrieved in meaningful pieces.

November 18, 2003 SC’O3 Distribute and Enhance Access (cont.) 3.Distribute the data between JGI and NERSC/HPSS: Develop tools and methodologies to move the data between JGI and NERSC/HPSS for timely access to sequence data as they are being generated. Incorporate this into regular site backups 4. Build a web interface to the data providing a consistent view of the data (allowing the data to be distributed underneath) with a link to the data at JGI for ease of access.

November 18, 2003 SC’O3 1. Metadata for the files being collected -- schema definition development -- the database system to support the metadata -- query interfaces to query the metadata -- possible rapid prototyping using the OPM tools 2. Data entry tools for the metadata -- procedure to enforce metadata entry -- checks on the correctness of the metadata entered Data Organization Requirements None of this was contemplated in the Human Project

November 18, 2003 SC’O3 3. Robust massive file movement -- from daily generated files into NERSC's HPSS -- insure correctness in spite of system, network, and HPSS transient failures -- automated reporting of errors / failures -- possible use of HRM technology 4. Managing annotations of genomic data -- need to support history of annotation, perhaps by version hierarchy -- need for a controlled vocabulary (an ontology) for searching the annotations Data Organization Requirements (cont.)

November 18, 2003 SC’O3 Future Goals 1. Hold more partial and raw data online 2. Enhance searching these data using annotated databases. 3.Enhance current iterative processing of the data by moving some of this processing close to the data. For example some programs could run on the web server with access to a local file system of data for matches and selections of data. NERSC to become the repository of DOE genomic data focusing on microbial and environmental genomics