Presentation is loading. Please wait.

Presentation is loading. Please wait.

-- Don Preuss NCBI/NLM/NIH

Similar presentations


Presentation on theme: "-- Don Preuss NCBI/NLM/NIH"— Presentation transcript:

1 -- Don Preuss NCBI/NLM/NIH
Trends in Genomics Big Data, NCBI perspective, and 1,000 Genomes in the Cloud -- Don Preuss NCBI/NLM/NIH Every decade a new, lower priced computer class forms with new programming platform, network, and interface resulting in new usage and industry. - Bell’s Law of computer classes

2 Outline Emerging trends on "Big Data“ and large scale networking and "the cloud" in the genomics community. Trends in data transfer and data compression Cloud initiatives – 1,000 Genomes in the cloud

3 National Center for Biotechnology Information
Created by Public Law in 1988 as part of National Library of Medicine at NIH to: Create automated systems for knowledge about molecular biology, biochemistry, and genetics Perform research into advanced methods of analyzing and interpreting molecular biology data. Enable biotechnology researchers and medical care personnel to use the systems and methods developed. The NCBI advances science and health by providing access to biomedical and genomic information. Builders and providers of GenBank, Entrez, BLAST, PubMed, dbGaP, SRA, dbSNP, Pubchem and much, much, more…. Center for basic research and training in computational biology.

4 NCBI Daily Users Web page views: 28 million per day
Web users: 3.1 million per day Data downloaded: 26.6 TB per day Peak web hits: 7,000 per second

5

6 Sequencers

7 DNA Sequencing Caught in Deluge of Data
BGI, based in China, is the world’s largest genomics research institute, with 167 DNA sequencers producing the equivalent of 2,000 human genomes a day. BGI churns out so much data that it often cannot transmit its results to clients or collaborators over the Internet or other communications lines because that would take weeks. Instead, it sends computer disks containing the data, via FedEx

8 Big Data in Scientific Discovery
Physics: Large Hadron Collider Biology: Genomes Project Trunnell 2012

9 NLM I2 Traffic Stats

10 Getting exponential growth under control

11 What is the Big Data Problem in Biology
What is the Big Data Problem in Biology? Example: Reducing the 1000 Genome Dataset Submitted BAM Read IDs as strings Original quality & recalibrated quality scores Additional analysis tags 250TB Size (Terabytes) cSRA (lossless) Read IDs as integers 40-level read qualities using recalibrated quality scores cSRA (lossy) 8 level qualities for all sites Uniform binning of recalibrated quality scores 85TB Variant Call Format (VCF) Genotype likelihoods for all variants 30TB 0.1TB Total Project Size Lossless cSRA Lossy cSRA Analysis Genotypes

12 Flicek

13 Problem: Enable Access to Data
1,000 genome data set is very large Many sites do not have capacity for TB downloads Request – Can the 1,000 genomes project store the data in the cloud? Reduce cost for extramural investigators and increase accessibility to data In addition, it supports Federal Open Data A primary goal of Data.gov is to improve access to Federal data and expand creative use of those data beyond the walls of government… Latest release announced at #ICGH2011, more releases coming. Part of the National Big Data Initiative Announcement

14 Why is NCBI interested in cloud computing?
Quantity of Data NCBI has petabytes of sequence data that is made available to researchers around the world. Bandwidth NIH has a good bit of network capacity, and Network capacity is available for many sites to download data sets, especially those on Internet II. However, for many more, it is not available, reducing their practical access to research data Analysis Tools and Platforms Some need simple tools – Extract a portion of data (chromosome, area of interest) Others use more complex tools – Genome browsers, analysis tools for epigenomics using Elastic MapReduce If we can bring compute to the data we can improve access to the data References in this talk to any specific commercial products, process, service, manufacturer, company, or trademark does not constitute its endorsement or recommendation by the U.S. Government, HHS, or NIH. As an agency of the U.S. Government, NIH cannot endorse or appear to endorse any specific commercial products or services.

15 1,000 Genomes in the Clouds The 1,000 Genome Project files are loaded in Amazon S3 Millions of files have been uploaded (200TB) AMIs have been developed to analyze and review the data Cloudbiolinux, Galaxy This is a public data set with storage provided by AWS NIH is funding several efforts to port genome pipelines to cloud computing environments Research labs, such as those at Emory and UCSC have placed versions of their software in AWS to make 1,000 genome data readily accessible through browser interfaces in the cloud

16 What is Galaxy Galaxy is a framework for integrating computational tools. It allows nearly any tool that can be run from the command line to be wrapped in a structured well defined interface. On top of these tools, Galaxy provides an accessible environment for interactive analysis that transparently tracks the details of analyses, a workflow system for convenient reuse, data management, sharing, publishing, and more. Even more – Galaxy has made it easy for a researcher to extend their compute power into cloud compute systems Tools like Galaxy make it possible for a researcher to take advantage of much greater compute power without having to worry about the infrastructure details. From ASMB tutorial

17

18 Summary/Questions Compression will help slow this big data problem
Other big data problems remain New file formats will compress data close to sequencers Last mile networking is a big issue, prevents access for researchers Cloud will enable access for many more researchers internationally and at underserved institutions


Download ppt "-- Don Preuss NCBI/NLM/NIH"

Similar presentations


Ads by Google