Presentation on theme: "Computational challenges Sean Eddy HHMI Janelia Farm Research Campus."— Presentation transcript:
Computational challenges Sean Eddy HHMI Janelia Farm Research Campus
My charge 2008: 2 Tb 2009: 32 Tb 2010: 150 Tb 2011: 165 Tb How will we keep up with this? maintaining/annotating quality storage communication (network bandwidth) analysis (including software and databases) integration
Ewan Birney Michael Brent Jeremy Buhler Goran Ceric Barak Cohen Richard Durbin Jonathan Eisen Rob Finn Paul Gardner Ian Holmes Scott Hunicke-Smith Rob Knight David Konerding Saul Kravitz Anthony Leonardo Rob Mitra Ryan Richt Jason Stajich Lincoln Stein Granger Sutton George Weinstock Rick Wilson Dan Meiron Dept. of Applied & Computational Mathematics and Aeronautics, Caltech Steven Brenner UC Berkeley David Dooling WashU Genome Center Vivien Bonazzi and Adam Felsenfeld NIH NHGRI http://cryptogenomicon.org
FY09 NHGRI: $488M about 40M databases about 60M informatics FY09 NIH: $29,000M HHMI: $760M Janelia Farm alone: $120M Informatics challenges affect all biomedical research NHGRI lacks resources to solve these problems alone First planning priority is dealing with NHGRI’s own data well At the same time: lead and catalyze -- show others how to do it; best practices work together -- NCBI, EBI, NCI, others share our problems. CERN: $1000M SLAC: $300M LSST: $45M (start 2015)
Data volume per se is not the problem.
250 TB 12 TB GIG Data Capacity (Services, Transport & Storage) UUVs Sensor Data Volume 2000 Today 2010 2015 & Beyond PREDATOR UAV VIDEO GLOBAL HAWK DATA Future Sensor X Future Sensor Y Large Data JCTD Future Sensor Z Theater Data Stream (2006): ~270 TB of NTM data / year Example: One Theater’s Storage Capacity: 2006 2010 10 18 10 12 10 24 Yottabytes Exabytes Terabytes 10 15 Petabytes 10 21 Zettabytes FIRESCOUT VTUAV DATA Capability Gap Bob Gourley http://ctovision.typepad.com/InfoSharingTechnologyFutures.ppt
Moore’s law: CPU power doubles in ~18-24 mo. Hard drive capacity doubles in ~12 mo. Network bandwidth doubles in ~20 mo.
Fundamental computing capabilities should increase: 7-10x in 5 years 50-100x in 10 years We project in 3-5 years: 100x increase in sequencing volume Therefore: yes, next-gen sequencing tech bumps us up; and we can’t just sit on our hands; but we only have to be a little more clever
Fortunately, we are not alone.
For example: Microsoft is constructing a new $500M data center in Chicago. Four new electrical substations totalling 168 MW power. About 200 40’ truckable containers, each containing ~1000-2000 servers. Estimated 200K-400K servers total. Comparisons to Google, Microsoft, etc. aren’t entirely appropriate; scale of their budgets vs. ours aren’t comparable. Google FY2007: 11.5B; ~ $1B to computing hardware Though they do give us early warning of coming trends: (container data centers; cloud computing) Private sector datasets and computing capacity are already huge. Google, Yahoo!, Microsoft: probably ~100 PB or so Ebay, Facebook, Walmart: probably ~10 PB or so
CERN Large Hadron Collider (LHC) ~10 PB/year at start ~1000 PB in ~10 years 2500 physicists collaborating http://www.cern.ch
Pan-STARRS (Haleakala, Hawaii) US Air Force now: 800 TB/year soon: 4 PB/year Large Synoptic Survey Telescope (LSST) NSF, DOE, and private donors ~5-10 PB/year at start in 2012 ~100 PB by 2025 http://www.lsst.orghttp://www.lsst.org; http://pan-starrs.ifa.hawaii.edu/public/
1. Petabyte data volumes are manageable using commodity tech Pan STARRS: 80 “data bricks”, RAID-6; 3 PB for ~$1M
2. Just because you can store raw data doesn’t mean you should data filtering at the source and at every stop along the way using strategy appropriate to a particular experiment/analysis CERN LHC Atlas detector generates 10 5 more data than is stored (40 million events/sec 200/sec stored)
3. Distributed, hierarchical, redundant data archives and analysis (CERN LHC’s four tiers of data centers: 10 Tier 1 sites, 30 Tier 2 sites) 4. Computational infrastructure is integral to experimental design
Hardware technology is important, but is not where we are stressed. Our single most important problem is the democratization of sequence analysis. Biology has become an informatics- and data-heavy science, but we lack a culture that supports pervasive computational analysis Our weak links are computational infrastructure and the training and expertise of bench scientists.
one genome cloning, mapping, sequencing assembly genome center international DNA databases The good old days:
PI’s lab genome centers ENCODE centers, CEGS departmental core sequencers international databases genome browsers model organism databases boutique databases supplementary material reference genome assemblies comparative sequence transcript sequence ChIP-seq, CLIP-seq resequencing (mutants, variants) phenotype data 1.Evolving toward a tiered structure (like physics/astronomy) 2. Must integrate lots of different data (unlike physics/astronomy)
A return to a paper as a unit of advance, not a genome The output of genome sequencing and assembly is simple, modular, and well-understood, including the associated quality metrics This meant we could shoot pre-publication data into the databases and it was useful Now that next-gen sequencing is a multipurposed digital assay tool: Details of methods, experimental design, and analysis all matter again we’ve been calling this information “metadata”, as if it were merely a db format issue to solve with XML; it is not. it’s the information in a properly written paper.
For an individual PI’s lab to generate reliable/reusable datasets, integrate them with other datasets, conduct large-scale computational analyses, and write great papers, with results that can in turn be integrated with others; Those individual labs need good software for mapping/assembling sequence access to reliable, modular, well-annotated datasets good software for integrated data analysis efficient means of sharing/distributing their datasets Democratization means:
Availability of good software. Availability of other datasets in a form that can be most readily integrated into analysis workflows. Computing infrastructure to do the work. Bottlenecks (challenges, opportunities):
Software and database infrastructure requires engineering discipline and science Our culture values science, not engineering Commercialization path largely hasn’t worked: why? market too small? too dynamic? poisoned by open source? The main challenge with software: The result: a software literature full of good ideas that don’t get fully baked; tools that work in one place but aren’t portable Commercialization isn’t a complete answer anyway: tools themselves are research, require open publication
Suggested approach to better software: There is currently little support niche for the engineering of robust research software in biology (exceptions include NCI caBIG; NCBI) “Centers of excellence” in software engineering could be established to harden/productize research tools while they’re still in R&D phase: reward engineering for its own sake (compare Tech D funding at genome centers) Encourage commercialization of stable tools once they’ve left R&D phase: SBIR mechanisms compare Road Map NCBCs: http://www.ncbcs.org
One desired outcome: earlier, more widespread adoption of analysis best practices (no more using BLAST to map short reads) more efficient use of time and computational resources; less big iron and less global warming required.
The main challenges with datasets: overly reliant on monolithic, overly centralized international databases versioning: instability of coordinate systems interferes with data integration poor ability to improve annotation and quality of archived data
An aggregated monolithic database makes sense if you’re going to search it all at once Historically, we think in terms of the sequence databases and homology searches Literature, text search also makes sense (Google) But does a monolithic archive make sense for all data? For example: is the Short Read Archive useful?
An approach for better datasets: modularity Do one thing well; define standards for input/output so tools can be chained together in powerful combinations. Akin to CERN/LHC tiered structure, where each tier digests data from previous tier, adding new information while compressing the previous. Example: I really don’t want the raw short reads from your ChIP-seq experiment; I want the histogram of them mapped to a reference genome, with defined methods and reliability measures modularity rather than tiers because our data isn’t a hierarchical single experiment like the LHC
The lowest level of modularity is supplementary material Supplementary material should be electronic datasets in standard exchange formats suitable for integration with other data (not an unreviewed, wordy alternative version of the same paper to circumvent page limits) Not an NHGRI problem; a community problem requiring consciousness-raising and commitment at journals R. Gentleman, Reproducible research: a case study. Stat Appl Genet Mol Biol 4:Article2 (2005)
top-down bottom-up International DNA databases International protein databases Model organism databases consider the fate of a coding gene annotation integration-ready data from supplementary material Model organism databases International protein databases International DNA databases
main challenge with computing infrastructure: Efficient large-scale analysis and data requires data centers Data centers exhibit strong economies of scale, due to load balancing, space, cooling, power, staffing Most individual labs cannot justify cost of an efficient data center, nor can they keep it loaded NIH traditionally funds at individual lab level Individual labs are wasting money on subscaled computing
Example : the Janelia Farm data center circa 2006: 528 nodes (1056 cores): 480 w/ 4 GB RAM, 40 w/ 8 GB RAM, 8 w/ 64 GB RAM 1 gigabit to each node; 10 gigabit between racks EMC DMX-3 + 8 EMC Celerra file servers; MPFSi (parallel NFS) 200 TB disk; 1 PB offline backup crucially: entire datacenter is accessible on our desktops (no transfer lag in/out) 2 full time staff (including one demigod); $3M capital expense, recurring every 3 years; serves ~40 labs with widely mixed needs
Approaches to computational infrastructure: Enable department- or institute-level data centers (NCRR? however, requires plan for the 3-year technology refresh rate; more a consumable than a capital expense) Web services (“SOA”, “service-oriented architecture”) For certain well-defined computational tasks, a remote server can process a query and return a formatted answer. includes annotation/integration problems: DAS, for example. Cloud computing For arbitrary computational tasks, you can create a virtual machine image, send it to a remote server, and have it execute there. “move the compute to the data”: large datasets can be hosted
Recommendations: 1. Develop “centers of excellence” for software engineering. 2. Modularize the organization of databases for key genomic resources, reduce reliance on monolithic centralized archives: think tiers, except not in a hierarchy. 3. Strengthen that modularity all the way down to the level of supplementary material in publications: reproducible methods, integration-ready results 4. Plan for hardware infrastructure at department, institute level 5. Catalyze development of web services and cloud computing resources, especially on hosted large datasets 6. Engage resources outside traditional biology: create “grand challenges” attractive to high-performance computing community