Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data- and Compute-Driven Transformation of Modern Science Edward Seidel Assistant Director, Mathematical and Physical Sciences, NSF (Director, Office of.

Similar presentations


Presentation on theme: "Data- and Compute-Driven Transformation of Modern Science Edward Seidel Assistant Director, Mathematical and Physical Sciences, NSF (Director, Office of."— Presentation transcript:

1 Data- and Compute-Driven Transformation of Modern Science Edward Seidel Assistant Director, Mathematical and Physical Sciences, NSF (Director, Office of Cyberinfrastructure) 1

2 2 Profound Transformation of Science Gravitational Physics  Galileo, Newton usher in birth of modern science: c. 1600  Problem: single “particle” (apple) in gravitational field (General 2 body- problem already too hard)  Methods  Data: notebooks (Kbytes)  Theory: driven by data  Computation: calculus by hand (1 Flop/s)  Collaboration  1 brilliant scientist, 1-2 student

3 3 3 Profound Transformation of Science Collision of Two Black Holes Science Result The “Pair of Pants” Year: 1994 Team size ~ 10 Data produced ~ 50Mbytes Science Result The “Pair of Pants” Year: 1994 Team size ~ 10 Data produced ~ 50Mbytes  Science Result  The “Pair of Pants”  Year: 1972  Team size  1 person (S. Hawking)  Computation  Flop/s  Data produced  ~ Kbytes (text, hand- drawn sketch)  400 years later…same!

4 Move to 3D: 1000x more data! 4 3D Collision Science Result Year: 1998 Team size ~ 15 Data produced ~ 50Gbytes 3D Collision Science Result Year: 1998 Team size ~ 15 Data produced ~ 50Gbytes

5 5 Just ahead: Complexity of Universe LHC, Gamma-ray bursts!  Gamma-ray bursts!  GR now soluble: complex problems in relativistic astrophysics  All energy emitted in lifetime of sun bursts out in a few seconds: what are they?! Colliding BH-NS? SN?  GR, hydrodynamics, nuclear physics, radiation transport, neutrinos, magnetic fields: globally distributed collab!  Scalable algorithms, complex AMR codes, viz, PFlops*week, PB output!  LHC: Higgs particle?  ~10K scientists, 33+ countries, 25PB  Planetary lab for scientific discovery! Remote Instrument

6 6 Grand Challenge Communities Combine it All... Where is it going to go? 6 Same CI useful for black holes, hurricanes

7 Framing the Question Science is Radically Revolutionized by CI  Modern science  Data- and compute- intensive  Integrative  Multiscale Collaborations for Complexity  Individuals, groups  Teams, communities  Must Transition NSF CI approach to support  Integrative, multiscale  4 centuries of constancy, 4 decades 10 9-12 change! 7 …But such radical change cannot be adequately addressed with (our current) incremental approach! We still think like this… Students take note!

8 Scientific Computing and Imaging Institute, University of Utah Data Crisis: Information Big Bang PCAST Digital Data NSF Experts Study Wired, Nature Storage Networking Industry Association (SNIA) 100 Year Archive Requirements Survey Report “there is a pending crisis in archiving… we have to create long-term methods for preserving information, for making it available for analysis in the future.” 80% respondents: >50 yrs; 68% > 100 yrs Industry

9 9 Explosive Trends in Data Growth  Comparative Metagenomics  DNA sequencing of entire families of organisms  Already hundreds of TB, thousands of users  HD Collaborations and Optiportals  Multichannel HD, gigapixel visualizations  Petascale-Exascale simulation  They generate peta-exabytes per simulation!  Square Kilometer Array  3000 radio receivers, 1 km 2 area!  19 countries! Possibly beginning in 201X, operational 202X  Data: exabyte per week! Analysis: Exaflops!

10 Provenance in Science  Provenance is as important as the result  Not a new issue  Lab notebooks have been used for a long time  What is new?  Large volumes of data  Complex analyses— computational processes  Writing notes is no longer an option…  GC Communities require open, sharable data, standards, metadata DNA recombination By Lederberg When Observed data Annotation Source: Juliana Freire, U of Utah

11 11 “Data Deluge” Drives Change at NSF  Data Issues resonate the most across NSF units!  DataNet – $100M investment in Sustainable Archive & Access Partners – development of widely accessible network of interactive data archives; Driven by today’s grand challenges, integrating multiple disciplines  INTEROP – Community-led interoperability – interdisciplinary, community approaches to combine and re-use data in ways not envisioned by their creators  Data-intensive computing: SDSC Gordon facility  NSF Data Policy – The “Data Working Group” (NSF-wide group of Program Directors) working to assure that data are shared within and across disciplines There is a major shift in science towards data- intensive methods. NSF is responding…

12 NSF Vision and National CI Blueprint 12 Track 1 Track 2 CampusCampusCampusCampusCampusCampus CampusCampus CampusCampusCampusCampusCampusCampusCampusCampus DataNetDataNet DataNetDataNet SoftwareSoftware NetsNets DataNetDataNet DataNetDataNet DataNetDataNet Education Crisis: I need all of this to start to solve my problem! Science is becoming unreproducible in this environment. Validation?Provenance? Reproducibility?

13 The Shift Towards Data Implications  All science is becoming data-dominated  Experiment, computation, theory  Totally new methodologies  Algorithms, mathematics  All disciplines from science and engineering to arts and humanities  End-to-end networking becomes critical part of CI ecosystem  Campuses, please note!  How do we train “data-intensive” scientists?  Data policy becomes critical! 13

14 Recent NSF Activities on Data Policy and Implementation 14

15 15 Fundamental points on data and publication policy  Publicly funded scientific data and publications should be available, and science benefits  There has to be a place to keep data, and a way to access it  There needs to be an affordable, sustainable cost model for this 15 Who pays? The NSF? The Institution? What is the cost model? What is reasonable? What data must be made available? Raw data? Peer reviewed? When is it available? 6 months? 1 year? After publication? Where is it placed? Author web site? Library? NSF sites? How long is it made available? How do we enforce it post- award? There is great variability in requirements across science communities: peer review can help guide this process.

16 Changes Coming for Data!  Long-standing NSF Policy on Data (Proposal & Award Policies & Procedures Guide)  “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data... created or gathered in the course of work under NSF grants” 16  NSF will soon require a Data Management Plan (DMP), subject to peer review; criterion for award  The DMP will be in the form of a 2-page supplementary document to the proposal  It will not be possible to submit proposals without such a document  Customization by discipline, program necessary

17 Upcoming Implementation of NSF Data Policy  Directorate-Specific Issues & Peer Review  Many details are implemented/enforced via peer review and Program Officer discretion, including things like embargo period, standards, etc.  The challenge at NSF is that “no one size fits all” so each Directorate will be responsible for its own recommendations for DMP content, appropriate institutional repositories, etc.  This does not address Open Access as applied 17

18 Electronic Access to Scientific Publications 18

19 Why is this Important?  Science requires it  Science progress accelerated by making publications available and searchable  Results in one community need to easily propagate to another for multidisciplinary complex problem solving  Search technologies can be brought to bear  Publications need to be associated with rich information: videos of simulations, supporting data, simulation and analysis codes…  Equality and Broadening participation  Young scientists at smaller universities at a needless disadvantage without it. They may lose journal access. This hurts science and puts talent at risk  US Administration focus on transparency and accountability 19

20 Current Activities  We have begun serious discussions within NSF on these issues  National Science Board Committee on Data started  Goals similar to those for Data  We have had numerous visits from funding agencies from around the world  Primary topic: what is NSF doing on OA?  Discussion with various publishers, libraries to explore options  Quality of science relies on peer review systems of best journals; need a way to support OA 20

21 On Working with Publishers  Quality of science, identification of talented scientists: we rely on the peer review systems of the best journals  NSF receives an assurance that the work done on a grant meets a standard  Universities use impact factors as part of their tenure and promotion process I believe it is in the interests of science, and hence the public interest, to help journals find a viable OA business model.. Bernard Schutz, Presentation to NSF, May 2009 21

22 Final Remarks  Science is becoming collaborative and data dominated  We are accelerating efforts to advance NSF in all aspects of data  Science requires that data need to be open and accessible; we are working to achieve this  All forms of data are important, and must be more tightly connected in the future  Collections, software, publications  Time is of the essence 22


Download ppt "Data- and Compute-Driven Transformation of Modern Science Edward Seidel Assistant Director, Mathematical and Physical Sciences, NSF (Director, Office of."

Similar presentations


Ads by Google