Presentation is loading. Please wait.

Presentation is loading. Please wait.

U.S. Department of Energy Office of Science U.S. Department of Energy Office of Science New Opportunities for Data and Information Management: Finding.

Similar presentations


Presentation on theme: "U.S. Department of Energy Office of Science U.S. Department of Energy Office of Science New Opportunities for Data and Information Management: Finding."— Presentation transcript:

1 U.S. Department of Energy Office of Science U.S. Department of Energy Office of Science New Opportunities for Data and Information Management: Finding the Dots, Connecting the Dots, Understanding the Dots Raymond L. Orbach Director, Office of Science 2006 AAAS Annual Meeting February 19, 2006 St. Louis, MO U.S. Department of Energy’s Office of Science

2 U.S. Department of Energy Office of Science February 19, 20062  Supports basic research that underpins DOE missions  Constructs and operates large scientific facilities for the U.S. scientific community l Accelerators, synchrotron light sources, neutron sources  Seven Program Offices l Advanced Scientific Computing Research (ASCR) l Basic Energy Sciences (BES) l Biological and Environmental Research (BER) l Fusion Energy Sciences (FES) l High Energy Physics (HEP) l Nuclear Physics (NP) l Workforce Development (WD) DOE Office of Science

3 U.S. Department of Energy Office of Science February 19, 20063 The FY 2007 President’s Request for science funding is a 14.1% increase and sets the Office of Science on a path to doubling by 2016 An historic opportunity for our country – a renaissance for U.S. science and continued global competitiveness.

4 U.S. Department of Energy Office of Science February 19, 20064 Data Storage Funding FY 2006 FY 2007 Data Storage Funding Including R&D $ 34M$ 37.6M (ASCR+HEP+NP) Current experiment and simulation data storage capacity for the Office of Science is about 100 petabytes and is expected to more than double by FY 2009

5 U.S. Department of Energy Office of Science February 19, 20065 Data Sources Three Pillars of Scientific Discovery: Experiment, Theory, and Simulation Two different kinds of very large data sets:  Experimental data l High energy physics, environment and climate observation data, biological mass-spectrometry l Data needs to be retained for long term  Simulation data l Astrophysics, climate, fusion, catalysis, QCD l From computationally expensive large simulations l Post processing of data using quantum Monte Carlo, analytics and graphical analysis, perturbation theory, and molecular dynamics

6 U.S. Department of Energy Office of Science February 19, 20066 PetaCache Project HEP Data Analysis: Beyond Data Mining  BaBar Data Challenge: 2 petabytes stored, 10-100 terabytes intense access/inquiry 1–15 kilobytes (small) data objects Hundreds of users, thousands of batch jobs  PetaCache project (SLAC: David Leith and Richard Mount) Revolutionize access to huge datasets: First innovative solid-state disk as intermediate storage for HPC data searches 100 times smaller latency than disk At least 500 times faster throughput than disk Builds Feature Database structures to accelerate the retrieval of data  Expected Impact BaBar: From analyst’s idea to seeing the result – nine months becomes one day.

7 U.S. Department of Energy Office of Science February 19, 20067 Sheer Volume of Data Climate Now: 20-40 Terabytes/year 5 years: 5-10 Petabytes/year Fusion Now: 100 Megabytes/15 min 5 years: 1000 Megabytes/2 min Advanced Mathematics and Algorithms  Huge dimensional space  Combinatorial challenge  Complicated by noisy data  Requires high-performance computers Providing Predictive Understanding  Produce hydrogen-based energy  Stabilize carbon dioxide  Clean and dispose toxic waste Finding the Dots Understanding the Dots Connecting the Dots in Science ORNL: Nagiza Samatova

8 U.S. Department of Energy Office of Science February 19, 20068 Connecting the Dots in Combustion, Fusion, and Structural Biology Finding the DOTS - Large-scale simulations in support of combustion grand challenges are generating terabytes of data per simulation. Of particular interest in these simulations are transient events such as ignition, extinction, and re-ignition, which are not well understood. Similar problems also exist in high-resolution, ultra-high speed images of edge turbulence in the National Spherical Torus Experiment at PPPL. In structural biology, the interaction between two proteins forming a molecular machine can be described as the set of contacting amino acid residues. The set of features is very large, and is generated by the combinations of different chemical identities, orientation patterns, and spatial arrangement of the residues. Connecting the DOTS – In combustion, it is unclear what features in the simulation data and their nonlinear dynamic effects could be used to characterize such events. Simulations need to be carried out to explore different possibilities. In fusion, extracting features that could characterize the plasma blobs is relevant to the analysis of Poincaré sections for the particle orbits. For the two interacting proteins, the number of the distinctly different variants of subunits forming the molecular machine is millions or billions, even after applying sophisticated filtering algorithms. The correlations between the subunits establishes the connection between the dots. Understanding the DOTS – A complete understanding the correlations and chemical reactions inherent in the turbulent flow during combustion is still beyond our reach. In fusion, each particle orbit in a Poincaré section is generated when a particle intersects a plane perpendicular to the magnetic axis. Identifying and classifying the orbits is of significant importance in understanding and stabilizing the plasma. Multiple connectable groups of amino acids can be constructed for the interacting proteins, with probabilities giving the likelihood for each variant. Finding the "optimal" solution is important. For example, high scoring interfaces may represent a dynamic picture of the protein machine workings, or additional "ports" suitable for yet-not-discovered protein subunits and other co-factors.

9 U.S. Department of Energy Office of Science February 19, 20069 Office of Science Decadal Data Challenge Mathematical and Computational Challenges and Needs “Curse of Dimensionality” - Interpretation of high dimensional data Challenges: l Going beyond classical Bayesian theory of probabilistic quantification to address long range and non-linear correlations between features in noisy data l Mathematical description of complex geometric shapes in their spatial and temporal dimensions l Enumeration and optimization of multivariate functions on complex graphs that describe relationships between identified features l Low rank approximations and generalized separation of variables to reduce the dimension with out destroying information l New harmonic and discrete mathematics and new algorithms for fast extraction of correlations and patterns

10 U.S. Department of Energy Office of Science February 19, 200610 Office of Science Response to the Data Challenge The Office of Science will initiate a long-term research program to address the “Curse of Dimensionality.” Some of the elements of the research program are: l Bayesian Theory – New research to develop efficient ways for dealing with both local and long-range correlations between features, including Bayesian estimators to correctly estimate the simultaneous appearance of “striking” features at precisely defined locations, and mechanisms to incorporate partial analytical models to supplement missing statistics. l Mathematical description of complex geometric shapes – New research on the stochastic theory of shapes to classify geometric shapes in terms of stochastic models, which are essential for the rigorous comparisons needed for pattern discovery. We intend to develop high performance scalable algorithms for querying, searching, tracking, and reconstruction of high dimensional shapes from incomplete information. l Enumeration and optimization of multivariate functions on complex graphs – New research to develop efficient methodologies for the hierarchical enumeration of composite objects, including analytical methods for dynamically constraining the search space. We intend to develop optimization methods to deal with novel spaces formed by graphs of identified features (dots) and their relationships (connections). Such spaces typically have hundreds of variables and dimensions. Additionally, we intend to develop computational libraries to efficiently handle an enormous number of possible variants through construction of subgraph indexing schemes and efficient lookup methods.


Download ppt "U.S. Department of Energy Office of Science U.S. Department of Energy Office of Science New Opportunities for Data and Information Management: Finding."

Similar presentations


Ads by Google