U.S. Department of Energy Office of Science U.S. Department of Energy Office of Science New Opportunities for Data and Information Management: Finding.

Slides:

Advertisements

Similar presentations

Grant review at NIH for statistical methodology Jeremy M G Taylor Michelle Dunn Marie Davidian.

Advertisements

Office of Science U.S. Department of Energy FASTOS – June FASTOS – June 2005 WELCOME!!

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

U.S. Department of Energy’s Office of Science Basic Energy Sciences Advisory Committee Dr. Daniel A. Hitchcock October 21, 2003

BER Long Term Measures As discussed at the last BERAC meeting with Joel Parriott (OMB) and Bill Valdez (DOE/SC) BERAC is on the hook for evaluating BER’s.

U.S. Department of Energy Office of Science Advanced Scientific Computing Research Program NERSC Users Group Meeting Department of Energy Update June 12,

FES International Collaboration Program: Vision and Budget Steve Eckstrand International Program Manager Office of Fusion Energy Sciences U.S. Department.

. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.

Complex Feature Recognition: A Bayesian Approach for Learning to Recognize Objects by Paul A. Viola Presented By: Emrah Ceyhan Divin Proothi Sherwin Shaidee.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

U.S. Department of Energy’s Office of Science Dr. Raymond Orbach February 25, 2003 Briefing for the Basic Energy Sciences Advisory Committee FY04 Budget.

B1 -Biogeochemical ANL - Townhall V. Rao Kotamarthi.

Office of Science U.S. Department of Energy U.S. Department of Energy’s Office of Science Dr. Raymond L. Orbach Under Secretary for Science U.S. Department.

Panelist: Shashi Shekhar McKnight Distinguished Uninversity Professor University of Minnesota Cyber-Infrastructure (CI) Panel,

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Computer Science Prof. Bill Pugh Dept. of Computer Science.

Office of Science Office of Biological and Environmental Research J Michael Kuperberg, Ph.D. Dan Stover, Ph.D. Terrestrial Ecosystem Science AmeriFlux.

Data Mining – Intro.

Science and Engineering Practices

ASCR Scientific Data Management Analysis & Visualization PI Meeting Exploration of Exascale In Situ Visualization and Analysis Approaches LANL: James Ahrens,

Machine Learning and Optimization For Traffic and Emergency Resource Management. Milos Hauskrecht Department of Computer Science University of Pittsburgh.

Data Mining Techniques

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.

ROOT: A Data Mining Tool from CERN Arun Tripathi and Ravi Kumar 2008 CAS Ratemaking Seminar on Ratemaking 17 March 2008 Cambridge, Massachusetts.

Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,

The Climate Prediction Project Global Climate Information for Regional Adaptation and Decision-Making in the 21 st Century.

Multimedia Databases (MMDB)

Beyond the Human Genome Project Future goals and projects based on findings from the HGP.

Science Research: Journey to 10,000 Sources Presented by: Abe Lederman, President and Founder Deep Web Technologies, Inc. Special Libraries Association.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

U.S. Department of Energy Office of Science Advanced Scientific Computing Research Program NERSC Users Group Meeting Department of Energy Update September.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Presented by ORNL Statistics and Data Sciences Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and.

Scientific Writing Abstract Writing. Why ? Most important part of the paper Number of Readers ! Make people read your work. Sell your work. Make your.

Network-related problems in M2ACS Mihai Anitescu.

Problem is to compute: f(latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity Approach: –Discretize the.

Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.

Office of Science Office of Biological and Environmental Research DOE Workshop on Community Modeling and Long-term Predictions of the Integrated Water.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

Last Words DM 1. Mining Data Steams / Incremental Data Mining / Mining sensor data (e.g. modify a decision tree assuming that new examples arrive continuously,

Office of Science U.S. Department of Energy Raymond L. Orbach Director Office of Science U.S. Department of Energy Presentation to BESAC December 6, 2004.

Genomes To Life Biology for 21 st Century A Joint Initiative of the Office of Advanced Scientific Computing Research and Office of Biological and Environmental.

Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.

Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division.

U.S. Department of Energy Office of Science U.S. Department of Energy Office of Science Basic Energy Sciences Advisory Committee Meeting FY 2009 Budget.

Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.

COMPUTERS IN BIOLOGY Elizabeth Muros INTRO TO PERSONAL COMPUTING.

DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.

Image Classification for Automatic Annotation

Sequential Monte-Carlo Method -Introduction, implementation and application Fan, Xin

Scientific Computing at SLAC: The Transition to a Multiprogram Future Richard P. Mount Director: Scientific Computing and Computing Services Stanford Linear.

1 OFFICE OF ADVANCED SCIENTIFIC COMPUTING RESEARCH The NERSC Center --From A DOE Program Manager’s Perspective-- A Presentation to the NERSC Users Group.

Patricia M. Dehmer Deputy Director for Science Office of Science, U.S. Department of Energy

BER Long Term Measures As discussed at a previous BERAC meeting with Joel Parriott (OMB) and Bill Valdez (DOE/SC) BERAC is on the hook for evaluating BER’s.

Role of Theory Model and understand catalytic processes at the electronic/atomistic level. This involves proposing atomic structures, suggesting reaction.

Using Bayesian Networks to Predict Plankton Production from Satellite Data By: Rob Curtis, Richard Fenn, Damon Oberholster Supervisors: Anet Potgieter,

MA354 Math Modeling Introduction. Outline A. Three Course Objectives 1. Model literacy: understanding a typical model description 2. Model Analysis 3.

Scientific Data Analysis via Statistical Learning Raquel Romano romano at hpcrd dot lbl dot gov November 2006.

High throughput biology data management and data intensive computing drivers George Michaels.

Nigel Lockyer Fermilab Operations Review 16 th -18 th May 2016 Fermilab in the Context of the DOE Mission.

Nigel Lockyer Fermilab Operations Review 16 th -18 th May 2016 Fermilab in the Context of the DOE Mission.

Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.

U.S. Department of Energy’s Office of Science Presentation to the Basic Energy Sciences Advisory Committee (BESAC) Dr. Raymond L. Orbach, Director November.

DOE Office of Science Graduate Student Research (SCGSR) Program

A Brief Introduction to NERSC Resources and Allocations

Data Mining – Intro.

Chapter 13 The Data Warehouse

Data Warehousing and Data Mining

Course Introduction CSC 576: Data Mining.

Presentation transcript:

U.S. Department of Energy Office of Science U.S. Department of Energy Office of Science New Opportunities for Data and Information Management: Finding the Dots, Connecting the Dots, Understanding the Dots Raymond L. Orbach Director, Office of Science 2006 AAAS Annual Meeting February 19, 2006 St. Louis, MO U.S. Department of Energy’s Office of Science

U.S. Department of Energy Office of Science February 19,  Supports basic research that underpins DOE missions  Constructs and operates large scientific facilities for the U.S. scientific community l Accelerators, synchrotron light sources, neutron sources  Seven Program Offices l Advanced Scientific Computing Research (ASCR) l Basic Energy Sciences (BES) l Biological and Environmental Research (BER) l Fusion Energy Sciences (FES) l High Energy Physics (HEP) l Nuclear Physics (NP) l Workforce Development (WD) DOE Office of Science

U.S. Department of Energy Office of Science February 19, The FY 2007 President’s Request for science funding is a 14.1% increase and sets the Office of Science on a path to doubling by 2016 An historic opportunity for our country – a renaissance for U.S. science and continued global competitiveness.

U.S. Department of Energy Office of Science February 19, Data Storage Funding FY 2006 FY 2007 Data Storage Funding Including R&D $ 34M$ 37.6M (ASCR+HEP+NP) Current experiment and simulation data storage capacity for the Office of Science is about 100 petabytes and is expected to more than double by FY 2009

U.S. Department of Energy Office of Science February 19, Data Sources Three Pillars of Scientific Discovery: Experiment, Theory, and Simulation Two different kinds of very large data sets:  Experimental data l High energy physics, environment and climate observation data, biological mass-spectrometry l Data needs to be retained for long term  Simulation data l Astrophysics, climate, fusion, catalysis, QCD l From computationally expensive large simulations l Post processing of data using quantum Monte Carlo, analytics and graphical analysis, perturbation theory, and molecular dynamics

U.S. Department of Energy Office of Science February 19, PetaCache Project HEP Data Analysis: Beyond Data Mining  BaBar Data Challenge: 2 petabytes stored, terabytes intense access/inquiry 1–15 kilobytes (small) data objects Hundreds of users, thousands of batch jobs  PetaCache project (SLAC: David Leith and Richard Mount) Revolutionize access to huge datasets: First innovative solid-state disk as intermediate storage for HPC data searches 100 times smaller latency than disk At least 500 times faster throughput than disk Builds Feature Database structures to accelerate the retrieval of data  Expected Impact BaBar: From analyst’s idea to seeing the result – nine months becomes one day.

U.S. Department of Energy Office of Science February 19, Sheer Volume of Data Climate Now: Terabytes/year 5 years: 5-10 Petabytes/year Fusion Now: 100 Megabytes/15 min 5 years: 1000 Megabytes/2 min Advanced Mathematics and Algorithms  Huge dimensional space  Combinatorial challenge  Complicated by noisy data  Requires high-performance computers Providing Predictive Understanding  Produce hydrogen-based energy  Stabilize carbon dioxide  Clean and dispose toxic waste Finding the Dots Understanding the Dots Connecting the Dots in Science ORNL: Nagiza Samatova

U.S. Department of Energy Office of Science February 19, Connecting the Dots in Combustion, Fusion, and Structural Biology Finding the DOTS - Large-scale simulations in support of combustion grand challenges are generating terabytes of data per simulation. Of particular interest in these simulations are transient events such as ignition, extinction, and re-ignition, which are not well understood. Similar problems also exist in high-resolution, ultra-high speed images of edge turbulence in the National Spherical Torus Experiment at PPPL. In structural biology, the interaction between two proteins forming a molecular machine can be described as the set of contacting amino acid residues. The set of features is very large, and is generated by the combinations of different chemical identities, orientation patterns, and spatial arrangement of the residues. Connecting the DOTS – In combustion, it is unclear what features in the simulation data and their nonlinear dynamic effects could be used to characterize such events. Simulations need to be carried out to explore different possibilities. In fusion, extracting features that could characterize the plasma blobs is relevant to the analysis of Poincaré sections for the particle orbits. For the two interacting proteins, the number of the distinctly different variants of subunits forming the molecular machine is millions or billions, even after applying sophisticated filtering algorithms. The correlations between the subunits establishes the connection between the dots. Understanding the DOTS – A complete understanding the correlations and chemical reactions inherent in the turbulent flow during combustion is still beyond our reach. In fusion, each particle orbit in a Poincaré section is generated when a particle intersects a plane perpendicular to the magnetic axis. Identifying and classifying the orbits is of significant importance in understanding and stabilizing the plasma. Multiple connectable groups of amino acids can be constructed for the interacting proteins, with probabilities giving the likelihood for each variant. Finding the "optimal" solution is important. For example, high scoring interfaces may represent a dynamic picture of the protein machine workings, or additional "ports" suitable for yet-not-discovered protein subunits and other co-factors.

U.S. Department of Energy Office of Science February 19, Office of Science Decadal Data Challenge Mathematical and Computational Challenges and Needs “Curse of Dimensionality” - Interpretation of high dimensional data Challenges: l Going beyond classical Bayesian theory of probabilistic quantification to address long range and non-linear correlations between features in noisy data l Mathematical description of complex geometric shapes in their spatial and temporal dimensions l Enumeration and optimization of multivariate functions on complex graphs that describe relationships between identified features l Low rank approximations and generalized separation of variables to reduce the dimension with out destroying information l New harmonic and discrete mathematics and new algorithms for fast extraction of correlations and patterns

U.S. Department of Energy Office of Science February 19, Office of Science Response to the Data Challenge The Office of Science will initiate a long-term research program to address the “Curse of Dimensionality.” Some of the elements of the research program are: l Bayesian Theory – New research to develop efficient ways for dealing with both local and long-range correlations between features, including Bayesian estimators to correctly estimate the simultaneous appearance of “striking” features at precisely defined locations, and mechanisms to incorporate partial analytical models to supplement missing statistics. l Mathematical description of complex geometric shapes – New research on the stochastic theory of shapes to classify geometric shapes in terms of stochastic models, which are essential for the rigorous comparisons needed for pattern discovery. We intend to develop high performance scalable algorithms for querying, searching, tracking, and reconstruction of high dimensional shapes from incomplete information. l Enumeration and optimization of multivariate functions on complex graphs – New research to develop efficient methodologies for the hierarchical enumeration of composite objects, including analytical methods for dynamically constraining the search space. We intend to develop optimization methods to deal with novel spaces formed by graphs of identified features (dots) and their relationships (connections). Such spaces typically have hundreds of variables and dimensions. Additionally, we intend to develop computational libraries to efficiently handle an enormous number of possible variants through construction of subgraph indexing schemes and efficient lookup methods.