Scientific Data Management contains extensive publication list

Slides:



Advertisements
Similar presentations
1 The SciDAC Scientific Data Management Center: Infrastructure and Results Arie Shoshani Lawrence Berkeley National Laboratory SC 2004 November, 2004.
Advertisements

University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
1 Slides presented by Hank Childs at the VACET/SDM workshop at the SDM Center All-Hands Meeting. November 26, 2007 Snoqualmie, Wa Work performed under.
SDM Center Coupling Parallel IO with Remote Data Access Ekow Otoo, Arie Shoshani, Doron Rotem, and Alex Sim Lawrence Berkeley National Lab.
SDM center Questions – Dave Nelson What kind of processing / queries / searches biologists do over microarray data? –Range query on a spot? –Range query.
ProActive Task Manager Component for SEGL Parameter Sweeping Natalia Currle-Linde and Wasseim Alzouabi High Performance Computing Center Stuttgart (HLRS),
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Grid Collector: Enabling File-Transparent Object Access For Analysis Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
SDM Center A Quick Update on the TSI and PIW workflows SDM All Hands March 2-3, Terence Critchlow, Xiaowen Xin, Bertram.
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
SensIT PI Meeting, January 15-17, Self-Organizing Sensor Networks: Efficient Distributed Mechanisms Alvin S. Lim Computer Science and Software Engineering.
Scientific Data Management (SDM)
SDM meeting, July 10-11, 2001Area 3 Report Data mining and discovery of access patterns 3a.i) Adaptive file caching in a distributed system (LBNL) 3b.i)
1 Scientific Data Management Center DOE Laboratories: ANL: Rob Ross LBNL:Doron Rotem LLNL:Chandrika Kamath ORNL: Nagiza Samatova.
DOE BER Climate Modeling PI Meeting, Potomac, Maryland, May 12-14, 2014 Funding for this study was provided by the US Department of Energy, BER Program.
material assembled from the web pages at
SDM Center February 2, 2005 Progress on MPI-IO Access to Mass Storage System Using a Storage Resource Manager Ekow J. Otoo, Arie Shoshani and Alex Sim.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
1 Use of SRMs in Earth System Grid Arie Shoshani Alex Sim Lawrence Berkeley National Laboratory.
1 Arie Shoshani, LBNL SDM center Scientific Data Management Center(SDM-ISIC) Arie Shoshani Computing Sciences Directorate Lawrence Berkeley National Laboratory.
Presented by On the Path to Petascale: Top Challenges to Scientific Discovery Scott A. Klasky NCCS Scientific Computing End-to-End Task Lead.
Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Opportunities in Parallel I/O for Scientific Data Management Rajeev Thakur and Rob Ross Mathematics and Computer Science Division Argonne National Laboratory.
Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad.
SDM Center’s Data Mining & Analysis SDM Center Parallel Statistical Analysis with RScaLAPACK Parallel, Remote & Interactive Visual Analysis with ASPECT.
1 Arie Shoshani, LBNL SDM center Scientific Data Management Center (Integrated Software Infrastructure Center – ISIC) Arie Shoshani All Hands Meeting March.
1 Scientific Data Management Center(ISIC) contains extensive publication list.
Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division.
Kepler includes contributors from GEON, SEEK, SDM Center and Ptolemy II, supported by NSF ITRs (SEEK), EAR (GEON), DOE DE-FC02-01ER25486.
Your name here SPA: Successes, Status, and Future Directions Terence Critchlow And many, many, others Scientific Process Automation PNNL.
SDM center Supporting Heterogeneous Data Access in Genomics Terence Critchlow Ling Liu, Calton Pu GT Reagan Moore, Bertam Ludaescher, SDSC Amarnath Gupta.
Toward interactive visualization in a distributed workflow Steven G. Parker Oscar Barney Ayla Khan Thiago Ize Steven G. Parker Oscar Barney Ayla Khan Thiago.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.
Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory.
Supercomputing 2006 Scientific Data Management Center Lead Institution: LBNL; PI: Arie Shoshani Laboratories: ANL, ORNL, LBNL, LLNL, PNNL Universities:
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
Satisfying Requirements BPF for DRA shall address: –DAQ Environment (Eclipse RCP): Gumtree ISEE workbench integration; –Design Composing and Configurability,
SDM Center Parallel I/O Storage Efficient Access Team.
An Architectural Approach to Managing Data in Transit Micah Beck Director & Associate Professor Logistical Computing and Internetworking Lab Computer Science.
SDM Center Techniques for feature identification in scientific data Chandrika Kamath (LLNL) with Erick Cantú-Paz, Imola Fodor, Cyrus Harrison, Nicole Love,
PIDX PIDX - a parallel API to capture the data models used by HPC application and write it out in an IDX format. PIDX enables simulations to write out.
Climate-SDM (1) Climate analysis use case –Described by: Marcia Branstetter Use case description –Data obtained from ESG –Using a sequence steps in analysis,
Production Mode Data-Replication Framework in STAR using the HRM Grid CHEP ’04 Congress Centre Interlaken, Switzerland 27 th September – 1 st October Eric.
Workflow Management Concepts and Requirements For Scientific Applications.
1 Scientific Data Management Group LBNL SRM related demos SC 2002 DemosDemos Robust File Replication of Massive Datasets on the Grid GridFTP-HPSS access.
VisIt Project Overview
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
University of Chicago and ANL
MATLAB Distributed, and Other Toolboxes
Applying Control Theory to Stream Processing Systems
HDF5 October 8, 2017 Elena Pourmal Copyright 2016, The HDF Group.
Presented by Munezero Immaculee Joselyne PhD in Software Engineering
PowerMart of Informatica
Introduction to Spark.
DOE 2000 PI Retreat Breakout C-1
SDM workshop Strawman report History and Progress and Goal.
Data Warehousing and Data Mining
Hadoop Technopoints.
Overview of big data tools
Software Engineering with Reusable Components
TeraScale Supernova Initiative
Gordon Erlebacher Florida State University
Data Management Components for a Research Data Archive
MapReduce: Simplified Data Processing on Large Clusters
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Scientific Data Management contains extensive publication list Center (ISIC) http://sdmcenter.lbl.gov contains extensive publication list

Scientific Data Management Center Participating Institutions Center PI: Arie Shoshani LBNL DOE Laboratories co-PIs: Bill Gropp, Rob Ross ANL Arie Shoshani, Doron Rotem LBNL Terence Critchlow, Chandrika Kamath LLNL Nagiza Samatova, Andy White ORNL Universities co-PIs : Mladen Vouk North Carolina State Alok Choudhary Northwestern Reagan Moore, Bertram Ludaescher UC San Diego (SDSC) Calton Pu Georgia Tech

Phases of Scientific Exploration Data Generation From large scale simulations or experiments Fast data growth with computational power examples HENP: 100 Teraops and 10 Petabytes by 2006 Climate: Spatial Resolution: T42 (280 km) -> T85 (140 km) -> T170 (70 km), T42: about 1 TB/100 year run => factor of ~ 10-20 Problems Can’t dump the data to storage fast enough – waste of compute resources Can’t move terabytes of data over WAN robustly – waste of scientist’s time Can’t steer the simulation – waste of time and resource Need to reorganize and transform data – large data intensive tasks slowing progress

Phases of Scientific Exploration Data Analysis Analysis of large data volume Can’t fit all data in memory Problems Find the relevant data – need efficient indexing Cluster analysis – need linear scaling Feature selection – efficient high-dimensional analysis Data heterogeneity – combine data from diverse sources Streamline analysis steps – output of one step needs to match input of next

Example Data Flow in TSI Logistical Network Courtesy: John Blondin

Goal: Reduce the Data Management Overhead Efficiency Example: parallel I/O, indexing, matching storage structures to the application Effectiveness Example: Access data by attributes-not files, facilitate massive data movement New algorithms Example: Specialized PCA techniques to separate signals or to achieve better spatial data compression Enabling ad-hoc exploration of data Example: by enabling exploratory “run and render” capability to analyze and visualize simulation output while the code is running

Approach Use an integrated framework that: Provides a scientific workflow capability Supports data mining and analysis tools Accelerates storage and access to data Simplify data management tasks for the scientist Hide details of underlying parallel and indexing technology Permit assembly of modules using a simple graphical workflow description tool SDM Framework Scientific Process Automation Layer Scientific Application Data Mining & Analysis Layer Scientific Understanding Storage Efficient Access Layer

Technology Details by Layer

Accomplishments: Storage Efficient Access (SEA) Parallel Virtual File System: Enhancements and deployment Shared memory communication P0 P1 P2 P3 netCDF Parallel File System Parallel netCDF P0 P1 P2 P3 Parallel File System Developed Parallel netCDF Enables high performance parallel I/O to netCDF datasets Achieves up to 10 fold performance improvement over HDF5 Enhanced ROMIO: Provides MPI access to PVFS Advanced parallel file system interfaces for more efficient access Developed PVFS2 Adds Myrinet GM and InfiniBand support improved fault tolerance asynchronous I/O offered by Dell and HP for Clusters Deployed an HPSS Storage Resource Manager (SRM) with PVFS Automatic access of HPSS files to PVFS through MPI-IO library SRM is a middleware component Before After FLASH I/O Benchmark Performance (8x8x8 block sizes)

Robust Multi-file Replication NCAR Anywhere LBNL Disk Cache SRM-COPY (thousands of files) SRM-GET (one file at a time) DataMover SRM (performs writes) (performs reads) GridFTP GET (pull mode) stage files archive files Network transfer Get list of files MSS Problem: move thousands of files robustly Takes many hours Need error recovery Mass storage systems failures Network failures Use Storage Resource Managers (SRMs) Problem: too slow Use parallel streams Use concurrent transfers Use large FTP windows Pre-stage files from MSS

Accomplishments: Data Mining and Analysis (DMA) Developed Parallel-VTK Efficient 2D/3D Parallel Scientific Visualization for NetCDF and HDF files Built on top of PnetCDF Developed “region tracking” tool For exploring 2D/3D scientific databases Using bitmap technology to identify regions based on multi-attribute conditions Implemented Independent Component Analysis (ICA) module Used for accurate for signal separation Used for discovering key parameters that correlate with observed data Developed highly effective data reduction Achieves 15 fold reduction with high level of accuracy Using parallel Principle Component Analysis (PCA) technology Developed ASPECT A framework that supports a rich set of pluggable data analysis tools Including all the tools above A rich suite of statistical tools based on R package Combustion region tracking El Nino signal (red) and estimation (blue) closely match

ASPECT Analysis Environment Data Select  Data Access  Correlate  Render  Display (temp, pressure) From astro-data Where (step=101) (entropy>1000); Sample (temp, pressure) Visualize scatter plot in QT Run pVTK filter Run R analysis pVTK Tool Select Data Take Sample R Analysis Tool Data Mining & Analysis Layer Read Data (buffer-name) Write Data Read Data (buffer-name) Write Data Read Data (buffer-name) Use Bitmap (condition) Get variables (var-names, ranges) Storage Efficient Access Layer Bitmap Index Selection Parallel NetCDF PVFS Hardware, OS, and MSS (HPSS)

Accomplishments: Scientific Process Automation (SPA) Unique requirements of scientific WFs Moving large volumes between modules Tightlly-coupled efficient data movement Specification of granularity-based iteration e.g. In spatio-temporal simulations – a time step is a “granule” Support for data transformation complex data types (including file formats, e.g. netCDF, HDF) Dynamic steering of workflow by user Dynamic user examination of results Developed a working scientific work flow system Automatic microarray analysis Using web-wrapping tools developed by the center Using Kepler WF engine Kepler is an adaptation of the UC Berkeley tool, Ptolemy workflow steps defined graphically workflow results presented to user

GUI for setting up and running workflows

Re-applying Technology SDM technology, developed for one application, can be effectively targeted at many other applications … Technology Parallel NetCDF Parallel VTK Compressed bitmaps Storage Resource Managers Feature Selection Scientific Workflow Initial Application Astrophysics HENP Climate Biology New Applications Climate Combustion, Astrophysics Astrophysics Fusion Astrophysics (planned)

Broad Impact of the SDM Center… Astrophysics: High speed storage technology, parallel NetCDF, parallel VTK, and ASPECT integration software used for Terascale Supernova Initiative (TSI) and FLASH simulations Tony Mezzacappa – ORNL, John Blondin –NCSU, Mike Zingale – U of Chicago, Mike Papka – ANL Climate: High speed storage technology, Parallel NetCDF, and ICA technology used for Climate Modeling projects Ben Santer – LLNL, John Drake – ORNL, John Michalakes – NCAR Combustion: Compressed Bitmap Indexing used for fast generation of flame regions and tracking their progress over time Wendy Koegler, Jacqueline Chen – Sandia Lab ASCI FLASH – parallel NetCDF Dimensionality reduction Region growing

Broad Impact (cont.) Biology: High Energy Physics: Fusion: Kepler workflow system and web-wrapping technology used for executing complex highly repetitive workflow tasks for processing microarray data Matt Coleman - LLNL High Energy Physics: Compressed Bitmap Indexing and Storage Resource Managers used for locating desired subsets of data (events) and automatically retrieving data from HPSS Doug Olson - LBNL, Eric Hjort – LBNL, Jerome Lauret - BNL Fusion: A combination of PCA and ICA technology used to identify the key parameters that are relevant to the presence of edge harmonic oscillations in a Tokomak Keith Burrell - General Atomics Building a scientific workflow Dynamic monitoring of HPSS file transfers Identifying key parameters for the DIII-D Tokamak

Goals for Years 4-5 Fully develop the integrated SDM framework Implement the 3 layer framework on SDM center facility Provide a way to select only components needed Develop self-guiding web pages on the use of SDM components Use existing successful examples as guides Generalize components for reuse Develop general interfaces between components in the layers support loosely-coupled WSDL interfaces Support tightly-coupled components for efficient dataflow Integrate operation of components in the framework Hide details form user – automate parallel access and indexing Develop a reusable library of components that can be selected for use in the workflow system