I/O Analysis and Optimization for an AMR Cosmology Simulation Jianwei LiWei-keng Liao Alok ChoudharyValerie Taylor ECE Department Northwestern University.

Slides:

Advertisements

Similar presentations

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.

Advertisements

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 36 Virtual Memory Read.

Parallel I/O A. Patra MAE 609/CE What is Parallel I/O ? zParallel processes need parallel input/output zIdeal: Processor consuming/producing data.

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.

Data Locality Aware Strategy for Two-Phase Collective I/O. Rosa Filgueira, David E.Singh, Juan C. Pichel, Florin Isaila, and Jesús Carretero. Universidad.

File System Implementation CSCI 444/544 Operating Systems Fall 2008.

I/O Optimization for ENZO Cosmology Simulation Using MPI-IO Jianwei Li12/06/2001.

Memory Organization.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Memory Management Chapter 5.

Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory

Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.

HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.

Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.

Alok 1Northwestern University Access Patterns, Metadata, and Performance Alok Choudhary and Wei-Keng Liao Department of ECE,

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Institute for Mathematical Modeling RAS 1 Dynamic load balancing. Overview. Simulation of combustion problems using multiprocessor computer systems For.

Adaptive MPI Milind A. Bhandarkar

Memory Management Chapter 7.

HDF5 A new file format & software for high performance scientific data management.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Using HDF5 in WRF Part of MEAD - an alliance expedition.

Victoria, May 2006 DAL for theorists: Implementation of the SNAP service for the TVO Claudio Gheller, Giuseppe Fiameni InterUniversitary Computing Center.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

(Short) Introduction to Parallel Computing CS 6560: Operating Systems Design.

Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.

The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

File Processing - Hash File Considerations MVNC1 Hash File Considerations.

F. Douglas Swesty, DOE Office of Science Data Management Workshop, SLAC March Data Management Needs for Nuclear-Astrophysical Simulation at the Ultrascale.

File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.

Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Progress on Component-Based Subsurface Simulation I: Smooth Particle Hydrodynamics Bruce Palmer Pacific Northwest National Laboratory Richland, WA.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.

Generic GUI – Thoughts to Share Jinping Gwo EMSGi.org.

Finite State Machines (FSM) OR Finite State Automation (FSA) - are models of the behaviors of a system or a complex object, with a limited number of defined.

SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

Parallel IO for Cluster Computing Tran, Van Hoai.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.

Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

Parallel Programming By J. H. Wang May 2, 2017.

Chapter 9 – Real Memory Organization and Management

CSCE 990: Advanced Distributed Systems

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Support for ”interactive batch”

Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.

COMP755 Advanced Operating Systems

Presentation transcript:

I/O Analysis and Optimization for an AMR Cosmology Simulation Jianwei LiWei-keng Liao Alok ChoudharyValerie Taylor ECE Department Northwestern University

ENZO Background Domain Spatial Temporal Domain RefinementTemperature T0 T1 T2 Density T3 Simulate the formation of a cluster of galaxies (gas and starts) starting near the big bang until the present day Used to test theories of how galaxy forms by comparing the results with what is really observed in the sky today Highly irregular spatial distribution of cosmic objects Algorithm: Adaptive Mesh Refinement (AMR) Parallelism achieved by domain decomposition of 3-D grids Dynamic load balance using MPI Implementation PurposeVisualize Datasets

AMR Algorithm Multi-scale algorithm that achieves high spatial resolution in localized regions Recursively produces a deep, dynamic hierarchy of increasingly refined grid patches Refinement Hierarchy Level 0 Level 1 Level 2 Combined hierarchical AMR data sets as bounding boxes

Execution Flow Read root grid and some initial pre-refined sub grids Partition the grids data among multiple processors Main loop of evolution over time: –Solve hydro-equations to advance the solution by dt on each grid –Recursively evolve the grid hierarchy on each level using AMR control algorithm –Check-pointing to write out grids data (hierarchical simulation results) at current time stamp –Adaptively refine the grids and rebuild the new finer hierarchy –Redistribute grids data to perform load balance

Parallelization A hierarchical data structure, containing grids metadata and gird hierarchy information, is maintained on all processors Each of the hierarchy nodes points to a real grid that resides only on the allocated processor A grid is owned by only one processor but one processor can have many grids. Each processor perform computation on its own grids. Top grid is partitioned into multiple grids among all processors Sub grids are distributed among all processors Some sub grids can even be partitioned and redistributed in load balance

ENZO Datasets ● Baryon Fields (Gas) A number of 3-D float type arrays Density Field Energy Fields Velocity [X, Y, Z] Fields Temperature Field, Dark Matter Field... There are three kinds of datasets that are involved in ENZO I/O operations: Baryon Fields, Particle Datasets contained in a grid and Boundary data ● Particle Datasets (Stars) A number of 1-D float type arrays and 1 integer array Particle ID Particle Position [X,Y, Z] Particle Velocity [X, Y, Z] Particle Mass Particle Attributes … ● Boundary data (Boundary Types and Values on each boundary side of each dimension for each Baryon Field) a number of two-dimensional float type arrays.

Data Partition Partition of Baryon Field Datasets (Block, Block, Block) Partition of Particle Datasets Boundary data are maintained on all processors and not partitioned The 1-D particle data arrays are partitioned based on which grid sub-domain the particle position falls within, so the pattern is totally Irregular.

I/O Patterns Real data –Read from initialized physical variables of starting grids –Periodically write out physical solutions of all grids. During the simulation loop, all grids will be dump to files in a timely sequence (check pointing) –The datasets are read/written in a fixed, pre-defined sequence order –Even the access to the grid hierarchy follows a certain sequence Metadata –Metadata of grid hierarchy is written out recursively –Metadata of each dataset is written out together with grid real data Visualizing data –In ENZO, there’s no separate visualizing data directly written out by the simulation –Visualization is performed by another program taking the check pointing grid data (real and meta data) as input and doing projection

I/O Approach Original Approach: –Real data is read/written as one grid per file by the allocated processor –The read of initial grids is done by one processor and then partitioned among all processors –For the partitioned top grid, the data is first combined from other processor, and then written out by the root processor –Grid hierarchy metadata is written out in text files by root processor –Grid dataset metadata is written out to the same grid file Other choices: –Real data to be stored in one single file with parallel I/O access –Generate visualizing data directly without re-read and processing of the check pointing grid files  Advantages: Parallel I/O largely improves the I/O performance Storing all real data in one single file makes pre-fetching easier Visualization is real time. Re-reading a large number of distributed grid files and processing them is very time consuming.  Disadvantages: Managing the metadata needs a lot more work, parallel I/O not so easy

Sequential HDF4 I/O This is the original I/O implementation –Each grid (real data and meta data) is read/written sequentially by its allocated processor independently using HDF4 I/O library –The grid hierarchy metadata is written to separate ASCII file using native I/O library  Advantages: –HDF4 provides self-describing data format with metadata stored together with the real data in the same file  Disadvantages: –Does not provide parallel I/O facilities – low performance –Can not combine datasets into file, or have to spend extra time to explicitly combine datasets in memory and then write to file –Storing the metadata with real data bring some overhead and makes access of real data inefficient.

Native I/O Advantage –Flexible, apply any parallel I/O techniques at application level –Performance can be potentially very good Disadvantages: –Implementation will be trivial, user have to handle a lot of tasks –Hard for programmer to manage the metadata –Lots of work for performance –Platform dependent

Parallel I/O using MPI-IO  Advantages –Parallel I/O access –Collective I/O –Easy to implement application level two- phase I/O –Expecting high performance –Easy to combine grids into a single file –Also easy to directly write visualizing data  Disadvantages –Only beneficial for raw data. Metadata is usually small and reading or writing metadata using MPI-IO can make performance only worse. –Need more implementation to manage metadata separately Collective read of a Baryon Dataset with two-phase I/O Application level two-phase I/O for a particle dataset One Possible solution : Using XML to manage the hierarchical metadata

Parallel I/O using HDF5  Advantages –Self-describing data format, easy to manage metadata –Parallel I/O access on top of MPI-IO –Groups and Datasets organized in a hierarchical/tree structure, easy to combine grids into a single file  Disadvantages –Implementation overhead. Packing and unpacking hyperslabs are currently handled recursively, taking a relatively long time. –Some features are not yet completed. Creating and closing datasets are collective which produces additional synchronization in parallel I/O access. Adding/changing attributes can only be done on processor 0, which also limits the parallel performance on writing real data. –Mixing the metadata with real data makes the real data not aligned on appropriate boundaries, which results in a high variance in access time between processes

Performance Evaluation on Origin2000 AMR64 AMR128AMR256 Read 2.78 MB22.15 MB MB Write MB MB MB The amount of data read/written by ENZO Cosmology Simulation with different problem sizes I/O Performance of the ENZO application on SGI Origin2000 with XFS Result: significant I/O performance improvement of MPI-IO over HDF4 I/O

Performance Evaluation on IBM SP (Using GPFS) The performance of our parallel I/O using MPI-IO is worse than that of the original HDF4 I/O. This happens because the data access pattern in this application does not fit in well with the disk file striping and distribution pattern in the parallel file system. Each process may access small chunks of data while the physical distribution of the file on the disks is based on very large striping size, the chunks of data requested by one process may span on multiple I/O nodes, or multiple processes may try to access the data on a single I/O node. I/O Performance of the ENZO application on IBM SP-2 with GPFS

Performance Evaluation on Linux Cluster (Using PVFS) I/O Performance of the ENZO application on Linux cluster with PVFS (8 compute nodes and 8 I/O nodes) Like GPFS, the PVFS for MPI-IO uses fixed striping scheme specified by the striping parameters at setup time and the physical data partition pattern is also fixed, hence not tailored for specific parallel I/O applications. More importantly, the striping and partition patterns are uniform across multiple I/O nodes, which is good for efficient utilization of disk space but not flexible hence not good enough for performance. For various types of access patterns, especially those in which each process accesses a large number of stridden, small data chunks, there may be significant skew between application access patterns and physical file partition/distribution patterns. So the communication (between compute nodes and I/O nodes) overhead of using parallel I/O may be very large.

Performance Evaluation on Linux Cluster (Using Local Disk) I/O Performance of the ENZO application on Linux cluster with each compute node accessing its local disk using PVFS interface The I/O operation of each compute node is performed on its local disk. The only overhead of MPI-IO is the user-level inter-communication among compute nodes. As expected, the MPI-IO has much better overall performance than the HDF4 sequential I/O and it scales pretty well with increasing number of processors. However, unlike the real PVFS which generates integrated files, the file system used in this experiment does not keep any metadata of the partitioned file and there’s no way to extract the distributed output files for other applications to use.

Performance Evaluation: HDF5 Comparison of I/O write performance for HDF5 I/O vs MPI-IO (on SGI Origin2000) The performance of HDF5 I/O is much worse than we expected. Although it uses MPI-IO for its parallel I/O access and has optimizations based on access patterns and other metadata, the overhead of HDF5 is very significant, as we mentioned in previous discussion