IO Best Practices For Franklin Katie Antypas User Services Group NERSC User Group Meeting September 19, 2007.

IO Best Practices For Franklin Katie Antypas User Services Group Kantypas@lbl.gov NERSC User Group Meeting September 19, 2007

NERSC User Group Meeting, September 17, 2007 1 Outline Goals and scope of tutorial IO Formats Parallel IO strategies Striping Recommendations Thanks to Julian Borrill, Hongzang Shan, John Shalf and Harvey Wasserman for slides and data, Nick Cardo for Franklin/Lustre tutorials and NERSC-IO group for feedback

NERSC User Group Meeting, September 17, 2007 2 Goals Very high level answer question of “how should I do my IO on Franklin?” With X GB of data to output running on Y processors -- do this.

NERSC User Group Meeting, September 17, 2007 3 Axis of IO Striping Total Output Size Number of Processors IO Library File Size Per Processor Chunking Number of Files per Ouput Dump Blocksize Transfer Size File System Hints Strided or Contiguous Access Collective vs Independent Weak vs Strong Scaling This is why IO is complicated…..

NERSC User Group Meeting, September 17, 2007 4 Axis of IO Striping Total Output Size Number of Processors IO Library File Size Per Processor Chunking Number of Files per Ouput Dump Blocksize Transfer Size File System Hints Strided or Contiguous Access Collective vs Independent Weak vs Strong Scaling

NERSC User Group Meeting, September 17, 2007 5 Axis of IO Striping Total File Size Number of Processors IO Library File Size Per Processor Number of Writers Blocksize Transfer Size Primarily large block IO, transfer size same as blocksize Used HDF5 Some Basic Tips Strong Scaling

NERSC User Group Meeting, September 17, 2007 6 Parallel I/O: A User Perspective Wish List –Write data from multiple processors into a single file –File can be read in the same manner regardless of the number of CPUs that read from or write to the file. (eg. want to see the logical data layout… not the physical layout) –Do so with the same performance as writing one-file-per- processor (only writing one-file-per-processor because of performance problems) –And make all of the above portable from one machine to the next

NERSC User Group Meeting, September 17, 2007 7 I/O Formats

NERSC User Group Meeting, September 17, 2007 8 Common Storage Formats ASCII: –Slow –Takes more space! –Inaccurate Binary –Non-portable (eg. byte ordering and types sizes) –Not future proof –Parallel I/O using MPI-IO Self-Describing formats –NetCDF/HDF4, HDF5, Parallel NetCDF –Example in HDF5: API implements Object DB model in portable file –Parallel I/O using: pHDF5/pNetCDF (hides MPI-IO) Community File Formats –FITS, HDF-EOS, SAF, PDB, Plot3D –Modern Implementations built on top of HDF, NetCDF, or other self-describing object-model API Many NERSC users at this level. We would like to encourage users to transition to a higher IO library

NERSC User Group Meeting, September 17, 2007 9 HDF5 Library Can store data structures, arrays, vectors, grids, complex data types, text Can use basic HDF5 types integers, floats, reals or user defined types such as multi- dimensional arrays, objects and strings Stores metadata necessary for portability - endian type, size, architecture HDF5 is a general purpose library and file format for storing scientific data

NERSC User Group Meeting, September 17, 2007 10 HDF5 Data Model Groups –Arranged in directory hierarchy –root group is always ‘/’ Datasets –Dataspace –Datatype Attributes –Bind to Group & Dataset References –Similar to softlinks –Can also be subsets of data “/” ( root ) “Dataset0” type,space “Dataset1” type, space “subgrp” “time”=0.2345 “validity”=None “author”=Jane Doe “Dataset0.1” type,space “Dataset0.2” type,space “date”=10/24/2006

NERSC User Group Meeting, September 17, 2007 11 A Plug for Self Describing Formats... Application developers shouldn’t care about about physical layout of data Using own binary file format forces user to understand layers below the application to get optimal IO performance Every time code is ported to a new machine or underlying file system is changed or upgraded, user is required to make changes to improve IO performance Let other people do the work –HDF5 can be optimized for given platforms and file systems by HDF5 developers –User can stay with the high level But what about performance?

NERSC User Group Meeting, September 17, 2007 12 IO Library Overhead Data from Hongzhang Shan Very little, if any overhead from HDF5 for one file per processor IO compared to Posix and MPI-IO

NERSC User Group Meeting, September 17, 2007 13 Ways to do Parallel IO

NERSC User Group Meeting, September 17, 2007 14 Serial I/O 01234 File processors Each processor sends its data to the master who then writes the data to a file Advantages Simple May perform ok for very small IO sizes Disadvantages Not scalable Not efficient, slow for any large number of processors or data sizes May not be possible if memory constrained 5

NERSC User Group Meeting, September 17, 2007 15 Parallel I/O Multi-file 01234 File processors Each processor writes its own data to a separate file Advantages Simple to program Can be fast -- (up to a point) Disadvantages Can quickly accumulate many files With Lustre, hit metadata server limit Hard to manage Requires post processing Difficult for storage systems, HPSS, to handle many small files 5 File

NERSC User Group Meeting, September 17, 2007 16 Flash Center IO Nightmare… Large 32,000 processor run on LLNL BG/L Parallel IO libraries not yet available Intensive I/O application –checkpoint files.7 TB, dumped every 4 hours, 200 dumps used for restarting the run full resolution snapshots of entire grid –plotfiles - 20GB each, 700 dumps coarsened by a factor of two averaging single precision subset of grid variables –particle files 1400 particle files 470MB each 154 TB of disk capacity 74 million files! Unix tool problems 2 Years Later still trying to sift though data, sew files together

NERSC User Group Meeting, September 17, 2007 17 Parallel I/O Single-file 0 1234 File processors Each processor writes its own data to the same file using MPI-IO mapping Advantages Single file Manageable data Disadvantages Lower performance than one file per processor at some concurrencies 5

NERSC User Group Meeting, September 17, 2007 18 Parallel IO single file 352924319824 012345 processors array of data Each processor writes to a section of a data array. Each must know its offset from the beginning of the array and the number of elements to write

NERSC User Group Meeting, September 17, 2007 19 Trade offs It isn’t hard to have speed, portability or usability. It is hard to have speed, portability and usability in the same implementation Ideally users want speed, portability and usability –speed - one file per processor –portability - high level IO library –usability single shared file and own file format or community file format layered on top of high level IO library

NERSC User Group Meeting, September 17, 2007 20 Benchmarking Methodology and Results

NERSC User Group Meeting, September 17, 2007 21 Disclaimer IO runs done during production time Rates dependent on other jobs running on the system Focus on trends rather than one or two outliers Some tests ran twice, others only once

NERSC User Group Meeting, September 17, 2007 22 Peak IO Performance on Franklin Expectation that IO rates will continue to rise linearly Back end saturated around ~250 processors Weak scaling IO, ~300 MB/proc Peak performance ~11GB/Sec (5 DDNs * ~2GB/sec) Image from Julian Borrill

NERSC User Group Meeting, September 17, 2007 23 Description of IOR Developed by LLNL used for purple procurement Focuses on parallel/sequential read/write operations that are typical in scientific applications Can exercise one file per processor or shared file access for common set of testing parameters Exercises array of modern file APIs such as MPI- IO, POSIX (shared or unshared), HDF5 and parallel-netCDF Parameterized parallel file access patterns to mimic different application situations

NERSC User Group Meeting, September 17, 2007 24 Benchmark Methodology 0 1234 File processors 5 01234 File processors 5 File Focus on performance difference between single shared and one file per processor

NERSC User Group Meeting, September 17, 2007 25 Benchmark Methodology 4096 2048 1024 512 256 100 MB1 GB10 GB100 GB1 TB Using IOR HDF5 Interface Contiguous IO Not intended to be a scaling study Blocksize and transfer size always the same but vary from run to run Goal is to fill out opposite chart with best IO strategy Aggregate Output Size Processors

NERSC User Group Meeting, September 17, 2007 26 Small Aggregate Output Sizes 100 MB - 1GB One File per Processor vs Shared File - GB/Sec Aggregate File Size 100 MB Clearly the ‘one file per processor’ strategy wins in the low concurrency cases correct? Aggregate File Size 1 GB Peak performance line - Anything greater than this is due to caching effect or timer granularity

NERSC User Group Meeting, September 17, 2007 27 Small Aggregate Output Sizes 100 MB - 1GB One File per Processor vs Shared File - Time Aggregate File Size 100 MB But when looking at absolute time, the difference doesn’t seem so big... Aggregate File Size 1 GB

NERSC User Group Meeting, September 17, 2007 28 Aggregate Output Size 100GB One File per Processor vs Shared File Rate: GB/Sec Time: Seconds Is there anything we can do to improve the performance of the 4096 processor shared file case ? 2.5 mins 390 MB/proc 24 MB/proc Peak performance line

NERSC User Group Meeting, September 17, 2007 29 Hybrid Model 0 1234 File processors 5 File Examine 4096 processor case more closely Group subsets of processors to write to separate shared files Try grouping 64, 256, 512, 1024, and 2048 processors to see performance difference from file per processor case vs single shared file case

NERSC User Group Meeting, September 17, 2007 30 Effect of Grouping Processors into Separate Smaller Shared Files 1 file per proc Single Shared File 512 procs write to single file 64 procs write to single file 2048 procs write to single file 100GB Aggregate Output Size on 4096 procs User gains some from grouping files Since very little data is written per processor, overhead for synchronization dominates Each processor writes out 24MB Only difference between runs is number of files to which processors are grouped Created a new MPI communicator in IOR for multiple shared files Number of Files

NERSC User Group Meeting, September 17, 2007 31 Aggregate Output Size 1TB One File per Processor vs Shared File Rate: GB/Sec Time: Seconds ~ 3 mins Is there anything we can do to improve the performance of the 4096 processor shared file case ? 976 MB/proc 244 MB/proc

NERSC User Group Meeting, September 17, 2007 32 1 file per proc Single Shared File 512 procs write to single file 64 procs write to single file 2048 procs write to single file Each processor writes out 244MB Only difference between runs is number of files to which processors are grouped Created a new MPI communicator in IOR for multiple shared files Effect from grouping files is fairly substantial But do users want to do this? Important to show hdf5 developers to make splitting files easier in API. Effect of Grouping Processors into Separate Smaller Shared Files

NERSC User Group Meeting, September 17, 2007 33 1 file per proc Single Shared File 512 procs write to single file 64 procs write to single file Each processor writes out 488MB Only difference between runs is number of files to which processors are grouped Created a new MPI communicator in IOR for multiple shared files Effect of Grouping Processors into Separate Smaller Shared Files

NERSC User Group Meeting, September 17, 2007 34 What is Striping? Lustre file system on Franklin made up of an underlying set of file systems calls Object Storage Targets (OSTs), essentially a set of parallel IO servers File is said to be striped when read and write operations access multiple OSTs concurrently Striping can be a way to increase IO performance since writing or reading from multiple OSTs simultaneously increases the available IO bandwidth

NERSC User Group Meeting, September 17, 2007 35 What is Striping? File striping will most likely improve performance for applications which read or write to a single (or multiple) large shared files Striping will likely have little effect for the following type of IO patterns –Serial IO where a single processor performs all the IO –Multiple node perform IO, but access files at different times –Multiple nodes perform IO simultaneously to different files that are small (each < 100 MB) –One file per processor

NERSC User Group Meeting, September 17, 2007 36 Striping Commands Striping can be set at a file or directory level Set striping on an directory then all files created in that directory with inherit striping level of the directory Moving a file into a directory with a set striping will NOT change the striping of that file stripe-size - –Number of bytes in each stripe (multiple of 64k block) OST offset - –Always keep this -1 –Choose starting OST in round robin stripe count - –Number of OSTs to stripe over –-1 stripe over all OSTs –1 stripe over one OST lfs setstripe

NERSC User Group Meeting, September 17, 2007 37 Stripe-Count Suggestions Franklin Default Striping –1MB stripe size –Round robin starting OST (OST Offset -1) –Stripe over 4 OSTs (Stripe count 4) Many small files, one file per proc –Use default striping –Or 0 -1, 1 Large shared files –Stripe over all available OSTs (0 -1 -1) –Or some number larger than 4 (0 -1 X) Stripe over odd numbers? Prime numbers?

NERSC User Group Meeting, September 17, 2007 38 Recommendations N/A 4096 2048 1024 512 256 100 MB1 GB10 GB100 GB1 TB Aggregate File Size Processors Single Shared File, Default or No Striping Single Shared File, Stripe over many OSTs Single Shared File, Stripe over many OSTs OR File per processor with default striping Benefits to mod n shared files Single Shared File, Stripe over some OSTs (~10) Legend

NERSC User Group Meeting, September 17, 2007 39 Recommendations Think about the big picture –Run time vs Post Processing trade off –Decide how much IO overhead you can afford –Data Analysis –Portability –Longevity H5dump works on all platforms Can view an old file with h5dump If you use your own binary format you must keep track of not only your file format version but the version of your file reader as well –Storability

NERSC User Group Meeting, September 17, 2007 40 Recommendations Use a standard IO format, even if you are following a one file per processor model One file per processor model really only makes some sense when writing out very large files at high concurrencies, for small files, overhead is low If you must do one file per processor IO then at least put it in a standard IO format so pieces can be put back together more easily Splitting large shared files into a few files appears promising –Option for some users, but requires code changes and output format changes –Could be implemented better in IO library APIs Follow striping recommendations Ask the consultants, we are here to help!

NERSC User Group Meeting, September 17, 2007 41 Questions?

IO Best Practices For Franklin Katie Antypas User Services Group NERSC User Group Meeting September 19, 2007.

Similar presentations

Presentation on theme: "IO Best Practices For Franklin Katie Antypas User Services Group NERSC User Group Meeting September 19, 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IO Best Practices For Franklin Katie Antypas User Services Group NERSC User Group Meeting September 19, 2007.

Similar presentations

Presentation on theme: "IO Best Practices For Franklin Katie Antypas User Services Group NERSC User Group Meeting September 19, 2007."— Presentation transcript:

Similar presentations

About project

Feedback