IBM ATS Deep Computing © 2007 IBM Corporation High Performance IO HPC Workshop – University of Kentucky May 9, 2007 – May 10, 2007 Andrew Komornicki, Ph. D. Balaji Veeraraghavan, Ph. D.
IBM ATS Deep Computing © 2007 IBM Corporation Agenda Introduction General IO performance Results of some small tests. Modular IO libraries, Linux and AIX
IBM ATS Deep Computing © 2007 IBM Corporation I/O Optimization Analyze the IO pattern Determine optimization method Optimize in user space Minimize source code changes Possibly relink with libtkio.so
IBM ATS Deep Computing © 2007 IBM Corporation General I/O Performance C: Do not use fopen(), fread(), or fwrite(); These are inefficient due to small (4KB) IO blocks and extra memory copies. Use instead: POSIX open(), read(), write() Direct (raw) IO will eliminate an additional memory copy FORTRAN: Use unformatted IO
IBM ATS Deep Computing © 2007 IBM Corporation Asynchronous IO, an example Non Blocking IO aio_read(), aio_write(), aio_return(); Completion Notification Polling with aio_error(); Block until complete with aio_suspend(): Cancellation of IO requests aio_cancel(); Large File enabled Removes the 2GB file size limitation POSIX conforming
IBM ATS Deep Computing © 2007 IBM Corporation Results of Bonnie IO test Run on Blade system in San Mateo Lab System Memory, 5 Gbytes File systems, ext2, and ext3 All tests done in four stages: Writing with putc()...done Rewriting...done Writing intelligently...done Reading with getc()...done Reading intelligently... done
IBM ATS Deep Computing © 2007 IBM Corporation Results of Bonnie IO test, Block IO performance Size (MB) Write (Kbytes/sec) Read(Kbytes/sec) __________________________________________ ,524 2,233, ,237 1,658, ,599 50, ,656 50,677
IBM ATS Deep Computing © 2007 IBM Corporation Results of Bonnie IO test Results for ext2 file system, time in seconds Size (MB) User System Elapsed _________________________________
IBM ATS Deep Computing © 2007 IBM Corporation Results of Bonnie IO test Results for ext3 file system, time in seconds Size (MB) User System Elapsed ______________________________________
IBM ATS Deep Computing © 2007 IBM Corporation Modular I/O (MIO) Familiar and flexible runtime interface MIO modules mio trace pf MIO available on both Linux and AIX
IBM ATS Deep Computing © 2007 IBM Corporation MIO user code interface open MIO_open read MIO_read writeMIO_write closeMIO_close lseekMIO_lseek fcntl MIO_fcntl ftruncate MIO_ftruncate
IBM ATS Deep Computing © 2007 IBM Corporation MIO run time interface MIO_STATS="file name" MIO_FILES=" *.dat* [trace|pf ] *.inp [aix]" MIO_DEBUG="ALL" MIO_DEFAULTS="trace/mbytes, pf/cache=10m“
IBM ATS Deep Computing © 2007 IBM Corporation trace module summary of file activity binary events file low cpu overhead typical options /stats /mbytes /gbytes /tbytes /events=mio.evt
IBM ATS Deep Computing © 2007 IBM Corporation pf module User selectable cache size User selectable page size User selectable prefetch depth Direct or system buffered IO Global or private cache Usage summary
IBM ATS Deep Computing © 2007 IBM Corporation pf module detects sequential I/O user memory buffering options /global /cache_size=10m /page_size=1m /prefetch=1 /stride=1 /direct /stats
IBM ATS Deep Computing © 2007 IBM Corporation Relink with libtkio.a libtkio.a has shared object members tkio.so 32 bit and 64 bit Entry points for open,open64,close,read,write,lseek,lseek64 fcntl,ffinfo,fstat,fstat64,fstatfs,fsync ftruncate,ftruncate64 unlink,aio_...
IBM ATS Deep Computing © 2007 IBM Corporation Default tkio behavior Uses dlopen and dlsym for runtime linking tkio entrycalls open64libc(shr.o) open64 closelibc(shr.o) close readlibc(shr.o) read writelibc(shr.o) write lseek64libc(shr.o) lseek64 fsynclibc(shr.o) fsync ……
IBM ATS Deep Computing © 2007 IBM Corporation tkio runtime interface setenv TKIO_ALTLIB so_name/print/abort export TKIO_ALTLIB=so_name/print/abort so_name is name of shared library Either name.so or libname.a(name.so) tkio calls function in so_name that returns a structure filled with I/O entry points to replace default entry points /print option outputs a print to stderr indicating success of load /abort issues exit(-1) if load is not successfull
IBM ATS Deep Computing © 2007 IBM Corporation tkio using MIO setenv TKIO_ALTLIB get_mio_ptrs_64.so tkio entryCalls Open64libmio(mio.o) MIO_open64 Closelibmio(mio.o) MIO_close Readlibmio(mio.o) MIO_read Writelibmio(mio.o) MIO_write Lseek64libmio(mio.o) MIO_lseek64 Fsynclibmio(mio.o) MIO_fsync …
IBM ATS Deep Computing © 2007 IBM Corporation kernel Application libc libtkio Fortran I/O Demonstration only open64 write read lseek6 4 close ->open64 ->write ->read ->lseek64 ->close stdio fopen frwrite fread fclose libmio ->MIO_open64 ->MIO_write ->MIO_read ->MIO_lseek64 ->MIO_close X
IBM ATS Deep Computing © 2007 IBM Corporation kernel libc libtkio open64 write read lseek6 4 close ->open64 ->write ->read ->lseek64 ->close libmio ->MIO_open64 ->MIO_write ->MIO_read ->MIO_lseek64 ->MIO_close trace pf aix
IBM ATS Deep Computing © 2007 IBM Corporation System buffered Data Movement user space kernel 256k b system buffers MIO space
pf cached Data Movement user space kernel 256k b 5 x 2mb system buffers MIO space
O_DIRECT Data Movement user space kernel O_DIREC T 256k b 5 x 2mb system buffers MIO space
Asynchronous Data Movement user space kernel O_DIREC T 256k b 5 x 2mb system buffers MIO space
IBM ATS Deep Computing © 2007 IBM Corporation Trace close : program pf : /bmwfs/cdh108.T20536_13.SCR300 : (281946/ )= mbytes/s current size=0 max_size=16277 mode =0777 sector size=4096 oflags =0x302=RDWR CREAT TRUNC open write read seek fcntl trunc close size MSC.NASTRAN trace output from program pf Min/Max Request size in bytes Mbytes requested and Mbytes delivered Number of occurances
IBM ATS Deep Computing © 2007 IBM Corporation Trace close : pf aix : /bmwfs/cdh108.T20536_13.SCR300 : (276645/ )= mbytes/s current size=0 max_size=16276 mode =0777 sector size=4096 oflags =0x =RDWR CREAT TRUNC DIRECT open write awrite suspend mbytes/s read aread suspend mbytes/s seek fcntl trunc close size pages MSC.NASTRAN trace output
IBM ATS Deep Computing © 2007 IBM Corporation pf close for /bmwfs/cdh108.T20536_13.SCR300 global cache 0: 150 pages of bytes 29739/29749 pages not preread for write / prefetches : prefetch= write behinds writes reads page writes 37772/33124 mbytes transferred program --> > pf --> > aix program < <-- pf < <-- aix MSC.NASTRAN pf output
IBM ATS Deep Computing © 2007 IBM Corporation time ( seconds ) file position ( bytes ) DataView file activity plot
IBM ATS Deep Computing © 2007 IBM Corporation time ( seconds ) file position ( bytes ) DataView file activity plot
IBM ATS Deep Computing © 2007 IBM Corporation time ( seconds ) file position ( bytes ) suspend time hidden time queuing time Asynchronous I/O plotting
IBM ATS Deep Computing © 2007 IBM Corporation time ( seconds ) file position ( bytes ) cache page activity
IBM ATS Deep Computing © 2007 IBM Corporation MSC.Nastran performance gains 16 cpu 32GB NH2 node 2.2M dof, 767GB I/O, 8 copies 2GB memory per copy 114MB/sec 198MB/sec 8 SSA, 16 loops, 4 disk/loop
IBM ATS Deep Computing © 2007 IBM Corporation MIO Summary Demonstrated performance gains Simple to implement Flexible run time interface Delivered as a shared object library Contact: