Efficient I/O on the Cray XT Jeff Larkin With Help Of: Gene Wagenbreth.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

High Performance Computing Course Notes Parallel I/O.
SDN + Storage.
ARM-DSP Multicore Considerations CT Scan Example.
FY 2004 Allocations Francesca Verdier NERSC User Services NERSC User Group Meeting 05/29/03.
Katie Antypas NERSC User Services Lawrence Berkeley National Lab NUG Meeting 1 February 2012 Best Practices for Reading and Writing Data on HPC Systems.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
The Triumph of Hope over Experience * ? Bill Gropp *Samuel Johnson.
Distributed Computations
Chapter 7 Memory Management
Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.
Point-to-Point Communication Self Test with solution.
I/O Optimization for ENZO Cosmology Simulation Using MPI-IO Jianwei Li12/06/2001.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Chapter 7 Memory Management
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
CS1104-8Memory1 CS1104: Computer Organisation Lecture 8: Memory
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Input / Output CS 537 – Introduction to Operating Systems.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
MapReduce How to painlessly process terabytes of data.
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
MapReduce M/R slides adapted from those of Jeff Dean’s.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
Computer Basics Terminology - Take Notes. What is a computer? well, what is the technical definition A computer is a machine that changes information.
1/30/2003 BARC1 Profile-Guided I/O Partitioning Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University {yiwang,
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
Unit 2—Part A Computer Memory Computer Technology (S1 Obj 2-3)
CPU How It Works. 2 Generic Block Diagram CPU MemoryInputOutput Address Bus Data Bus.
MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.
Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Model-driven Data Layout Selection for Improving Read Performance Jialin Liu 1, Bin Dong 2, Surendra Byna 2, Kesheng Wu 2, Yong Chen 1 Texas Tech University.
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Chapter 7 Memory Management
Achieving the Ultimate Efficiency for Seismic Analysis
Hadoop Aakash Kag What Why How 1.
Introduction to HDF5 Session Five Reading & Writing Raw Data Values
Computer Memory Digital Literacy.
Chapter 9 – Real Memory Organization and Management
Cache Memory Presentation I
Parallel Programming with MPI and OpenMP
MapReduce Simplied Data Processing on Large Clusters
Data Structures and Algorithms
Advanced Topics in Data Management
CSCE569 Parallel Computing
PVFS: A Parallel File System for Linux Clusters
Lecture 13: Computer Memory
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Efficient I/O on the Cray XT Jeff Larkin With Help Of: Gene Wagenbreth

Overview  What's the problem?  “Typical” Application I/O  I/O Solutions  A Solution That Works  Graphs, so many Graphs  Take Home Notes

What's The Problem?  Flops are Cheap, Bandwidth isn’t  Machines and Applications aren’t getting any smaller  But... Isn't Lustre enough? Can't I use libraries? Doesn't it just work?  Without user or programmer intervention, I/O will not perform at peak  There is no Silver Bullet

“Typical” Application I/O  THERE IS NO TYPICAL APPLICATION I/O  There are several common methods, but 2 are very common and problematic Single-writer reduction N-writer/N-reader to N-files SimpleEfficient

Single-writer Reduction  The Plan All processors send to 1 I/O node for output File striped to maximum OSTs  The Problem Even with maximum striping, 1 node will never achieve maximum bandwidth single node IO bandwidth is approximately 200 MB/s reading/writing a terabyte would require more than 1 hour at current I/O rates Lustre

N-Writer to N-Files  The Plan Every process opens a file and dumps its data Files striped to 1 OST  The Problem Can lead to slow opens and general filesystem slowness If the writes are not large, performance will suffer Inconvenient Can only be used as input for same number of nodes  One Modification Use MPI-I/O for just 1 file Suffers when i/o results in small buffers

What does efficient I/O look like?

Striking a Balance Filesystem Limits Application Needs

Subset of Readers/Writers Approach  The Plan: Combine the best of our first two I/O methods Choose a subset of nodes to do I/O Send output to or Receive input from 1 node in your subset  The Benefits I/O Buffering High Bandwidth, Low FS Stress  The Costs I/O Nodes must sacrifice memory for buffer Requires Code Changes Lustre

Subset of Readers/Writers Approach  Assumes job runs on thousands of nodes  Assumes job needs to do large I/O  From data partitioning, identify groups of nodes such that: each node belongs to a single group data in each group is contiguous on disk there are approximately the same number of groups as OSTs  Pick one node from each group to be the ionode  Use MPI to transfer data within a group to its ionode  Each IOnode reads/write shared disk file

Example Code: MPI Subset Communicator create an MPI communicator that include only ionodes call MPI_COMM_GROUP(MPI_COMM_WORLD, WORLD_GROUP,ierr) call MPI_GROUP_INCL(WORLD_GROUP,niotasks, listofiotasks,IO_GROUP,ierr) call MPI_COMM_CREATE(MPI_COMM_WORLD,IO_GROUP, MPI_COMM_IO,ierr)

Example Code: MPI I/O open call MPI_FILE_OPEN(MPI_COMM_IO,trim(filename), filemode,finfo,mpifh,ierr) read/write call MPI_FILE_WRITE_AT(mpifh, offset, iobuf, bufsize, MPI_REAL8,status,ierr) close call MPI_FILE_CLOSE(mpifh,ierr)

Example Code: I/O Code Outline  IONode: copy (scatter) this nodes data to IO buffer loop over nonIOnodes in this group mpi_recv data from compute node copy(scatter) data to IO buffer write data from IO buffer to disk  Non-IONode: copy data to mpi buffer mpi send data to IO node

Sample Paritioning: POP  data is 3d - X, Y, Z  X and Y dimensions are partitioned in blocks  sample 4 node partition: Each of the 4 colored blocks represents one node’s part of the data Each of the two lighter colored blocks represent 1 I/O Node I/O Groups should be arranged so their data is contiguous on disk

Sample Paritioning: POP  Given a nearly square partitioning, the number of nodes simultaneously performing IO is approximately the square root of the total number of compute nodes compute nodes - 50 IO nodes compute nodes - 100IO nodes compute nodes IO nodes  Many partitions allow a reasonable assignment of ionodes For Example:  An array of 8 byte reals (300, 400, 40) on each of nodes 4.8 million elements on each node 48 billion elements total 384 gigabytes data seconds to read or write at gbyte/sec 100 IO nodes

A Subset of Writers Benchmark

Benchmark Results: Things to Know  Uses write_at rather than file partitioning  Only write data...sorry Read data was largely similar  Initial benchmarking showed MPI transfers to be marginal, so they were excluded in later benchmarking  Real Application Data in the works, Come to CUG

Benchmark Results: 1 I/O Node - Stripes  Single IO node, 10 megabyte buffer, 20 megabyte stripe size: bandwidth of IO write to disk Number of stripes MB/s 134MB/s 135MB/s 139MB/s 149MB/s 148MB/s  Using a single IO node: number of stripes doesn't matter stripe size doesn't matter (timings not shown)

Benchmark Results: 1 I/O Node - Stripes

Benchmark Results: 1 I/O Node – Buffer Size  Single node, single stripe: bandwidth of IO write to disk for different buffer sizes Buffer size is the size of contiguous memory on one IO node written to disk with one write  Buffer size should be at least 10 megabytes

50 Writers, Varying Stripe Count, Size and Buffer Size

150 Stripes, Varying Writers, Buffer, and Stripe Sizes

Cliff’s Take Home Notes  Do Large I/O Operations in Parallel MPI-IO  Create a natural partitioning of nodes so that data will go to disk in a way that makes sense  Stripe as close to the maximum OSTs as possible given your partitioning  Use buffers of at least 1MB, 10MB if you can afford it  Make your I/O flexible so that you can tune to the problem and machine One hard-coded solution will meet your some of the time, but not all of the time  Come to CUG 2007 and see the application results!