N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.
ECMWF 1 COM HPCF 2004: High performance file I/O High performance file I/O Computer User Training Course 2004 Carsten Maaß User Support.
Today’s topics Single processors and the Memory Hierarchy
Efficient I/O on the Cray XT Jeff Larkin With Help Of: Gene Wagenbreth.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
NUG Meeting 1 File and Data Conversion Jonathan Carter NERSC User Services
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
1: Operating Systems Overview
Chapter 7 Interupts DMA Channels Context Switching.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
1 I/O Management in Representative Operating Systems.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
1 A Look at PVFS, a Parallel File System for Linux Will Arensman Anila Pillai.
Pooja Shetty Usha B Gowda.  Network File Systems (NFS)  Drawbacks of NFS  Parallel Virtual File Systems (PVFS)  PVFS components  PVFS application.
Synchronization and Communication in the T3E Multiprocessor.
Disk Access. DISK STRUCTURE Sector: Smallest unit of data transfer from/to disk; 512B 2/4/8 adjacent sectors transferred together: Blocks Read/write heads.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Multiprocessor systems Objective n the multiprocessors’ organization and implementation n the shared-memory in multiprocessor n static and dynamic connection.
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Porting from the Cray T3E to the IBM SP Jonathan Carter NERSC User Services.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
MIMD Distributed Memory Architectures message-passing multicomputers.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Evolution of the NERSC SP System NERSC User Services Original Plans Phase 1 Phase 2 Programming.
UNIX File and Directory Caching How UNIX Optimizes File System Performance and Presents Data to User Processes Using a Virtual File System.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up Parallel I/O on the SP David Skinner, NERSC Division, Berkeley Lab.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.
UNIX Files File organization and a few primitives.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
CLASS Information Management Presented at NOAATECH Conference 2006 Presented by Pat Schafer (CLASS-WV Development Lead)
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up MPI and MPI-I/O on seaborg.nersc.gov David Skinner, NERSC Division, Berkeley Lab.
Coupling Facility. The S/390 Coupling Facility (CF), the key component of the Parallel Sysplex cluster, enables multisystem coordination and datasharing.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab.
Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower than CPU.
IBM ATS Deep Computing © 2007 IBM Corporation High Performance IO HPC Workshop – University of Kentucky May 9, 2007 – May 10, 2007 Andrew Komornicki, Ph.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
ECE 456 Computer Architecture Lecture #9 – Input/Output Instructor: Dr. Honggang Wang Fall 2013.
Background Computer System Architectures Computer System Software.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
An Introduction to GPFS
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Multiple Processor Systems
Overview of Computer Architecture and Organization
PVFS: A Parallel File System for Linux Clusters
Parallel I/O for Distributed Applications (MPI-Conn-IO)
Presentation transcript:

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User Services

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 2 Overview Node Characteristics Interconnect Characteristics MPI Performance I/O Configuration I/O Performance

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 3 T3E Architecture Distributed memory, single CPU processing elements Interconnect CPU Memory

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 4 T3E Communication Network Processing Elements (PE) are connected by a 3D torus.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 5 T3E Communication Network The peak bandwidth of the torus is about 600 Mbyte/sec bidirectional Sustainable bandwidth is about 480 Mbytes/sec bidirectional Latency is  1μs Shmem API gives latency of 1μs, bandwidth 350 Mbyte/sec bidirectional

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 6 SP Architecture Cluster of SMP nodes Interconnect Memory CPU

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 7 SP Communication Network Nodes are connected via adapters to the SP Switch. Switch is composed of boards which link 16 nodes. Boards are linked to form larger network. Switch Board Nodes

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 8 SP Communication Network The peak bandwidth of adapter and switch is 300 Mbyte/sec bidirectional Latency of the switch is about 2μs Sustainable bandwidth is about 185 Mbytes/sec bidirectional

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 9 MPI Performance Intra-node is 1 MPI process per node, 2 MPI processes (typical) will halve bandwidth

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 10 MPI Performance

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 11 MPI Performance

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 12 T3E I/O Configuration PEs do not have local disk All PEs access all filesystems equivalently Path for (optimum) I/O generally looks like: –PE to I/O node via torus –I/O node to Fibre Channel Node (FCN) via Gigaring –FCN to Disk Array via Fibre loop In some cases data on APP PE must be transferred to a system buffer on an OS PE then out to an FCN

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 13 T3E I/O Configuration I/O FCN Gigaring Disk Arrays

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 14 SP I/O Configuration Nodes have local disk. One SCSI disk for all local filesystems. Non-optimal. All nodes access Global Parallel File System (GPFS) filesystems equivalently Path for GPFS I/O looks like: –Node to GPFS Node via IP over the switch –GPFS Node to Disk Array via SSA loop

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 15 SP I/O Configuration Nodes Switch GPFS Nodes Disk Array

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 16 T3E Filesystems /usr/tmp –fast –subject to 14 day purge, not backed up –check quota with quota -s /usr/tmp (usually 75Gb and 6000 inodes) $TMPDIR –fast –purged at end of job or session –shares quota with /usr/tmp $HOME –slower –permanent, backed up –check quota with quota (usually 2Gb and 3500 inodes)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 17 SP Filesystems /scratch and $SCRATCH –global –fast (GPFS) –subject to 14 day purge (or at session end for $SCRATCH), not backed up –check quota with myquota (usually 100Gb and 6000 inodes) $TMPDIR –local (created in /scr) - only 2 Gbyte total –slower –purged at end of job or session $HOME –global –slower (GPFS) –permanent, not backed up yet –check quota with myquota (usually 4Gb and 5000 inodes)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 18 Types of I/O Bewildering number of choices on both machines: –Standard Language I/O: Fortran or C (ANSI or POSIX) –Vendor extensions to language I/O –MPI I/O –Cray FFIO library (can be used from Fortran or C) –IBM MIO library, requires code changes

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 19 Standard Language I/O Fortran direct access is slightly more efficient then sequential access both on the T3E (see comments on FFIO later) and the SP. It also allows file transferability. C language I/O (fopen, fwrite, etc.) is inefficient on both machines. POSIX standard I/O (open, read, etc.) can be efficient on the T3E, but requires care (see comments on FFIO later). Works well on the SP.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 20 Vendor Extensions to Language I/O Cray has a number of I/O routines (aqopen, etc.) which are legacies from the PVP systems. Non-portable. IBM has extended Fortran syntax to provide asynchronous I/O. Non-portable.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 21 MPI I/O Part of MPI-2 Interface for High Performance Parallel I/O –data partitioning –collective I/O –asynchronous I/O –portability and interoperability bwteen T3E and SP Different subset implemented on T3E and SP

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 22 Summary of access routines for T3E

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 23 Summary of access routines for SP

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 24 Cray FFIO library FFIO is a set of I/O layers tuned for different I/O characteristics Buffering of data (configurable size) Caching of data (configurable size) Available to regular Fortran I/O without reprogramming Available for C through POSIX-like calls, e.g. ffopen, ffwrite

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 25 FFIO - The assign command controls program behavior at runtime the assign command controls –controls which FFIO layer is active –striping across multiple partitions –lots more scope of assign –File name –Fortran unit number –File type (e.g. all sequential unformatted files)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 26 IBM MIO library User interface based on POSIX I/O routines, so requires program modification Useful trace module to collect statistics Not much experience with using on GPFS filesystem Coming soon

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 27 I/O Strategies - Exclusive access files Each process reads and writes to a separate file –Language I/O Increase language I/O performance with FFIO library (for example, sepcify a large buffer with the bufa layer) on T3E. For Fortran direct access default buffer is only the maximum of the record length or 32 Kbytes read/write large amounts of data per request on the SP –MPI I/O read/write large amounts of data per request

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 28 bufa FFIO layer Overview bufa is an asynchronous buffering layer performs read-ahead, write-behind specify buffer size with -F bufa:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers buffer space increases your applications memory requirements

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 29 I/O Strategies - Shared files All PEs read and write the same file simultaneously –Language I/O (requires FFIO library global layer for T3E) –MPI I/O –On T3E, language I/O with FFIO library global layer and Cray extensions for additional flexibility

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 30 Positioning with a shared file Positioning of a read or write is your responsibility File pointers are private Fortran –Use a direct access file, and read/write(rec=num) –Use Cray T3E extensions setpos and getpos to position file pointer (not portable) C –Use ffseek MPI I/O –MPI I/O fileview generally takes care of this. Positioning routines also available.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 31 global FFIO layer Overview global is a caching and buffering layer which enables multiple PEs to read and write to the same file if one PE has already read the data, an additional read request from another PE will result in a remote memory copy file open is a synchronizing event By default, all PEs must open a global file, this can be changed by calling GLIO_GROUP_MPI(comm) specify buffer size with -F global:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers per PE

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 32 GPFS and shared files On the T3E the global FFIO layer takes care of updates to a file from multiple PEs by tracking the state of the file across all PEs. On the SP, GPFS implements a safe update scheme via tokens and a token manager. –If two processes access the same block of a GPFS file (256 Kbytes), a negotiation is conducted between the nodes and the token manager to determine the order of updates. This can slow down I/O considerably. –MPI I/O merges requests from different processes to alleviate this problem

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 33 I/O Performance Comparison Each process writes a 200 Mbyte file. 2 processes per node on SP.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 34 Further Information I/O on the T3E Tutorial by Richard Gerber at als Cray Publication - Application Programmer’s I/O Guide Cray Publication - Cray T3E Fortran Optimization Guide man assign XL Fortran User’s Guide