NERCS Users’ Group, Oct. 3, 2005 Interconnect and MPI Bill Saphir.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Chap 2 System Structures.
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 5: Inter-VLAN Routing Routing & Switching.
Types of Parallel Computers
1 I.S Introduction to Telecommunication in Business Chapter 6 Network Hardware Components Dr. Jan Clark FALL, 2002.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
t Popularity of the Internet t Provides universal interconnection between individual groups that use different hardware suited for their needs t Based.
CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Quick Tutorial on MPICH for NIC-Cluster CS 387 Class Notes.
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
(part 3).  Switches, also known as switching hubs, have become an increasingly important part of our networking today, because when working with hubs,
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Lecture 1, 1Spring 2003, COM1337/3501Computer Communication Networks Rajmohan Rajaraman COM1337/3501 Textbook: Computer Networks: A Systems Approach, L.
Client/Server Architectures
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
NERCS Users’ Group, Oct. 3, 2005 NUG Training 10/3/2005 Logistics –Morning only coffee and snacks –Additional drinks $0.50 in refrigerator in small kitchen.
Chapter 4: Managing LAN Traffic
LECTURE 9 CT1303 LAN. LAN DEVICES Network: Nodes: Service units: PC Interface processing Modules: it doesn’t generate data, but just it process it and.
Implementing File and Print Services
Introducing Network Standards
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Buses Warning: some of the terminology is used inconsistently within the field.
Internet Addresses. Universal Identifiers Universal Communication Service - Communication system which allows any host to communicate with any other host.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Lab System Environment
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
MODULE I NETWORKING CONCEPTS.
BSP on the Origin2000 Lab for the course: Seminar in Scientific Computing with BSP Dr. Anne Weill –
August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
Software Overview Environment, libraries, debuggers, programming tools and applications Jonathan Carter NUG Training 3 Oct 2005.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
1 © 2003, Cisco Systems, Inc. All rights reserved. CCNA 3 v3.0 Module 9 Virtual Trunking Protocol.
1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.
How to for compiling and running MPI Programs. Prepared by Kiriti Venkat.
Silberschatz, Galvin and Gagne  Operating System Concepts UNIT II Operating System Services.
Chapter 2 Network Models
Higher Computing Networking. Networking – Local Area Networks.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Rehab AlFallaj.  Network:  Nodes: Service units: PC Interface processing Modules: it doesn’t generate data, but just it process it and do specific task.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
MINIX Presented by: Clinton Morse, Joseph Paetz, Theresa Sullivan, and Angela Volk.
Cliff Addison University of Liverpool NW-GRID Training Event 26 th January 2007 SCore MPI Taking full advantage of GigE.
Intro to Distributed Systems Hank Levy. 23/20/2016 Distributed Systems Nearly all systems today are distributed in some way, e.g.: –they use –they.
NETWORKS THEORY AND PRACTICAL DISCUSSION JOHN KOVACEVICH INFORMATION SERVICES DEPARTMENT TEXAS A&M GALVESTON CAMPUS
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Compute and Storage For the Farm at Jlab
Instructor & Todd Lammle
Overview Parallel Processing Pipelining
Constructing a system with multiple computers or processors
Chapter 5: Inter-VLAN Routing
Routing and Switching Essentials v6.0
Chapter 2: System Structures
CS703 - Advanced Operating Systems
Constructing a system with multiple computers or processors
CSC3050 – Computer Architecture
Quick Tutorial on MPICH for NIC-Cluster
Chapter-1 Computer is an advanced electronic device that takes raw data as an input from the user and processes it under the control of a set of instructions.
Presentation transcript:

NERCS Users’ Group, Oct. 3, 2005 Interconnect and MPI Bill Saphir

NERCS Users’ Group, Oct. 3, 2005 What this talk will cover Infiniband fabric –Overview of Infiniband – past, present, future –Configuration on Jacquard MPI –How to use it –Limitations/workarounds –Plans

NERCS Users’ Group, Oct. 3, 2005 Infiniband Industry standard high performance network. –Many years in development. Near death in Has come roaring back –Originally seen as PCI replacement –Retains ability to connect directly to disk controllers High performance –Direct user space access to hardware Kernel not involved in the transfer No memory-to-memory copies needed Protected access ensures security, safety –Supports RDMA (put/get) and Send/Receive models

NERCS Users’ Group, Oct. 3, 2005 IB Link Speeds Single channel – 2.5 Gb/s in each direction, simultaneously “4X” link is 10 Gb/s –10 bits encode 8 – error correction/detection –1 GB/s bidirectional per 4X link –PCI-X is 1 GB/s total bandwidth, so not possible to fully utilize IB 4X with PCI-X 12X link is three 4X links –One fat pipe to routing logic rather than three separate links Double Data Rate (DDR) is 2X the speed (5 Gb/s links) –Just now becoming available –Very short cable lengths (3M) [XXX] Quad Data Rate (QDR) is envisioned –Cable length issues with copper –Optical possible but expensive

NERCS Users’ Group, Oct. 3, 2005 IB Switching Large switches created by connecting smaller switches together Current small switch building block is 24 ports Usual configuration is fat tree –N port-switch can be used to build N 2 /2-port fat tree using 3N/2 switches (max 288 ports for N=24) –Larger switches available from some vendors are actually a 2-level fat tree based on 24-port switches. –A fat tree has “full bisection bandwidth” – supports all nodes communicating at the same time

NERCS Users’ Group, Oct. 3, 2005 Example: 120-port thin tree Each L1 switch has 4 connections to L2 switch L2 L1 All small switches have 24 ports (7 switches total) Each L1 switch has 4 “up” and 20 “down” (to nodes) connections to nodes

NERCS Users’ Group, Oct. 3, 2005 Example: 96-port fat tree (Clos) Each L1 switch has 3 connections to each L2 switch L2 L1 All small switches have 24 ports (12 switches total) Each L1 switch has 12 “up” and 12 “down” (to nodes) connections to nodes

NERCS Users’ Group, Oct. 3, 2005 Infiniband Routing Infiniband is “destination routed” –Switches make decisions based on destination of a packet –Even though a fat tree has full bisection bandwidth, hot spots are possible. –Routing scheme makes it more difficult to avoid network “hot spots” (not yet clear if Jacquard users impacted) –Workarounds available – will be addressed in future versions of MPI

NERCS Users’ Group, Oct. 3, 2005 Jacquard configuration Jacquard is a “2-level” fat tree –24-port switches on L1 (to nodes) –96-port switches on L2 –Really a 3-level tree because 96-port switches are 2-level trees internally –4X connections (1 GB/s) to all nodes –Innovation: 12X uplinks from L1 to L2 – smaller number of fat pipes Full bisection bandwidth –supports all nodes communicating at the same time –network supports 2X what PCI-X busses can sustain

NERCS Users’ Group, Oct. 3, 2005 Infiniband Software IB software interface originally called “Virtual Interface Architecture” (VI Architecture or VIA) NERSC wrote the first MPI for VIA (MVICH) – basis for current MPI implementation on Jacquard Microsoft derailed API in the standard De-facto current standard is VAPI – from Mellanox (part of OpenIB generation 1 software) OpenIB Gen 2 will have slightly different interface

NERCS Users’ Group, Oct. 3, 2005 MPI For Infiniband Jacquard uses MVAPICH (MPICH + VAPI) –Based on MVICH from NERSC (MPICH + VIA) and MPICH from ANL –OSU: Porting to VAPI + performance improvements –Support path OSU->Mellanox->LNXI->NERSC Support mechanisms/responsibilities being discussed MPI-1 functionality NERSC is tracking OpenMPI for Infiniband

NERCS Users’ Group, Oct. 3, 2005 Compiling/Linking MPI MPI versioning controlled by modules –“module load mvapich” in default startup files –compiler loaded independently mpicc/mpif90 –mpicc –o myprog myprog.c –mpif90 –o myprog myprog.f –uses the currently loaded pathscale module –automatically finds MPI include files –automatically find MPI libraries –latest version uses shared libraries

NERCS Users’ Group, Oct. 3, 2005 Running MPI programs Always use the “mpirun” command –written by NERSC –integrates PBS and MPI –runs with processor affinity enabled Inside a PBS job: –mpirun./a.out runs a.out on all processors allocated by PBS no need for “$PBS_NODEFILE” hack make sure to request ppn=2 with PBS “-np N” optional. Can be used to run on fewer processors On a login node –“mpirun –np 32 a.out” just works –internally: creates a PBS script (on 32 processors); runs the script interactively using “qsub –I” and expect –Max wallclock time: 30 minutes

NERCS Users’ Group, Oct. 3, 2005 mpirun current limitations Currently propagates these environment variables: –FILENV, LD_LIBRARY_PATH, LD_PRELOAD –To propagate other variables: ask NERSC Does not directly support MPMD –To run different binaries on different nodes, use a starter script that “execs” the correct binary based on the value of MPIRUN_RANK Does not allow redirection of standard input, e.g. –mpirun a.out < file Does not propagate $PATH, so “./a.out” needed even if “.” is in $PATH

NERCS Users’ Group, Oct. 3, 2005 Orphan processes mpirun (using ssh) has a habit of leaving “orphan” processes on nodes when a program fails PBS (NERSC additions) goes to great lengths to clean these up between jobs mpirun detects whether it has been previously called in the same PBS job. If so, it first tried to clean up orphan processes in case previous run failed

NERCS Users’ Group, Oct. 3, 2005 Peeking inside mpirun mpirun currently uses ssh to start up processes (internal starter is called “mpirun_rsh” – do not use this yourself) NERSC expects to move to PBS-based startup (internal starter called “mpiexec”) –may help with orphan processes, accounting, ability to redirect standard input, direct mpmd support Do not use mpirun_rsh or mpiexec directly. They are not supported by NERSC

NERCS Users’ Group, Oct. 3, 2005 MPI Memory Use Current MVAPICH uses a lot of memory per process – linear in number of MPI processes Per process: –64MB + –276KB/process up to 64 –1.2 MB/process above 64 Due to limitation in VI Architecture that does not exist in Infiniband but was carried forward Future implementations of MPI will have lower memory use Note: getrusage() doesn’t report memory use under Linux.

NERCS Users’ Group, Oct. 3, 2005 MPI Performance ping pong bandwidth: –800 MB/s (Seaborg: 320 MB/s) drops to 500 MB/s above 200k messages –theoretical peak 1000 MB/s ping pong latency: –5.6 us between nodes (Seaborg: 24us default; 21us with MPI_SINGLE_THREAD) –0.6 us within a node “random ring bandwidth”: –184 MB/s (Seaborg: ~43 MB/s at 4 nodes) –measures contention in network –theoretical peak 250 MB/s

NERCS Users’ Group, Oct. 3, 2005 MPI Futures Different startup mechanism – fewer orphans, faster startup, full environment propagated Lower memory use More control over memory registration cache Higher bandwidth

NERCS Users’ Group, Oct. 3, 2005 Summary All you need to know” –mpicc/mpif77/mpif90/mpicxx –mpirun –np N./a.out