1 Performance Modeling l Basic Model »Needed to evaluate approaches »Must be simple l Synchronization delays l Main components »Latency and Bandwidth »Load.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Interconnection Networks: Flow Control and Microarchitecture.

1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)

Chapter 3 Basic Input/Output

MPI Message Passing Interface

Super computers Parallel Processing By: Lecturer \ Aisha Dawood.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Interfacing Processors and Peripherals Andreas Klappenecker CPSC321 Computer Architecture.

S oftware- H ardware I nformation F low T racking + M ulticore Colleen Lewis & Cynthia Sturton SHIFT+M.

Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance.

Department of Computer Engineering University of California at Santa Cruz Networking Systems (1) Hai Tao.

ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.

1 Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection.

1 Parallel Computing—Introduction to Message Passing Interface (MPI)

1 Message protocols l Message consists of “envelope” and data »Envelope contains tag, communicator, length, source information, plus impl. private data.

Design, Implementation, and Evaluation of Differentiated Caching Services Ying Lu, Tarek F. Abdelzaher, Avneesh Saxena IEEE TRASACTION ON PARALLEL AND.

Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

University Of Maryland1 A Study Of Cyclone Technology.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

Protocol Implementation An Engineering Approach to Computer Networking.

Error Checking continued. Network Layers in Action Each layer in the OSI Model will add header information that pertains to that specific protocol. On.

1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.

MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.

CH12 CPU Structure and Function

Data Communications and Networking

Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.

1 MPI Datatypes l The data in a message to sent or received is described by a triple (address, count, datatype), where l An MPI datatype is recursively.

1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.

1 Choosing MPI Alternatives l MPI offers may ways to accomplish the same task l Which is best? »Just like everything else, it depends on the vendor, system.

Bulk Synchronous Parallel Processing Model Jamie Perkins.

Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.

DEVICES AND COMMUNICATION BUSES FOR DEVICES NETWORK

1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

Data and Computer Communications Chapter 10 – Circuit Switching and Packet Switching (Wide Area Networks)

Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

1 Timing MPI Programs The elapsed (wall-clock) time between two points in an MPI program can be computed using MPI_Wtime : double t1, t2; t1 = MPI_Wtime();...

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

Network Performance. Performance (1) What would be the characteristics of the ideal network? –It would be completely transparent in every conceivable.

Network Technologies Definitions –Network: physical connection that allows two computers to communicate –Packet: a unit of transfer »A sequence of bits.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

Presented by Open MPI on the Cray XT Richard L. Graham Tech Integration National Center for Computational Sciences.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

EECB 473 Data Network Architecture and Electronics Lecture 1 Conventional Computer Hardware Architecture

FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 5 Message Passing and MPI.

Interconnection network network interface and a case study.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 CH. 8: SWITCHING & DATAGRAM NETWORKS 7.1.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.

Data Communication Networks Lec 13 and 14. Network Core- Packet Switching.

Intro to Distributed Systems Hank Levy. 23/20/2016 Distributed Systems Nearly all systems today are distributed in some way, e.g.: –they use –they.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

Transport Layer Unit 5.

CMSC 611: Advanced Computer Architecture

CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #

Direct Memory Access Disk and Network transfers: awkward timing:

Networks Networking has become ubiquitous (cf. WWW)

Presentation transcript:

1 Performance Modeling l Basic Model »Needed to evaluate approaches »Must be simple l Synchronization delays l Main components »Latency and Bandwidth »Load balancing l Other effects on performance »Understand deviations from the model

2 Latency and Bandwidth l Simplest model s + r n l s includes both hardware (gate delays) and software (context switch, setup) l r includes both hardware (raw bandwidth of interconnection and memory system) and software (packetization, copies between user and system) l head-to-head and pingpong values may differ

3 l Bandwidth is the inverse of the slope of the line time = latency + (1/rate) size_of_message l Latency is sometimes described as “time to send a message of zero bytes”. This is true only for the simple model. The number quoted is sometimes misleading. Interpreting Latency and Bandwidth Latency 1/slope=Bandwidth Message Size Time to Send Message Not latency

4 Including Contention l Lack of contention greatest limitation of latency/bandwidth model l Hyperbolic model of Stoica, Sultan, and Keyes provides a way to estimate effects of contention for different communication patterns see ftp://ftp.icase.edu/pub/techreports/96/ ps.Z

5 Synchronization Delays l Message passing is a cooperative method - if the partner doesn’t react quickly, a delay results l There is a performance tradeoff caused by reacting quickly - it requires devoting resources to checking for things to do

6 Polling Mode MPI User System User MPI_Send Request Acknowledgment TransferSync Delay

7 Interrupt Mode MPI User System User MPI_Send Request Acknowledgment Transfer l Cost of interrupt higher than polling (usually)

8 Example of the effect of Polling l IBM SP2 MPI_Allreduce times for each mode l Times in usecs. Similar effects on other operations. l BUT some programs (with extensive computing) can show better performance with interrupt mode

9 Observing Synchronization Delays l 3 processors sending data, with one sending a short message and another sending a long message to the same process: Eager Rendezvous

10 Other Impacts on Performance l Contention l Memory Copies l Packet sizes and stepping

11 Contention l Point-to-point analysis ignores fact that communications links (usually) shared l Easiest model is to equally share bandwidth (if K can shared at one time, give each 1/K of the bandwidth). l “Topology doesn’t matter anymore” is not true, but there is less you can do about it (just like cache memory)

12 Effect of contention l IBM SP2 has a multistage switch. This test shows the point-to-point bandwidth with half the nodes sending and half receiving

13 Memory copies l Memory copies are the primary source of performance problems l Cost of non-contiguous datatypes l Single processor memcpy is often much slower than the hardware. Measured memcpy performance:

14 Example: Performance Impact of Memory Copies l Assume n bytes sent eagerly (and buffered) »s + r n + c n l Rendezvous, not buffered »s + s + (s + r n) l Rendezvous faster if s < cn/2 »Assumes no delays in responding to rendezvous control information

15 Example: Why MPI Datatypes l Handling non-contiguous data l Assume must pack/unpack on each end »cn + (s + r n) + cn l Can move directly »s + r’ n »r’ probably > r but < (2c+r) l MPI implementation must copy data anyway (into network buffer or shared memory); having the datatype permits removing 2 copies

16 Performance of MPI Datatypes l Test of 1000 element vector of doubles with stride of 24 doubles (MB/sec). »MPI_Type_vector »MPI_Type_struct (MPI_UB for stride) »User packs and unpacks by hand l Performance very dependent on implementation; should improve with time

17 Packet sizes l Data sent in fixed- or maximum-sized “packets” l Introduces a ceil(n/packet_size) term l Staircase appearance of performance graph

18 Example of Packetization Packets contain 232 bytes of data. (first is 200 bytes, so MPI header is probably 32 bytes). Data from mpptest, available at ftp://ftp.mcs.anl.gov/ pub/mpi/misc/ perftest.tar.gz