Parallel System Performance CS 524 – High-Performance Computing.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Distributed Systems CS
Potential for parallel computers/parallel programming
What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
Parallel System Performance CS 524 – High-Performance Computing.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Reference: Message Passing Fundamentals.
Introduction CS 524 – High-Performance Computing.
HPC 01 Communication Models, Speedup and Scalability Schoenauer sec 8.2,8.4.
12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Parallel Computing Overview CS 524 – High-Performance Computing.
CSCI 8150 Advanced Computer Architecture
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Communication operations Efficient Parallel Algorithms COMP308.
Performance Metrics Parallel Computing - Theory and Practice (2/e) Section 3.6 Michael J. Quinn mcGraw-Hill, Inc., 1994.
Parallel Computing Platforms
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
Steve Lantz Computing and Information Science Parallel Performance Week 7 Lecture Notes.
CSE 160 – Lecture 2. Today’s Topics Flynn’s Taxonomy Bit-Serial, Vector, Pipelined Processors Interconnection Networks –Topologies –Routing –Embedding.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
CS 240A: Complexity Measures for Parallel Computation.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Storage area network and System area network (SAN)
Switching, routing, and flow control in interconnection networks.
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.
Communication Networks
On-Chip Networks and Testing
CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Networks-on-Chips (NoCs) Basics
“elbowing out” Processors used Speedup Efficiency timeexecution Parallel Processors timeexecution Sequential Efficiency   
Flynn’s Taxonomy SISD: Although instruction execution may be pipelined, computers in this category can decode only a single instruction in unit time SIMD:
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
Computer Architecture Distributed Memory MIMD Architectures Ola Flygt Växjö University
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
BZUPAGES.COM Presentation On SWITCHING TECHNIQUE Presented To; Sir Taimoor Presented By; Beenish Jahangir 07_04 Uzma Noreen 07_08 Tayyaba Jahangir 07_33.
Complexity Measures for Parallel Computation. Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time.
Classification of parallel computers Limitations of parallel processing.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Distributed and Parallel Processing George Wells.
Potential for parallel computers/parallel programming
Lecture 23: Interconnection Networks
Interconnection topologies
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Complexity Measures for Parallel Computation
Communication operations
Distributed Systems CS
COMP60621 Fundamentals of Parallel and Distributed Systems
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Complexity Measures for Parallel Computation
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
COMP60611 Fundamentals of Parallel and Distributed Systems
Multiprocessors and Multi-computers
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

Parallel System Performance CS 524 – High-Performance Computing

CS 524 (Wi 2003/04)- Asim LUMS2 Parallel System Performance Parallel system = algorithm + hardware Measure of problem  Problem size: e.g. the dimension N in vector and matrix computations  Floating point operations  Execution time Measure of hardware  Number of processors, p  Interconnection network performance (channel bandwidth, cost, diameter, etc)  Memory system characteristics (sizes, bandwidth, etc)

CS 524 (Wi 2003/04)- Asim LUMS3 Performance Metrics Execution time  Serial run time is the time elapsed between the beginning and the end of execution on a sequential computer (T S )  Parallel run time is the time that elapses from the moment parallel execution starts to the moment the last processor finishes execution (T P ) Speedup (S): the ratio of the serial execution time of the best sequential algorithm to the parallel execution time Efficiency (E): the effective fractional utilization of parallel hardware Cost (C): the sum of times each processor spends on the problem

CS 524 (Wi 2003/04)- Asim LUMS4 Speedup Speedup, S = T S /T P  Measures benefit of parallelizing program  Usually less than number of processors, p (sublinear speedup)  Can S be greater than p (super linear speedup)? p S sublinear (typical) linear superlinear

CS 524 (Wi 2003/04)- Asim LUMS5 Efficiency and Cost Efficiency, E = S/p  Measures utilization of processors for problem computation only  Usually ranges from 0 to 1  Can efficiency be greater than 1? Cost, C = pT P (also known as work or processor-time product)  Measures sum of times spent by each processor  Cost-optimal: cost of solving a problem on a parallel computer is proportional to the execution time of the fastest known sequential algorithm on a single processor E = T S /C

CS 524 (Wi 2003/04)- Asim LUMS6 Amdahl’s Law Let W = work needed to solve a problem and W S = work that is serial (i.e. is not parallelizable) The maximum possible speedup on p processors (assuming no superlinear speedup) is obtained as: S = W/[W S + (W – W S )/p]  If a problem has 10% of serial computation, the maximum speedup is 10  If a problem has 1% of serial computation, the maximum speedup is 100 Speedup is upper bounded by W/W S as the number of processor p increases

CS 524 (Wi 2003/04)- Asim LUMS7 Execution Time In a distributed memory model, the execution time T P = t comp + t comm.  t comp : computation time  t comm : communication time for explicit send and receive of messages In a shared memory model, the execution time T P consists of computation time and communication time for memory access. Communications are not specified explicitly. Hence, execution time is CPU time, determined in a manner similar to that for sequential algorithms.

CS 524 (Wi 2003/04)- Asim LUMS8 Message Passing Communication Overhead Parameters for determining communication time, t comm  Startup time (t s ): The time required to handle a message at the sending processor including the time to prepare the message, the time to execute the routing algorithm, and the time to establish an interface between the local processor and router.  Per-hop time (t h ): The time for the message header to travel between two directly connected processors. Also known as node latency.  Per-word transfer time (t w ): The time for a word to traverse a link. If the channel bandwidth is r words per second, then per-word transfer time is t w = 1/r. t comm = t s + t h + t w

CS 524 (Wi 2003/04)- Asim LUMS9 Store-and-Forward Routing (1) Store-and-forward routing: a message is traversing a path with multiple links; each intermediate processor on the path forwards the message to the next processor after it has received and stored the entire message

CS 524 (Wi 2003/04)- Asim LUMS10 Store-and-Forward Routing (2) Communication overhead/cost  Message size = m words  Path length = l links  Communication overhead, t comm = t s + (mt w + t h )l  Usually t h is small compared to mt w. Therefore, the communication cost is simplified to t comm = t s + mt w l

CS 524 (Wi 2003/04)- Asim LUMS11 Cut-Through Routing (1) Cut-through routing: a message is forwarded at intermediate node without waiting for entire message to arrive

CS 524 (Wi 2003/04)- Asim LUMS12 Cut-Through Routing (2) Wormhole routing: is cut-through routing with pipelining through the network  Message is partitioned in small pieces, called flits (flow control digits)  There is no buffering in memory; busy link causes worm to stall; deadlock may ensue Communication cost/overhead  Message size = m words  Path length = l links  Communication cost t comm = t s + mt w + lt h  Again, considering t h to be small compared to mt w, the communication cos tis simplified to t comm = t s + mt w