1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Distributed Systems CS

SE-292 High Performance Computing

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

Parallel Computers Chapter 1

1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

Parallel System Performance CS 524 – High-Performance Computing.

Parallel Architectures: Topologies Heiko Schröder, 2003.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.

1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

1 Tuesday, September 26, 2006 Wisdom consists of knowing when to avoid perfection. -Horowitz.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.

Parallel System Performance CS 524 – High-Performance Computing.

CS 240A: Complexity Measures for Parallel Computation.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Introduction to Parallel Processing Ch. 12, Pg

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.

Interconnect Networks

Computer Organization Computer Organization & Assembly Language: Module 2.

Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.

Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

Performance Evaluation of Parallel Processing. Why Performance?

Lecture 2b: Performance Metrics. Performance Metrics Measurable characteristics of a computer system: Count of an event Duration of a time interval Size.

Flynn’s Taxonomy SISD: Although instruction execution may be pipelined, computers in this category can decode only a single instruction in unit time SIMD:

Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.

1 Dynamic Interconnection Networks Miodrag Bolic.

Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.

Computer Hardware A computer is made of internal components Central Processor Unit Internal External and external components.

Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations.

4 Linking the Components Linking The Components A computer is a system with data and instructions flowing between its components in response to processor.

Super computers Parallel Processing

Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Complexity Measures for Parallel Computation. Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time.

Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.

Concurrency and Performance Based on slides by Henri Casanova.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

These slides are based on the book:

Potential for parallel computers/parallel programming

Overview Parallel Processing Pipelining

Parallel Architecture

18-447: Computer Architecture Lecture 30B: Multiprocessors

4- Performance Analysis of Parallel Programs

Defining Performance Which airplane has the best performance?

Interconnection topologies

Chapter 3: Principles of Scalable Performance

Complexity Measures for Parallel Computation

Distributed Systems CS

COMP60621 Fundamentals of Parallel and Distributed Systems

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

Complexity Measures for Parallel Computation

Potential for parallel computers/parallel programming

Potential for parallel computers/parallel programming

COMP60611 Fundamentals of Parallel and Distributed Systems

Chapter 2 from ``Introduction to Parallel Computing'',

Presentation transcript:

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection network Dynamic/Indirect networks: –cross bar –bus based Static/Direct networks: –completely connected –star connected –linear array –ring –mesh –hypercube

2 More on Interconnects Dynamic Interconnect: Communication links are connected to one another dynamically by the switching elements to establish path among processors and memory banks. Normally used for share memory address space computers. Static/Direct Interconnect : Consists of point to point communication links among processors. Typically used for message passing computers.

3 Dynamic Interconnect Cross bar switching : p processors, m memory banks M1M2 Mm P1 P2 Pp.... Switch element

4 Dynamic Interconnect Cross Bar switch is a non blocking network i.e. connection of a processor to a memory bank does not block the connection of any other processor to any other memory bank. Total # of switching elements required is f(p*m) = approx f(p*p) (assuming p = m) As p complexity of switching network f(p*p) Cross bar switches are not scalable in terms of cost.

5 Dynamic Interconnect Bus based network: Processors are connected to global memory by means of a common data path called a bus. Global Memory BUS PPP

6 Dynamic Interconnect Global Memory BUS PP P cache

7 Bus with and without cache # processors performance Which one with and which one without cache?

8 Dynamic Interconnect Bus based network Simplicity of construction Provides uniform access to shared memory Bus can carry limited amount of data between the memory and processors As the number of processors increases each processor spends more time waiting for memory access while the bus is used by other processor

9 Static Interconnect Completely Connected : Each processor has direct communication link to every other processor Star Connected Network : The middle processor is the central processor. Every other processor is connected to it. Counter part of Cross Bar switch in Dynamic interconnect.

10 Static Interconnect Linear Array : Ring : Mesh Network :

11 Static Interconnect Torus or Wraparound Mesh :

12 Static Interconnect Hypercube Network : A multidimensional mesh of processors with exactly two processors in each dimension. A d dimensional processor consists of p = 2 d processors shown below are 0, 1, 2, and 3 dimensional hypercubes 0-D 1-D 2-D 3-D hypercubes

13 More on Static Interconnects Diameter : Maximum distance between any two processors in the network. (the distance between two processors is defined as the shortest path, in terms of links, between them). This relates to communication time. Diameter for completely connected network is 1, for star network is 2, for ring is p/2 (for p even processors) Connectivity: This is a measure of the multiplicity of paths between any two processors (# arcs that must be removed to break into two). High connectivity is desired since it lowers contention for communication resources. Connectivity is 1 for linear array, 1 for star, 2 for ring, 2 for mesh, 4 for torus.

14 More on Static Interconnects Bisection width: Minimum # of communication links that have to be removed to partition the network into two equal halves. Bisection width is 2 for ring, sq. root(p) for mesh with p (even) processors, p/2 for hypercube, (p*p)/4 for completely connected (p even). Channel width: # of physical wires in each communication link Channel rate: Peak rate at which a single physical wire link can deliver bits Channel BW : Peak rate at which data can be communicated between the ends of a communication link ( = (channel width) * (channel rate) ) Bisection BW : Minimum volume of communication allowed between any 2 halves of the network with equal # of procs ( = (bisection width) * (channel BW) )

15 Communication Time Modeling Tcomm = Nmsg * Tmsg Nmsg = # of non overlapping messages Tmsg = time for one point to point communication L = length of message ( for e.g in words) Tmsg = ts + tw * L latency = ts = startup time (size independent) tw = asymptotic time per word (1/BW)

16 Performance and Scalability Terms Serial runtime(Ts): Time elapsed between the begining and the end of execution of a sequential program Parallel runtime(Tn): Time that elapses from the moment that a parallel computer starts to execute to the moment that the last processor finishes execution. Speedup(S): Ratio of the serial runtime of the best sequential algorithm for solving a problem to the time taken by the parallel algorithm to solve the same problem on N processors. Ts = serial time Tn = parallel time on N processors S = Ts/Tn

17 Performance and Scalability Terms Efficiency: Measure of the fraction of time for which a processor is usefully employed. Defined as the ratio of speedup to the number of processor. E = S/N Amdahl’s law : discussed before Scalability : An algorithm is scalable if the level of parallelism increases at least linearly with the problem size. An architecture is scalable if it continues to yield the same performance per processor, albeit used on a larger problem size, as the # of processors increases. Algorithm and architecture scalability are important since they allow a user to solve larger problems in the same amount of time by buying a parallel computer with more processors.

18 Performance and Scalability Terms Superlinear speedup: In practice a speedup greater than N (on N processors) is called superlinear speedup. This is observed due to 1. Non optimal sequential algorithm 2. Sequential problem may not fit in one processor’s main memory and require slow secondary storage, whereas on multiple processors problem fits in main memory of N processors

19 Sources of Parallel Overhead Interprocessor communication: Time to transfer data between processors is usually the most significant source of parallel processing overhead. Load imbalance: In some parallel applications it is impossible to equally distribute the subtask workload to each processor. So at some point all but one processor might be done and waiting for one processor to complete. Extra computation: Sometime the best sequential algorithm is not easily parallelizable and one is forced to use a parallel algorithm based on a poorer but easily parallelizable sequential algorithm. Sometimes repetitive work is done on each of the N processors instead of send/recv, which leads to extra computation.

20 CPU Performance Comparison MIPS = millions of instructions per second = IC / (execution time in second* 10 6 ) = clock rate / (CPI * 10 6 ) (IC = Instruction count for a program; CPI = CPU clock cycles for a program/ IC) MIPS is not an accurate measure for computing performance among computers : MIPS is dependent on the instruction set, making it difficult to compare MIPS of computers with different instruction set MIPS varies between programs on the same computer MIPS can vary inversely to performance

21 CPU Performance Comparison # of floating point operations in a program MFLOPS = execution time in seconds * 10 6 MFLOPS (Millions of Floating Point Operations per Second) are dependent on the machine and on the program ( same program running on different computers would execute a different # of instructions but the same # of FP operations) MFLOPS is also not a consistent and useful measure of performance because –Set of FP operations is not consistent across machines e.g. some have divide instructions, some don’t – MFLOPS rating for a single program cannot be generalized to establish a single performance metric for a computer

22 CPU Performance Comparison Execution time is the principle measure of performance Unlike execution time, it is tempting to characterize a machine with a single MIPS, or MFLOPS rating without naming the program, specifying I/O, or describing the version of OS and compilers