Chapter 7 Multicores, Multiprocessors, and Clusters.

Slides:

Advertisements

Similar presentations

Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 5 Author: Julia Richards and R. Scott Hawley.

Advertisements

Introduction Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint manual.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.

Multiple Processor Systems

Chapter 6 File Systems 6.1 Files 6.2 Directories

1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.

Chapter 05 Authors: John Hennessy & David Patterson.

Performance Models for Application Optimization

CMPE 421 Advanced Parallel Computer Architecture Pipeline datapath and Control.

25 seconds left…...

Copyright © Cengage Learning. All rights reserved.

Parallelism Lecture notes from MKP and S. Yalamanchili.

SE-292 High Performance Computing

© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.

Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks.

The University of Adelaide, School of Computer Science

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

The University of Adelaide, School of Computer Science

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Today’s topics Single processors and the Memory Hierarchy

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Chapter Hardwired vs Microprogrammed Control Multithreading

Chapter 17 Parallel Processing.

Chapter 7 Multicores, Multiprocessors, and Clusters.

Parallel Computer Architectures

Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Last Time Performance Analysis It’s all relative

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

Lecture 3: Computer Architectures

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Processor Level Parallelism 1

Appendix C Graphics and Computing GPUs

COMP 740: Computer Architecture and Implementation

Chapter 6 Parallel Processors from Client to Cloud

Parallel Processing - introduction

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers

Assembly Language for Intel-Based Computers, 5th Edition

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Parallel Computers Today

Parallel and Multiprocessor Architectures

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

CSC3050 – Computer Architecture

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Chapter 4 Multiprocessors

Memory System Performance Chapter 3

Graphics Processing Unit

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Presentation transcript:

Chapter 7 Multicores, Multiprocessors, and Clusters

Chapter 7 Multicores, Multiprocessors, and Clusters 2 FIGURE 7.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 3 FIGURE 7.2 Classic organization of a shared memory multiprocessor. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 4 FIGURE 7.3 The last four levels of a reduction that sums results from each processor, from bottom to top. For all processors whose number i is less than half, add the sum produced by processor number (i + half) to its sum. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 5 FIGURE 7.4 Classic organization of a multiprocessor with multiple private address spaces, traditionally called a message-passing multiprocessor. Note that unlike the SMP in Figure 7.2, the interconnection network is not between the caches and memory but is instead between processor-memory nodes. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 6 FIGURE 7.5 How four threads use the issue slots of a superscalar processor in different approaches. The four threads at the top show how each would execute running alone on a standard superscalar processor without multithreading support. The three examples at the bottom show how they would execute running together in three multithreading options. The horizontal dimension represents the instruction issue capability in each clock cycle. The vertical dimension represents a sequence of clock cycles. An empty (white) box indicates that the corresponding issue slot is unused in that clock cycle. The shades of gray and color correspond to four different threads in the multithreading processors. The additional pipe line start-up effects for coarse multithreading, which are not illustrated in this figure, would lead to further loss in throughput for coarse multithreading. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 7 FIGURE 7.6 Hardware categorization and examples based on number of instruction streams and data streams: SISD, SIMD, MISD, and MIMD. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 8 FIGURE 7.7 Comparing single core of a Sun UltraSPARC T2 (Niagara 2) to a single Tesla multiprocessor. The T2 core is a single processor and uses hardware-supported multithreading with eight threads. The Tesla multiprocessor contains eight streaming processors and uses hardware-supported multithreading with 24 warps of 32 threads (eight processors times four clock cycles). The T2 can switch every clock cycle, while the Tesla can switch only every two or four clock cycles. One way to compare the two is that the T2 can only multithread the processor over time, while Tesla can multithread over time and over space; that is, across the eight streaming processors as well as segments of four clock cycles. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 9 FIGURE 7.8 Hardware categorization of processor architectures and examples based on static versus dynamic and ILP versus DLP. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 10 FIGURE 7.9 Network topologies that have appeared in commercial parallel processors. The colored circles represent switches and the black squares represent processor-memory nodes. Even though a switch has many links, generally only one goes to the processor. The Boolean n-cube topology is an n-dimensional interconnect with 2n nodes, requiring n links per switch (plus one for the processor) and thus n nearest-neighbor nodes. Frequently, these basic topologies have been supplemented with extra arcs to improve performance and reliability. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 11 FIGURE 7.10 Popular multistage network topologies for eight nodes. The switches in these drawings are simpler than in earlier drawings because the links are unidirectional; data comes in at the bottom and exits out the right link. The switch box in c can pass A to C and B to D or B to C and A to D. The crossbar uses n2 switches, where n is the number of processors, while the Omega network uses 2n log2n of the large switch boxes, each of which is logically composed of four of the smaller switches. In this case, the crossbar uses 64 switches versus 12 switch boxes, or 48 switches, in the Omega network. The crossbar, how ever, can support any combination of messages between processors, while the Omega network cannot. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 12 FIGURE 7.11 Examples of parallel benchmarks. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 13 FIGURE 7.12 Arithmetic intensity, specified as the number of fl oat-point operations to run the program divided by the number of bytes accessed in main memory [Williams, Patterson, 2008]. Some kernels have an arithmetic intensity that scales with problem size, such as Dense Matrix, but there are many kernels with arithmetic intensities independent of problem size. For kernels in this former case, weak scaling can lead to different results, since it puts much less demand on the memory system. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 14 FIGURE 7.13 Roofline Model [Williams, Patterson, 2008]. This example has a peak floating-point performance of 16 GFLOPS/sec and a peak memory bandwidth of 16 GB/sec from the Stream benchmark. (Since stream is actually four measurements, this line is the average of the four.) The dotted vertical line in color on the left represents Kernel 1, which has an arithmetic intensity of 0.5 FLOPs/byte. It is limited by memory bandwidth to no more than 8 GFLOPS/sec on this Opteron X2. The dotted vertical line to the right represents Kernel 2, which has an arithmetic intensity of 4 FLOPs/byte. It is limited only computationally to 16 GFLOPS/s. (This data is based on the AMD Opteron X2 (Revision F) using dual cores running at 2 GHz in a dual socket system.) Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 15 FIGURE 7.14 Roofline models of two generations of Opterons. The Opteron X2 roofline, which is the same as Figure 7.11, is in black, and the Opteron X4 roofline is in color. The bigger ridge point of Opteron X4 means that kernels that where computationally bound on the Opteron X2 could be memory-performance bound on the Opteron X4. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 16 FIGURE 7.15 Roofline model with ceilings. The top graph shows the computational ceilings of 8 GFLOPs/sec if the floating-point operation mix is imbalanced and 2 GFLOPs/sec if the optimizations to increase ILP and SIMD are also missing. The bottom graph shows the memory bandwidth ceilings of 11 GB/sec without software prefetching and 4.8 GB/sec if memory affinity optimizations are also missing. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 17 FIGURE 7.16 Roofline model with ceilings, overlapping areas shaded, and the two kernels from Figure Kernels whose arithmetic intensity land in the blue trapezoid on the right should focus on computation optimizations, and kernels whose arithmetic intensity land in the gray triangle in the lower left should focus on memory bandwidth optimizations. Those that land in the blue-gray parallelogram in the middle need to worry about both. As Kernel 1 falls in the parallelogram in the middle, try optimizing ILP and SIMD, memory affinity, and software prefetching. Kernel 2 falls in the trapezoid on the right, so try optimizing ILP and SIMD and the balance of floating-point operations. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 18 FIGURE 7.17 Four recent multiprocessors, each using two sockets for the processors. Starting from the upper left hand corner, the computers are: (a) Intel Xeon e5345 (Clovertown), (b) AMD Opteron X (Barcelona), (c) Sun UltraSPARC T (Niagara 2), and (d) IBM Cell QS20. Note that the Intel Xeon e5345 (Clovertown) has a separate north bridge chip not found in the other microprocessors. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 19 FIGURE 7.18 Characteristics of the four recent multicores. Although the Xeon e5345 and Opteron X4 have the same speed DRAMs, the Stream benchmark shows a higher practical memory bandwidth due to the inefficiencies of the front side bus on the Xeon e5345. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 20 FIGURE 7.19 Roofline model for multicore multiprocessors in Figure The ceilings are the same as in Figure Starting from the upper left hand corner, the computers are: (a) Intel Xeon e5345 (Clovertown), (b) AMD Opteron X (Barcelona), (c) Sun UltraSPARC T (Niagara 2), and (d) IBM Cell QS20. Note the ridge points for the four microprocessors intersect the X-axis at the arithmetic intensities of 6, 4, 1/3, and 3/4, respectively. The dashed vertical lines are for the two kernels of this section and the stars mark the performance achieved for these kernels after all the optimizations. SpMV is the pair of dashed vertical lines on the left. It has two lines because its arithmetic intensity improved from to based on register blocking optimizations. LBHMD is the dashed vertical lines on the right. It has a pair of lines in (a) and (b) because a cache optimization skips filling the cache block on a miss when the processor would write new data into the entire block. That optimization increases the arithmetic intensity from 0.70 to Its a single line in (c) at 0.70 because UltraSPARC T2 does not offer the cache optimization. It is a single line at 1.07 in (d) because Cell has local store loaded by DMA, so the program doesnt fetch unnecessary data as do caches. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 21 FIGURE 7.20 Performance of SpMV on the four multicores. Copyright © 2009 Elsevier, Inc. Allrights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 22 FIGURE 7.21 Performance of LBMHD on the four multicores. Copyright © 2009 Elsevier, Inc. All rights reserved.

Chapter 7 Multicores, Multiprocessors, and Clusters 23 FIGURE 7.22 Base versus fully optimized performance of the four cores on the two kernels. Note the high fraction of fully optimized performance delivered by the Sun UltraSPARC T2 (Niagara 2). There is no base performance column for the IBM Cell because there is no way to port the code to the SPEs without caches. While you could run the code on the Power core, it has an order of magnitude lower performance than the SPES, so we ignore it in this fi gure. Copyright © 2009 Elsevier, Inc. All rights reserved.