Presentation on theme: "Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida."— Presentation transcript:
Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009 Parallel Computing Explained About the IBM Regatta P690
Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 9 About the IBM Regatta P690 9.1 IBM p690 General Overview 9.2 IBM p690 Building Blocks 9.3 Features Performed by the Hardware 9.4 The Operating System 9.5 Further Information
About the IBM Regatta P690 To obtain your program’s top performance, it is important to understand the architecture of the computer system on which the code runs. This chapter describes the architecture of NCSA's IBM p690. Technical details on the size and design of the processors, memory, cache, and the interconnect network are covered along with technical specifications for the compute rate, memory size and speed, and interconnect bandwidth.
IBM p690 General Overview The p690 is IBM's latest Symmetric Multi-Processor (SMP) machine with Distributed Shared Memory (DSM). This means that memory is physically distributed and logically shared. It is based on the Power4 architecture and is a successor to the Power3-II based RS/6000 SP system. IBM p690 Scalability The IBM p690 is a flexible, modular, and scalable architecture. It scales in these terms: Number of processors Memory size I/O and memory bandwidth and the Interconnect bandwidth
Agenda 9 About the IBM Regatta P690 9.1 IBM p690 General Overview 9.2 IBM p690 Building Blocks 9.2.1 Power4 Core 9.2.2 Multi-Chip Modules 9.2.3 The Processor 9.2.4 Cache Architecture 9.2.5 Memory Subsystem 9.3 Features Performed by the Hardware 9.4 The Operating System 9.5 Further Information
IBM p690 Building Blocks An IBM p690 system is built from a number of fundamental building blocks. The first of these building blocks is the Power4 Core, which includes the processors and L1 and L2 caches. At NCSA, four of these Power4 Cores are linked to form a Multi-Chip Module. This module includes the L3 cache and four Multi-Chip Modules are linked to form a 32 processor system (see figure on the next slide). Each of these components will be described in the following sections.
32-processor IBM p690 configuration (Image courtesy of IBM)
Power4 Core The Power4 Chip contains: Two processors Local caches (L1) External cache for each processor (L2) I/O and Interconnect interfaces
The POWER4 chip (Image curtsey of IBM)
Multi-Chip Modules Four Power4 Chips are assembled to form a Multi-Chip Module (MCM) that contains 8 processors. Each MCM also supports the L3 cache for each Power4 chip. Multiple MCM interconnection (Image courtesy of IBM)
The Processor The processors at the heart of the Power4 Core are speculative superscalar out of order execution chips. The Power4 is a 4-way superscalar RISC architecture running instructions on its 8 pipelined execution units. Speed of the Processor The NCSA IBM p690 has CPUs running at 1.3 GHz. 64-Bit Processor Execution Units There are 8 independent fully pipelined execution units. 2 load/store units for memory access 2 identical floating point execution units capable of fused multiply/add 2 fixed point execution units 1 branch execution unit 1 logic operation unit
The Processor The units are capable of 4 floating point operations, fetching 8 instructions and completing 5 instructions per cycle. It is capable of handling up to 200 in-flight instructions. Performance Numbers Peak Performance: 4 floating point instructions per cycle 1.3 Gcycles/sec * 4 flop/cycle yields 5.2 GFLOPS MIPS Rating: 5 instructions per cycle 1.3 Gcycles/sec * 5 instructions/cycle yields 65 MIPS Instruction Set The instruction set (ISA) on the IBM p690 is the PowerPC AS Instruction set.
Cache Architecture Each Power4 Core has both a primary (L1) cache associated with each processor and a secondary (L2) cache shared between the two processors. In addition, each Multi- Chip Module has a L3 cache. Level 1 Cache The Level 1 cache is in the processor core. It has split instruction and data caches. L1 Instruction Cache The properties of the Instruction Cache are: 64KB in size direct mapped cache line size is 128 bytes L1 Data Cache The properties of the L1 Data Cache are: 32KB in size 2-way set associative FIFO replacement policy 2-way interleaved cache line size is 128 bytes Peak speed is achieved when the data accessed in a loop is entirely contained in the L1 data cache.
Cache Architecture Level 2 Cache on the Power4 Chip When the processor can't find a data element in the L1 cache, it looks in the L2 cache. The properties of the L2 Cache are: external from the processor unified instruction and data cache 1.41MB per Power4 chip (2 processors) 8-way set associative split between 3 controllers cache line size is 128 bytes pseudo LRU replacement policy for cache coherence 124.8 GB/s peak bandwidth from L2
Cache Architecture Level 3 Cache on the Multi-Chip Module When the processor can't find a data element in the L2 cache, it looks in the L3 cache. The properties of the L3 Cache are: external from the Power4 Core unified instruction and data cache 128MB per Multi-Chip Module (8 processors) 8-way set associative cache line size is 512 bytes 55.5 GB/s peak bandwidth from L2
Memory Subsystem The total memory is physically distributed among the Multi-Chip Modules of the p690 system (see the diagram in the next slide). Memory Latencies The latency penalties for each of the levels of the memory hierarchy are: L1 Cache - 4 cycles L2 Cache - 14 cycles L3 Cache - 102 cycles Main Memory - 400 cycles
Memory distribution within an MCM
Agenda 9 About the IBM Regatta P690 9.1 IBM p690 General Overview 9.2 IBM p690 Building Blocks 9.3 Features Performed by the Hardware 9.4 The Operating System 9.5 Further Information
Features Performed by the Hardware The following is done completely by the hardware, transparent to the user: Global memory addressing (makes the system memory shared) Address resolution Maintaining cache coherency Automatic page migration from remote to local memory (to reduce interconnect memory transactions)
The Operating System The operating system is AIX. NCSA's p690 system is currently running version 5.1 of AIX. Version 5.1 is a full 64- bit file system. Compatibility AIX 5.1 is highly compatible to both BSD and System V Unix
Further Information Computer Architecture: A Quantitative Approach John Hennessy, et al. Morgan Kaufman Publishers, 2nd Edition, 1996 Computer Hardware and Design: The Hardware/Software Interface David A. Patterson, et al. Morgan Kaufman Publishers, 2nd Edition, 1997 IBM P Series  at the URL: http://www-03.ibm.com/systems/p/hardware/highend/590/index.html IBM p690 Documentation at NCSA at the URL: http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/IBMp690/