Computer Organization & Design 计算机组成与设计

Computer Organization & Design 计算机组成与设计
Weidong Wang (王维东) College of Information Science & Electronic Engineering 信息与通信网络工程研究所（ICAN） Zhejiang University

Course Information Instructor: Weidong WANG TA:
Tel(O): ; Office Hours: TBD, Yuquan Campus, Xindian (High-Tech) Building 306 Mobile: TA: mobile，陈彬彬 Binbin CHEN, ; 陈佳云 Jiayun CHEN， ; Office Hours: Wednesday & Saturday 14:00-16:30 PM. Xindian (High-Tech) Building 308.（也可以短信邮件联系）微信号-“2017计组群”

Lecture 13 Introduction to Multi-core Processor

Motivation:动机 Single Processor Performance Scaling
4

Multi-core Chips (aka亦称 Chip Multi-Processors or CMPs)
5

Sample of Multi-core Options
6

Sample of Multi-core Options
7

And There is Much More… 异种的套/插座群集 8

Vector Supercomputers
Epitomized by Cray-1, 1976: Scalar Unit Load/Store Architecture Vector Extension Vector Registers Vector Instructions Implementation Hardwired Control Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory 9 9

Vector Programming Model
Scalar Registers r0 r15 Vector Registers v0 v15 [0] [1] [2] [VLRMAX-1] VLR Vector Length Register + [0] [1] [VLR-1] Vector Arithmetic Instructions ADDV v3, v1, v2 v3 v2 v1 v1 Vector Load and Store Instructions LV v1, r1, r2 Base, r1 Stride, r2 Memory Vector Register 10 10

Multimedia Extensions (aka SIMD extensions)
64b 32b 16b 8b Very short vectors added to existing ISAs for microprocessors Use existing 64-bit registers split into 2x32b or 4x16b or 8x8b This concept first used on Lincoln Labs TX-2 computer in 1957, with 36b datapath split into 2x18b or 4x9b Newer designs have 128-bit registers (PowerPC Altivec, Intel SSE2/3/4) Single instruction operates on all elements within register 16b + + + + 16b 4x16b adds 16b 11 11

Supercomputers Definition of a supercomputer:
Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem Any machine costing $30M+ Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer 12

CDC 6600 Seymour Cray, 1963 A fast pipelined machine with 60-bit words
128 Kword main memory capacity, 32 banks Ten functional units (parallel, unpipelined) Floating Point: adder, 2 multipliers, divider Integer: adder, 2 incrementers, ... Hardwired control (no microcoding) Scoreboard for dynamic scheduling of instructions Ten Peripheral Processors for Input/Output a fast multi-threaded 12-bit integer ALU Very fast clock, 10 MHz (FP add in 4 clocks) >400,000 transistors, 750 sq. ft., 5 tons, 150 kW, novel freon-based technology for cooling Fastest machine in world for 5 years (until 7600) over 100 sold ($7-10M each) 3/10/2009 CS252 S05 13

CDC6600: Vector Addition B0  - n loop: JZE B0, exit
A0  B0 + a0 load X0 A1  B0 + b0 load X1 X6  X0 + X1 A6  B0 + c0 store X6 B0  B0 + 1 jump loop Ai = address register Bi = index register Xi = data register 14 CS252 S05 14

Supercomputer Applications
Typical application areas Military research (nuclear weapons, cryptography) Scientific research Weather forecasting Oil exploration Industrial design (car crash simulation) Bioinformatics Cryptography All involve huge computations on large data sets In 70s-80s, Supercomputer  Vector Machine 15

BlueGene/Q Compute chip
System-on-a-Chip design : integrates processors, memory and networking logic into a single chip 360 mm² Cu-45 technology (SOI) ~ 1.47 B transistors 16 user + 1 service processors plus 1 redundant processor all processors are symmetric each 4-way multi-threaded 64 bits PowerISA™ 1.6 GHz L1 I/D cache = 16kB/16kB L1 prefetch engines each processor has Quad FPU (4-wide double precision, SIMD) peak performance Central shared L2 cache: 32 MB eDRAM multiversioned cache will support transactional memory, speculative execution. supports atomic ops Dual memory controller 16 GB external DDR3 memory 1.33 Gb/s 2 * 16 byte-wide interface (+ECC) Chip-to-chip networking Router logic integrated into BQC chip. External IO PCIe Gen2 interface

Blue Gene/Q packaging hierarchy
4. Node Card 32 Compute Cards, Optical Modules, Link Chips, Torus 3. Compute Card One single chip module, 16 GB DDR3 Memory 2. Module Single Chip 1. Chip 16 cores 5b. I/O Drawer 8 I/O Cards 8 PCIe Gen2 slots 6. Rack 2 Midplanes 1, 2 or 4 I/O Drawers 7. System 20PF/s 5a. Midplane 16 Node Cards 5-D Topology: 16x16x16x12x2 A Q32 card is 2x2x2x2x2 and a midplane is 4x4x4x4x2. Ref: SC2010

Graphics Processing Units (GPUs)
Original GPUs were dedicated fixed-function devices for generating 3D graphics (mid-late 1990s) including high-performance floating-point units Provide workstation-like graphics for PCs User could configure graphics pipeline, but not really program it Over time, more programmability added ( ) E.g., New language Cg for writing small programs run on each vertex or each pixel, also Windows DirectX variants Massively parallel (millions of vertices or pixels per frame) but very constrained programming model Some users noticed they could do general-purpose computation by mapping input and output data to images, and computation to vertex and pixel shading computations Incredibly difficult programming model as had to use graphics pipeline model for general computation 顶点着色器：就三角形的3个顶点的颜色会计算，三角形中的其它像素通过插值得到像素着色器：每个像素的颜色都会被计算 18

General-Purpose GPUs (GP-GPUs)
In 2006, Nvidia introduced GeForce 8800 GPU supporting a new programming language: CUDA “Compute Unified Device Architecture” Subsequently, broader industry pushing for OpenCL, a vendor-neutral version of same ideas. Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing Attached processor model: Host CPU issues data-parallel kernels to GP-GPU for execution This lecture has a simplified version of Nvidia CUDA-style model and only considers GPU execution for computational kernels, not graphics Would probably need another course to describe graphics processing 计算统一设备架构 19

“Single Instruction, Multiple Thread”线程
GPUs use a SIMT model, where individual scalar instruction streams for each CUDA thread are grouped together for SIMD execution on hardware (Nvidia groups 32 CUDA threads into a warp) µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7 ld x Scalar instruction stream mul a ld y add st y SIMD execution across warp 20

Nvidia Fermi GF100 GPU [Nvidia, 2010] 21

GPU Future High-end desktops have separate GPU chip, but trend towards integrating GPU on same die as CPU (already in laptops, tablets and smartphones) Advantage is shared memory with CPU, no need to transfer data Disadvantage is reduced memory bandwidth compared to dedicated smaller-capacity specialized memory system Graphics DRAM (GDDR) versus regular DRAM (DDR3) Will GP-GPU survive? Or will improvements in CPU DLP make GP-GPU redundant? On same die, CPU and GPU should have same memory bandwidth GPU might have more FLOPS as needed for graphics anyway 22

Another HW Issue: Memory Model for Multi-core
隐式难处理 23

Symmetric对称性的Multiprocessors
Memory I/O controller Graphics output CPU-Memory bus bridge Processor I/O bus Networks symmetric All memory is equally far away from all processors Any processor can do any I/O (set up a DMA transfer) 24

Synchronization同时性 The need for synchronization同步arises whenever
there are concurrent processes并发的进程 in a system (even in a uniprocessor system) Producer-Consumer: A consumer process must wait until the producer process has produced data Mutual Exclusion互斥: Ensure that only one process uses a resource at a given time producer consumer 生产者一消费者问题互斥 Shared Resource P1 P2 25

A Producer-Consumer Example
tail head Rtail Rhead R Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) Producer posting Item x: Load Rtail, (tail) Store (Rtail), x Rtail=Rtail+1 Store (tail), Rtail The program is written assuming instructions are executed in order. Problems? 26

Performance of Symmetric Shared-Memory Multiprocessors
均衡的 Cache performance is combination of: Uniprocessor cache miss traffic Traffic caused by communication Results in invalidations and subsequent cache misses Adds 4th C: coherence miss Joins Compulsory, Capacity, Conflict (Sometimes called a Communication miss) 39

Coherency一致性Misses True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1st write to shared block Reads by another CPU of modified block in different cache Miss would still occur if block size were 1 word False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Block is shared, but no word in block is actually shared  miss would not occur if block size were 1 word 40

HomeWork Readings: HomeWork Read Book;
Read Parallel Processors from Client to Cloud.pdf; HomeWork HW4上交 HW5 Project 2 Reading Appendices B: TH-2 HPC in Computer Organization and Design (COD) (Fifth Edition) 41

Acknowledgements These slides contain material from courses: UCB CS152
Stanford EE108B Also MIT course 6.823 42

Computer Organization & Design 计算机组成与设计

Similar presentations

Presentation on theme: "Computer Organization & Design 计算机组成与设计"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Organization & Design 计算机组成与设计

Similar presentations

Presentation on theme: "Computer Organization & Design 计算机组成与设计"— Presentation transcript:

Similar presentations

About project

Feedback