Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Organization & Design 计算机组成与设计 Weidong Wang ( 王维东 ) College of Information Science & Electronic Engineering 信息与通信工程研究所 Zhejiang.

Similar presentations


Presentation on theme: "Computer Organization & Design 计算机组成与设计 Weidong Wang ( 王维东 ) College of Information Science & Electronic Engineering 信息与通信工程研究所 Zhejiang."— Presentation transcript:

1 Computer Organization & Design 计算机组成与设计 Weidong Wang ( 王维东 ) wdwang@zju.edu.cn College of Information Science & Electronic Engineering 信息与通信工程研究所 Zhejiang University

2 Course Information Instructor: Weidong WANG –Email: wdwang@zju.edu.cnwdwang@zju.edu.cn –Tel(O): 0571-87953170; –Office Hours: TBD, Yuquan Campus, Xindian (High-Tech) Building 306, /email whenever TA: –mobile , Email: »Lu Huang 黄露, 13516719473/6719473; eliver8801@zju.edu.cn »Hanqi Shen 沈翰祺, 15067115046; 542886864@qq.com542886864@qq.com »Office Hours: Wednesday & Saturday 14:00-16:30 PM. »Xindian (High-Tech) Building 308. (也可以短信邮件联系) 2

3 Lecture 13 Introduction to Multi-core Processor 3

4 Motivation: 动机 Single Processor Performance Scaling 4

5 Multi-core Chips (aka 亦称 Chip Multi-Processors or CMPs) 5

6 Sample of Multi-core Options 6

7 7

8 And There is Much More… 8 异种的 群集 套 / 插座

9 9 Vector Supercomputers Epitomized by Cray-1, 1976: Scalar Unit –Load/Store Architecture Vector Extension –Vector Registers –Vector Instructions Implementation –Hardwired Control –Highly Pipelined Functional Units –Interleaved Memory System –No Data Caches –No Virtual Memory

10 Vector Programming Model 10 ++++++ [0][1][VLR-1] Vector Arithmetic Instructions ADDV v3, v1, v2 v3 v2 v1 Scalar Registers r0 r15 Vector Registers v0 v15 [0][1][2][VLRMAX-1] VLR Vector Length Register v1 Vector Load and Store Instructions LV v1, r1, r2 Base, r1Stride, r2 Memory Vector Register

11 11 Multimedia Extensions (aka SIMD extensions) Very short vectors added to existing ISAs for microprocessors Use existing 64-bit registers split into 2x32b or 4x16b or 8x8b –This concept first used on Lincoln Labs TX-2 computer in 1957, with 36b datapath split into 2x18b or 4x9b –Newer designs have 128-bit registers (PowerPC Altivec, Intel SSE2/3/4) Single instruction operates on all elements within register 16b 32b 64b 8b 16b ++++ 4x16b adds

12 Supercomputers Definition of a supercomputer: Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem Any machine costing $30M+ Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer

13 3/10/2009 CDC 6600 Seymour Cray, 1963 A fast pipelined machine with 60-bit words –128 Kword main memory capacity, 32 banks Ten functional units (parallel, unpipelined) –Floating Point: adder, 2 multipliers, divider –Integer: adder, 2 incrementers,... Hardwired control (no microcoding) Scoreboard for dynamic scheduling of instructions Ten Peripheral Processors for Input/Output –a fast multi-threaded 12-bit integer ALU Very fast clock, 10 MHz (FP add in 4 clocks) >400,000 transistors, 750 sq. ft., 5 tons, 150 kW, novel freon-based technology for cooling Fastest machine in world for 5 years (until 7600) –over 100 sold ($7-10M each)

14 14 CDC6600: Vector Addition B0  - n loop:JZE B0, exit A0  B0 + a0load X0 A1  B0 + b0 load X1 X6  X0 + X1 A6  B0 + c0 store X6 B0  B0 + 1 jump loop Ai = address register Bi = index register Xi = data register

15 Supercomputer Applications Typical application areas Military research (nuclear weapons, cryptography) Scientific research Weather forecasting Oil exploration Industrial design (car crash simulation) Bioinformatics Cryptography All involve huge computations on large data sets, Supercomputer Vector Machine In 70s-80s, Supercomputer  Vector Machine

16 BlueGene/Q Compute chip 360 mm² Cu-45 technology (SOI) – ~ 1.47 B transistors 16 user + 1 service processors –plus 1 redundant processor –all processors are symmetric –each 4-way multi-threaded –64 bits PowerISA™ –1.6 GHz –L1 I/D cache = 16kB/16kB –L1 prefetch engines –each processor has Quad FPU (4-wide double precision, SIMD) –peak performance 204.8 GFLOPS@55W Central shared L2 cache: 32 MB –eDRAM –multiversioned cache will support transactional memory, speculative execution. –supports atomic ops Dual memory controller –16 GB external DDR3 memory –1.33 Gb/s –2 * 16 byte-wide interface (+ECC) Chip-to-chip networking –Router logic integrated into BQC chip. External IO –PCIe Gen2 interface System-on-a-Chip design : integrates processors, memory and networking logic into a single chip

17 1. Chip 16 cores 2. Module Single Chip 4. Node Card 32 Compute Cards, Optical Modules, Link Chips, Torus 5a. Midplane 16 Node Cards 6. Rack 2 Midplanes 1, 2 or 4 I/O Drawers 7. System 20PF/s 3. Compute Card One single chip module, 16 GB DDR3 Memory 5b. I/O Drawer 8 I/O Cards 8 PCIe Gen2 slots Blue Gene/Q packaging hierarchy Ref: SC2010

18 Graphics Processing Units (GPUs) fixed-functionOriginal GPUs were dedicated fixed-function devices for generating 3D graphics (mid-late 1990s) including high- performance floating-point units –Provide workstation-like graphics for PCs –User could configure graphics pipeline, but not really program it programmabilityOver time, more programmability added (2001-2005) –E.g., New language Cg for writing small programs run on each vertex or each pixel, also Windows DirectX variants –Massively parallel (millions of vertices or pixels per frame) but very constrained programming model general-purpose computationSome users noticed they could do general-purpose computation by mapping input and output data to images, and computation to vertex and pixel shading computations –Incredibly difficult programming model as had to use graphics pipeline model for general computation 18 顶点着色器: 就三角形的 3 个顶点的颜色会计算, 三角形中的其 它像素通过插值得到 像素着色器: 每个像素的颜色都会被计算

19 General-Purpose GPUs (GP-GPUs) CUDAIn 2006, Nvidia introduced GeForce 8800 GPU supporting a new programming language: CUDA –“Compute Unified Device Architecture” OpenCL –Subsequently, broader industry pushing for OpenCL, a vendor-neutral version of same ideas. Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing Attached processor model: Host CPU issues data-parallel kernels to GP-GPU for execution This lecture has a simplified version of Nvidia CUDA-style model and only considers GPU execution for computational kernels, not graphics –Would probably need another course to describe graphics processing 19 计算统一设备架构

20 “Single Instruction, Multiple Thread” 线程 GPUs use a SIMT model, where individual scalar instruction streams for each CUDA thread are grouped together for SIMD execution on hardware (Nvidia groups 32 CUDA threads into a warp) 20 µT0µT1µT2µT3µT4µT5µT6µT7 ld x mul a ld y add st y Scalar instruction stream SIMD execution across warp

21 Nvidia Fermi GF100 GPU 21 [Nvidia, 2010]

22 GPU Future High-end desktops have separate GPU chip, but trend towards integrating GPU on same die as CPU (already in laptops, tablets and smartphones) –Advantage is shared memory with CPU, no need to transfer data –Disadvantage is reduced memory bandwidth compared to dedicated smaller-capacity specialized memory system »Graphics DRAM (GDDR) versus regular DRAM (DDR3) Will GP-GPU survive? Or will improvements in CPU DLP make GP-GPU redundant? –On same die, CPU and GPU should have same memory bandwidth –GPU might have more FLOPS as needed for graphics anyway 22

23 Another HW Issue: Memory Model for Multi-core 23 隐式 难处理

24 symmetric All memory is equally far away from all processors Any processor can do any I/O (set up a DMA transfer) Symmetric 对称性的 Multiprocessors Memory I/O controller Graphics output CPU-Memory bus bridge Processor I/O controller I/O bus Networks Processor

25 Synchronization 同时性 The need for synchronization 同步 arises whenever there are concurrent processes 并发的进程 in a system (even in a uniprocessor system) Producer-Consumer: A consumer process must wait until the producer process has produced data Mutual Exclusion 互斥 : Ensure that only one process uses a resource at a given time producer consumer Shared Resource P1 P2 生产者一消费者问题 互斥

26 A Producer-Consumer Example The program is written assuming instructions are executed in order. Producer posting Item x: Load R tail, (tail) Store (R tail ), x R tail =R tail +1 Store (tail), R tail Consumer: Load R head, (head) spin:Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head process(R) Producer Consumer tailhead R tail R head R Problems?

27 Cache Coherence Problem: Example 27

28 Hardware Cache Coherence Using Snooping 28 窥探

29 Quick Question 29

30 30 Snoopy Cache Goodman 1983 Idea: Have cache watch (or snoop upon) DMA transfers, and then “do the right thing” Snoopy cache tags are dual-ported Proc. Cache Snoopy read port attached to Memory Bus Data (lines) Tags and State A D R/W Used to drive Memory Bus when Cache is Bus Master A R/W

31 31 Snoopy Cache Actions for DMA Observed Bus Cycle Cache State Cache Action Address not cached DMA Read Cached, unmodified Memory Disk Cached, modified Address not cached DMA Write Cached, unmodified Disk Memory Cached, modified No action Cache intervenes Cache purges its copy ???

32 MSI: Simple Coherence Protocol for Write Back Caches 32

33 33 Cache State Transition Diagram The MSI protocol M SI M: Modified S: Shared I: Invalid Each cache line has state bits Address tag state bits Write miss (P1 gets line from memory) Other processor intent to write (P 1 writes back) Read miss (P1 gets line from memory) P 1 intent to write Other processor intent to write Read by any processor P 1 reads or writes Cache state in processor P 1 Other processor reads (P 1 writes back)

34 MSI Example with 2Cores 34

35 35 Two Processor Example (Reading and writing the same cache line) M SI Write miss Read miss P 1 intent to write P 2 intent to write P 2 reads, P 1 writes back P 1 reads or writes P 2 intent to write P1P1 M SI Write miss Read miss P 2 intent to write P 1 intent to write P 1 reads, P 2 writes back P 2 reads or writes P 1 intent to write P2P2 P 1 reads P 1 writes P 2 reads P 2 writes P 1 writes P 2 writes P 1 reads P 1 writes

36 36 Observation If a line is in the M state then no other cache can have a copy of the line! – Memory stays coherent, multiple differing copies cannot exist M SI Write miss Other processor intent to write Read miss P 1 intent to write Other processor intent to write Read by any processor P 1 reads or writes Other processor reads P 1 writes back

37 Quick Questions 37

38 But there is much more 38

39 Performance of Symmetric Shared-Memory Multiprocessors Cache performance is combination of: 1.Uniprocessor cache miss traffic 2.Traffic caused by communication –Results in invalidations and subsequent cache misses coherence missAdds 4 th C : coherence miss –Joins Compulsory, Capacity, Conflict –(Sometimes called a Communication miss) 均衡的

40 Coherency 一致性 Misses 1.True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1 st write to shared block Reads by another CPU of modified block in different cache Miss would still occur if block size were 1 word 2.False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Block is shared, but no word in block is actually shared  miss would not occur if block size were 1 word

41 HomeWork Readings: –Read Book; –Read Parallel Processors from Client to Cloud.pdf; HW10 (-5th) –Project 2 –6.3. –Reading Appendices B: TH-2 HPC in Computer Organization and Design (COD) » (Fifth Edition) 41

42 Acknowledgements These slides contain material from courses: –UCB CS152 –Stanford EE108B –Also MIT course 6.823 42


Download ppt "Computer Organization & Design 计算机组成与设计 Weidong Wang ( 王维东 ) College of Information Science & Electronic Engineering 信息与通信工程研究所 Zhejiang."

Similar presentations


Ads by Google