Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Many-Core Architectures Henk Corporaal ASCI Winterschool on Embedded Systems Soesterberg, March 2010.

Similar presentations

Presentation on theme: "Introduction to Many-Core Architectures Henk Corporaal ASCI Winterschool on Embedded Systems Soesterberg, March 2010."— Presentation transcript:

1 Introduction to Many-Core Architectures Henk Corporaal ASCI Winterschool on Embedded Systems Soesterberg, March 2010

2 ASCI Winterschool 2010Henk Corporaal(2) Intel Trends (K. Olukotun) Core i7 3GHz 100W 5

3 ASCI Winterschool 2010Henk Corporaal(3) System-level integration (Chuck Moore, AMD at MICRO 2008) Single-chip CPU Era: 1986 –2004 Extreme focus on single-threaded performance Multi-issue, out-of-order execution plus moderate cache hierarchy Chip Multiprocessor (CMP) Era: 2004 –2010 Early: Hasty integration of multiple cores into same chip/package Mid-life: Address some of the HW scalability and interference issues Current: Homogeneous CPUs plus moderate system-level functionality System-level Integration Era: ~2010 onward Integration of substantial system-level functionality Heterogeneous processors and accelerators Introspective control systems for managing on-chip resources & events

4 ASCI Winterschool 2010Henk Corporaal(4) Why many core? Running into Frequency wall ILP wall Memory wall Energy wall Chip area enabler: Moore's law goes well below 22 nm What to do with all this area? Multiple processors fit easily on a single die Application demands Cost effective (just connect existing processors or processor cores) Low power: parallelism may allow lowering Vdd Performance/Watt is the new metric !!

5 ASCI Winterschool 2010Henk Corporaal(5) Low power through parallelism Sequential Processor Switching capacitance C Frequency f Voltage V P1 = fCV2 Parallel Processor (two times the number of units) Switching capacitance 2C Frequency f/2 Voltage V < V P2 = f/2 2C V2 = fCV2 < P1 CPU CPU1CPU2

6 ASCI Winterschool 2010Henk Corporaal(6) How low Vdd can we go? Subthreshold JPEG encoder Vdd 0.4 – 1.2 Volt Engine

7 ASCI Winterschool 2010Henk Corporaal(7) Computational efficiency: how many MOPS/Watt? Yifan He e.a., DAC 2010

8 ASCI Winterschool 2010Henk Corporaal(8) Computational efficiency: what do we need? 3G Wireless 4 Mobile HD Video Woh e.a., ISCA 2009

9 ASCI Winterschool 2010Henk Corporaal(9) Intel's opinion: 48-core x86

10 ASCI Winterschool 2010Henk Corporaal(10) Outline Classifications of Parallel Architectures Examples Various (research) architectures GPUs Cell Intel multi-cores How much performance do you really get? Roofline model Trends & Conclusions

11 ASCI Winterschool 2010Henk Corporaal(11) Classifications Performance / parallelism driven: 4-5 D Flynn Communication & Memory Message passing / Shared memory Shared memory issues: coherency, consistency, synchronization Interconnect

12 ASCI Winterschool 2010Henk Corporaal(12) Flynn's Taxomony SISD (Single Instruction, Single Data) Uniprocessors SIMD (Single Instruction, Multiple Data) Vector architectures also belong to this class Multimedia extensions (MMX, SSE, VIS, AltiVec, …) Examples: Illiac-IV, CM-2, MasPar MP-1/2, Xetal, IMAP, Imagine, GPUs, …… MISD (Multiple Instruction, Single Data) Systolic arrays / stream based processing MIMD (Multiple Instruction, Multiple Data) Examples: Sun Enterprise 5000, Cray T3D/T3E, SGI Origin Flexible Most widely used

13 ASCI Winterschool 2010Henk Corporaal(13) Flynn's Taxomony

14 ASCI Winterschool 2010Henk Corporaal(14) Enhance performance: 4 architecture methods (Super)-pipelining Powerful instructions MD-technique multiple data operands per operation MO-technique multiple operations per instruction Multiple instruction issue Single stream: Superscalar Multiple streams Single core, multiple threads: Simultaneously Multi- Threading Multiple cores

15 ASCI Winterschool 2010Henk Corporaal(15) Architecture methods Pipelined Execution of Instructions Purpose of pipelining: Reduce #gate_levels in critical path Reduce CPI close to one (instead of a large number for the multicycle machine) More efficient Hardware Problems Hazards: pipeline stalls Structural hazards: add more hardware Control hazards, branch penalties: use branch prediction Data hazards: by passing required IF: Instruction Fetch DC: Instruction Decode RF: Register Fetch EX: Execute instruction WB: Write Result Register IFDCRFEXWB IFDCRFEXWB IFDCRFEXWB IFDCRFEXWB INSTRUCTION CYCLE Simple 5-stage pipeline

16 ASCI Winterschool 2010Henk Corporaal(16) Architecture methods Pipelined Execution of Instructions Superpipelining: Split one or more of the critical pipeline stages Superpipelining degree S: * Op I_set S(architecture) = f(Op) * lt (Op) where: f(op) is frequency of operation op lt(op) is latency of operation op

17 ASCI Winterschool 2010Henk Corporaal(17) Architecture methods Powerful Instructions (1) MD-technique Multiple data operands per operation SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; or c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3)

18 ASCI Winterschool 2010Henk Corporaal(18) Architecture methods Powerful Instructions (1) SIMD computing All PEs (Processing Elements) execute same operation Typical mesh or hypercube connectivity Exploit data locality of e.g. image processing applications Dense encoding (few instruction bits needed) SIMD Execution Method time Instruction 1 Instruction 2 Instruction 3 Instruction n PE1PE2PEn

19 ASCI Winterschool 2010Henk Corporaal(19) Architecture methods Powerful Instructions (1) Sub-word parallelism SIMD on restricted scale: Used for Multi-media instructions Examples MMX, SSE, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia II Example: i=1..4|ai-bi| ****

20 ASCI Winterschool 2010Henk Corporaal(20) Architecture methods Powerful Instructions (2) MO-technique: multiple operations per instruction Two options: CISC (Complex Instruction Set Computer) VLIW (Very Long Instruction Word) sub r8, r5, 3and r1, r5, 12mul r6, r5, r2ld r3, 0(r5) FU 1FU 2FU 3FU 4 field instruction bnez r5, 13 FU 5 VLIW instruction example

21 ASCI Winterschool 2010Henk Corporaal(21) Exec unit 1 Exec unit 2 Exec unit 3 Register file Issue slot 1 Exec unit 4 Exec unit 5 Exec unit 6 Exec unit 7 Exec unit 8 Exec unit 9 Issue slot 2Issue slot 3 Q: How many ports does the registerfile need for n-issue? VLIW architecture: central Register File

22 ASCI Winterschool 2010Henk Corporaal(22) Architecture methods Multiple instruction issue (per cycle) Who guarantees semantic correctness? can instructions be executed in parallel User: he specifies multiple instruction streams Multi-processor: MIMD (Multiple Instruction Multiple Data) HW: Run-time detection of ready instructions Superscalar Compiler: Compile into dataflow representation Dataflow processors

23 ASCI Winterschool 2010Henk Corporaal(23) Four dimensional representation of the architecture design space Instructions/cycle I Superpipelining Degree S Operations/instruction O Data/operation D SuperscalarMIMD Dataflow Superpipelined RISC VLIW Vector 10 SIMD 100 CISC

24 ASCI Winterschool 2010Henk Corporaal(24) Architecture design space Architecture IODSMpar CISC RISC VLIW Superscalar SIMD MIMD GPU Top500 Jaguar??? Example values of for different architectures Mpar = I*O*D*S Op I_set S(architecture) = f(Op) * lt (Op) You should exploit this amount of parallelism !!!

25 ASCI Winterschool 2010Henk Corporaal(25) Communication Parallel Architecture extends traditional computer architecture with a communication network abstractions (HW/SW interface) organizational structure to realize abstraction efficiently Communication Network Processing node Processing node Processing node Processing node Processing node

26 ASCI Winterschool 2010Henk Corporaal(26) Communication models: Shared Memory Coherence problem Memory consistency issue Synchronization problem Process P1 Process P2 Shared Memory (read, write)

27 ASCI Winterschool 2010Henk Corporaal(27) Communication models: Shared memory Shared address space Communication primitives: load, store, atomic swap Two varieties: Physically shared => Symmetric Multi-Processors (SMP) usually combined with local caching Physically distributed => Distributed Shared Memory (DSM)

28 ASCI Winterschool 2010Henk Corporaal(28) SMP: Symmetric Multi-Processor Memory: centralized with uniform access time (UMA) and bus interconnect, I/O Examples: Sun Enterprise 6000, SGI Challenge, Intel Main memoryI/O System One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor One or more cache levels Processor can be 1 bus, N busses, or any network

29 ASCI Winterschool 2010Henk Corporaal(29) DSM: Distributed Shared Memory Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Interconnection Network Cache Processor Memory Cache Processor Memory Cache Processor Memory Cache Processor Memory Main memoryI/O System

30 ASCI Winterschool 2010Henk Corporaal(30) Shared Address Model Summary Each processor can name every physical location in the machine Each process can name all data it shares with other processes Data transfer via load and store Data size: byte, word,... or cache blocks Memory hierarchy model applies: communication moves data to local proc. cache

31 ASCI Winterschool 2010Henk Corporaal(31) Three fundamental issues for shared memory multiprocessors Coherence, about: Do I see the most recent data? Consistency, about: When do I see a written value? e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? Synchronization How to synchronize processes? how to protect access to shared data?

32 ASCI Winterschool 2010Henk Corporaal(32) Communication models: Message Passing Communication primitives e.g., send, receive library calls standard MPI: Message Passing Interface Note that MP can be build on top of SM and vice versa! Process P1 Process P2 receive send FiFO

33 ASCI Winterschool 2010Henk Corporaal(33) Message Passing Model Explicit message send and receive operations Send specifies local buffer + receiving process on remote computer Receive specifies sending process on remote computer + local buffer to place data Typically blocking communication, but may use DMA HeaderDataTrailer Message structure

34 ASCI Winterschool 2010Henk Corporaal(34) Message passing communication Interconnection Network Network interface Network interface Network interface Network interface Cache Processor Memory DMA Cache Processor Memory DMA Cache Processor Memory DMA Cache Processor Memory DMA

35 ASCI Winterschool 2010Henk Corporaal(35) Communication Models: Comparison Shared-Memory: Compatibility with well-understood language mechanisms Ease of programming for complex or dynamic communications patterns Shared-memory applications; sharing of large data structures Efficient for small items Supports hardware caching Messaging Passing: Simpler hardware Explicit communication Implicit synchronization (with any communication)

36 ASCI Winterschool 2010Henk Corporaal(36) Interconnect How to connect your cores? Some options: Connect everybody: Single bus Hierarchical bus NoC multi-hop via routers any topology possible easy 2D layout helps Connect with e.g. neighbors only e.g. using shift operation in SIMD or using dual-ported mems to connect 2 cores.

37 ASCI Winterschool 2010Henk Corporaal(37) Bus (shared) or Network (switched) Network: claimed to be more scalable no bus arbitration point-to-point connections but router overhead node R R R R R R R R Example: NoC with 2x4 mesh routing network

38 ASCI Winterschool 2010Henk Corporaal(38) Historical Perspective Early machines were: Collection of microprocessors. Communication was performed using bi-directional queues between nearest neighbors. Messages were forwarded by processors on path Store and forward networking There was a strong emphasis on topology in algorithms, in order to minimize the number of hops => minimize time

39 ASCI Winterschool 2010Henk Corporaal(39) Design Characteristics of a Network Topology (how things are connected): Crossbar, ring, 2-D and 3-D meshes or torus, hypercube, tree, butterfly, perfect shuffle,.... Routing algorithm (path used): Example in 2D torus: all east-west then all north-south (avoids deadlock) Switching strategy: Circuit switching: full path reserved for entire message, like the telephone. Packet switching: message broken into separately-routed packets, like the post office. Flow control and buffering (what if there is congestion): Stall, store data temporarily in buffers re-route data to other nodes tell source node to temporarily halt, discard, etc. QoS guarantees, Error handling, …., etc, etc.

40 ASCI Winterschool 2010Henk Corporaal(40) Switch / Network Topology Topology determines: Degree: number of links from a node Diameter: max number of links crossed between nodes Average distance: number of links to random destination Bisection: minimum number of links that separate the network into two halves Bisection bandwidth = link bandwidth * bisection

41 ASCI Winterschool 2010Henk Corporaal(41) Bisection Bandwidth Bisection bandwidth: bandwidth across smallest cut that divides network into two equal halves Bandwidth across narrowest part of the network bisection cut not a bisection cut bisection bw= link bwbisection bw = sqrt(n) * link bw Bisection bandwidth is important for algorithms in which all processors need to communicate with all others

42 ASCI Winterschool 2010Henk Corporaal(42) Common Topologies TypeDegreeDiameterAve Dist Bisection 1D mesh 2N-1N/3 1 2D mesh 42(N 1/2 - 1)2N 1/2 / 3 N 1/2 3D mesh 63(N 1/3 - 1)3N 1/3 / 3 N 2/3 nD mesh 2nn(N 1/n - 1)nN 1/n / 3 N (n-1) / n Ring 2N/2N/4 2 2D torus 4N 1/2 N 1/2 / 2 2N 1/2 HypercubeLog 2 N n=Log 2 N n/2 N/2 2D Tree 32Log 2 N ~2Log 2 N 1 Crossbar N-11 1 N 2 /2 N = number of nodes, n = dimension

43 ASCI Winterschool 2010Henk Corporaal(43) Topologies in Real High End Machines Red Storm (Opteron + Cray network, future) 3D Mesh Blue Gene/L3D Torus SGI AltixFat tree Cray X14D Hypercube (approx) Myricom (Millennium)Arbitrary Quadrics (in HP Alpha server clusters) Fat tree IBM SPFat tree (approx) SGI OriginHypercube Intel Paragon2D Mesh BBN ButterflyButterfly older newer

44 ASCI Winterschool 2010Henk Corporaal(44) Network: Performance metrics Network Bandwidth Need high bandwidth in communication How does it scale with number of nodes? Communication Latency Affects performance, since processor may have to wait Affects ease of programming, since it requires more thought to overlap communication and computation How can a mechanism help hide latency? overlap message send with computation, prefetch data, switch to other task or thread

45 ASCI Winterschool 2010Henk Corporaal(45) Examples of many core / PE architectures SIMD Xetal (320 PEs), Imap (128 PEs), AnySP (Michigan Univ) VLIW Itanium,TRIPS / EDGE, ADRES, Multi-threaded idea: hide long latencies Denelcor HEP (1982), SUN Niagara (2005) Multi-processor RaW, PicoChip, Intel/AMD, GRID, Farms, ….. Hybrid, like, Imagine, GPUs, XC-Core actually, most are hybrid !!

46 ASCI Winterschool 2010Henk Corporaal(46) IMAP from NEC NEC IMAP SIMD 128 PEs Supports indirect addressing e.g. LD r1, (r2) Each PE 5-issue VLIW

47 ASCI Winterschool 2010Henk Corporaal(47) TRIPS (Austin Univ / IBM) a statically mapped data flow architecture R: register file E: execution unit D: Data cache I: Instruction cache G: global control

48 ASCI Winterschool 2010Henk Corporaal(48) Compiling for TRIPS 1. Form hyperblocks (use unrolling, predication, inlining to enlarge scope) 2. Spatial map operations of each hyperblock registers are accessed at hyperblock boundaries 3. Schedule hyperblocks

49 ASCI Winterschool 2010Henk Corporaal(49) Multithreaded Categories Time (processor cycle) SuperscalarFine-GrainedCoarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot Intel calls this 'Hyperthreading'

50 ASCI Winterschool 2010Henk Corporaal(50) SUN Niagara processing element 4 threads per processor 4 copies of PC logic, Instr. buffer, Store buffer, Register file

51 ASCI Winterschool 2010Henk Corporaal(51) Really BIG: Jaguar-Cray XT5-HE Oak Ridge Nat Lab 224,256 AMD Opteron cores 2.33 PetaFlop peak perf. 299 Tbyte main memory 10 Petabyte disk 478GB/s mem bandwidth 6.9 MegaWatt 3D torus TOP 500 #1 (Nov 2009)

52 ASCI Winterschool 2010Henk Corporaal(52) Graphic Processing Units (GPUs) NVIDIA GT 340 (2010) ATI 5970 (2009)

53 ASCI Winterschool 2010Henk Corporaal(53) Why GPUs

54 ASCI Winterschool 2010Henk Corporaal(54) In Need of TeraFlops? 3 * GTX PEs 5.3 TeraFlop

55 ASCI Winterschool 2010Henk Corporaal(55) How Do GPUs Spend Their Die Area? GPUs are designed to match the workload of 3D graphics. J. Roca, et al. "Workload Characterization of 3D Games", IISWC 2006, linklink T. Mitra, et al. "Dynamic 3D Graphics Workload Characterization and the Architectural Implications", Micro 1999, linklink Die photo of GeForce GTX 280 (source: NVIDIA)

56 ASCI Winterschool 2010Henk Corporaal(56) How Do CPUs Spend Their Die Area? CPUs are designed for low latency instead of high throughput Die photo of Intel Penryn (source: Intel)

57 ASCI Winterschool 2010Henk Corporaal(57) GPU: Graphics Processing Unit The Utah teapot: From polygon mesh to image pixel.

58 ASCI Winterschool 2010Henk Corporaal(58) The Graphics Pipeline K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008,

59 ASCI Winterschool 2010Henk Corporaal(59) The Graphics Pipeline K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008,

60 ASCI Winterschool 2010Henk Corporaal(60) The Graphics Pipeline K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008,

61 ASCI Winterschool 2010Henk Corporaal(61) The Graphics Pipeline K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008,

62 ASCI Winterschool 2010Henk Corporaal(62) GPUs: what's inside? Basically an SIMD: A single instruction stream operates on multiple data streams All PEs execute the same instruction at the same time PEs operate concurrently on their own piece of memory However, GPU far more complex !! Add

63 ASCI Winterschool 2010Henk Corporaal(63) CPU Programming: NVIDIA CUDA example Single thread program float A[4][8]; do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; } CUDA program float A[4][8]; kernelF >>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++; } CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP). Hardware converts TLP into DLP at run time.

64 ASCI Winterschool 2010Henk Corporaal(64) System Architecture Erik Lindholm, et al. "NVIDIA Tesla: A Unified Graphics and Computing Architecture", IEEE Micro 2008, linklink

65 ASCI Winterschool 2010Henk Corporaal(65) NVIDIA Tesla Architecture (G80) Erik Lindholm, et al. "NVIDIA Tesla: A Unified Graphics and Computing Architecture", IEEE Micro 2008, linklink

66 ASCI Winterschool 2010Henk Corporaal(66) Texture Processor Cluster (TPC)

67 ASCI Winterschool 2010Henk Corporaal(67) Deeply pipelined SM for high throughput One instruction executed by a warp of 32 threads One warp is executed on 8 PEs over 4 shader cycles Let's start with a simple example: execution of 1 instruction

68 ASCI Winterschool 2010Henk Corporaal(68) Issue an Instruction for 32 Threads

69 ASCI Winterschool 2010Henk Corporaal(69) Read Source Operands of 32 Threads

70 ASCI Winterschool 2010Henk Corporaal(70) Buffer Source Operands to Op Collector

71 ASCI Winterschool 2010Henk Corporaal(71) Execute Threads 0~7

72 ASCI Winterschool 2010Henk Corporaal(72) Execute Threads 8~15

73 ASCI Winterschool 2010Henk Corporaal(73) Execute Threads 16~23

74 ASCI Winterschool 2010Henk Corporaal(74) Execute Threads 24~31

75 ASCI Winterschool 2010Henk Corporaal(75) Write Back from Result Queue to Reg

76 ASCI Winterschool 2010Henk Corporaal(76) Warp: Basic Scheduling Unit in Hardware One warp consists of 32 consecutive threads Warps are transparent to programmer, formed at run time

77 ASCI Winterschool 2010Henk Corporaal(77) Warp Scheduling Schedule at most 24 warps in an interleaved manner Zero overhead for interleaved issue of warps

78 ASCI Winterschool 2010Henk Corporaal(78) Handling Branch Threads within a warp are free to branch. if( $r17 > $r19 ){ $r16 = $r20 + $r31 } else{ $r16 = $r21 - $r32 } $r18 = $r15 + $r16 Assembly code on the right are disassembled from cuda binary (cubin) using "decuda", linklink

79 ASCI Winterschool 2010Henk Corporaal(79) Branch Divergence within a Warp If threads within a warp diverge, both paths have to be executed. Masks are set to filter out threads not executing on current path.

80 ASCI Winterschool 2010Henk Corporaal(80) CPU Programming: NVIDIA CUDA example Single thread program float A[4][8]; do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; } CUDA program float A[4][8]; kernelF >>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++; } CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP). Hardware converts TLP into DLP at run time.

81 ASCI Winterschool 2010Henk Corporaal(81) CUDA Programming kernelF >>(A); __device__ kernelF(A){ i = blockDim.x * blockIdx.y + blockIdx.x; j = threadDim.x * threadIdx.y + threadIdx.x; A[i][j]++; } Both grid and thread block can have two dimensional index.

82 ASCI Winterschool 2010Henk Corporaal(82) Mapping Thread Blocks to SMs One thread block can only run on one SM Thread block can not migrate from one SM to another SM Threads of the same thread block can share data using shared memory Example: mapping 12 thread blocks on 4 SMs.

83 ASCI Winterschool 2010Henk Corporaal(83) Mapping Thread Blocks (0,0)/(0,1)/(0,2)/(0,3)

84 ASCI Winterschool 2010Henk Corporaal(84) CUDA Compilation Trajectory cudafe: CUDA front end nvopencc: customized open64 compiler for CUDA ptx: high level assemble code (documented) ptxas: ptx assembler cubin: CUDA binrary decuda,

85 ASCI Winterschool 2010Henk Corporaal(85) Optimization Guide Optimizations on memory latency tolerance Reduce register pressure Reduce shared memory pressure Optimizations on memory bandwidth Global memory coalesce Shared memory bank conflicts Grouping byte access Avoid Partition camping Optimizations on computation efficiency Mul/Add balancing Increase floating point proportion Optimizations on operational intensity Use tiled algorithm Tuning thread granularity

86 ASCI Winterschool 2010Henk Corporaal(86) Global Memory: Coalesced Access NVIDIA, "CUDA Programming Guide", linklink perfectly coalescedallow threads skipping LD/ST

87 ASCI Winterschool 2010Henk Corporaal(87) Global Memory: Non-Coalesced Access NVIDIA, "CUDA Programming Guide", linklink non-consecutive address starting address not aligned to 128 Byte non-consecutive address stride larger than one word

88 ASCI Winterschool 2010Henk Corporaal(88) Shared Memory: without Bank Conflict NVIDIA, "CUDA Programming Guide", linklink one access per bankone access per bank with shuffling access the same address (broadcast) partial broadcast and skipping some banks

89 ASCI Winterschool 2010Henk Corporaal(89) Shared Memory: with Bank Conflict NVIDIA, "CUDA Programming Guide", linklink access more than one address per bank broadcast more than one address per bank

90 ASCI Winterschool 2010Henk Corporaal(90) Optimizing MatrixMul Matrix Multiplication example from the 5kk70 course in TU/e, The course also provides Matrix Multiplication as a hands-on example,

91 ASCI Winterschool 2010Henk Corporaal(91) ATI Cypress (RV870) 1600 shader ALUs ref: tom's hardware, linklink

92 ASCI Winterschool 2010Henk Corporaal(92) ATI Cypress (RV870) VLIW PEs ref: tom's hardware, linklink

93 ASCI Winterschool 2010Henk Corporaal(93) Intel Larrabee x86 core, 8/16/32 cores. Larry Seiler, et al. "Larrabee: a many-core x86 architecture for visual computing", SIGGRAPH 2008, linklink

94 ASCI Winterschool 2010Henk Corporaal(94) CELL PS3 NVIDIA RSX reality synthesizer NVIDIA RSX reality synthesizer Cell Broadband Engine 3.2 GHz Cell Broadband Engine 3.2 GHz South Bridge GDDR3 XDR DRAM drives USB Network Media Video Memory 2.5 GB/sec 15 GB/sec20 GB/sec 128pin * 1.4Gbps/pin = 22.4GB/sec 64pin * 3.2Gbps/pin = 25.6GB/sec Main Memory

95 ASCI Winterschool 2010Henk Corporaal(95) CELL – the architecture 1 x PPE 64-bit PowerPC L1: 32 KB I$ + 32 KB D$ L2: 512 KB 8 x SPE cores: Local store: 256 KB 128 x 128 bit vector registers Hybrid memory model: PPE: Rd/Wr SPEs: Asynchronous DMA EIB: 205 GB/s sustained aggregate bandwidth Processor-to-memory bandwidth: 25.6 GB/s Processor-to-processor: 20 GB/s in each direction

96 ASCI Winterschool 2010Henk Corporaal(96)

97 ASCI Winterschool 2010Henk Corporaal(97) Intel / AMD x86 – Historical overview

98 ASCI Winterschool 2010Henk Corporaal(98) Nehalem architecture In novel processors Core i7 & Xeon 5500s Quad Core 3 cache levels 2 TLB levels 2 branch predictors Out-of-Order execution Simultaneous Multithreading DVFS: dynamic voltage & frequency scaling 1 core

99 ASCI Winterschool 2010Henk Corporaal(99) Nehalem pipeline (1/2) Instruction Fetch and PreDecode Instruction Queue Decode Rename/Alloc Retirement unit (Re-Order Buffer) Scheduler EXE Unit Clust er 0 EXE Unit Clust er 1 EXE Unit Clust er 2 Load Store L1D Cache and DTLB L2 Cache Inclusive L3 Cache by all cores Micro- code ROM Q PI Quick Path Interconnect (2x20 bit)

100 ASCI Winterschool 2010Henk Corporaal(100) Nehalem pipeline (2/2)

101 ASCI Winterschool 2010Henk Corporaal(101) Tylersburg: connecting 2 quad cores LevelCapacityAssociativity (ways) Line size (bytes) Access Latency (clocks) Access Throughput (clocks) Write Update Policy L1D4 x 32 KiB86441Writeback L1I4 x 32 KiB4N/A L2U4 x 256KiB86410VariesWriteback L3U1 x 8 MiB VariesWriteback

102 ASCI Winterschool 2010Henk Corporaal(102) Programming these arechitectures: N-tap FIR int i, j; for (i = 0; i < M; i ++){ out[i] = 0; for (j = 0; j < N; j ++) out[i] +=n[i+j]*coeff[j]; } C-code:

103 ASCI Winterschool 2010Henk Corporaal(103)

104 ASCI Winterschool 2010Henk Corporaal(104) __m128 X, XH, XL, Y, C, H; int i, j; for(i = 0; i < (M/4); i ++){ XL = _mm_load_ps(&in[i*4]); Y = _mm_setzero_ps(); for(j = 0; j < (N/4); j ++){ XH = XL; XL = _mm_load_ps(&in[(i+j+1)*4]); C =_mm_load_ps(&coeff[j*4]); H =_mm_shuffle_ps (C, C, _MM_SHUFFLE(0,0,0,0)); X = _mm_mul_ps (XH, H); Y = _mm_add_ps (Y, X); H =_mm_shuffle_ps (C, C, _MM_SHUFFLE(1,1,1,1)); X = _mm_alignr_epi8 (XL, XH, 4); X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X); H = _mm_shuffle_ps (C, C, _MM_SHUFFLE(2,2,2,2)); X = _mm_alignr_epi8 (XL, XH, 8); X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X); H = _mm_shuffle_ps (C, C, _MM_SHUFFLE(3,3,3,3)); X = _mm_alignr_epi8 (XL, XH, 12); X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X); } _mm_store_ps(&out[i*4], Y); } FIR with x86 SSE Intrinsics

105 ASCI Winterschool 2010Henk Corporaal(105) FIR using pthread pthread_t fir_threads[N_THREAD]; fir_arg fa[N_THREAD]; tsize = M/N_THREAD; for(i = 0; i < N_THREAD; i ++){ /*… Initialize thread parameters fa[i] … */ rc = pthread_create(&fir_threads[i],\ NULL, fir_kernel, (void *)&fa[i]); } for(i=0; i

106 ASCI Winterschool 2010Henk Corporaal(106) x86 FIR speedup On Intel Core 2 Quad Q8300, gcc optimization level 2 Input: ~5M samples #threads in pthread: 4

107 ASCI Winterschool 2010Henk Corporaal(107) FIR kernel on CELL SPE Vectorization is similar to SSE vector float,X, XH, XL, Y, H; int i, j; for(i = 0; i < (M/4); i ++){ XL = in[i]; Y = spu_splats(0.0f); for(j = 0; j < (N/4); j ++){ XH = XL; XL = in[i+j+1]); H=splats(coeff[j*4]); Y = spu_madd(XH, H, Y); H=splats(coeff[j*4+1]); X = spu_shuffle(XH, XL, SHUFFLE_X1); Y = spu_madd(X, H, Y); H=splats(coeff[j*4+2]); X = spu_shuffle(XH, XL, SHUFFLE_X2); Y = spu_madd(X, H, Y); H=splats(coeff[j*4+3]); X = spu_shuffle(XH, XL, SHUFFLE_X3); Y = spu_madd(X, H, Y); } out[i] = Y; }

108 ASCI Winterschool 2010Henk Corporaal(108) SPE DMA double buffering float iBuf[2][BUF_SIZE]; float oBuf[2][BUF_SIZE]; int idx=0; int buffers=size/BUF_SIZE; mfc_get(iBuf[idx],argp,\ BUF_SIZE*sizeof(float),\ tag[idx],0,0); for(int i = 1;I < buffers; i++){ wait_for_dma(tag[idx]); next_idx = idx^1; mfc_get(iBuf[next_idx],argp,\ BUF_SIZE*sizeof(float),0,0,0); fir_kernel(oBuf[idx], iBuf[idx],\ coeff,BUF_SIZE,taps); mfc_put(oBuf[idx],outbuf,\ BUF_SIZE*sizeof(float),\ tag[idx],0,0); idx = next_idx; } /* Finish up the last block...*/

109 ASCI Winterschool 2010Henk Corporaal(109) CELL FIR speedup On PlayStation 3, CELL with six accessible SPE Input: ~6M samples Speed-up compare to scalar implementation on PPE

110 ASCI Winterschool 2010Henk Corporaal(110) Roofline Model Introduced by Samual Williams and David Patterson Performance in GFlops/sec Operational intensity in Flops/Byte peak performance peak bandwidth ridge point balanced architecture for given application

111 ASCI Winterschool 2010Henk Corporaal(111) Roofline Model of GT8800 GPU

112 ASCI Winterschool 2010Henk Corporaal(112) Roofline Model Threads of one warp diverge into different paths at branch.

113 ASCI Winterschool 2010Henk Corporaal(113) Roofline Model In G80 architecture, a non-coalesced global memory access will be separated into 16 accesses.

114 ASCI Winterschool 2010Henk Corporaal(114) Roofline Model Previous examples assume memory latency can be hidden. Otherwise the program can be latency-bound. Z. Guz, et al, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley", IEEE Comp Arch Letters, 2009, linklink S. Hong, et al. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness", ISCA09, linklink r m : percentage of memory instruction in total instruction t avg : average memory latency CPI exe : Cycle per Instruction There is one memory instruction in every (1/r m ) instructions. There is one memory instruction every (1/r m ) x CPI exe cycles. It takes (t avg x rm / CPI exe ) threads to hide memory latency.

115 ASCI Winterschool 2010Henk Corporaal(115) Roofline Model If not enough threads to hide the memory latency, the memory latency could become the bottleneck. Samuel Williams, "Auto-tuning Performance on Multicore Computers", PhD Thesis, UC Berkeley, 2008, linklink S. Hong, et al. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness", ISCA09, linklink

116 ASCI Winterschool 2010Henk Corporaal(116) Four Architectures 667MHz DDR2 DIMMs GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) interconnect 86.4 GB/s 768MB 900MHz GDDR3 Device DRAM Thread Cluster Thread Cluster Thread Cluster Thread Cluster Thread Cluster Thread Cluster Thread Cluster Thread Cluster 192KB L2 (Textures only) 24 ROPs 6 x 64b memory controllers BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) Sun Victoria FallsAMD Barcelona NVIDIA G80IBM Cell Blade

117 ASCI Winterschool 2010Henk Corporaal(117) 32b Rooflines for the Four (in-core parallelism) 1/81/8 flop:DRAM byte ratio attainable Gflop/s (32b) 1/41/4 1/21/ /81/8 flop:DRAM byte ratio attainable Gflop/s (32b) 1/41/4 1/21/ /81/8 flop:DRAM byte ratio attainable Gflop/s (32b) 1/41/4 1/21/ Sun Victoria FallsAMD Barcelona NVIDIA G80 Single Precision Roofline models for the SMPs used in this work. Based on micro- benchmarks, experience, and manuals Ceilings = in-core parallelism Can the compiler find all this parallelism ? NOTE: log-log scale Assumes perfect SPMD 4 1/81/8 flop:DRAM byte ratio attainable Gflop/s (32b) /41/4 1/21/ IBM Cell Blade peak SP mul / add imbalance w/out SIMD w/out ILP peak SP w/out FMA w/out SIMD w/out ILP peak SP w/out FMA w/out memory coalescing w/out NUMAw/out SW prefetch w/out NUMAw/out SW prefetch w/out NUMA w/out DMA concurrency

118 ASCI Winterschool 2010Henk Corporaal(118) Let's conclude: Trends Reliability + Fault Tolerance Requires run-time management, process migration Power is the new metric Low power management at all levels - Scenarios - subthreshold, back biasing, …. Virtualization (1): do not disturb other applications composability Virtualization (2): 1 virual target platform avoids porting problem 1 intermediate supporting multiple target huge RT management support, JITC multiple OS Compute servers Transactional memory 3D: integrate different dies

119 ASCI Winterschool 2010Henk Corporaal(119) 3D using Through Silicon Vias (TSV) Using TVS: Face-to-Back (Scalable) Flip-Chip: Face-to-Face (limited to 2 die tiers) 4um pitch in 2011 (ITRS 2007) Can enlarge device area from Woo e.a. HPCA 2009

120 ASCI Winterschool 2010Henk Corporaal(120) Don't forget Amdahl However, see next slide!

121 ASCI Winterschool 2010Henk Corporaal(121) Trends: Homogeneous vs Heterogeneous: where do we go ? Homogenous: Easier to program Favored by DLP / Vector parallelism Fault tolerant / Task migration Heterogeneous Energy efficiency demands Higher speedup Amdahl++ (see Hill and Marty, HPCA'08 on Amdahl's law in multi-core area) Memory dominated suggests homogenous sea of heterogeneous cores Sea of reconfigurable compute or processor blocks? many examples: Smart Memory, SmartCell, PicoChip, MathStar FPOA, Stretch, XPP, ……. etc.

122 ASCI Winterschool 2010Henk Corporaal(122) How does a future architecture look like A couple of high performance (low latency) cores also sequential code should run fast Add a whole battery of wide vector processors Some shared memory (to reduce copying large data structures) Level 2 and 3 in 3D technology Huge bandwidth; exploit large vectors Accelerators for dedicated domains OS support (runtime mapping, DVFS, use of accelerators)

123 ASCI Winterschool 2010Henk Corporaal(123) But the real problem is ….. Programming parallel is the real bottleneck new programming models like transaction based progr. That's what we will talk about this week…

124 ASCI Winterschool 2010Henk Corporaal(124)

Download ppt "Introduction to Many-Core Architectures Henk Corporaal ASCI Winterschool on Embedded Systems Soesterberg, March 2010."

Similar presentations

Ads by Google