Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction Introduction Håkon Kvale Stensland August 22 th, 2014 INF5063: Programming heterogeneous multi-core processors.

Similar presentations


Presentation on theme: "Introduction Introduction Håkon Kvale Stensland August 22 th, 2014 INF5063: Programming heterogeneous multi-core processors."— Presentation transcript:

1 Introduction Introduction Håkon Kvale Stensland August 22 th, 2014 INF5063: Programming heterogeneous multi-core processors

2 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo INF5063

3 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Overview  Course topic and scope  Background for the use and parallel processing using heterogeneous multi-core processors  Examples of heterogeneous architectures  Vector Processing

4 INF5063: The Course

5 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo People  Håkon Kvale Stensland email: haakonks @ ifi  Preben Nenseth Olsen email: prebenno @ ifi  Carsten Griwodz email: griff @ ifi  Professor Pål Halvorsen email: paalh @ ifi

6 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Time and place  Lectures: Friday 09:00 - 16:00 Storstua (Simula Research Laboratory) −Friday August 22 nd −Friday September 19 th −Friday October 17 th −Friday November 21 st  Group exercises: The time reserved on IFI’s pages will not be used!

7 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Plan for Today (Session 1) 09:15 – 10:00: Course Introduction 10:00 – 10:15: Break 10:15 – 11:00: Introduction to SSE/AVX 11:00 – 11:15: Break 11:15 – 12:00: Introduction to Video Processing 12:00 – 13:00: Lunch (Will be provided by Simula) 13:00 – 13:45: Using SIMD for Video Processing 13:45 – 14:00: Break 14:00 – 14:45: Codec 63 (c63) Home Exam Precode 14:45 – 15:00: Home Exam 1 is presented

8 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo About INF5063: Topic & Scope  Content: The course gives … −… an overview of heterogeneous multi-core processors in general and three variants in particular and a modern general-purpose core (architectures and use) −… an introduction to working with heterogeneous multi-core processors SSE x / AVX x for x86 Nvidia’s family of GPUs and the CUDA programming framework Multiple machines connected with Dolphin PCIe links −… some ideas of how to use/program heterogeneous multi-core processors

9 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo About INF5063: Topic & Scope  Tasks: The important part of the course is lab-assignments where you program each of the three examples of heterogeneous multi-core processors  3 graded home exams (counting 33% each): −Deliver code and make a demonstration explaining your design and code to the class 1.On the x86 Video encoding – Improve the performance of video compression by using SSE instructions. 2.On the Nvidia graphics cards Video encoding – Improve the performance of video compression using the GPU architecture. 3.On the distributed systems Video encoding – The same as above, but exploit the parallelism on multiple GPUs and computers connected with Dolphin PCIe links.  Students will be working together in groups of two. Try to find a partner during the session today!  Competition at the end of the course! Have the fastest implementation of the code!

10 Background and Motivation: Moore’s Law

11 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Motivation: Intel View  >billion transistors integrated 2014: 7,1 billion - nVIDIA GK110 (Kepler) 1971: 2,300 - Intel 4004

12 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Motivation: Intel View  >billion transistors integrated  Clock frequency has not increased since 2006 2014 (Still): 5 GHz – IBM Power6

13 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Motivation: Intel View  >billion transistors integrated  Clock frequency has not increased since 2006  Power? Heat?

14 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Motivation: Intel View  Soon >billion transistors integrated  Clock frequency can still increase  Future applications will demand TIPS  Power? Heat?

15 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Motivation “Future applications will demand more instructions per second” “Think platform beyond a single processor” “Exploit concurrency at multiple levels” “Power will be the limiter due to complexity and leakage” Distribute workload on multiple cores

16 Multicores!

17 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Symmetric Multi-Core Processors Intel Haswell

18 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Symmetric Multi-Core Processors AMD “Piledriver”

19 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Symmetric Multi-Core Processors  Good −Growing computational power  Problematic −Growing die sizes −Unused resources Some cores used much more than others Many core parts frequently unused  Why not spread the load better?  Heterogeneous Architectures!

20 x86 – heterogeneous?

21 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Intel Haswell Core Architecture Haswell

22 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Intel Haswell Architecture THUS, an x86 is definitely parallel and has heterogeneous (internal) cores. The cores have many (hidden) complicating factors. One of these are out-of-order execution.

23 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Out-of-order execution  Beginning with Pentium Pro, out-of-order-execution has improved the micro-architecture design −Execution of an instruction is delayed if the input data is not yet available. −Haswell architecture has 192 instructions in the out-of-order window.  Instructions are split into micro-operations (μ-ops) − ADD EAX, EBX % Add content in EBX to EAX simple operation which generates only 1 μ-ops − ADD EAX, [mem1] % Add content in mem1 to EAX operation which generates 2 μ-ops: 1) load mem1 into (unamed) register, 2) add − ADD [mem1], EAX % Add content in EAX to mem1 operation which generates 3 μ-ops: 1) load mem1 into (unamed) register, 2) add, 3) write result back to memory

24 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Intel Haswell  The x86 architecture “should” be a nice, easy introduction to more complex heterogeneous architectures later in the course…

25 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Co-Processors  The original IBM PC included a socket for an Intel 8087 floating point co-processor (FPU) −50-fold speed up of floating point operations  Intel kept the co-processor up to i486 −486DX contained an optimized i487 block −Still separate pipeline (pipeline flush when starting and ending use) −Communication over an internal bus  Commodore Amiga was one of the earlier machines that used multiple processors −Motorola 680x0 main processor −Blitter (block image transferrer - moving data, fill operations, line drawing, performing boolean operations) −Copper (Co-Processor - change address for video RAM on the fly)

26 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo nVIDIA Tegra K1 ARM SoC  One of many multi-core processors for handheld devices  4 (5?) ARM Cortex-A15 processors −Out-of-order design −32-bit ARMv7 instruction set −Cache-coherent cores −2,2 GHz  Several “dedicated” co-processors: −4K Video Decoder −4K Video Encoder −Audio Processor −2x Image Processor  Fully programmable Kepler-family GPU with 192 simple cores.

27 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Intel SCC (Single Chip Cloud Computer)  Prototype processor from Intel’s TerraScale project  24 «tiles» connected in a mesh  1.3 billion transistors  Intel SCC Tile: −Two Intel P54C cores Pentium −In-order-execution design −IA32 architecture −NO SIMD units(!!)  No cache coherency  Only made a limited number of chips for research.

28 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Intel MIC («Larabee» / «Knights Ferry» / «Knights Corner»)  Introduced in 2009 as a consumer GPU from Intel  Canceled in May 2010 due to “disappointing performance”  Launched in 2013 as a dedicated “accelerator” card (Intel Xeon Phi)  50+ cores connected with a ring bus  Intel “Larabee” core: −Based on the Intel P54C Pentium core −In-order-execution design −Multi-threaded (4-way) −64-bit architecture −512-bit SIMD unit (LBNi) −256 kB L2 Cache  Cache coherency  Software rendering

29 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo General Purpose Computing on GPU  The −high arithmetic precision −extreme parallel nature −optimized, special-purpose instructions −available resources −… … of the GPU allows for general, non-graphics related operations to be performed on the GPU  Generic computing workload is off-loaded from CPU and to GPU  More generically: Heterogeneous multi-core processing

30 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo nVIDIA Kepler Architecture – GK110 −7,08 billion transistors −2880 “cores” −384 bit memory bus (GDDR5) −336,40 GB/sec memory bandwidth −5121 GFLOPS single precision performance −PCI Express 3.0 Found in: Tesla K20c, K20x, K40c and K40x GeForce GTX 780, 780 Ti, Titan, Titan Black and Titan Z Quadro K5200 and K6000

31 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo nVIDIA GK110 Streaming Multiprocessor (SMX) nVIDIA GK110 −Fundamental thread block unit −192 stream processors (SPs) (scalar ALU for threads) −64 double-precision ALUs −32 super function units (SFUs) (cos, sin, log,...) −65.336 x 32-bit local register files (RFs) −16 / 48 kB level 1 cache −16 / 48 kB shared memory −48 kB Read-Only Data Cache −1536 kB global level 2 cache

32 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo STI (Sony, Toshiba, IBM) Cell  Motivation for the Cell −Cheap processor −Energy efficient −For games and media processing −Short time-to-market  Conclusion −Use a multi-core chip −Design around an existing, power- efficient design −Add simple cores specific for game and media processing requirements

33 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo STI (Sony, Toshiba, IBM) Cell  Cell is a 9-core processor −Power Processing Element (PPE) Conventional Power processor Not supposed to perform all operations itself, acting like a controller Running conventional OS −Synergistic Processing Elements (SPE) Specialized co-processors for specific types of code, i.e., very high performance vector processors Local stores Can do general purpose operations The PPE can start, stop, interrupt and schedule processes running on an SPE

34 Vector processing: x86, SSE, AVX

35 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Types of Parallel Processing/Computing?  Bit-level parallelism −4-bit  8-bit  16-bit  32-bit  64-bit  …  Instruction level parallelism − classic RISC pipeline (fetch, decode, execute, memory, write back) IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB

36 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Types of Parallel Processing/Computing?  Task parallelism −Different operations are performed concurrently −Task parallelism is achieved when the processors execute different threads (or processes) on the same or different data −Examples?

37 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Types of Parallel Processing/Computing?  Data parallelism −Distribution of data across different parallel computing nodes −Data parallelism is achieved when each processor performs the same task on different pieces of the data −Examples? −When should we not use data parallelism? for each element a perform the same (set of) instruction(s) on a end

38 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Flynn's taxonomy Single instructionMultiple instruction Single data Multiple data

39 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Vector processors  A vector processor (or array processor) −CPU that implements an instruction set containing instructions that operate on one-dimensional arrays (vectors) −Example systems: Cray-1 (1976)  4096 bits registers (64x64-bit floats) IBM  POWER with ViVA (Virtual Vector Architecture) 128 bits registers  Cell – SPE 128 bit registers SUN  UltraSPARC with VIS 64-bit registers NEC  SX-6/SX-9 (2008) - Earth simulator 1/2 with 4096 bit registers (used different ways)  up to 512 nodes of 8/16 cores (8192 cores)  each core  has 6 parallel instruction units  shares 72 4096-bit registers

40 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Vector processors People use vector processing in many areas…  Scientific computing  Multimedia Processing (compression, graphics, image processing, …)  Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort)  Lossy Compression (JPEG, MPEG video and audio)  Lossless Compression (Zero removal, RLE, Differencing, LZW)  Cryptography (RSA, DES/IDEA, SHA/MD5)  Speech and handwriting recognition  Operating systems ( memcpy, memset, parity, …)  Networking (checksum, …)  Databases (hash/join, data mining, updates)  …

41 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Vector processors  Instruction sets? −MMX −SSEx (several extensions) −AVX −AltiVec −3DNow! −VIS −MDMX −FMA −…−…

42 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Vector Instruction sets  MMX −MMX is officially a meaningless initialism trademarked by Intel; unofficially, MultiMedia eXtension Multiple Math eXtension Matrix Math eXtension −Introduced on the “Pentium with MMX Technology” in 1998. − SIMD (Single Instruction, Multiple Data) computation processes multiple data in parallel with a single instruction, resulting in significant performance improvement; MMX gives 2 x 32-bit computations at once. − MMX defined 8 “new” 64-bit integer registers (mm0 ~ mm7), which were aliases for the existing x87 FPU registers – reusing 64 (out of 80) bits in the floating point registers.

43 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Vector Instruction sets  SSE −Streaming SIMD Extensions (SSE) SSE – Pentium II (1997) SSE2 – Pentium 4 (Willamette) (2001) SSE3 – Pentium 4 (Prescott) (2004) SSE4.1 – Penryn (2006) SSE4.2 – Nehalem (2008) − SIMD - 4 x 32-bit simultaneous computations (128-bit) − SSE defines 8 new 128-bit registers (xmm0 ~ xmm7) for single-precision floating-point computations. Since each register is 128-bit long, we can store total 4 of 32-bit floating-point numbers.

44 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo Vector Instruction sets  AVX − Advanced Vector Extentions (AVX)  AVX – Sandy Bridge (2011)  AVX2 – Haswell (2013) − SIMD - 8 x 32-bit simultaneous computations (256-bit) − AVX increases the width of the SIMD registers from 128-bit to 256-bit. It is now possible to store up to 8 x 32-bit floating point number. − AVX2 also introduced support for integer operations, vector shifts, gather support and three-operand fused multiply-accumulate (FMA3). − Next version of AVX is called AVX-512, will extend AVX to 512-bit, and is scheduled to be launched in 2015 with the next generation of Xeon Phi MIC processor (“Knights Landing”).

45 INF5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland University of Oslo The End: Summary  Heterogeneous multi-core processors are already everywhere  Challenge: programming −Need to know the capabilities of the system −Different abilities in different cores −Memory bandwidth −Memory sharing efficiency −Need new methods to program the different components


Download ppt "Introduction Introduction Håkon Kvale Stensland August 22 th, 2014 INF5063: Programming heterogeneous multi-core processors."

Similar presentations


Ads by Google