Multicore / Multiprocessor Architectures

Multicore / Multiprocessor Architectures
CDA Spring 2016 Introduction to Computer Organization Multicore / Multiprocessor Architectures 7, 12 April 2016

Multicore Architectures
Introduction – What are Multicores? Why Multicores? Power and Performance Perspectives Multiprocessor Architectures Conclusion CDA – Fall Copyright © Prabhat Mishra

How to Reduce Power Consumption
Multicore One core with frequency 2 GHz Two cores with 1 GHz frequency (each) Same performance Two 1 GHz cores require half power/energy Power  freq2 1GHz core needs one-fourth power compared to 2GHz core. New challenges – Performance How to utilize the cores It is difficult to find parallelism in programs to keep all these cores busy.

Reducing Energy Consumption
Pentium Max Temp = deg C Crusoe Max Temp = 48.2 deg C [ Both processors are running the same multimedia application. Infrared Cameras (FLIR) can be used to detect thermal distribution.

Introduction Never ending story …
Complex Applications Faster Computation How far did we go with uniprocessors? Parallel Processors now play a major role Logical way to improve performance Connect multiple microprocessors Not much left with ILP exploitation Server and embedded software have parallelism Multiprocessor architectures will become increasingly attractive Due to slowdown in advances of uniprocessors

Level of Parallelism Bit level parallelism: 1970 to ~1985
4 bits, 8 bit, 16 bit, 32 bit microprocessors Instruction level parallelism: ~ today Pipelining Superscalar VLIW Out-of-order execution / Dynamic Instr. Scheduling Process level or thread level parallelism Servers are parallel Desktop dual processor PCs Multicore architectures (CPUs, GPUs)

Taxonomy of Parallel Architectures
SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) Multiple processors on a single data stream No commercial prototypes. Can be thought of as successive refinement of a given set of data by multiple processors (units). SIMD (Single Instruction Multiple Data) Examples: Illiac-IV, CM-2 Simple programming model, low overhead, and flexibility All custom integrated circuits MIMD (Multiple Instruction Multiple Data) Examples: Sun Enterprise 5000, Cray T3D, SGI Origin Flexible – Difficult to program – no unifying model of parallelism Use off-the-shelf microprocessors MIMD in practice: designs with <= 128 processors Flynn Classification

MIMD Two types Exploits thread-level-parallelism
Centralized shared-memory multiprocessors Distributed-memory multiprocessors Exploits thread-level-parallelism The program should have at least n threads or processes for a MIMD machine with n processors Threads can be of different types Independent programs Parallel iterations of a loop (extracted by compiler)

Centralized Shared-Memory Multiprocessor

Centralized Shared-Memory Multiprocessor
Small number of processors share a centralized memory Use multiple buses or switches Multiple memory banks Main memory has a symmetric relationship to all processors and uniform access time from any processor SMP: symmetric shared-memory multiprocessors UMA: uniform memory access architectures Increase in processor performance and memory bandwidth requirements make centralized memory paradigm less attractive

Distributed-Memory Multiprocessors

Distributed-Memory Multiprocessors
Distributing memory has two benefits Cost-effective way to scale memory bandwidth Reduces local memory access time. Communicating data between processors is complex and has higher latency Two approaches for data communication Shared address space (not centralized memory) Same physical addr. refers to same memory location DSM: Distributed Shared-Memory Architectures NUMA: Non-uniform memory access since the access time depends on the location of the data Logically disjoint address space - Multicomputers

Small-Scale—Shared Memory
Caches serve to: Increase bandwidth versus bus/memory Reduce latency of access Valuable for both private data and shared data What about cache consistency?

Example: Cache Coherence Problem
1 2 P 3 4 u = ? 5 u = ? 3 u = 7 Cache Cache Cache 1 u :5 2 u :5 I/O devices u :5 Memory Processors see different values for u after event 3 With write back caches, value written back to memory depends on which cache flushes or writes back value Processes accessing main memory may see very stale value Unacceptable for programming, and its frequent!

4 C’s: Sources of Cache Misses
Morgan Kaufmann Publishers 24 November, 2018 4 C’s: Sources of Cache Misses Compulsory misses (aka cold start misses) First access to a block Capacity misses Due to finite cache size A replaced block is later accessed again Conflict misses (aka collision misses) In a non-fully associative cache Due to competition for entries in a set Would not occur in a fully associative cache of the same total size Coherence Misses Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Graphics Processing Units (GPUs)
Moore’s Law will come to an end Many complicated solutions Simple solution – SPATIAL PARALLELISM SIMD model (single instr, multiple data streams) GPUs have a SIMD grid with local & shared memory model

GPUs – Nvidia CUDA Hierarchy
Map Process to Thread Group Threads in Block Group Blocks in Grids for Efficiency Memory Access  Also, memory coales-cing operations for faster data transfer

Nvidia Fermi GPU – 3GB DRAM, 512 cores CUDA architecture Thread Thread Block Grid of Thread Blocks Intelligent CUDA Compiler

Nvidia Tesla 20xx GPU Board

Nvidia Maxwell GM100 – 8GB + 6,144 cores CUDA architecture Threads can be spawned internally 32 cores per streaming multiprocessor 128KB L1 and 2MB L2 cache v.5.2+ CUDA Compiler

GPU Problems and Solutions
GPUs are designed for graphics rendering GPUs are not designed for general-purpose computing!! (no unifying model of ||-ism) Memory hierarchy: Local Memory – Fast, small (MBs) Shared Memory – Slower, larger Global Memory – Slow, Gbytes How to circumvent data movement cost? Clever hand coding  costly, app-specific Automatic coding  sub-optimal, softwe support

Advantages and Disadvantages
GPUs provide fast parallel computing  GPUs work best for parallel solutions Sequential programs can actually run slower Amdahl’s Law describes speedup: Speedup = P is fraction of program that is parallel S is fraction of program that is sequential

Multicore CPUs Intel Nehalem: Intel Xeon: Servers, HPC arrays
45nm circuit technology Intel Xeon: 2001-present 2 to >60 cores Workstations Multiple cores Laptops Heat dissipation? DUAL NEHALEM

Intel Multicore CPU Performance
SINGLE CORE

Conclusions Parallel machines  Parallel solutions
Inherently sequential programs don’t benefit much from parallelism 2 main types of parallel architectures SIMD – Single-instruction, multiple data stream MIMD – Multiple-instruction, multiple data stream Modern parallel architectures (multicores) GPUs – Exploit SIMD parallelism for general-purpose parallel computing solutions CPUs – Multicore CPUs are more amenable to MIMD parallel applications

Multicore / Multiprocessor Architectures

Similar presentations

Presentation on theme: "Multicore / Multiprocessor Architectures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multicore / Multiprocessor Architectures

Similar presentations

Presentation on theme: "Multicore / Multiprocessor Architectures"— Presentation transcript:

Similar presentations

About project

Feedback