Samira Khan University of Virginia Jan 30, 2019

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
The University of Adelaide, School of Computer Science
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Chapter 17 Parallel Processing.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
CSE Advanced Computer Architecture Week-1 Week of Jan 12, 2004 engr.smu.edu/~rewini/8383.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
15-740/ Computer Architecture Lecture 2: SIMD, MIMD, and ISA Principles Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/14/2011.
Samira Khan University of Virginia Jan 26, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models The content and concept of this course.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
Processor Level Parallelism 1
Samira Khan University of Virginia Jan 28, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs The content and.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Computer Architecture: VLIW, DAE, Systolic Arrays Prof. Onur Mutlu Carnegie Mellon University.
These slides are based on the book:
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Topics to be covered Instruction Execution Characteristics
Computer Architecture: SIMD and GPUs (Part I)
Design of Digital Circuits Lecture 24: Systolic Arrays and Beyond
Advanced Architectures
18-447: Computer Architecture Lecture 30B: Multiprocessors
Prof. Onur Mutlu Carnegie Mellon University
Precise Exceptions and Out-of-Order Execution
Computer Architecture: Parallel Processing Basics
buses, crossing switch, multistage network.
Parallel Processing - introduction
Samira Khan University of Virginia Sep 6, 2017
15-740/ Computer Architecture Lecture 7: Pipelining
Prof. Gennady Pekhimenko University of Toronto Fall 2017
Architecture & Organization 1
Samira Khan University of Virginia Sep 4, 2017
5.2 Eleven Advanced Optimizations of Cache Performance
Scalable Processor Design
Morgan Kaufmann Publishers
Vector Processing => Multimedia
Multi-Processing in High Performance Computer Architecture:
Pipelining and Vector Processing
Architecture & Organization 1
Computer Architecture Dataflow (Part II) and Systolic Arrays
Chapter 17 Parallel Processing
Samira Khan University of Virginia Sep 10, 2018
Samira Khan University of Virginia Sep 12, 2018
buses, crossing switch, multistage network.
Samira Khan University of Virginia Sep 5, 2018
Chapter 1 Introduction.
Computer Evolution and Performance
Chapter 12 Pipelining and RISC
Chapter 4 Multiprocessors
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
Multicore and GPU Programming
6- General Purpose GPU Programming
COMPUTER ORGANIZATION AND ARCHITECTURE
Samira Khan University of Virginia Jan 23, 2019
Samira Khan University of Virginia Feb 4, 2019
Multicore and GPU Programming
Design of Digital Circuits Lecture 23a: Systolic Arrays and Beyond
Prof. Onur Mutlu Carnegie Mellon University
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

Samira Khan University of Virginia Jan 30, 2019 ADVANCED COMPUTER ARCHITECTURE Fundamental Concepts: Computing Models Samira Khan University of Virginia Jan 30, 2019 The content and concept of this course are adapted from CMU ECE 740

AGENDA Review from last lecture Flynn’s taxonomy of computers Single core->multi core->accelerator

REVIEWS Due on Feb 6, 2019 Esmaeilzadeh et al., "Dark silicon and the end of multicore scaling", ISCA 2011. Y.-H. Chen et al., “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” ISCA 2016.

Data Flow Characteristics Data-driven execution of instruction-level graphical code Nodes are operators Arcs are data (I/O) As opposed to control-driven execution Only real dependencies constrain processing No sequential I-stream No program counter Operations execute asynchronously Execution triggered by the presence of data Single assignment languages and functional programming E.g., SISAL in Manchester Data Flow Computer No mutable state

Data Flow Advantages/Disadvantages Very good at exploiting irregular parallelism Only real dependencies constrain processing Disadvantages Debugging difficult (no precise state) Interrupt/exception handling is difficult (what is precise state semantics?) Implementing dynamic data structures difficult in pure data flow models Too much parallelism? (Parallelism control needed) High bookkeeping overhead (tag matching, data storage) Instruction cycle is inefficient (delay between dependent instructions), memory locality is not exploited

OOO EXECUTION: RESTRICTED DATAFLOW An out-of-order engine dynamically builds the dataflow graph of a piece of the program which piece? The dataflow graph is limited to the instruction window Instruction window: all decoded but not yet retired instructions Can we do it for the whole program? Why would we like to? In other words, how can we have a large instruction window?

FLYNN’S TAXONOMY OF COMPUTERS Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

SIMD PROCESSING Single instruction operates on multiple data elements In time or in space Multiple processing elements Time-space duality Array processor: Instruction operates on multiple data elements at the same time Vector processor: Instruction operates on multiple data elements in consecutive time steps

ARRAY VS. VECTOR PROCESSORS ARRAY PROCESSOR VECTOR PROCESSOR Instruction Stream Same op @ same time Different ops @ time LD VR  A[3:0] ADD VR  VR, 1 MUL VR  VR, 2 ST A[3:0]  VR LD0 LD1 LD2 LD3 LD0 AD0 AD1 AD2 AD3 LD1 AD0 MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 AD3 MU2 ST1 Different ops @ same space MU3 ST2 Time Same op @ space ST3 Space Space

SCALAR PROCESSING Conventional form of processing (von Neumann model) add r1, r2, r3

SIMD ARRAY PROCESSING Array processor

VECTOR PROCESSOR ADVANTAGES + No dependencies within a vector Pipelining, parallelization work well Can have very deep pipelines, no dependencies! + Each instruction generates a lot of work Reduces instruction fetch bandwidth + Highly regular memory access pattern Interleaving multiple banks for higher memory bandwidth Prefetching + No need to explicitly code loops Fewer branches in the instruction sequence

VECTOR PROCESSOR DISADVANTAGES -- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list? Fisher, “Very Long Instruction Word Architectures and the ELI-512,” ISCA 1983.

VECTOR PROCESSOR LIMITATIONS -- Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks

VECTOR MACHINE EXAMPLE: CRAY-1 Russell, “The CRAY-1 computer system,” CACM 1978. Scalar and vector modes 8 64-element vector registers 64 bits per element 16 memory banks 8 64-bit scalar registers 8 24-bit address registers

AMDAHL’S LAW: BOTTLENECK ANALYSIS Speedup= timewithout enhancement / timewith enhancement Suppose an enhancement speeds up a fraction f of a task by a factor of S timeenhanced = timeoriginal·(1-f) + timeoriginal·(f/S) Speedupoverall = 1 / ( (1-f) + f/S ) f (1 - f) timeoriginal (1 - f) timeenhanced f/S Focus on bottlenecks with large f (and large S)

FLYNN’S TAXONOMY OF COMPUTERS Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

SYSTOLIC ARRAYS

WHY SYSTOLIC ARCHITECTURES? Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory Similar to an assembly line of processing elements Different people work on the same car Many cars are assembled simultaneously Why? Special purpose accelerators/architectures need Simple, regular design (keep # unique parts small and regular) High concurrency  high performance Balanced computation and I/O (memory) bandwidth

SYSTOLIC ARRAYS Memory: heart PEs: cells Memory pulses data through cells H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982.

SYSTOLIC ARCHITECTURES Basic principle: Replace one PE with a regular array of PEs and carefully orchestrate flow of data between the PEs Balance computation and memory bandwidth Differences from pipelining: These are individual PEs Array structure can be non-linear and multi-dimensional PE connections can be multidirectional (and different speed) PEs can have local memory and execute kernels (rather than a piece of the instruction)

SYSTOLIC COMPUTATION EXAMPLE Convolution Used in filtering, pattern matching, correlation, polynomial evaluation, etc … Many image processing tasks

SYSTOLIC ARCHITECTURE FOR CONVOLUTION

y1=w1x1 y1=0 W3 W2 W1 x1

y1=w1x1 + w2x2 y1=w1x1 W3 W2 W1 x2 x2

y1=w1x1 + w2x2 + w3x3 y1=w1x1 + w2x2 W3 W2 W1 x3 x3

CONVOLUTION y1 = w1x1 + w2x2 + w3x3 y2 = w1x2 + w2x3 + w3x4

Convolution: Another Design

x1 W3 W2 W1

x2 x1 W3 W2 W1

x3 x1 x2 y1 W3 W2 W1

x4 x2 x3 x1 y2 W3 W2 W1 y1=w3x3

x5 x3 x1 x4 x2 y3 W3 W2 W1 y2=w3x4 y1=w2x2+w3x3

x6 x4 x2 x5 x3 x1 y4 W3 W2 W1 y3=w3x5 y2=w2x3+w3x4 y1=w1x1+w2x2+w3x3

x7 x5 x3 x1 x6 x4 x2 y5 W3 W2 W1 y4=w3x6 y3=w2xx+w3x5 y2=w1x2+w2x3+w3x4

More Programmability Each PE in a systolic array Taken further Can store multiple “weights” Weights can be selected on the fly Eases implementation of, e.g., adaptive filtering Taken further Each PE can have its own data and instruction memory Data memory  to store partial/temporary results, constants Leads to stream processing, pipeline parallelism More generally, staged execution

SYSTOLIC ARRAYS: PROS AND CONS Advantage: Specialized (computation needs to fit PE organization/functions)  improved efficiency, simple design, high concurrency/ performance  good to do more with less memory bandwidth requirement Downside: Specialized  not generally applicable because computation needs to fit the PE functions/organization

The WARP Computer HT Kung, CMU, 1984-1988 Linear array of 10 cells, each cell a 10 Mflop programmable processor Attached to a general purpose host machine HLL and optimizing compiler to program the systolic array Used extensively to accelerate vision and robotics tasks Annaratone et al., “Warp Architecture and Implementation,” ISCA 1986. Annaratone et al., “The Warp Computer: Architecture, Implementation, and Performance,” IEEE TC 1987.

The WARP Computer

The WARP Computer

AGENDA Review from last lecture Flynn’s taxonomy of computers Single core-multi core-accelerators

MULTIPLE CORES ON CHIP Simpler and lower power than a single large core Large scale parallelism on chip AMD Barcelona 4 cores Intel Core i7 8 cores IBM Cell BE 8+1 cores IBM POWER7 8 cores Nvidia Fermi 448 “cores” Intel SCC 48 cores, networked Tilera TILE Gx 100 cores, networked Sun Niagara II 8 cores

MOORE’S LAW Moore, “Cramming more components onto integrated circuits,” Electronics, 1965.

MULTI-CORE Idea: Put multiple processors on the same die. Technology scaling (Moore’s Law) enables more transistors to be placed on the same die area What else could you do with the die area you dedicate to multiple processors? Have a bigger, more powerful core Have larger caches in the memory hierarchy Integrate platform components on chip (e.g., network interface, memory controllers)

WHY MULTI-CORE? Alternative: Bigger, more powerful single core Larger superscalar issue width, larger instruction window, more execution units, large trace caches, large branch predictors, etc + Improves single-thread performance transparently to programmer, compiler - Very difficult to design (Scalable algorithms for improving single-thread performance elusive) - Power hungry – many out-of-order execution structures consume significant power/area when scaled. Why? - Diminishing returns on performance - Does not significantly help memory-bound application performance (Scalable algorithms for this elusive)

MULTI-CORE VS. LARGE SUPERSCALAR Multi-core advantages + Simpler cores  more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures) + Higher system throughput on multiprogrammed workloads  reduced context switches + Higher system throughput in parallel applications Multi-core disadvantages - Requires parallel tasks/threads to improve performance (parallel programming) - Resource sharing can reduce single-thread performance - Shared hardware resources need to be managed - Number of pins limits data supply for increased demand

Samira Khan University of Virginia Jan 30, 2019 ADVANCED COMPUTER ARCHITECTURE Fundamental Concepts: Computing Models Samira Khan University of Virginia Jan 30, 2019 The content and concept of this course are adapted from CMU ECE 740