Presentation is loading. Please wait.

Presentation is loading. Please wait.

Samira Khan University of Virginia Jan 30, 2019

Similar presentations


Presentation on theme: "Samira Khan University of Virginia Jan 30, 2019"— Presentation transcript:

1 Samira Khan University of Virginia Jan 30, 2019
ADVANCED COMPUTER ARCHITECTURE Fundamental Concepts: Computing Models Samira Khan University of Virginia Jan 30, 2019 The content and concept of this course are adapted from CMU ECE 740

2 AGENDA Review from last lecture Flynn’s taxonomy of computers
Single core->multi core->accelerator

3 REVIEWS Due on Feb 6, 2019 Esmaeilzadeh et al., "Dark silicon and the end of multicore scaling", ISCA 2011. Y.-H. Chen et al., “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” ISCA 2016.

4 Data Flow Characteristics
Data-driven execution of instruction-level graphical code Nodes are operators Arcs are data (I/O) As opposed to control-driven execution Only real dependencies constrain processing No sequential I-stream No program counter Operations execute asynchronously Execution triggered by the presence of data Single assignment languages and functional programming E.g., SISAL in Manchester Data Flow Computer No mutable state

5 Data Flow Advantages/Disadvantages
Very good at exploiting irregular parallelism Only real dependencies constrain processing Disadvantages Debugging difficult (no precise state) Interrupt/exception handling is difficult (what is precise state semantics?) Implementing dynamic data structures difficult in pure data flow models Too much parallelism? (Parallelism control needed) High bookkeeping overhead (tag matching, data storage) Instruction cycle is inefficient (delay between dependent instructions), memory locality is not exploited

6 OOO EXECUTION: RESTRICTED DATAFLOW
An out-of-order engine dynamically builds the dataflow graph of a piece of the program which piece? The dataflow graph is limited to the instruction window Instruction window: all decoded but not yet retired instructions Can we do it for the whole program? Why would we like to? In other words, how can we have a large instruction window?

7 FLYNN’S TAXONOMY OF COMPUTERS
Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

8 SIMD PROCESSING Single instruction operates on multiple data elements
In time or in space Multiple processing elements Time-space duality Array processor: Instruction operates on multiple data elements at the same time Vector processor: Instruction operates on multiple data elements in consecutive time steps

9 ARRAY VS. VECTOR PROCESSORS
ARRAY PROCESSOR VECTOR PROCESSOR Instruction Stream Same same time Different time LD VR  A[3:0] ADD VR  VR, 1 MUL VR  VR, 2 ST A[3:0]  VR LD0 LD1 LD2 LD3 LD0 AD0 AD1 AD2 AD3 LD1 AD0 MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 AD3 MU2 ST1 Different same space MU3 ST2 Time Same space ST3 Space Space

10 SCALAR PROCESSING Conventional form of processing (von Neumann model)
add r1, r2, r3

11 SIMD ARRAY PROCESSING Array processor

12 VECTOR PROCESSOR ADVANTAGES
+ No dependencies within a vector Pipelining, parallelization work well Can have very deep pipelines, no dependencies! + Each instruction generates a lot of work Reduces instruction fetch bandwidth + Highly regular memory access pattern Interleaving multiple banks for higher memory bandwidth Prefetching + No need to explicitly code loops Fewer branches in the instruction sequence

13 VECTOR PROCESSOR DISADVANTAGES
-- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list? Fisher, “Very Long Instruction Word Architectures and the ELI-512,” ISCA 1983.

14 VECTOR PROCESSOR LIMITATIONS
-- Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks

15 VECTOR MACHINE EXAMPLE: CRAY-1
Russell, “The CRAY-1 computer system,” CACM 1978. Scalar and vector modes 8 64-element vector registers 64 bits per element 16 memory banks 8 64-bit scalar registers 8 24-bit address registers

16 AMDAHL’S LAW: BOTTLENECK ANALYSIS
Speedup= timewithout enhancement / timewith enhancement Suppose an enhancement speeds up a fraction f of a task by a factor of S timeenhanced = timeoriginal·(1-f) + timeoriginal·(f/S) Speedupoverall = 1 / ( (1-f) + f/S ) f (1 - f) timeoriginal (1 - f) timeenhanced f/S Focus on bottlenecks with large f (and large S)

17 FLYNN’S TAXONOMY OF COMPUTERS
Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

18 SYSTOLIC ARRAYS

19 WHY SYSTOLIC ARCHITECTURES?
Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory Similar to an assembly line of processing elements Different people work on the same car Many cars are assembled simultaneously Why? Special purpose accelerators/architectures need Simple, regular design (keep # unique parts small and regular) High concurrency  high performance Balanced computation and I/O (memory) bandwidth

20 SYSTOLIC ARRAYS Memory: heart PEs: cells Memory pulses data through cells H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982.

21 SYSTOLIC ARCHITECTURES
Basic principle: Replace one PE with a regular array of PEs and carefully orchestrate flow of data between the PEs Balance computation and memory bandwidth Differences from pipelining: These are individual PEs Array structure can be non-linear and multi-dimensional PE connections can be multidirectional (and different speed) PEs can have local memory and execute kernels (rather than a piece of the instruction)

22 SYSTOLIC COMPUTATION EXAMPLE
Convolution Used in filtering, pattern matching, correlation, polynomial evaluation, etc … Many image processing tasks

23 SYSTOLIC ARCHITECTURE FOR CONVOLUTION

24 y1=w1x1 y1=0 W3 W2 W1 x1

25 y1=w1x1 + w2x2 y1=w1x1 W3 W2 W1 x2 x2

26 y1=w1x1 + w2x2 + w3x3 y1=w1x1 + w2x2 W3 W2 W1 x3 x3

27 CONVOLUTION y1 = w1x1 + w2x2 + w3x3 y2 = w1x2 + w2x3 + w3x4

28 Convolution: Another Design

29 x1 W3 W2 W1

30 x2 x1 W3 W2 W1

31 x3 x1 x2 y1 W3 W2 W1

32 x4 x2 x3 x1 y2 W3 W2 W1 y1=w3x3

33 x5 x3 x1 x4 x2 y3 W3 W2 W1 y2=w3x4 y1=w2x2+w3x3

34 x6 x4 x2 x5 x3 x1 y4 W3 W2 W1 y3=w3x5 y2=w2x3+w3x4 y1=w1x1+w2x2+w3x3

35 x7 x5 x3 x1 x6 x4 x2 y5 W3 W2 W1 y4=w3x6 y3=w2xx+w3x5 y2=w1x2+w2x3+w3x4

36 More Programmability Each PE in a systolic array Taken further
Can store multiple “weights” Weights can be selected on the fly Eases implementation of, e.g., adaptive filtering Taken further Each PE can have its own data and instruction memory Data memory  to store partial/temporary results, constants Leads to stream processing, pipeline parallelism More generally, staged execution

37 SYSTOLIC ARRAYS: PROS AND CONS
Advantage: Specialized (computation needs to fit PE organization/functions)  improved efficiency, simple design, high concurrency/ performance  good to do more with less memory bandwidth requirement Downside: Specialized  not generally applicable because computation needs to fit the PE functions/organization

38 The WARP Computer HT Kung, CMU, 1984-1988
Linear array of 10 cells, each cell a 10 Mflop programmable processor Attached to a general purpose host machine HLL and optimizing compiler to program the systolic array Used extensively to accelerate vision and robotics tasks Annaratone et al., “Warp Architecture and Implementation,” ISCA 1986. Annaratone et al., “The Warp Computer: Architecture, Implementation, and Performance,” IEEE TC 1987.

39 The WARP Computer

40 The WARP Computer

41 AGENDA Review from last lecture Flynn’s taxonomy of computers
Single core-multi core-accelerators

42 MULTIPLE CORES ON CHIP Simpler and lower power than a single large core Large scale parallelism on chip AMD Barcelona 4 cores Intel Core i7 8 cores IBM Cell BE 8+1 cores IBM POWER7 8 cores Nvidia Fermi 448 “cores” Intel SCC 48 cores, networked Tilera TILE Gx 100 cores, networked Sun Niagara II 8 cores

43 MOORE’S LAW Moore, “Cramming more components onto integrated circuits,” Electronics, 1965.

44

45 MULTI-CORE Idea: Put multiple processors on the same die.
Technology scaling (Moore’s Law) enables more transistors to be placed on the same die area What else could you do with the die area you dedicate to multiple processors? Have a bigger, more powerful core Have larger caches in the memory hierarchy Integrate platform components on chip (e.g., network interface, memory controllers)

46 WHY MULTI-CORE? Alternative: Bigger, more powerful single core
Larger superscalar issue width, larger instruction window, more execution units, large trace caches, large branch predictors, etc + Improves single-thread performance transparently to programmer, compiler - Very difficult to design (Scalable algorithms for improving single-thread performance elusive) - Power hungry – many out-of-order execution structures consume significant power/area when scaled. Why? - Diminishing returns on performance - Does not significantly help memory-bound application performance (Scalable algorithms for this elusive)

47 MULTI-CORE VS. LARGE SUPERSCALAR
Multi-core advantages + Simpler cores  more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures) + Higher system throughput on multiprogrammed workloads  reduced context switches + Higher system throughput in parallel applications Multi-core disadvantages - Requires parallel tasks/threads to improve performance (parallel programming) - Resource sharing can reduce single-thread performance - Shared hardware resources need to be managed - Number of pins limits data supply for increased demand

48 Samira Khan University of Virginia Jan 30, 2019
ADVANCED COMPUTER ARCHITECTURE Fundamental Concepts: Computing Models Samira Khan University of Virginia Jan 30, 2019 The content and concept of this course are adapted from CMU ECE 740


Download ppt "Samira Khan University of Virginia Jan 30, 2019"

Similar presentations


Ads by Google