Presentation is loading. Please wait.

Presentation is loading. Please wait.

Samira Khan University of Virginia Sep 4, 2017

Similar presentations


Presentation on theme: "Samira Khan University of Virginia Sep 4, 2017"— Presentation transcript:

1 Samira Khan University of Virginia Sep 4, 2017
COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs Samira Khan University of Virginia Sep 4, 2017 The content and concept of this course are adapted from CMU ECE 740

2 AGENDA Logistics Review from last lecture Fundamental concepts
Computing models ISA Tradeoffs

3 LOGISTICS Review 2 Due Wednesday Participation Project list
Dennis and Misunas, “A preliminary architecture for a basic data flow processor,” ISCA 1974. Kung, H. T., "Why Systolic Architectures?", IEEE Computer 1982. Participation Discuss in Piazza and class (5% grade) Project list Will be open Wednesday, start early Be prepared to spend time on the project

4 FLYNN’S TAXONOMY OF COMPUTERS
Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

5 SIMD PROCESSING Single instruction operates on multiple data elements
In time or in space Multiple processing elements Time-space duality Array processor: Instruction operates on multiple data elements at the same time Vector processor: Instruction operates on multiple data elements in consecutive time steps

6 SCALAR CODE EXAMPLE For I = 1 to 50 Scalar code
C[i] = (A[i] + B[i]) / 2 Scalar code MOVI R0 = MOVA R1 = A 1 MOVA R2 = B 1 MOVA R3 = C 1 X: LD R4 = MEM[R1++] 11 ;autoincrement addressing LD R5 = MEM[R2++] 11 ADD R6 = R4 + R5 4 SHFR R7 = R6 >> 1 1 ST MEM[R3++] = R7 11 DECBNZ --R0, X 2 ;decrement and branch if NZ 304 dynamic instructions

7 VECTOR CODE EXAMPLE A loop is vectorizable if each iteration is independent of any other For I = 0 to 49 C[i] = (A[i] + B[i]) / 2 Vectorized loop: MOVI VLEN = MOVI VSTR = 1 1 VLD V0 = A VLN - 1 VLD V1 = B VLN – 1 VADD V2 = V0 + V VLN - 1 VSHFR V3 = V2 >> VLN - 1 VST C = V VLN – 1 7 dynamic instructions

8 SCALAR CODE EXECUTION TIME
Scalar execution time on an in-order processor with 1 bank First two loads in the loop cannot be pipelined 2*11 cycles 4 + 50*40 = 2004 cycles Scalar execution time on an in-order processor with 16 banks (word-interleaved) First two loads in the loop can be pipelined 4 + 50*30 = 1504 cycles Why 16 banks? 11 cycle memory access latency Having 16 (>11) banks ensures there are enough banks to overlap enough memory operations to cover memory latency

9 VECTOR CODE EXECUTION TIME
No chaining i.e., output of a vector functional unit cannot be used as the input of another (i.e., no vector data forwarding) 16 memory banks (word-interleaved) 285 cycles

10 VECTOR PROCESSOR DISADVANTAGES
-- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list? Fisher, “Very Long Instruction Word Architectures and the ELI-512,” ISCA 1983.

11 VECTOR PROCESSOR LIMITATIONS
-- Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks

12 FURTHER READING: SIMD Recommended H&P, Appendix on Vector Processors
Russell, “The CRAY-1 computer system,” CACM 1978.

13 VECTOR MACHINE EXAMPLE: CRAY-1
Russell, “The CRAY-1 computer system,” CACM 1978. Scalar and vector modes 8 64-element vector registers 64 bits per element 16 memory banks 8 64-bit scalar registers 8 24-bit address registers

14 AMDAHL’S LAW: BOTTLENECK ANALYSIS
Speedup= timewithout enhancement / timewith enhancement Suppose an enhancement speeds up a fraction f of a task by a factor of S timeenhanced = timeoriginal·(1-f) + timeoriginal·(f/S) Speedupoverall = 1 / ( (1-f) + f/S ) f (1 - f) timeoriginal (1 - f) timeenhanced f/S Focus on bottlenecks with large f (and large S)

15 FLYNN’S TAXONOMY OF COMPUTERS
Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

16 SYSTOLIC ARRAYS

17 WHY SYSTOLIC ARCHITECTURES?
Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory Similar to an assembly line of processing elements Different people work on the same car Many cars are assembled simultaneously Why? Special purpose accelerators/architectures need Simple, regular design (keep # unique parts small and regular) High concurrency  high performance Balanced computation and I/O (memory) bandwidth

18 SYSTOLIC ARRAYS Memory: heart PEs: cells Memory pulses data through cells H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982.

19 SYSTOLIC ARCHITECTURES
Basic principle: Replace one PE with a regular array of PEs and carefully orchestrate flow of data between the PEs Balance computation and memory bandwidth Differences from pipelining: These are individual PEs Array structure can be non-linear and multi-dimensional PE connections can be multidirectional (and different speed) PEs can have local memory and execute kernels (rather than a piece of the instruction)

20 SYSTOLIC COMPUTATION EXAMPLE
Convolution Used in filtering, pattern matching, correlation, polynomial evaluation, etc … Many image processing tasks

21 CONVOLUTION y1 = w1x1 + w2x2 + w3x3 y2 = w1x2 + w2x3 + w3x4

22 SYSTOLIC ARRAYS: PROS AND CONS
Advantage: Specialized (computation needs to fit PE organization/functions)  improved efficiency, simple design, high concurrency/ performance  good to do more with less memory bandwidth requirement Downside: Specialized  not generally applicable because computation needs to fit the PE functions/organization

23 AGENDA Logistics Review from last lecture Fundamental concepts
Computing models ISA Tradeoffs

24 LEVELS OF TRANSFORMATION
ISA Agreed upon interface between software and hardware SW/compiler assumes, HW promises What the software writer needs to know to write system/user programs Microarchitecture Specific implementation of an ISA Not visible to the software Microprocessor ISA, uarch, circuits “Architecture” = ISA + microarchitecture Problem Algorithm Program/Language ISA ISA is the interface between hardware and software… It is a contract that the hardware promises to satisfy. Builder/user interface Microarchitecture Logic Circuits

25 ISA VS. MICROARCHITECTURE
What is part of ISA vs. Uarch? Gas pedal: interface for “acceleration” Internals of the engine: implements “acceleration” Add instruction vs. Adder implementation Implementation (uarch) can be various as long as it satisfies the specification (ISA) Bit serial, ripple carry, carry lookahead adders x86 ISA has many implementations: 286, 386, 486, Pentium, Pentium Pro, … Uarch usually changes faster than ISA Few ISAs (x86, SPARC, MIPS, Alpha) but many uarchs Why?

26 ISA Instructions Memory Call, Interrupt/Exception Handling
Opcodes, Addressing Modes, Data Types Instruction Types and Formats Registers, Condition Codes Memory Address space, Addressability, Alignment Virtual memory management Call, Interrupt/Exception Handling Access Control, Priority/Privilege I/O Task Management Power and Thermal Management Multi-threading support, Multiprocessor support

27 Microarchitecture Implementation of the ISA under specific design constraints and goals Anything done in hardware without exposure to software Pipelining In-order versus out-of-order instruction execution Memory access scheduling policy Speculative execution Superscalar processing (multiple instruction issue?) Clock gating Caching? Levels, size, associativity, replacement policy Prefetching? Voltage/frequency scaling? Error correction?

28 Samira Khan University of Virginia Sep 4, 2017
COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs Samira Khan University of Virginia Sep 4, 2017 The content and concept of this course are adapted from CMU ECE 740


Download ppt "Samira Khan University of Virginia Sep 4, 2017"

Similar presentations


Ads by Google