Samira Khan University of Virginia Sep 10, 2018

Samira Khan University of Virginia Sep 10, 2018
COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs Samira Khan University of Virginia Sep 10, 2018 The content and concept of this course are adapted from CMU ECE 740

AGENDA Review from last lecture Fundamental concepts Computing models
ISA tradeoffs

FLYNN’S TAXONOMY OF COMPUTERS
Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

VECTOR PROCESSOR -- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list? -- Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks

VECTOR MACHINE EXAMPLE: CRAY-1
Russell, “The CRAY-1 computer system,” CACM 1978. Scalar and vector modes 8 64-element vector registers 64 bits per element 16 memory banks 8 64-bit scalar registers 8 24-bit address registers

AMDAHL’S LAW: BOTTLENECK ANALYSIS
Speedup= timewithout enhancement / timewith enhancement Suppose an enhancement speeds up a fraction f of a task by a factor of S timeenhanced = timeoriginal·(1-f) + timeoriginal·(f/S) Speedupoverall = 1 / ( (1-f) + f/S ) f (1 - f) timeoriginal (1 - f) timeenhanced f/S Focus on bottlenecks with large f (and large S)

FLYNN’S TAXONOMY OF COMPUTERS
Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

SYSTOLIC ARRAYS

WHY SYSTOLIC ARCHITECTURES?
Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory Similar to an assembly line of processing elements Different people work on the same car Many cars are assembled simultaneously Why? Special purpose accelerators/architectures need Simple, regular design (keep # unique parts small and regular) High concurrency  high performance Balanced computation and I/O (memory) bandwidth

SYSTOLIC ARRAYS Memory: heart PEs: cells Memory pulses data through cells H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982.

SYSTOLIC ARCHITECTURES
Basic principle: Replace one PE with a regular array of PEs and carefully orchestrate flow of data between the PEs Balance computation and memory bandwidth Differences from pipelining: These are individual PEs Array structure can be non-linear and multi-dimensional PE connections can be multidirectional (and different speed) PEs can have local memory and execute kernels (rather than a piece of the instruction)

SYSTOLIC COMPUTATION EXAMPLE
Convolution Used in filtering, pattern matching, correlation, polynomial evaluation, etc … Many image processing tasks

SYSTOLIC ARCHITECTURE FOR CONVOLUTION

y1=w1x1 y1=0 W3 W2 W1 x1

y1=w1x1 + w2x2 y1=w1x1 W3 W2 W1 x2 x2

y1=w1x1 + w2x2 + w3x3 y1=w1x1 + w2x2 W3 W2 W1 x3 x3

CONVOLUTION y1 = w1x1 + w2x2 + w3x3 y2 = w1x2 + w2x3 + w3x4

SYSTOLIC ARRAYS: PROS AND CONS
Advantage: Specialized (computation needs to fit PE organization/functions)  improved efficiency, simple design, high concurrency/ performance  good to do more with less memory bandwidth requirement Downside: Specialized  not generally applicable because computation needs to fit the PE functions/organization

ISA VS. MICROARCHITECTURE
What is part of ISA vs. Uarch? Gas pedal: interface for “acceleration” Internals of the engine: implements “acceleration” Add instruction vs. Adder implementation Implementation (uarch) can be various as long as it satisfies the specification (ISA) Bit serial, ripple carry, carry lookahead adders x86 ISA has many implementations: 286, 386, 486, Pentium, Pentium Pro, … Uarch usually changes faster than ISA Few ISAs (x86, SPARC, MIPS, Alpha) but many uarchs Why?

TRADEOFFS: SOUL OF COMPUTER ARCHITECTURE
ISA-level tradeoffs Uarch-level tradeoffs System and Task-level tradeoffs How to divide the labor between hardware and software

ISA-LEVEL TRADEOFFS: SEMANTIC GAP
Where to place the ISA? Semantic gap Closer to high-level language (HLL) or closer to hardware control signals?  Complex vs. simple instructions RISC vs. CISC vs. HLL machines FFT, QUICKSORT, POLY, FP instructions? VAX INDEX instruction (array access with bounds checking) e.g., A[i][j][k] one instruction with bound check

SEMANTIC GAP High-Level Language Software Semantic Gap ISA Hardware
Control Signals

SEMANTIC GAP High-Level Language Software Semantic Gap ISA CISC RISC
Hardware Control Signals

ISA-LEVEL TRADEOFFS: SEMANTIC GAP
Where to place the ISA? Semantic gap Closer to high-level language (HLL) or closer to hardware control signals?  Complex vs. simple instructions RISC vs. CISC vs. HLL machines FFT, QUICKSORT, POLY, FP instructions? VAX INDEX instruction (array access with bounds checking) Tradeoffs: Simple compiler, complex hardware vs complex compiler, simple hardware Caveat: Translation (indirection) can change the tradeoff! Burden of backward compatibility Performance? Optimization opportunity: Example of VAX INDEX instruction: who (compiler vs. hardware) puts more effort into optimization? Instruction size, code size

X86: SMALL SEMANTIC GAP: STRING OPERATIONS
REP MOVS DEST SRC How many instructions does this take in Alpha?

SMALL SEMANTIC GAP EXAMPLES IN VAX
FIND FIRST Find the first set bit in a bit field Helps OS resource allocation operations SAVE CONTEXT, LOAD CONTEXT Special context switching instructions INSQUEUE, REMQUEUE Operations on doubly linked list INDEX Array access with bounds checking STRING Operations Compare strings, find substrings, … Cyclic Redundancy Check Instruction EDITPC Implements editing functions to display fixed format output Digital Equipment Corp., “VAX Architecture Handbook,”

CISC vs. RISC Which one is easy to optimize? X: MOV ADD REPMOVS COMP
JMP X REPMOVS Which one is easy to optimize?

SMALL VERSUS LARGE SEMANTIC GAP
CISC vs. RISC Complex instruction set computer  complex instructions Initially motivated by “not good enough” code generation Reduced instruction set computer  simple instructions John Cocke, mid 1970s, IBM 801 Goal: enable better compiler control and optimization RISC motivated by Memory stalls (no work done in a complex instruction when there is a memory stall?) When is this correct? Simplifying the hardware  lower cost, higher frequency Enabling the compiler to optimize the code better Find fine-grained parallelism to reduce stalls

SMALL VERSUS LARGE SEMANTIC GAP
John Cocke’s RISC (large semantic gap) concept: Compiler generates control signals: open microcode Advantages of Small Semantic Gap (Complex instructions) + Denser encoding  smaller code size  saves off-chip bandwidth, better cache hit rate (better packing of instructions) + Simpler compiler Disadvantages - Larger chunks of work  compiler has less opportunity to optimize - More complex hardware  translation to control signals and optimization needs to be done by hardware Read Colwell et al., “Instruction Sets and Beyond: Computers, Complexity, and Controversy,” IEEE Computer 1985.

HOW HIGH OR LOW CAN YOU GO?
Very large semantic gap Each instruction specifies the complete set of control signals in the machine Compiler generates control signals Open microcode (John Cocke, 1970s) Gave way to optimizing compilers Very small semantic gap ISA is (almost) the same as high-level language Java machines, LISP machines, object-oriented machines, capability-based machines

EFFECT OF TRANSLATION One can translate from one ISA to another ISA to change the semantic gap tradeoffs Examples Intel’s and AMD’s x86 implementations translate x86 instructions into programmer-invisible microoperations (simple instructions) in hardware Transmeta’s x86 implementations translated x86 instructions into “secret” VLIW instructions in software (code morphing software) Think about the tradeoffs

TRANSLATION LAYER High-Level Language Control Signals ISA Semantic Gap
Software Hardware High-Level Language Control Signals uISA (uops) Semantic Gap Software Hardware ISA (x86) Translation Layer Not Exposed to programmer

Samira Khan University of Virginia Sep 10, 2018
COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models Samira Khan University of Virginia Sep 10, 2018 The content and concept of this course are adapted from CMU ECE 740

Samira Khan University of Virginia Sep 10, 2018

Similar presentations

Presentation on theme: "Samira Khan University of Virginia Sep 10, 2018"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Samira Khan University of Virginia Sep 10, 2018

Similar presentations

Presentation on theme: "Samira Khan University of Virginia Sep 10, 2018"— Presentation transcript:

Similar presentations

About project

Feedback