Samira Khan University of Virginia Jan 28, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs The content and.

Samira Khan University of Virginia Jan 28, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs The content and concept of this course are adapted from CMU ECE 740

AGENDA Project Proposal and Ideas Review from last lecture Fundamental concepts – Computing models – ISA Tradeoffs 2

RESEARCH PROJECT Your chance to explore in depth a computer architecture topic that interests you Perhaps even publish your innovation in a top computer architecture conference. Start thinking about your project topic from now! Interact with me and the TA Read the project topics handout well Groups of 2-3 students (will finalize this later) Proposal due: Feb 18 3

RESEARCH PROJECT Goal: Develop (new) insight – Solve a problem in a new way or evaluate/analyze systems/ideas – Type 1: Develop new ideas to solve an important problem Rigorously evaluate the benefits and limitations of the ideas – Type 2: Derive insight from rigorous analysis and understanding of existing systems or previously proposed ideas Propose potential new solutions based on the new insight The problem and ideas need to be concrete Problem and goals need to be very clear 4

RESEARCH PROPOSAL OUTLINE The Problem: What is the problem you are trying to solve – Define very clearly. Explain why it is important. Novelty: Why has previous research not solved this problem? What are its shortcomings? – Describe/cite all relevant works you know of and describe why these works are inadequate to solve the problem. This will be your literature survey. Idea: What is your initial idea/insight? What new solution are you proposing to the problem? Why does it make sense? How does/could it solve the problem better? Hypothesis: What is the main hypothesis you will test? Methodology: How will you test the hypothesis/ideas? Describe what simulator or model you will use and what initial experiments you will do. Plan: Describe the steps you will take. What will you accomplish by Milestone 1, 2, and Final Report? Give 75%, 100%, 125% and moonshot goals. 5 All research projects can be (and should be) described in this fashion

HEILMEIER’S CATECHISM (VERSION 1) What are you trying to do? Articulate your objectives using absolutely no jargon. How is it done today, and what are the limits of current practice? What's new in your approach and why do you think it will be successful? Who cares? If you're successful, what difference will it make? What are the risks and the payoffs? How much will it cost? How long will it take? What are the midterm and final "exams" to check for success? 6

HEILMEIER’S CATECHISM (VERSION 2) What is the problem? Why is it hard? How is it solved today? What is the new technical idea? Why can we succeed now? What is the impact if successful? http://en.wikipedia.org/wiki/George_H._Heilmei er http://en.wikipedia.org/wiki/George_H._Heilmei er 7

SUPPLEMENTARY READINGS ON RESEARCH, WRITING, REVIEWS Hamming, “You and Your Research,” Bell Communications Research Colloquium Seminar, 7 March 1986. http://www.cs.virginia.edu/~robins/YouAndYourResearch.html Levin and Redell, “How (and how not) to write a good systems paper,” OSR 1983. Smith, “The Task of the Referee,” IEEE Computer 1990. – Read this to get an idea of the publication process SP Jones, “How to Write a Great Research Paper” Fong, “How to Write a CS Research Paper: A Bibliography” 8

WHERE TO GET PROJECT TOPICS/IDEAS FROM Project topics handout Assigned readings – Mutlu and Subramanium, “Research Problems and Opportunities in Memory Systems” Recent conference proceedings – ISCA: http://www.informatik.uni-trier.de/~ley/db/conf/isca/http://www.informatik.uni-trier.de/~ley/db/conf/isca/ – MICRO: http://www.informatik.uni- trier.de/~ley/db/conf/micro/http://www.informatik.uni- trier.de/~ley/db/conf/micro/ – HPCA: http://www.informatik.uni-trier.de/~ley/db/conf/hpca/http://www.informatik.uni-trier.de/~ley/db/conf/hpca/ – ASPLOS: http://www.informatik.uni- trier.de/~ley/db/conf/asplos/http://www.informatik.uni- trier.de/~ley/db/conf/asplos/ 9

LAST LECTURE RECAP Why Study Computer Architecture? Von Neumann Model Data Flow Architecture SIMD – Array – Vector 10

Von Neumann model: An instruction is fetched and executed in control flow order – As specified by the instruction pointer – Sequential unless explicit control flow instruction Dataflow model: An instruction is fetched and executed in data flow order – i.e., when its operands are ready – i.e., there is no instruction pointer – Instruction ordering specified by data flow dependence Each instruction specifies “who” should receive the result An instruction can “fire” whenever all operands are received – Potentially many instructions can execute at the same time Inherently more parallel REVIEW: THE DATA FLOW MODEL 11

REVIEW: FLYNN’S TAXONOMY OF COMPUTERS Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements – Array processor – Vector processor MISD: Multiple instructions operate on single data element – Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) – Multiprocessor – Multithreaded processor 12

REVIEW: SIMD PROCESSING Single instruction operates on multiple data elements – In time or in space Multiple processing elements Time-space duality – Array processor: Instruction operates on multiple data elements at the same time – Vector processor: Instruction operates on multiple data elements in consecutive time steps 13

REVIEW: VECTOR PROCESSOR ADVANTAGES + No dependencies within a vector – Pipelining, parallelization work well – Can have very deep pipelines, no dependencies! + Each instruction generates a lot of work – Reduces instruction fetch bandwidth + Highly regular memory access pattern – Interleaving multiple banks for higher memory bandwidth – Prefetching + No need to explicitly code loops – Fewer branches in the instruction sequence 14

SCALAR CODE EXAMPLE For I = 1 to 50 – C[i] = (A[i] + B[i]) / 2 Scalar code MOVI R0 = 501 MOVA R1 = A1 MOVA R2 = B1 MOVA R3 = C1 X: LD R4 = MEM[R1++]11 ;autoincrement addressing LD R5 = MEM[R2++]11 ADD R6 = R4 + R54 SHFR R7 = R6 >> 11 ST MEM[R3++] = R7 11 DECBNZ --R0, X2 ;decrement and branch if NZ 304 dynamic instructions 15

VECTOR PROCESSORS A vector is a one-dimensional array of numbers Many scientific/commercial programs use vectors for (i = 0; i<=49; i++) C[i] = (A[i] + B[i]) / 2 A vector processor is one whose instructions operate on vectors rather than scalar (single data) values Basic requirements – Need to load/store vectors  vector registers (contain vectors) – Need to operate on vectors of different lengths  vector length register (VLEN) – Elements of a vector might be stored apart from each other in memory  vector stride register (VSTR) Stride: distance between two elements of a vector 16

VECTOR CODE EXAMPLE A loop is vectorizable if each iteration is independent of any other For I = 0 to 49 – C[i] = (A[i] + B[i]) / 2 Vectorized loop: MOVI VLEN = 501 MOVI VSTR = 11 VLD V0 = A11 + VLN - 1 VLD V1 = B11 + VLN – 1 VADD V2 = V0 + V14 + VLN - 1 VSHFR V3 = V2 >> 11 + VLN - 1 VST C = V311 + VLN – 1 7 dynamic instructions 17

SCALAR CODE EXECUTION TIME Scalar execution time on an in-order processor with 1 bank – First two loads in the loop cannot be pipelined 2*11 cycles – 4 + 50*40 = 2004 cycles Scalar execution time on an in-order processor with 16 banks (word-interleaved) – First two loads in the loop can be pipelined – 4 + 50*30 = 1504 cycles Why 16 banks? – 11 cycle memory access latency – Having 16 (>11) banks ensures there are enough banks to overlap enough memory operations to cover memory latency 18

VECTOR CODE EXECUTION TIME No chaining – i.e., output of a vector functional unit cannot be used as the input of another (i.e., no vector data forwarding) 16 memory banks (word-interleaved) 285 cycles 19

VECTOR PROCESSOR DISADVANTAGES -- Works (only) if parallelism is regular (data/SIMD parallelism) ++ Vector operations -- Very inefficient if parallelism is irregular -- How about searching for a key in a linked list? Fisher, “Very Long Instruction Word Architectures and the ELI-512,” ISCA 1983. 20

VECTOR PROCESSOR LIMITATIONS -- Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 2. data is not mapped appropriately to memory banks 21

FURTHER READING: SIMD Recommended – H&P, Appendix on Vector Processors – Russell, “The CRAY-1 computer system,” CACM 1978. 22

VECTOR MACHINE EXAMPLE: CRAY-1 CRAY-1 Russell, “The CRAY-1 computer system,” CACM 1978. Scalar and vector modes 8 64-element vector registers 64 bits per element 16 memory banks 8 64-bit scalar registers 8 24-bit address registers 23

AMDAHL’S LAW: BOTTLENECK ANALYSIS Speedup= time without enhancement / time with enhancement Suppose an enhancement speeds up a fraction f of a task by a factor of S time enhanced = time original ·(1-f) + time original ·(f/S) Speedup overall = 1 / ( (1-f) + f/S ) f (1 - f) time original (1 - f) time enhanced f/S Focus on bottlenecks with large f (and large S) 24

FLYNN’S TAXONOMY OF COMPUTERS Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements – Array processor – Vector processor MISD: Multiple instructions operate on single data element – Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) – Multiprocessor – Multithreaded processor 25

SYSTOLIC ARRAYS

WHY SYSTOLIC ARCHITECTURES? Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory Similar to an assembly line of processing elements – Different people work on the same car – Many cars are assembled simultaneously – Can be two-dimensional Why? Special purpose accelerators/architectures need – Simple, regular design (keep # unique parts small and regular) – High concurrency  high performance – Balanced computation and I/O (memory) bandwidth 27

H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982. Memory: heart PEs: cells Memory pulses data through cells SYSTOLIC ARRAYS 28

SYSTOLIC ARCHITECTURES Basic principle: Replace one PE with a regular array of PEs and carefully orchestrate flow of data between the PEs – Balance computation and memory bandwidth Differences from pipelining: – These are individual PEs – Array structure can be non-linear and multi-dimensional – PE connections can be multidirectional (and different speed) – PEs can have local memory and execute kernels (rather than a piece of the instruction) 29

SYSTOLIC ARRAYS: PROS AND CONS Advantage: – Specialized (computation needs to fit PE organization/functions)  improved efficiency, simple design, high concurrency/ performance  good to do more with less memory bandwidth requirement Downside: – Specialized  not generally applicable because computation needs to fit the PE functions/organization 30

AGENDA Project Proposal and Ideas Review from last lecture Fundamental concepts – Computing models – ISA Tradeoffs 31

LEVELS OF TRANSFORMATION ISA – Agreed upon interface between software and hardware SW/compiler assumes, HW promises – What the software writer needs to know to write system/user programs Microarchitecture – Specific implementation of an ISA – Not visible to the software Microprocessor – ISA, uarch, circuits – “Architecture” = ISA + microarchitecture Microarchitecture ISA Program/Language Algorithm Problem Logic Circuits 32

ISA VS. MICROARCHITECTURE What is part of ISA vs. Uarch? – Gas pedal: interface for “acceleration” – Internals of the engine: implements “acceleration” – Add instruction vs. Adder implementation Implementation (uarch) can be various as long as it satisfies the specification (ISA) – Bit serial, ripple carry, carry lookahead adders – x86 ISA has many implementations: 286, 386, 486, Pentium, Pentium Pro, … Uarch usually changes faster than ISA – Few ISAs (x86, SPARC, MIPS, Alpha) but many uarchs – Why? 33

ISA Instructions – Opcodes, Addressing Modes, Data Types – Instruction Types and Formats – Registers, Condition Codes Memory – Address space, Addressability, Alignment – Virtual memory management Call, Interrupt/Exception Handling Access Control, Priority/Privilege I/O Task Management Power and Thermal Management Multi-threading support, Multiprocessor support 34

Microarchitecture Implementation of the ISA under specific design constraints and goals Anything done in hardware without exposure to software – Pipelining – In-order versus out-of-order instruction execution – Memory access scheduling policy – Speculative execution – Superscalar processing (multiple instruction issue?) – Clock gating – Caching? Levels, size, associativity, replacement policy – Prefetching? – Voltage/frequency scaling? – Error correction? 35

DESIGN POINT A set of design considerations and their importance – leads to tradeoffs in both ISA and uarch Considerations – Cost – Performance – Maximum power consumption – Energy consumption (battery life) – Availability – Reliability and Correctness (or is it?) – Time to Market Design point determined by the “Problem” space (application space) Microarchitecture ISA Program/Language Algorithm Problem Logic Circuits 36

TRADEOFFS: SOUL OF COMPUTER ARCHITECTURE ISA-level tradeoffs Uarch-level tradeoffs System and Task-level tradeoffs – How to divide the labor between hardware and software 37

ISA-LEVEL TRADEOFFS: SEMANTIC GAP Where to place the ISA? Semantic gap – Closer to high-level language (HLL) or closer to hardware control signals?  Complex vs. simple instructions – RISC vs. CISC vs. HLL machines FFT, QUICKSORT, POLY, FP instructions? VAX INDEX instruction (array access with bounds checking) – Tradeoffs: Simple compiler, complex hardware vs. complex compiler, simple hardware – Caveat: Translation (indirection) can change the tradeoff! Burden of backward compatibility Performance? – Optimization opportunity: Example of VAX INDEX instruction: who (compiler vs. hardware) puts more effort into optimization? – Instruction size, code size 38

Samira Khan University of Virginia Jan 28, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs The content and concept of this course are adapted from CMU ECE 740

Samira Khan University of Virginia Jan 28, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs The content and.

Similar presentations

Presentation on theme: "Samira Khan University of Virginia Jan 28, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs The content and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Samira Khan University of Virginia Jan 28, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs The content and.

Similar presentations

Presentation on theme: "Samira Khan University of Virginia Jan 28, 2016 COMPUTER ARCHITECTURE CS 6354 Fundamental Concepts: Computing Models and ISA Tradeoffs The content and."— Presentation transcript:

Similar presentations

About project

Feedback