Lecture 23: Advanced Microarchitecture

Slides:



Advertisements
Similar presentations
Lecture 4: CPU Performance
Advertisements

Computer Organization and Architecture
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Chapter 4 CSF 2009 The processor: Instruction-Level Parallelism.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Instruction Level Parallelism (ILP) Colin Stevens.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Architecture Basics ECE 454 Computer Systems Programming
RISC Architecture RISC vs CISC Sherwin Chan.
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Use of Pipelining to Achieve CPI < 1
Chapter 7 Digital Design and Computer Architecture, 2 nd Edition Chapter 7 David Money Harris and Sarah L. Harris.
Chapter 7 :: Microarchitecture
Chapter Six.
CS 352H: Computer Systems Architecture
Instruction Level Parallelism
Computer Architecture
Computer Architecture Principles Dr. Mike Frank
Chapter 7 Digital Design and Computer Architecture: ARM® Edition
William Stallings Computer Organization and Architecture 8th Edition
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Simultaneous Multithreading
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
Part IV Data Path and Control
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Pipeline Implementation (4.6)
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Chapter 7 Digital Design and Computer Architecture, 2nd Edition
Morgan Kaufmann Publishers The Processor
Part IV Data Path and Control
Vishwani D. Agrawal James J. Danaher Professor
Computer Architecture Lecture 4 17th May, 2006
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CS 704 Advanced Computer Architecture
Chapter Six.
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Vishwani D. Agrawal James J. Danaher Professor
Advanced Architecture +
CSC3050 – Computer Architecture
Wackiness Algorithm A: Algorithm B:
Systems Architecture II
The University of Adelaide, School of Computer Science
Presentation transcript:

Lecture 23: Advanced Microarchitecture E85 Digital Design & Computer Engineering Lecture 23: Advanced Microarchitecture

Advanced Microarchitecture Deep Pipelining Micro-operations Branch Prediction Superscalar Processors Out of Order Processors Register Renaming SIMD Multithreading Multiprocessors

Deep Pipelining 10-20 stages typical Number of stages limited by: Pipeline hazards Sequencing overhead Power Cost

Micro-operations Complex Op Micro-op Sequence Decompose more complex instructions into a series of simple instructions called micro-operations (micro-ops or µ-ops) At run-time, complex instructions are decoded into one or more micro-ops Used heavily in CISC (complex instruction set computer) architectures (e.g., x86) Used for some ARM instructions, for example: Complex Op Micro-op Sequence LDR R1, [R2], #4 LDR R1, [R2] ADD R2, R2, #4 Without u-ops, would need 2nd write port on the register file

Micro-operations Allow for dense code (fewer memory accesses) Yet preserve simplicity of RISC hardware ARM strikes balance by choosing instructions that: Give better code density than pure RISC instruction sets (such as MIPS) Enable more efficient decoding than CISC instruction sets (such as x86)

Branch Prediction Guess whether branch will be taken Backward branches are usually taken (loops) Consider history to improve guess Good prediction reduces fraction of branches requiring a flush

Branch Prediction Ideal pipelined processor: CPI = 1 Branch misprediction increases CPI Static branch prediction: Check direction of branch (forward or backward) If backward, predict taken Else, predict not taken Dynamic branch prediction: Keep history of last several hundred (or thousand) branches in branch target buffer, record: Branch destination Whether branch was taken

Branch Prediction Example MOV R1, #0 ; R1 = sum MOV R0, #0 ; R0 = i FOR ; for (i=0; i<10; i=i+1) CMP R0, #10 BGE DONE ADD R1, R1, R0 ; sum = sum + i ADD R0, R0, #1 B FOR DONE

1-Bit Branch Predictor Remembers whether branch was taken the last time and does the same thing Mispredicts first and last branch of loop

2-Bit Branch Predictor Only mispredicts last branch of loop

Superscalar Multiple copies of datapath execute multiple instructions at once Dependencies make it tricky to issue multiple instructions at once

Superscalar Example Ideal IPC: 2 Actual IPC: 2

Superscalar with Dependencies Ideal IPC: 2 Actual IPC: 6/5 = 1.2

Out of Order Processor Looks ahead across multiple instructions Issues as many instructions as possible at once Issues instructions out of order (as long as no dependencies) Dependencies: RAW (read after write): one instruction writes, later instruction reads a register WAR (write after read): one instruction reads, later instruction writes a register WAW (write after write): one instruction writes, later instruction writes a register

Out of Order Processor Instruction level parallelism (ILP): number of instruction that can be issued simultaneously (average < 3) Scoreboard: table that keeps track of: Instructions waiting to issue Available functional units Dependencies

Out of Order Processor Example LDR R8, [R0, #40] ADD R9, R8, R1 SUB R8, R2, R3 Ideal IPC: 2 AND R10, R4, R8 Actual IPC: 6/4 = 1.5 ORR R11, R5, R6 STR R7, [R11, #80]

Register Renaming LDR R8, [R0, #40] ADD R9, R8, R1 SUB R8, R2, R3 Ideal IPC: 2 AND R10, R4, R8 Actual IPC: 6/3 = 2 ORR R11, R5, R6 STR R7, [R11, #80]

SIMD Single Instruction Multiple Data (SIMD) Single instruction acts on multiple pieces of data at once Common application: graphics Perform short arithmetic operations (also called packed arithmetic) For example, add eight 8-bit elements

Advanced Architecture Techniques Multithreading Word processor: thread for typing, spell checking, printing Multiprocessors Multiple processors (cores) on a single chip

Threading: Definitions Process: program running on a computer Multiple processes can run at once: e.g., surfing Web, playing music, writing a paper Thread: part of a program Each process has multiple threads: e.g., a word processor may have threads for typing, spell checking, printing

Threads in Conventional Processor One thread runs at at at a time When one thread stalls (for example, waiting for memory): Architectural state of that thread stored Architectural state of waiting thread loaded into processor and it runs Called context switching Appears to user like all threads running simultaneously

Multithreading Multiple copies of architectural state Multiple threads active at once: When one thread stalls, another runs immediately If one thread can’t keep all execution units busy, another thread can use them Does not increase instruction-level parallelism (ILP) of single thread, but increases throughput Intel calls this “hyperthreading”

Multiprocessors Multiple processors (cores) with a method of communication between them Types: Homogeneous: multiple cores with shared main memory Heterogeneous: separate cores for different tasks (for example, DSP and CPU in cell phone) Clusters: each core has own memory system

Other Resources Patterson & Hennessy’s: Computer Architecture: A Quantitative Approach Conferences: www.cs.wisc.edu/~arch/www/ ISCA (International Symposium on Computer Architecture) HPCA (International Symposium on High Performance Computer Architecture)