Architecture Basics ECE 454 Computer Systems Programming

Slides:

Advertisements

Similar presentations

ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.

Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

1 Code Optimization(II). 2 Outline Understanding Modern Processor –Super-scalar –Out-of –order execution Suggested reading –5.14,5.7.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Computer Organization and Architecture

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Goal: Reduce the Penalty of Control Hazards

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Chapter 12 CPU Structure and Function. Example Register Organizations.

Appendix A Pipelining: Basic and Intermediate Concepts

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

CH12 CPU Structure and Function

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

Pipelining and Parallelism Mark Staveley

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Computer Architecture: Wrap-up CENG331 - Computer Organization Instructors: Murat Manguoglu(Section 1) Erol Sahin (Section 2 & 3) Adapted from slides of.

Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.

Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing”

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

PipeliningPipelining Computer Architecture (Fall 2006)

Use of Pipelining to Achieve CPI < 1

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

CS 352H: Computer Systems Architecture

Instruction Level Parallelism

William Stallings Computer Organization and Architecture 8th Edition

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Pipeline Architecture since 1985

CS203 – Advanced Computer Architecture

CS203 – Advanced Computer Architecture

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Machine-Dependent Optimization

Morgan Kaufmann Publishers The Processor

Code Optimization(II)

ECE 454 Computer Systems Programming CPU Architecture

Advanced Computer Architecture

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Optimizing program performance

CSC3050 – Computer Architecture

Code Optimization II October 2, 2001

Presentation transcript:

Architecture Basics ECE 454 Computer Systems Programming Cristiana Amza Topics: Basics of Computer Architecture Pipelining, Branches, Superscalar, Out of order Execution

Motivation: Understand Loop Unrolling j = 0; while (j < 99){ a[j] = b[j+1]; a[j+1] = b[j+2]; j += 2; } j = 0; while (j < 100){ a[j] = b[j+1]; j += 1; } reduces loop overhead Fewer adds to update j Fewer loop condition tests enables more aggressive instruction scheduling more instructions for scheduler to move around

Motivation: Understand Pointer vs. Array Code Pointer Code Performance Array Code: 4 instructions in 2 clock cycles Pointer Code: Almost same 4 instructions in 3 clock cycles .L24: # Loop: addl (%eax,%edx,4),%ecx # sum += data[i] incl %edx # i++ cmpl %esi,%edx # i:length jl .L24 # if < goto Loop .L30: # Loop: addl (%eax),%ecx # sum += *data addl $4,%eax # data ++ cmpl %edx,%eax # data:dend jb .L30 # if < goto Loop

Motivation:Understand Parallelism /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = (x * data[i]) * data[i+1]; } All multiplies performed in sequence /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = x * (data[i] * data[i+1]); } Multiplies overlap

Modern CPU Design Instruction Control Execution Address Instrs. Fetch Control Instruction Cache Retirement Unit Instrs. Register File Instruction Decode Operations Register Updates Prediction OK? Execution Functional Units Integer/ Branch General Integer FP Add FP Mult/Div Load Store Operation Results Addr. Addr. Data Data Data Cache

RISC and Pipelining 1980: Patterson (Berkeley) coins term RISC RISC Design Simplifies Implementation Small number of instruction formats Simple instruction processing RISC Leads Naturally to Pipelined Implementation Partition activities into stages Each stage simple computation

Reduce CPI from 5  1 (ideally) RISC pipeline Pipeline: also reduce cycles. But previous instructions also took multiple cycles to complete. Reduce CPI from 5  1 (ideally)

Pipelines and Branch Prediction BNEZ R3, L1 branch decision is resolved in ID stage Which instr. should we fetch here? Must wait/stall fetching until branch direction known? Solutions? Predict branch e.g., BNEZ taken or not taken.

Pipelines and Branch Prediction Wait/stall? Pipeline: Insts fetched Branch directions computed How bad is the problem? (isn’t it just one cycle?) Branch instructions: 15% - 25% Pipeline deeper: branch not resolved until much later Misprediction penalty larger! Multiple instruction issue (superscalar) Flushing & refetching more instructions Object-oriented programming More indirect branches which are harder to predict by compiler Q: why on deeper pipeline, we do not resolve the branch earlier. register renaming: internal register number is much larger than external ones, so they will do renaming, which takes time. Out-of-order scheduling, and other examples… branch prediction is effective

Branch Prediction: solution Solution: predict branch directions: branch prediction Intuition: predict the future based on history Local prediction for each branch (only based on your own history) Problem?

Branch Prediction: solution if (a == 2) a = 0; if (b == 2) b = 0; if (a != b) .. .. Only depends on the history of itself? Global predictor Intuition: predict based on the both the global and local history (m, n) prediction (2-D table) An m-bit vector storing the global branch history (all executed branches) The value of this m-bit vector will index into an n-bit vector – local history BP is important: 30K bits is the standard size of prediction tables on Intel P4!

Instruction-Level Parallelism 1 2 3 4 5 6 7 8 9 Execution Time single-issue 1 2 3 4 5 6 7 8 9 superscalar 1 2 3 4 5 6 7 8 9 application instructions

Data dependency: obstacle to perfect pipeline DIV F0, F2, F4 // F0 = F2/F4 ADD F10, F0, F8 // F10 = F0 + F8 SUB F12, F8, F14 // F12 = F8 – F14 DIV F0,F2,F4 STALL: Waiting for F0 to be written STALL: Waiting for F0 to be written ADD F10,F0,F8 Necessary? SUB F12,F8,F14

Out-of-order execution: solving data-dependency DIV F0, F2, F4 // F0 = F2/F4 ADD F10, F0, F8 // F10 = F0 + F8 SUB F12, F8, F14 // F12 = F8 – F14 DIV F0,F2,F4  Not wait (as long as it’s safe) SUB F12,F8,F14 STALL: Waiting for F0 to be written ADD F10,F0,F8

Out-of-Order exe. to mask cache miss delay IN-ORDER: OUT-OF-ORDER: inst1 inst1 inst2 load (misses cache) inst3 inst2 inst4 inst3 Cache miss latency load (misses cache) inst4 inst5 (must wait for load value) Cache miss latency inst6 inst5 (must wait for load value) inst6

Out-of-order execution In practice, much more complicated Reservation stations for keeping instructions until operands available and can execute Register renaming, etc.

Instruction-Level Parallelism 1 2 3 4 5 6 7 8 9 Execution Time single-issue 1 2 3 4 5 6 7 8 9 superscalar 1 2 3 4 5 6 7 8 9 out-of-order super-scalar 1 2 3 4 5 6 7 8 9 application instructions

The Limits of Instruction-Level Parallelism Execution Time 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 wider OOO super-scalar out-of-order super-scalar diminishing returns for wider superscalar

Multithreading The “Old Fashioned” Way 1 2 3 4 5 6 7 8 9 Application 1 1 2 3 4 5 6 7 8 9 Application 2 1 2 3 4 5 6 7 8 9 Execution Time Fast context switching

Simultaneous Multithreading (SMT) (aka Hyperthreading) 1 2 3 4 5 6 7 8 9 Execution Time Fast context switching 1 2 3 4 5 6 7 8 9 Execution Time hyperthreading SMT: 20-30% faster than context switching

A Bit of History for Intel Processors Year Processor Tech. CPI 1971 4004 no pipeline n 1985 386 pipeline close to 1 branch prediction closer to 1 1993 Pentium Superscalar < 1 1995 PentiumPro Out-of-Order exe. << 1 1999 Pentium III Deep pipeline shorter cycle 2000 Pentium IV SMT < 1?

32-bit to 64-bit Computing unlikely to go to 128bit Why 64 bit? 32b addr space: 4GB; 64b addr space: 18M * 1TB Benefits large databases and media processing OS’s and counters 64bit counter will not overflow (if doing ++) Math and Cryptography Better performance for large/precise value math Drawbacks: Pointers now take 64 bits instead of 32 Ie., code size increases unlikely to go to 128bit

Core2 Architecture (2006): UG machines!

Summary (UG Machines CPU Core Arch. Features) 64-bit instructions Deeply pipelined 14 stages Branches are predicted Superscalar Can issue multiple instructions at the same time Can issue instructions out-of-order