Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spring’19 Prof. Eric Rotenberg

Similar presentations


Presentation on theme: "Spring’19 Prof. Eric Rotenberg"— Presentation transcript:

1 Spring’19 Prof. Eric Rotenberg
ECE 721 Overview Spring’19 Prof. Eric Rotenberg ECE 721, Spring’19 Prof. Eric Rotenberg

2 Performance Strategies
Application Class Nature of Parallelism Example Applications Architecture Approach sequential programs Instruction-Level Parallelism (ILP) irregular and fine-grained most ordinary apps, operating systems, etc. general-purpose superscalar or VLIW speculation is key theme data-parallel programs Data-Level Parallelism (DLP) regular and fine-grained multimedia, games, network processing, scientific apps, etc. vector, SIMD, SIMT (GPGPU), special-purpose, ASICs, FPGAs, etc. thread-parallel programs Thread-Level Parallelism (TLP) regular and coarse-grained Data Center apps, multimedia, games, network processing, scientific apps, etc. parallel computers, multi-core 721 focus ECE 721, Spring’19 Prof. Eric Rotenberg

3 The Big Five ILP Techniques (ECE 463/563 Review)
Pipelining Overlap instructions for higher throughput Caches, prefetching Bridge processor/memory speed gap Branch prediction Remove control dependencies for effective pipelining Out-of-order execution Mitigate data dependencies and latencies by decoupling future independent instructions from earlier stalled instructions Superscalar / VLIW Exceed scalar (1 instr./cycle) performance via multiple-instruction issue (N instr./cycle) ECE 721, Spring’19 Prof. Eric Rotenberg

4 ILP Scaling in Commercial Processors
Processor Generation Pipeline Depth (fetch to execute) Issue Width In-flight Instructions Pentium 5 1 instr. ~5 Pentium-III 10 3 µ-ops ~40 Pentium-IV 20 126 IBM Power4 12 5 instr. 200 IBM Power8 10 (issue) 8 (retire) 224 ECE 721, Spring’19 Prof. Eric Rotenberg

5 ECE 721 Topics Modern Superscalar Processors
Contemporary organization Physical Register File Memory dependencies Canonical superscalar pipeline Implementation details Superscalar complexity Case studies Possible Next-Generation Superscalars Large-Window Processors High-ILP Processors Specialization New source of performance, as speed and power benefits of technology scaling decrease Efficiency Forms: reconfigurable processor, heterogeneous multi-core processor, accelerators ECE 721, Spring’19 Prof. Eric Rotenberg

6 Topic 1: Modern Superscalar Processors
ECE 721, Spring’19 Prof. Eric Rotenberg

7 decode, rename, register read, dispatch
Style 1 (ECE 563) branch predictor instruction fetch I$ decode, rename, register read, dispatch ARF retire issue queue (IQ) ROB OOO issue execution FU FU FU D$ complete ECE 721, Spring’19 Prof. Eric Rotenberg

8 decode, rename, dispatch
Style 2 (ECE 721) branch predictor instruction fetch Free list I$ exception recovery Arch. Map decode, rename, dispatch Map rename misp. recovery branch Issue Queue (IQ) retire head Shadow Maps OOO issue Physical RF execution Function Units (FUs) tail complete Active List ECE 721, Spring’19 Prof. Eric Rotenberg

9 Superscalar Complexity
ECE 721, Spring’19 Prof. Eric Rotenberg

10 Case Studies ARM Cortex A15 ECE 721, Spring’19 Prof. Eric Rotenberg

11 Case Studies IBM Power8 microarchitecture block diagram [Image credit: The Linley Group] ECE 721, Spring’19 Prof. Eric Rotenberg

12 Topic 2: Possible Next-Gen. Superscalars
Large-window processors High-ILP processors ECE 721, Spring’19 Prof. Eric Rotenberg

13 Large-Window Processors
Checkpoint processing and recovery (CPR) Large virtual window (e.g., 1K to 8K in-flight instructions!) with small physical resources Continual Flow Pipelines (CFP) Program continues executing for 100s of cycles while L2-miss instructions are deferred Run-Ahead Execution Efficiently exploits memory-level parallelism ECE 721, Spring’19 Prof. Eric Rotenberg

14 High-ILP Processors E.g., 16-way superscalar Explored in the 1990’s
Why revisit today? Frequency has peaked due to reaching air-cooled power limit Increase performance through parallelism, including ILP ECE 721, Spring’19 Prof. Eric Rotenberg

15 High-ILP Processors (cont.)
Two challenges ILP bottlenecks Logic complexity in all pipeline stages increases cycle time and power Two sets of solutions Need advanced speculation techniques to overcome ILP bottlenecks Need a complexity-effective microarchitecture ECE 721, Spring’19 Prof. Eric Rotenberg

16 Advanced Speculation Techniques
ILP Bottlenecks ILP bottleneck Advanced Speculation Techniques control-flow: branch mispredictions multipath execution, control independence, other control-flow: fetch bandwidth trace cache data-flow value prediction, instr./trace reuse ECE 721, Spring’19 Prof. Eric Rotenberg

17 Advanced Speculation Value prediction Trace-level reuse
Predict values and execute dependent instructions in parallel Trace-level reuse Collapse 10s of instructions into a few cycles Control independence Don’t squash all instructions after mispredicted branch Confidence, multipath Execute both paths of a branch if not confident of prediction ECE 721, Spring’19 Prof. Eric Rotenberg

18 Value Prediction p1 p2 p3 = = = ECE 721, Spring’19
Prof. Eric Rotenberg

19 Control Independence mispredicted branch Save control-independent,
data-independent (CIDI) instructions ECE 721, Spring’19 Prof. Eric Rotenberg

20 “unconfident” branch (from confidence estimator)
Multipath Execution “unconfident” branch (from confidence estimator) confident confident 1st thread 2nd thread ECE 721, Spring’19 Prof. Eric Rotenberg

21 Simultaneous Multithreading (SMT)
Run multiple independent threads on wide processor at same time Naturally increases ILP (more independent instructions) SMT hardware can also be repurposed for other microarchitecture techniques Multipath execution Pre-execution, helper threads, etc. ECE 721, Spring’19 Prof. Eric Rotenberg

22 Complexity-effective: Hierarchical Processors
Trace Predictor Trace Cache Global Registers Local Registers Function Units ECE 721, Spring’19 Prof. Eric Rotenberg

23 Topic 3: Specialization
ECE 721, Spring’19 Prof. Eric Rotenberg

24 Specialization Past: Generic superscalar microarchitecture
Generic means not as efficient as possible for individual tasks Didn’t care: Exponential performance improvements: technology + microarchitecture (5 ILP techniques) = frequency Vdd scaling kept power in check Future: Specialize Need to specialize hardware to tasks because the scaling “gravy train” is over ECE 721, Spring’19 Prof. Eric Rotenberg

25 Forms of Specialization
Reconfigurable processor Adaptive core: e.g., adjustable width, depth, or structure sizes Core fusion: aggregate narrow cores to form a wide processor Single-ISA heterogeneous multi-core processor Multiple core “types”, each with different superscalar dimensions Basic form commercialized: big and little cores (e.g., ARM’s big.LITTLE) More advanced form: non-monotonic cores (can’t be performance-ranked) Accelerators CPU + Accelerators Many variants: programmable loop accelerators, compound circuits, reconfigurable arrays of ALUs, conservation cores, ASICs, FPGAs, GPGPU now mainstream in data centers ECE 721, Spring’19 Prof. Eric Rotenberg

26 Single-ISA Heterogeneous Multi-core Processor
Include many differently-designed cores on a chip Fundamentally change what a “general-purpose processor” looks like Overcome barrier to deploying new microarchitecture ideas: general applicability not an issue ECE 721, Spring’19 Prof. Eric Rotenberg

27 Project Frameworks: 721sim (C++) FabScalar (verilog) Other Required
Cycle-level execute-at-execute simulator of a superscalar processor Projects 1 and 2 are training ground FabScalar (verilog) Optional Highly-parameterized synthesizable RTL design of a superscalar core Width, depth, and size are configurable Other Must be approved by instructor If needed by your custom research project Compilers, other simulators, etc. ECE 721, Spring’19 Prof. Eric Rotenberg

28 FabScalar-based Chips from NCSU
H3 (3D Heterogeneous Processor) Technology IBM 8RF (130nm) Dimensions 5.25mm x 5.25mm Area 27.6mm2 Transistors 14.6 Million Cells 1.1 Million Nets 721 Thousand Memory macros 56 Clock domains 10 ECE 721, Spring’19 Prof. Eric Rotenberg

29 FabScalar-based Chips from NCSU
AnyCore: A Width and Size Adaptive Superscalar Core ECE 721, Spring’19 Prof. Eric Rotenberg


Download ppt "Spring’19 Prof. Eric Rotenberg"

Similar presentations


Ads by Google