1 Carnegie Mellon University 2 Intel Corporation Chris Fallin 1 Chris Wilkerson 2 Onur Mutlu 1 The Heterogeneous Block Architecture A Flexible Substrate.

Slides:

Advertisements

Similar presentations

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

Chapter 17 Parallel Processing.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

Multiscalar processors

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

Architecture Basics ECE 454 Computer Systems Programming

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Computer Architecture: Out-of-Order Execution

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Power Awareness through Selective Dynamically Optimized Traces Roni Rosner, Yoav Almog, Micha Moffie, Naftali Schwartz and Avi Mendelson – Intel Labs,

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

15-740/ Computer Architecture Lecture 3: Performance

Adaptive Cache Partitioning on a Composite Core

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Superscalar Processors & VLIW Processors

Hardware Multithreading

Address-Value Delta (AVD) Prediction

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

* From AMD 1996 Publication #18522 Revision E

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

The Vector-Thread Architecture

Overview Prof. Eric Rotenberg

Patrick Akl and Andreas Moshovos AENAO Research Group

Design of Digital Circuits Lecture 19a: VLIW

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

1 Carnegie Mellon University 2 Intel Corporation Chris Fallin 1 Chris Wilkerson 2 Onur Mutlu 1 The Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores

Executive Summary Problem: General purpose core design is a compromise between performance and energy-efficiency Our Goal: Design a core that achieves high performance and high energy efficiency at the same time Two New Observations: 1) Applications exhibit fine-grained heterogeneity in regions of code, 2) A core can exploit this heterogeneity by grouping tens of instructions into atomic blocks and executing each block in the best-fit backend Heterogeneous Block Architecture (HBA): 1) Forms atomic blocks of code, 2) Dynamically determines the best-fit (most efficient) backend for each block, 3) Specializes the block for that backend Initial HBA Implementation: Chooses from Out-of-order, VLIW or In-order backends, with simple schedule stability and stall heuristics Results: HBA provides higher efficiency than four previous designs at negligible performance loss; HBA enables new tradeoff points 2

Backend 1 Backend 2 Backend 3 Shared Frontend “Specialization” Picture of HBA 3 Out-of-order VLIW/IO Shared Frontend “Specialization” Out-of-order VLIW/IO

Talk Agenda Background and Motivation Two New Observations The Heterogeneous Block Architecture (HBA) An Initial HBA Design Experimental Evaluation Conclusions 4

Competing Goals in Core Design High performance and high energy-efficiency  Difficult to achieve both at the same time High performance: Sophisticated features to extract it  Out-of-order execution (OoO), complex branch prediction, wide instruction issue, … High energy efficiency: Use of only features that provide the highest efficiency for each workload  Adaptation of resources to different workload requirements Today’s high-performance designs: Features may not yield high performance, but every piece of code pays for their energy penalty 5

Principle: Exploiting Heterogeneity Past observation 1: Workloads have different characteristics at coarse granularity (thousands to millions of instructions)  Each workload or phase may require different features Past observation 2: This heterogeneity can be exploited by switching “modes” of operation  Separate out-of-order and in-order cores  Separate out-of-order and in-order pipelines with shared frontend Past coarse-grained heterogeneous designs  provide higher energy efficiency  at slightly lower performance vs. a homogeneous OoO core 6

Talk Agenda Background and Motivation Two New Observations The Heterogeneous Block Architecture (HBA) An Initial HBA Design Experimental Evaluation Conclusions 7

Two Key Observations Fine-Grained Heterogeneity of Application Behavior  Small chunks (10s or 100s of instructions) of code have different execution characteristics Some instruction chunks always have the same instruction schedule in an out-of-order processor Nearby chunks can have different schedules Atomic Block Execution with Multiple Execution Backends  A core can exploit fine-grained heterogeneity with Having multiple execution backends Dividing the program into atomic blocks Executing each block in the best-fit backend 8

Fine-Grained Heterogeneity Small chunks of code have different execution characteristics 9 Good fit for Wide VLIW Good fit for Wide OoO Good fit for Narrow In-order

Instruction Schedule Stability Varies at Fine Grain Observation 1: Some regions of code are scheduled in the same instruction scheduling order in an OoO engine across different dynamic instances Observation 2: Stability of instruction schedules of (nearby) code regions can be very different An experiment:  Same-schedule chunk: A chunk of code that has the same schedule it had in its previous dynamic instance  Examined all chunks of <=16 instructions in 248 workloads  Computed the fraction of “same schedule chunks” in each workload 10

Instruction Schedule Stability Varies at Fine Grain Many chunks have the same schedule in their previous instances  Can reuse the previous instruction scheduling order the next time 2. Stability of instruction schedule of chunks can be very different  Need a core that can dynamically schedule insts. in some chunks 3. Temporally-adjacent chunks can have different schedule stability  Need a core that can switch quickly between multiple schedulers

Two Key Observations Fine-Grained Heterogeneity of Application Behavior  Small chunks (10s or 100s of instructions) of code have different execution characteristics Some instruction chunks always have the same schedule Nearby chunks can have different schedules Atomic Block Execution with Multiple Execution Backends  A core can exploit fine-grained heterogeneity with Having multiple execution backends Dividing the program into atomic blocks Executing each block in the best-fit backend 12

Atomic Block Execution w/ Multiple Backends Fine grained heterogeneity can be exploited by a core that:  Has multiple (specialized) execution backends  Divides the program into atomic (all-or-none) blocks  Executes each atomic block in the best-fit backend 13 Atomicity enables specialization of a block for a backend (can freely reorder/rewrite/eliminate instructions in an atomic block)

Talk Agenda Background and Motivation Two New Observations The Heterogeneous Block Architecture (HBA) An Initial HBA Design Experimental Evaluation Conclusions 14

HBA: Principles Multiple different execution backends  Customized for different block execution requirements: OoO, VLIW, in-order, SIMD, …  Can be simultaneously active, executing different blocks  Enables efficient exploitation of fine-grained heterogeneity Block atomicity  Either the entire block executes or none of it  Enables specialization (reordering and rewriting) of instructions freely within the block to fit a particular backend Exploit stable block characteristics (instruction schedules)  E.g., reuse schedule learned by the OoO backend in VLIW/IO  Enables high efficiency by adapting to stable behavior 15

HBA Operation at High Level 16 Backend 1 Backend 2 Backend 3 Shared Frontend “Specialization” Application Block 1 Block 2 Block 3 Instruction sequence

HBA Design: Three Components (I) Block formation and fetch Block sequencing and value communication Block execution 17

HBA Design: Three Components (II) 18 Branch predictor Branch predictor Block Cache Block Formation Block Liveout Rename Global RF Subscription Matrix Sequencer Block Execution Unit Block CoreBlock Frontend Block Execution Unit... Block ROB & Atomic Retire Shared LSQ L1D Cache 1. Frontend forms blocks dynamically2. Blocks are dispatched and communicate via global registers 3. Heterogeneous block execution units execute different types of blocks in a specialized way

Talk Agenda Background and Motivation Two New Observations The Heterogeneous Block Architecture (HBA) An Initial HBA Design Experimental Evaluation Conclusions 19

An Example HBA Design: OoO/VLIW Block formation and fetch Block sequencing and value communication Block execution Two types of backends  OoO scheduling  VLIW/in-order scheduling 20 Out-of-order VLIW/IO Shared Frontend “Specialization” Out-of-order VLIW/IO

Block Formation and Fetch Atomic blocks are microarchitectural and dynamically formed Block info cache (BIC) stores metadata about a formed block  Indexed with PC and branch path of a block (ala Trace Caches)  Metadata info: backend type, instruction schedule, … Frontend fetches instructions/metadata from I-cache/BIC If miss in BIC, instructions are sent to OoO backend Otherwise, buffer instructions until the entire block is fetched The first time a block is executed on the OoO backend, its’ OoO version is formed (& OoO schedule/names stored in BIC)  Max 16 instructions, ends in a hard-to-predict or indirect branch 21

An Example HBA Design: OoO/VLIW Block formation and fetch Block sequencing and value communication Block execution Two types of backends  OoO scheduling  VLIW/in-order scheduling 22 Out-of-order VLIW/IO Shared Frontend “Specialization” Out-of-order VLIW/IO

Block Sequencing and Communication Block Sequencing  Block based reorder buffer for in-order block sequencing  All subsequent blocks squashed on a branch misprediction  Current block squashed on an exception or intra-block misprediction  Single-instruction sequencing from non-optimized code upon an exception in a block to reach the exception point Value Communication  Across blocks: via the Global Register File  Within block: via the Local Register File  Only liveins and liveouts to a block need to be renamed and allocated global registers [Sprangle & Patt, MICRO 1994] Reduces energy consumption of register file and bypass units 23

An Example HBA Design: OoO/VLIW Block formation and fetch Block sequencing and value communication Block execution Two types of backends  OoO scheduling  VLIW/in-order scheduling 24 Out-of-order VLIW/IO Shared Frontend “Specialization” Out-of-order VLIW/IO

Block Execution Each backend contains  Execution units, scheduler, local register file Each backend receives  A block specialized for it (at block dispatch time)  Liveins for the executing block (as they become available) A backend executes the block  As specified by its logic and any information from the BIC Each backend produces  Liveouts to be written to the Global RegFile (as available)  Branch misprediction, exception results, block completion signal to be sent to the Block Sequencer Memory operations are handled via a traditional LSQ 25

An Example HBA Design: OoO/VLIW Block formation and fetch Block sequencing and value communication Block execution Two types of backends  OoO scheduling  VLIW/in-order scheduling 26 Out-of-order VLIW/IO Shared Frontend “Specialization” Out-of-order VLIW/IO

OoO and VLIW Backends 27 Two customized backend types (physically, they are unified) Out-of-order VLIW/In-order Dynamic scheduling with matrix scheduler Renaming performed before the backend (block specialization logic) No renaming or matrix computation logic (done for previous instance of block) No need to maintain per-instruction order Static scheduling Stall-on-use policy VLIW blocks specialized by OoO backends No per-instruction order

OoO Backend 28 Block Dispatch Liveins (reg, val) Liveouts (reg, val) Resolved Branches Dispatch Queue Scheduler (no CAM) Block Completion Generic Interface Dynamically-scheduled implementation Selection logic Local RF Execution Units Consumer wakeup vectors srcs ready? srcs ready? srcs ready? srcs ready? srcs ready? srcs ready? Liveout? Branch? Writeback bus Counter Livein bus To shared EUs

VLIW/In-Order Backend 29 Block Dispatch Liveins (reg, val) Liveouts (reg, val) Resolved Branches Dispatch Queue Issue Logic Block Completion Generic Interface Statically-scheduled implementation Ready-bit Scoreboard Local RF Execution Units Liveout? Branch? Writeback bus Counter Livein bus To shared EUs

Backend Switching and Specialization OoO to VLIW: If dynamic schedule is the same for N previous executions of the block on the OoO backend:  Record the schedule in the Block Info Cache  Mark the block to be executed on the VLIW backend VLIW to OoO: If there are too many false stalls in the block on the VLIW backend  Mark the block to be executed on the OoO backend 30

Talk Agenda Background and Motivation Two New Observations The Heterogeneous Block Architecture (HBA) An Initial HBA Design Experimental Evaluation Conclusions 31

Experimental Methodology Models  Cycle-accurate execution-driven x86-64 simulator – ISA remains unchanged  Energy model with modified McPAT [Li+ MICRO’09] – various leakage parameters investigated Workloads  184 different representative execution checkpoints  SPEC CPU2006, Olden, MediaBench, Firefox, Ffmpeg, Adobe Flash player, MySQL, lighttpd web server, LaTeX, Octave, an x86 simulator 32

Simulation Setup 33

Four Core Comparison Points Out-of-order core with large, monolithic backend Clustered out-of-order core with split backend [Farkas+ MICRO’97]  Clusters of Coarse-grained heterogeneous design [Lukefahr+ MICRO’12]  Combines out-of-order and in-order cores  Switches mode for coarse-grained intervals based on a performance model (ideal in our results)  Only one mode can be active: in-order or out-of-order Coarse-grained heterogeneous design, with clustered OoO core 34

Key Results: Power, EPI, Performance HBA greatly reduces power and energy-per-instruction (>30%) HBA performance is within 1% of monolithic out-of-order core HBA’s performance higher than coarse-grained hetero. core HBA is the most power- and energy-efficient design among the five different core configurations 35 Averaged across 184 workloads

Energy Analysis (I) HBA reduces energy/power by concurrently  Decoupling Execution Backends to Simplify the Core (Clustering)  Exploiting Block Atomicity to Simplify the Core (Atomicity)  Exploiting Heterogeneous Backends to Specialize (Heterogeneity) What is the relative energy benefit of each when applied successively? 36

Energy Analysis (II) 37 +Clustering: 8% +Atomicity: 17% +Heterogeneity: 6% HBA effectively takes advantage of clustering, atomic blocks, and heterogeneous backends to reduce energy Averaged across 184 workloads

Comparison to Best Previous Design Coarse-grained heterogeneity + clustering the OoO core 38 +Coarse-grained hetero: 9% +Clustering: 9% HBA: 15% lower EPI HBA provides higher efficiency (+15%) at higher performance (+2%) than coarse-grained heterogeneous core with clustering Averaged across 184 workloads

Per-Workload Results HBA reduces energy on almost all workloads HBA can reduce or improve performance (analysis in paper) 39 S-curve of all 184 workloads

Power-Performance Tradeoff Space HBA enables new design points in the tradeoff space 40 Relative Performance (vs. Baseline OoO) GOAL Averaged across 184 workloads

Cost of Heterogeneity Higher core area  Due to more backends  Can be optimized with good design (unified backends)  Core area cost increasingly small in an SoC only 17% in Apple’s A7  Analyzed in the paper but our design is not optimized for area Increased leakage power  HBA benefits reduce with higher transistor leakage  Analyzed in the paper 41

More Results and Analyses in the Paper Performance bottlenecks Fraction of OoO vs. VLIW blocks Energy optimizations in VLIW mode Energy optimizations in general Area and leakage Other and detailed results available in our technical report Chris Fallin, Chris Wilkerson, and Onur Mutlu, "The Heterogeneous Block Architecture" SAFARI Technical Report, TR-SAFARI , "The Heterogeneous Block Architecture" SAFARI Technical Report Carnegie Mellon University, March

Talk Agenda Background and Motivation Two New Observations The Heterogeneous Block Architecture (HBA) An Initial HBA Design Experimental Evaluation Conclusions 43

Conclusions HBA is the first heterogeneous core substrate that enables  concurrent execution of fine-grained code blocks  on the most-efficient backend for each block Three characteristics of HBA  Atomic block execution to exploit fine-grained heterogeneity  Exploits stable instruction schedules to adapt code to backends  Uses clustering, atomicity and heterogeneity to reduce energy HBA greatly improves energy efficiency over four core designs A flexible execution substrate for exploiting fine-grained heterogeneity in core design 44

Building on This Work … HBA is a substrate for efficient core design One can design different cores using this substrate:  More backends of different types (CPU, SIMD, reconfigurable logic, …)  Better switching heuristics between backends  More optimization of each backend  Dynamic code optimizations specific to each backend  … 45

1 Carnegie Mellon University 2 Intel Corporation Chris Fallin 1 Chris Wilkerson 2 Onur Mutlu 1 The Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores

Why Atomic Blocks? Each block can be pre-specialized to its backend (pre-rename for OoO, pre-schedule VLIW in hardware, reorder instructions, eliminate instructions) Each backend can operate independently User-visible ISA can remain unchanged despite multiple execution backends 47

Nearby Chunks Have Different Behavior 48

In-Order vs. Superscalar vs. OoO 49

Baseline EPI Breakdown 50

Retired Block Types (I) 51

Retired Block Types (II) 52

Retired Block Types (III) 53

BIC Size Sensitivity 54

Livein-Liveout Statistics 55

Auxiliary Data: Benefits of HBA 56

Auxiliary Data: Benefits of HBA 57

Some Specific Workloads 58

Related Works Forming and reusing instruction schedules [DIF, ISCA’97][Transmeta’00][Banerjia, IEEE TOC’98][Palomar, ICCD’09] Coarse-grained heterogeneous cores [Lukefahr, MICRO’12] Atomic blocks and block-structured ISA [Melvin&Patt, MICRO’88, IJPP’95] Instruction prescheduling [Michaud&Seznec, HPCA’01] Yoga [Villavieja+, HPS Tech Report’14] 59