Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template Niket K. Choudhary, Salil V. Wadhavkar, Tanmay.

Similar presentations


Presentation on theme: "1 FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template Niket K. Choudhary, Salil V. Wadhavkar, Tanmay."— Presentation transcript:

1 1 FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template Niket K. Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel, Sandeep Navada, Hashem H. Najaf-abadi, Eric Rotenberg Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

2 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Single Core to Multiple Cores o Generic microarchitecture o One-size-fits-all approach o Sub-optimal performance for individual applications o Power inefficient o Exciting opportunity to exploit application diversity o Employ many microarchitecturally diverse core designs o Higher performance on individual applications o Power efficient core core A core C core D Core B

3 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, App. 1 ILP Characteristics App. 2 App. 3 Application Diversity Core A Core B Core C Core D Heterogeneous Multi-core ILP Characteristics Structure Sizes Superscalar Width Pipeline Depth Customize each core to an application, class of application, or class of application behavior

4 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Benefits of Employing Diverse Cores o R. Kumar et al. Single-ISA Heterogeneous Multi-core Architectures: The Potential for Processor Power Reduction (MICRO 2003) o R. Kumar et al. Core Architecture Optimization for Heterogeneous Chip Multiprocessors (PACT 2006) o B. C. Lee et al. Efficiency Trends and Limits from Comprehensive Microarchitectural Adaptivity (ASPLOS 2008) o M. A. Suleman et al. Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures (ASPLOS 2009) o H. H. Najaf-abadi et al. Core-Selectability in Chip Multiprocessors (PACT 2009) Prior works have shown significant performance and power advantages

5 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Achilles Heel of Employing Diverse Cores o Designing and verifying a core is expensive o Designing and verifying many different core types is prohibitively expensive and impractical Core A Core B Core C Core D o No prior research in heterogeneous multi-core has addressed this challenge

6 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, FabScalar o Automate the generation of superscalar processors o Our approach: Frame superscalar processors in a canonical form o All superscalar processors have same set of canonical pipeline stages and interfaces among them, expressed by a Canonical Superscalar Template o A Canonical Pipeline Stage Library (CPSL) provides many different designs for each canonical pipeline stage, that differ in major superscalar dimensions o Automation is enabled because of o Invariant interfaces among canonical pipeline stages o Confinement of microarchitectural diversity within the canonical pipeline stages

7 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Canonical Superscalar Template

8 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Canonical Pipeline Stage Library (CPSL) Microarchitectural diversity is focused along key dimensions: 1)Superscalar Complexity Superscalar width, i.e., number of superscalar ways Sizes of stage-specific structures for extracting instruction-level parallelism (ILP) 2)Sub-pipelining Pipeline depth of a canonical stage 3)Stage-specific design choices E.g., different speculation alternatives, recovery alternatives, etc. Issue Stage 1-Wide 2-Wide 3-Wide iq-cam:8 iq-cam:32iq-cam:64 1-deep 2-deep 3-deep issuing policy: oldest-first, critical-first squash policy: complete or selective

9 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Fetch Rename Issue Core Generator core configuration CPSL Decode Dispatch Register Read Execute Writeback Retire Canonical Superscalar Template Fetch Rename Issue synthesizable RTL of customized core App. 1

10 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Fetch Rename Issue Core Generator core configuration CPSL Decode Dispatch Register Read Execute Writeback Retire Canonical Superscalar Template Fetch Rename Issue App. 2 synthesizable RTL of customized core

11 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Addressing Design-Effort Problem o FabScalar boosts designer productivity by generating RTL designs of whole cores o Quality RTL is an essential starting point of chip design cycle o Starting point for design tuning, verification, and physical design o Highly-ported RAMs and CAMs are pervasive in superscalars o FabMem: generates layouts of highly-ported RAMs and CAMs o See paper for more about FabMem

12 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Outline o Quality assessment of FabScalar-generated cores o Functional and IPC validation o Timing validation o Suitability for standard ASIC and FPGA flows o Extensibility of CPSL o G21: a workload-agnostic heterogeneous multi-core o Future work and conclusion

13 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Validation Results o Evaluate the quality of the register-transfer-level (RTL) designs produced by FabScalar along three fronts. o Functional and IPC validation o Timing validation o Suitability for physical design EDA tool(s)/ Library used Functional verificationCadence NC-Verilog, vers s006 Logic synthesisSynopsys Design Compiler, vers. X SP3 Place & routeCadence SoC Encounter, vers. 7.1 Standard cell libraryFreePDK 45nm SPICE modelBSIM4 PredictiveTechnology Model

14 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Functional & IPC Validation o Unit testing on isolated canonical pipeline stages of different widths/depths o Generate multiple arbitrary cores and run CPU2000 benchmarks on them Core-1Core-2Core-3Core-4Core-5Core-6Core-7Core-8Core-9Core-10Core-11Core-12 Fetch/Decode/Rename/Dispat ch width Issue/RR/Execute/WB/Retire width function unit mix (simple, complex, branch, load/store) 1,1,1,13,1,1,12,1,1,13,1,1,15,1,1,11,1,1,1 3,1,1,1 1,1,1,13,1,1,1 fetch queue active list (ROB) physical register file (PRF) issue queue (IQ) load queue / store queue (LQ/SQ) 32 / / 1632 / 32 branch predictorbimodal bimodal with block-ahead gshare branch history table (BHT) (# entries) 64K branch target buffer (BTB) (# entries) 4K return address stack (RAS) branch order buffer (BOB) Fetch depth Rename depth Issue depth: total / wakeup-select loop 2 / 2 1 / 1 3 / 22 / 23 / 22 / 2 Register Read (and Writeback) depth fetch-to-execute pipeline depth

15 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Functional & IPC Validation o Unit testing on isolated canonical pipe-stage of different widths/depths o Generate multiple arbitrary cores and run CPU2000 benchmarks on them Core-1Core-2Core-3Core-4Core-5Core-6Core-7Core-8Core-9Core-10Core-11Core-12 Fetch/Decode/Rename/Dispat ch width Issue/RR/Execute/WB/Retire width function unit mix (simple, complex, branch, load/store) 1,1,1,13,1,1,12,1,1,13,1,1,15,1,1,11,1,1,1 3,1,1,1 1,1,1,13,1,1,1 fetch queue active list (ROB) physical register file (PRF) issue queue (IQ) load queue / store queue (LQ/SQ) 32 / / 1632 / 32 branch predictorbimodal bimodal with block-ahead gshare branch history table (BHT) (# entries) 64K branch target buffer (BTB) (# entries) 4K return address stack (RAS) branch order buffer (BOB) Fetch depth Rename depth Issue depth: total / wakeup-select loop 2 / 2 1 / 1 3 / 22 / 23 / 22 / 2 Register Read (and Writeback) depth fetch-to-execute pipeline depth

16 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Functional & IPC Validation o Unit testing on isolated canonical pipe-stage of different widths/depths o Generate multiple arbitrary cores and run CPU2000 benchmarks on them Core-1Core-2Core-3Core-4Core-5Core-6Core-7Core-8Core-9Core-10Core-11Core-12 Fetch/Decode/Rename/Dispat ch width Issue/RR/Execute/WB/Retire width function unit mix (simple, complex, branch, load/store) 1,1,1,13,1,1,12,1,1,13,1,1,15,1,1,11,1,1,1 3,1,1,1 1,1,1,13,1,1,1 fetch queue active list (ROB) physical register file (PRF) issue queue (IQ) load queue / store queue (LQ/SQ) 32 / / 1632 / 32 branch predictorbimodal bimodal with block-ahead gshare branch history table (BHT) (# entries) 64K branch target buffer (BTB) (# entries) 4K return address stack (RAS) branch order buffer (BOB) Fetch depth Rename depth Issue depth: total / wakeup-select loop 2 / 2 1 / 1 3 / 22 / 23 / 22 / 2 Register Read (and Writeback) depth fetch-to-execute pipeline depth

17 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Functional & IPC Validation o RTL successfully simulates 100M instr. SimPoints from different benchmarks o IPC from RTL closely tracks with IPC from C++ simulator o IPC differences among cores correlate with microarchitecture differences

18 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Timing Validation o Compare cycle times and raw fetch-to-execute delays of FabScalar-generated cores with three different commercial processors: o 90nm POWER5 o 180nm Alpha o 65nm MIPS32 74K o All processors implement RISC ISAs o Represent extremes from highly custom designed to fully synthesized o Convert all delays to fanout-of-4 (FO4) POWER5 Pipeline stages [B. Sinharoy et al., IBM Journal of R&D, 2005] Integer Pipeline: fetch-to-execute

19 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Timing Validation Power5Alpha MIPS 74K Fetch Width 844 Dispatch Width 542 Issue Width 861 Fetch Queue Issue Queue(s) Int+Ld/St: 36, FP: 24, Br.: 12, CR: 10 Int:20, FP:15 Int:8, Agen:8 Physical Reg ister File(s) Int:120, FP:120 Int:80, FP:72 64 Load Queue / Store Queue 32 / 32 8 / 8 L1 I$ / L1 D$ (KB) 64 / 3264 / 6432 / 32 fetch-to-execute pipeline depth 126 Cycle Time of commercial core 23 FO425 FO433 FO4 Cycle Time of FabScalar core 29 FO437 FO432 FO4 Cycle Time of deeper FabScalar core 25 FO4 (depth=15) 26 FO4 (depth=11) N/A raw fetch-to-execute delay of FabScalar core 291 FO4188 FO4384 FO4 Cycle Time of FabScalar core with ideal latch-based design 24 FO432 FO4N/A with ideal latch-based design 17% 14%

20 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Physical Design Validation o Synthesized and place-and-routed a 4-way superscalar processor o Also synthesized the same core to a Virtex-5 FPGA in a BEE3 system ASIC FlowFPGA Flow bzipgzipmcfparser FPGA & verilog retired instr. 10M FPGA & verilog cycles 1.13M1.7M0.84M1.16M FPGA & verilog IPC FPGA simulation time (s) verilog simulation time (s)4,0185,5362,8703,748 simulation speedup5,3574,5382,3724,308 FPGA speed (MHz)50 FPGA effective speed (MHz) Technology45nm Die Area (excluding L1 caches) 2.6 mm2 Clock frequency500MHz Timing-critical pathNext-PC logic

21 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Extensibility o Extensibility of CPSL is important for proliferating microarchitectural diversity o Two examples : o LMP: Load Misspeculation Predictor (fix IPC bottleneck) o DEAP: Decoupled Effective Ahead Pipelining for conditional branch predictors (fix cycle-time) DesignCanonical Pipestages Modified Signals Added to Interface Implementation Effort LMPDispatch, Execute (LSU), Retire Dispatch/LSU, Retire/Dispatch 2days-1author DEAPFetch-14days-1author

22 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Providing Diverse Cores o FabScalar framework provides a design space of almost 38,000 different cores o Fit the most complex configuration for a given clock-period and pipeline depth (to maximize single-thread performance) o Prior approaches o Customize a core to a specific application o Co-customize multiple cores to a specific multiprogrammed workload o Customizing cores to specific workloads has two drawbacks: o Computationally intensive design-space exploration o Not robust for general performance (or for an arbitrary workload) With FabScalar, a chip with many different superscalar core types is conceivable

23 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, A Workload-Agnostic Heterogeneous Multi-Core o G21: Generic heterogeneous multi-core o Not trained for specific workloads o 21 core types provide a wide range of microarchitectural diversity o Maximizes single-thread performance for arbitrary instruction-level behavior G21 Core Selection larger structures higher frequency cycle time (ns) superscalar width 2 or or or

24 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Analysis of G21 o Consider two multi-core designs o Best-1: o Homogeneous multi-core with a single core type o The best harmonic-mean of BIPS across benchmarks o I.e., single core type customized to workload as a whole o G21: o Proposed heterogeneous multi-core o Assume ideal benchmark-to-core mapping for G21 o Peak BIPS of a benchmark o Highest BIPS possible for benchmark, considering entire design space o I.e., customize core to individual benchmark o Used only as upper-bound on performance

25 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Analysis of G21 Best-1 is within 10% of peak performance Worst sub-optimality: 70% of peak performance Core cycle time (ns) fetch/issue width fetch-to- execute depth L1 I$L1 D$IQLQ/SQ Phys. Reg. File Best / /32128 Performance of Best-1 and G21

26 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Analysis of G21 severe sub-optimality (note: sub-optimality can become more severe for unknown workload) Core cycle time (ns) fetch/issue width fetch-to- execute depth L1 I$L1 D$IQLQ/SQ Phys. Reg. File Best / /32128 Performance of Best-1 and G21

27 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Analysis of G21 Performance of Best-1 and G21 Core cycle time (ns) fetch/issue width fetch-to- execute depth L1 I$L1 D$IQLQ/SQ Phys. Reg. File Best / /32128 G21 is within 3% of peak performance Worst sub-optimality: 88% of peak performance

28 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Analysis of G21 o G21 highlights the merits of workload-agnostic design o Low computational complexity o Robust performance for unknown workloads o G21 outperforms Best-1 (even though Best-1 is customized to workload) o Diversity not only delivers better efficiency (BIPS/Watt) but also delivers better raw-performance o E.g. microarchitecture-loops and instruction-level behavior o G21 is highly representative of the entire design space w.r.t. single-thread performance o Can be used to distill fewer cores

29 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Future Work FabScalar Correct-by-Construction Application of formal verification & tools Expanding CPSL Floating-point and multimedia instr. support More features ….. Selection of Cores G21 is a preliminary study N-of-G21: factor-in other metrics e.g. power, area FabFPGA Automate the mapping of FabScalar- generated cores to FPGAs Accelerate verification Design space exploration

30 © Niket K. Choudhary38th Int'l Symp. on Computer Architecture, Conclusion o Providing microarchitecturally diverse cores has significant benefits but need multiple core designs o FabScalar addresses the practical issue of designing and verifying multiple cores o FabScalar is a novel toolset for automatically composing the RTL designs of arbitrary superscalar cores o FabScalar toolset is available as open-source gateware o

31 31 FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template Niket K. Choudhary, Salil V. Wadhavkar, Tanmay A. Shah, Hiran Mayukh, Jayneel Gandhi, Brandon H. Dwiel, Sandeep Navada, Hashem H. Najaf-abadi, Eric Rotenberg Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University


Download ppt "1 FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template Niket K. Choudhary, Salil V. Wadhavkar, Tanmay."

Similar presentations


Ads by Google