RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008.

RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008

2 Outline UCB Parallel Computing Laboratory (ParLab) overview InfiniCore: UCB’s Manycore prototype architecture RAMP Gold: A RAMP model for InfiniCore

3 Efficiency Language Compilers Personal Health Image Retrieval Hearing, Music Speech Parallel Browser Motifs/Dwarfs Sketching Legacy Code Schedulers Communication & Synch. Primitives UCB Par Lab Overview Easy to write correct software that runs efficiently on manycore Legacy OS Multicore/GPGPU OS Libraries+Services Hypervisor OS Arch. Composition & Coordination Language (C&CL) Parallel Libraries Parallel Frameworks Static Verification Dynamic Checking Debugging with Replay Directed Testing Autotuners C&CL Compiler/Interpreter Efficiency Languages Type Systems Efficiency Layer Productivity Layer Correctness Applications InfiniCore/RAMP Gold

4 “Manycore” covers huge design space L2Bank DRAM L2Bank DRAM L2Bank Flash Mem & I/O Interconnect Fast Serial I/O Ports Multiple Off-Chip DRAM/Flash Channels L2 Interconnect CPU L1 CPU L1 CPU L1 CPU L1 CPU L1 CPU L1 HW Accel. Multiple On-Chip L2 $/RAM banks “Fat” Cores “Thin” Cores Special-Purpose Cores Many alternative memory hierarchies

5 Narrowing our search space Laptops/Handhelds => single-socket systems  Don’t expect >1 manycore chip per platform  Servers/HPC will probably use multiple single-socket blades Homogeneous, general-purpose cores  Presents most of the interesting design challenges  Resulting designs can later be specialized for improved efficiency “Simple” in-order cores  Want low energy/op floor  Want high performance/area ceiling  More predictable performance A “tiled” physical design  Reduces logical/physical design verification costs  Enables design reuse across large family of parts  Provides natural locality to reduce latency and energy/op  Natural redundancy for yield enhancement & surviving failures

6 InfiniCore ParLab “strawman” manycore architecture  A playground (punching bag?) for trying out architecture ideas Highlights:  Flexible hardware partitioning & protected communication  Latency-tolerant CPUs  Fast and flexible synchronization primitives  Configurable memory hierarchy and user-level DMA  Pervasive QoS and performance counters

7 InfiniCore Architecture Overview Four separate on-chip network types Control networks combine 1-bit signals in combinational tree for interrupts & barriers Active message networks carry register-register messages between cores L2/Coherence network connects L1 caches to L2 slices and indirectly to memory Memory network connects L2 slices to memory controllers I/O and accelerators potentially attach to all network types. Flash replaces rotating disks. Only high-speed I/O is network & display. Active Message Network Control/Barrier Network L2/Coherence Network Memory Network Core L1D$ L1I$ L2RAML2Tags L2 Cntl. Core L1D$ L1I$ Accelerators and/or I/O interfaces MEMC DRAM I/O Pins L2RAML2Tags L2 Cntl. MEMC DRAM MEMC Flash

8 Physical View of Tiled Architecture DRAM DRAM DRAM Flash Core L1D$ L2$SliceL1I$ Intercon. Core L1D$ L2$SliceL1I$Intercon. Core L1D$ L2$SliceL1I$Intercon. Core L1D$ L2$SliceL1I$ Intercon. I/O

9 Core Internals Control Processor (Int 64b) L1D$ L1I$ Vector Unit (Int/FP 64b) GPRsVRegs Command Queue TLB/PLB Load Data Queues (Store Queues not shown) To outer levels of memory hierarchy Virtual Address RISC-style 64-bit instruction set   SPARC V9 used for pragmatic reasons In-order pipeline with decoupled single-lane (64-bit) vector unit (VU)   Integer control unit generates/checks addresses in-order to give precise exceptions on vector loads/stores   VU runs behind executing queued instructions on queued load data   VU executes both scalar & vector, can mix (e.g., vector load plus scalar ALU)   Each VU cycle: 2 ALU, 1 load, 1 store (all 64b) Vector regfile configurable to trade reduced I-fetch for fewer register spills   256 total registers (e.g., 32 regs. x 8 elements, or 8 regs. x 32 elements) Decoupling is cheap way to tolerate memory latency inside thread (scalar & vector) Vectors increase performance, reduce energy/op, and increase effective decoupling queue size TLB/PLB 1-3 issue? 2x64b FLOPS/clock

10 Cache Coherence L1 cache coherence tracked at L2 memory managers (set of readers) All cases except write to currently read shared line handled in pure hardwareAll cases except write to currently read shared line handled in pure hardware Writer gets trap on memory response, invokes handlerWriter gets trap on memory response, invokes handler Same process used for transactional memory (TM)Same process used for transactional memory (TM) Cache tags visible to user-level software in partition, useful for TM swappingCache tags visible to user-level software in partition, useful for TM swapping Active Message Network Control/Barrier Network L2/Coherence Network Memory Network Core L1D$ L1I$ L2RAML2Tags L2 Cntl. Core L1D$ L1I$ Accelerators and/or I/O interfaces MEMC DRAM I/O Pins L2RAML2Tags L2 Cntl. MEMC DRAM MEMC Flash

11 RAMP Gold: A Model of ParLab InfiniCore Target Target is single-socket tiled manycore system  Based on SPARC ISA (v8->v9)  Distributed coherent caches  Multiple on-chip networks (barrier, active message, coherence, memory)  Multiple DRAM channels Split timing/functional models, both in hardware Host multithreading of both timing and functional models Expect to model up to 1024 64-bit cores in system (8 BEE3 boards) Predict peak performance around 1-10 GIPS, with full timing models

12 Host Multithreading (Zhangxi Tan (UCB), Chung, (CMU)) CPU 1 CPU 2 CPU 3 CPU 4 Target Model Multithreading emulation engine reduces FPGA resource use and improves emulator throughput Hides emulation latencies (e.g., communicating across FPGAs) Multithreaded Host Emulation Engine (on FPGA) +1 2 PC 1 PC 1 PC 1 PC 1 I$ IR GPR1 X Y 2 D$ Single hardware pipeline with multiple copies of CPU state

13 Split Functional/Timing Models (HASIM Emer (MIT/Intel), FAST Chiou, (UT Austin)) Functional model executes CPU ISA correctly, no timing information  Only need to develop functional model once for each ISA Timing model captures pipeline timing details, does not need to execute code  Much easier to change timing model for architectural experimentation  Without RTL design, cannot be 100% certain that timing is accurate Many possible splits between timing and functional model Functional Model Timing Model

14 RAMP Gold Approach Split (and decoupled) functional and timing models Host multithreading of both functional and timing models

15 Multithreaded Func. & Timing Models MT-Unit multiplexes multiple target units on a single host engine MT-Channel multiplexes multiple target channels over a single host link Functional Model Pipeline Arch State Timing Model Pipeline TImin g State MT-Unit MT-Channels

16 RAMP Gold CPU Model (v0.1) Commit Timing Execute Timing PC 1 PC 1 PC 1 PC 1 PC/Fetch Func. ALU Func. Decode/Issue Timing Instructions Status GPR1 Immediates PC Values Store Fetch Commands GPR1 Timing State GPR1 Timing State Status Status Addresses Load Exec. Comm. Mem. Comm. Data Memory Interface Instruction Memory Interface Status

17 RAMP Gold Memory Model (v0.1) CPUModel CPUModel Host DRAM Cache BEE DRAM GPR1GPR1 GPR1GPR1 GPR1GPR1 GPR1GPR1 GPR1GPR1 GPR1GPR1 Memory Model (duplicate paths for Instruction and Data interface)

18 Matching physical resources to utilization Only implement sufficient functional units to match expected utilization, e.g.: For single-issue core, expected IPC ~0.6 Regfile read ports (1.2 operands/instruction)  0.6*1.2=0.72 per timing model Regfile write ports (0.8 operands/instruction)  0.6*0.8=0.48 per timing model Instruction mix:  Mem 0.3  FPU 0.1  Int 0.5  Branch 0.1 Therefore only need (per timing model)  0.6*0.3 = 0.18 memory ports  0.6*0.1 = 0.06 FPUs  0.6*0.5 = 0.30 Integer execution units  0.6*0.1 = 0.06 Branch execution units

19 Balancing Resource Utilization FPUMemIntIntIntIntIntBranch RegfileRegfileRegfile Timing Model Regfile Operand Interconnect

20 RAMP Gold Capacity Estimates For SPARC v8 (32-bit) pipeline Purely functional, no timing model Integer only For BEE3, predict 64 CPUs/engine, 8 engines/FPGA (LX110), or 512 CPUs/FPGA Throughput of 150MHz * 8 engines = 1200 MIPS/FPGA 8 BEE3 boards * 4 FPGAs/board = 38 GIPS/system Perhaps 4x reduction in capacity with v9, FPU, and timing models

RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008.

Similar presentations

Presentation on theme: "RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008.

Similar presentations

Presentation on theme: "RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008."— Presentation transcript:

Similar presentations

About project

Feedback