Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian.

Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian

Trends  Multi-core CPUs  Generalized GPUs –Brook, CTM, CUDA  Tighter CPU-GPU coupling –PS3 –Xbox 360 –AMD “Fusion” (faster bus, but GPU still treated as batch coprocessor)

CPU-GPU coupling  Important apps (game engines) exhibit workloads suitable for both CPU and GPU-style cores GPU Friendly Geometry processing Shading Physics (fluids/particles) CPU Friendly IO AI/planning Collisions Adaptive algorithms

CPU-GPU coupling  Current: coarse granularity interaction –Control: CPU launches batch of work, waits for results before sending more commands (multi-pass) –Necessitates algorithmic changes  GPU is slave coprocessor –Limited mechanisms to create new work –CPU must deliver LARGE batches –CPU sends GPU commands via “driver” model

Fundamentally different cores  “CPU” cores –Small number (tens) of HW threads –Software (OS) thread scheduling –Memory system prioritizes minimizing latency  “GPU” cores –Many HW threads (>1000), hardware scheduled –Minimize per-thread state (state kept on-chip) shared PC, wide SIMD execution, small register file No thread stack –Memory system prioritizes throughput (not clear: sync, SW-managed memory, isolation, resource constraints)

GPU as a giant scheduler IA VS GS RS PS 1-to-1 1-to-N (bounded) 1-to-(0 or X) (X static) 1-to-N (unbounded) data buffer cmd buffer OM = on-chip queues Off-chip buffers (data) output stream

GPU as a giant scheduler IA RS (read-modify-write) OM VS/GS/PS command queue vertex queue primitive queue fragment queue Thread scoreboard Hardware scheduler On-chip queues Processing cores Off-chip buffers (data)

GPU as a giant scheduler  Rasterizer (+ input cmd processor) is a domain specific HW work scheduler –Millions of work items/frame –On chip queues of work –Thousands of HW threads active at once –CPU threads (via API commands), GS programs, fixed function logic generate work –Pipeline describes dependencies  What is the work here? –Vertices –Geometric primitives –Fragments –In the future: Rays? Well defined resource requirements for each category.

The project  Investigate making “GPU” cores first-class execution engines in multi-core system  Add:  Fine granularity interaction between cores  Processing work on any core can create new work (for any other core)  Hypothesis: scheduling work (actions) is key problem –Keeping state on-chip  Drive architecture simulation with interactive graphics pipeline augmented with raytracing

Our architecture  Multi-core processor = some “CPU” + some “GPU” style cores  Unified system address space  “Good” interconnect between cores  Actions (work) on any core can create new work  Potentially… –Software-managed configurable L2 –Synchronization/signaling primitives across actions

Need new scheduler  GPU HW scheduler leverages highly domain-specific information –Knows dependencies –Knows resources used by threads  Need to move to more general-purpose HW/SW scheduler, yet still do okay  Questions –What scheduling algorithms? –What information is needed to make decisions?

Programming model = queues  Model system as a collection of work queues –Create work = enqueue –SW driven dispatch of “CPU” core work –HW driven dispatch of “GPU” core work –Application code does not dequeue

Benefits of queues  Describe classes of work –Associate queues with environments GPU (no gather) GPU + gather GPU + create work (bounded) CPU CPU + SW managed L2  Opportunity to coalesce/reorder work –Fine-created creation, bulk execution  Describe dependencies

Decisions  Granularity of work –Enqueue elements or batches?  “Coherence” of work (batching state changes) –Associate kernels/resources with queues (part of env)?  Constraints on enqueue –Fail gracefully in case of explosion  Scheduling policy –Minimize state (size of queues) –How to understand dependencies

First steps  Coarse architecture simulation –Hello world = run CPU + GPU threads, GPU threads create other threads Identify GPU ISA additions  Establish what information scheduler needs –What are the “environments”  Eventually drive simulation with hybrid renderer

Evaluation  Compare against architectural alternatives 1.Multi-pass rendering (very coarse-grain) with domain-specific scheduler –Paper: “GPU” microarchitecture comparison with our design –Scheduling resources –On chip state / performance tradeoff –On chip bandwidth 2.Many-core homogenous CPU

Summary  Hypothesis: Elevating “GPU” cores to first-class execution engines is better way to build hybrid system –Apps with dynamic/irregular components –Performance –Ease of programming  Allow all cores to generate new work by adding to system queues  Scheduling work in these queues is key issue (goal: keep queues on chip)

Three fronts  GPU micro-architecture –GPU work creating GPU work –Generalization of DirectX 10 GS  CPU-GPU integration –GPU cores as first-class execution environments (dump the driver model) –Unified view of work throughout machine –Any core creates work for other cores  GPU resource management –Ability to correctly manage/virtualize GPU resources –Window manager

Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian.

Similar presentations

Presentation on theme: "Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian.

Similar presentations

Presentation on theme: "Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian."— Presentation transcript:

Similar presentations

About project

Feedback