Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010.

Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010

2 The single fast core era is over Trends: Changing Metrics: ‘scale out’, not just ‘scale up’ Increasing diversity: many different mixes of ‘cores’ Today’s (and tomorrow’s) machines: commodity, heterogeneous, many-core Problem: How does one program all this complexity?!

3 High-level programming models Two major advantages over threads & locks –Constructs to express/expose parallelism –Scheduling support to help manage concurrency, communication, and synchronization Widespread in research and industry: OpenGL/Direct3D, SQL, Brook, Cilk, CUDA, OpenCL, StreamIt, TBB, …

4 My biases workloads Interesting applications have irregularity Large bundles of coherent work are efficient Producer-consumer idiom is important Goal: Rebuild coherence dynamically by aggregating related work as it is generated.

5 My target audience Highly informed, but (good) lazy –Understands the hardware and best practices –Dislikes rote, Prefers power versus constraints Goal: Let systems-savvy developers efficiently develop programs that efficiently map onto their hardware.

6 Contributions: Design of GRAMPS Programs are graphs of stages and queues Queues: –Maximum capacities, Packet sizes Stages: –No, limited, or total automatic parallelism –Fixed, variable, or reduction (in-place) outputs Simple Graphics Pipeline

7 Contributions: Implementation Broad application scope: –Rendering, MapReduce, image processing, … Multi-platform applicability: –GRAMPS runtimes for three architectures Performance: –Scale-out parallelism, controlled data footprint –Compares well to schedulers from other models (Also: Tunable)

8 Outline GRAMPS overview Study 1: Future graphics architectures Study 2: Current multi-core CPUs Comparison with schedulers from other parallel programming models

GRAMPS Overview

10 GRAMPS Programs are graphs of stages and queues –Expose the program structure –Leave the program internals unconstrained

11 Writing a GRAMPS program Design the application graph and queues: Design the stages Instantiate and launch. Credit: http://www.foodnetwork.com/recipes/alton-brown/the-chewy-recipe/index.html Cookie Dough Pipeline

12 Queues Bounded size, operate at “packet” granularity –“Opaque” and “Collection” packets GRAMPS can optionally preserve ordering –Required for some workloads, adds overhead

13 Thread (and Fixed) stages Preemptible, long-lived, stateful –Often merge, compare, or repack inputs Queue operations: Reserve/Commit (Fixed: Thread stages in custom hardware)

14 Shader stages: Automatically parallelized: –Horde of non-preemptible, stateless instances –Pre- reserve /post- commit Push : Variable/conditional output support –GRAMPS coalesces elements into full packets

15 Queue sets: Mutual exclusion Independent exclusive (serial) subqueues –Created statically or on first output –Densely or sparsely indexed Bonus: Automatically instanced Thread stages Cookie Dough Pipeline

16 Queue sets: Mutual exclusion Independent exclusive (serial) subqueues –Created statically or on first output –Densely or sparsely indexed Bonus: Automatically instanced Thread stages Cookie Dough (with queue set)

17 A few other tidbits Instanced Thread stages Queues as barriers / read all-at-once In-place Shader stages / coalescing inputs

18 Formative influences The Graphics Pipeline, early GPGPU “Streaming” Work-queues and task-queues

Study 1: Future Graphics Architectures (with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan; appeared in Transactions on Computer Graphics, January 2009)

20 Graphics is a natural first domain Table stakes for commodity parallelism GPUs are full of heterogeneity Poised to transition from fixed/configurable pipeline to programmable We have a lot of experience in it

21 The Graphics Pipeline in GRAMPS Graph, setup are (application) software –Can be customized or completely replaced Like the transition to programmable shading –Not (unthinkably) radical Fits current hw: FIFOs, cores, rasterizer, …

22 Reminder: Design goals Broad application scope Multi-platform applicability Performance: scale-out, footprint-aware

23 The Experiment Three renderers: –Rasterization, Ray Tracer, Hybrid Two simulated future architectures –Simple scheduler for each

24 Scope: Two(-plus) renderers Ray Tracing Extension Rasterization Pipeline (with ray tracing extension) Ray Tracing Graph

25 Platforms: Two simulated systems CPU-Like: 8 Fat Cores, Rast GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched

26 Performance— Metrics “Maximize machine utilization while keeping working sets small” Priority #1: Scale-out parallelism –Parallel utilization Priority #2: ‘Reasonable’ bandwidth / storage –Worst case total footprint of all queues –Inherently a trade-off versus utilization

27 Performance— Scheduling Simple prototype scheduler (both platforms): Static stage priorities: Only preempt on Reserve and Commit No dynamic weighting of current queue sizes (Lowest) (Highest)

28 Performance— Results Utilization: 95+% for all but rasterized fairy (~80%). Footprint: < 600KB CPU-like, < 1.5MB GPU-like Surprised how well the simple scheduler worked Maintaining order costs footprint

Study 2: Current Multi-core CPUs (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)

30 Reminder: Design goals Broad application scope Multi-platform applicability Performance: scale-out, footprint-aware

31 The Experiment 9 applications, 13 configurations One (more) architecture: multi-core x86 –It’s real (no simulation here) –Built with pthreads, locks, and atomics Per-pthread task-priority-queues with work-stealing –More advanced scheduling

32 Scope: Application bonanza GRAMPS Ray tracer (0, 1 bounce) Spheres (No rasterization, though) MapReduce Hist (reduce / combine) LR (reduce / combine) PCA Cilk(-like) Mergesort CUDA Gaussian, SRAD StreamIt FM, TDE

33 Scope: Many different idioms FM Merge Sort Ray Tracer SRAD MapReduce

34 Platform: 2xQuad-core Nehalem Queues: copy in/out, global (shared) buffer Threads: user-level scheduled contexts Shaders: create one task per input packet Native: 8 HyperThreaded Core i7’s

35 Performance— Metrics (Reminder) “Maximize machine utilization while keeping working sets small” Priority #1: Scale-out parallelism Priority #2: ‘Reasonable’ bandwidth / storage

36 Performance– Scheduling Static per-stage priorities (still) Work-stealing task-priority-queues Eagerly create one task per packet (naïve) Keep running stages until a low watermark –(Limited dynamic weighting of queue depths)

37 Performance– Good Scale-out (Footprint: Good; detail a little later) Parallel Speedup Hardware Threads

38 Performance– Low Overheads ‘App’ and ‘Queue’ time are both useful work. Percentage of Execution Execution Time Breakdown (8 cores / 16 hyperthreads)

Comparison with Other Schedulers (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)

40 Three archetypes Task-Stealing: (Cilk, TBB) Low overhead with fine granularity tasks  No producer-consumer, priorities, or data-parallel Breadth-First: (CUDA, OpenCL) Simple scheduler (one stage at the time)  No producer-consumer, no pipeline parallelism Static: (StreamIt / Streaming) No runtime scheduler; complex schedules  Cannot adapt to irregular workloads

41 GRAMPS is a natural framework Shader Support Producer- Consumer Structured ‘Work’ Adaptive GRAMPS Task- Stealing  Breadth- First   Static 

42 The Experiment Re-use the exact same application code Modify the scheduler per archetype: –Task-Stealing: Unbounded queues, no priority, (amortized) preempt to child tasks –Breadth-First: Unbounded queues, stage at a time, top-to-bottom –Static: Unbounded queues, offline per-thread schedule using SAS / SGMS

43 Seeing is believing (ray tracer) GRAMPS Breadth-First Static (SAS) Task-Stealing

44 Comparison: Execution time Mostly similar: good parallelism, load balance Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

45 Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Comparison: Execution time Breadth-first can exhibit load-imbalance

46 Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Comparison: Execution time Task-stealing can ping-pong, cause contention Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

47 Comparison: Footprint Breadth-First is pathological (as expected) Relative Packet Footprint (Log-Scale) Size versus GRAMPS

48 Footprint: GRAMPS & Task-Stealing Relative Packet Footprint Relative Task Footprint

49 GRAMPS gets insight from the graph: (Application-specified) queue bounds Group tasks by stage for priority, preemption Footprint: GRAMPS & Task-Stealing MapReduce Ray Tracer MapReduce Ray Tracer

50 Static scheduling is challenging Generating good Static schedules is *hard*. Static schedules are fragile: –Small mismatches compound –Hardware itself is dynamic (cache traffic, IRQs, …) Limited upside: dynamic scheduling is cheap! Execution TimePacket Footprint

51 Discussion (for multi-core CPUs) Adaptive scheduling is the obvious choice. –Better load-balance / handling of irregularity Semantic insight (app graph) gives a big advantage in managing footprint. More cores, development maturity → more complex graphs and thus more advantage.

Conclusion

53 Contributions Revisited GRAMPS programming model design –Graph of heterogeneous stages and queues Good results from actual implementation –Broad scope: Wide range of applications –Multi-platform: Three different architectures –Performance: High parallelism, good footprint

54 Anecdotes and intuitions Structure helps: an explicit graph is handy. Simple (principled) dynamic scheduling works. Queues impedance match heterogeneity. Graphs with cycles and push both paid off. (Also: Paired instrumentation and visualization help enormously)

55 Conclusion: Future trends revisited Core counts are increasing –Parallel programming models Memory and bandwidth are precious –Working set, locality (i.e., footprint) management Power, performance driving heterogeneity –All ‘cores’ need to communicate, interoperate  GRAMPS fits them well.

56 Thanks Eric, for agreeing to make this happen. Christos, for throwing helpers at me. Kurt, Mendel, and Pat, for, well, a lot. John Gerth, for tireless computer servitude. Melissa (and Heather and Ada before her)

57 Thanks My practice audiences My many collaborators Daniel, Kayvon, Mike, Tim Supporters at NVIDIA, ATI/AMD, Intel Supporters at VMware Everyone who entertained, informed, challenged me, and made me think

58 Thanks My funding agencies: –Rambus Stanford Graduate Fellowship –Department of the Army Research –Stanford Pervasive Parallelism Laboratory

59 Q&A Thank you for listening! Questions?

Extra Material (Backup)

61 Data: CPU-Like & GPU-Like

62 Footprint Data: Native

63 Tunability Diagnosis: –Raw counters, statistics, logs –Grampsviz Optimize / Control: –Graph topology (e.g., sort-middle vs. sort-last) –Queue watermarks (e.g., 10x win for ray tracing) –Packet size: Match SIMD widths, share data

64 Tunability– Grampsviz (1) GPU-Like: Rasterization pipeline

65 Tunability– Grampsviz (2) CPU-Like: Histogram (MapReduce) ReduceCombine

66 Graph topology/design: Tunability– Knobs Sort-MiddleSort-Last Sizing critical queues:

Alternatives

68 Alternate Contribution Formulation Design of the GRAMPS model –Structure computation as a graph of heterogeneous stages –Communication via programmer sized queues Many applications written in GRAMPS GRAMPS runtime for three platforms (dynamic scheduling) Evaluation of GRAMPS scheduler against Task- Stealing, Breadth-First, Static

69 A few other tidbits In-place Shader stages / coalescing inputs Image Histogram Pipeline Instanced Thread stages Queues as barriers / read all-at-once

70 Performance– Good Scale-out (Footprint: Good; detail a little later) Parallel Speedup Hardware Threads

71 Seeing is believing (ray tracer) GRAMPS Static (SAS) Task-Stealing Breadth-First

72 Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Comparison: Execution time Small ‘Sched’ time, even with large graphs

Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010.

Similar presentations

Presentation on theme: "Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010.

Similar presentations

Presentation on theme: "Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010."— Presentation transcript:

Similar presentations

About project

Feedback