Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010.

Similar presentations


Presentation on theme: "Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010."— Presentation transcript:

1 Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010

2 2 The single fast core era is over Trends: Changing Metrics: ‘scale out’, not just ‘scale up’ Increasing diversity: many different mixes of ‘cores’ Today’s (and tomorrow’s) machines: commodity, heterogeneous, many-core Problem: How does one program all this complexity?!

3 3 High-level programming models Two major advantages over threads & locks –Constructs to express/expose parallelism –Scheduling support to help manage concurrency, communication, and synchronization Widespread in research and industry: OpenGL/Direct3D, SQL, Brook, Cilk, CUDA, OpenCL, StreamIt, TBB, …

4 4 My biases workloads Interesting applications have irregularity Large bundles of coherent work are efficient Producer-consumer idiom is important Goal: Rebuild coherence dynamically by aggregating related work as it is generated.

5 5 My target audience Highly informed, but (good) lazy –Understands the hardware and best practices –Dislikes rote, Prefers power versus constraints Goal: Let systems-savvy developers efficiently develop programs that efficiently map onto their hardware.

6 6 Contributions: Design of GRAMPS Programs are graphs of stages and queues Queues: –Maximum capacities, Packet sizes Stages: –No, limited, or total automatic parallelism –Fixed, variable, or reduction (in-place) outputs Simple Graphics Pipeline

7 7 Contributions: Implementation Broad application scope: –Rendering, MapReduce, image processing, … Multi-platform applicability: –GRAMPS runtimes for three architectures Performance: –Scale-out parallelism, controlled data footprint –Compares well to schedulers from other models (Also: Tunable)

8 8 Outline GRAMPS overview Study 1: Future graphics architectures Study 2: Current multi-core CPUs Comparison with schedulers from other parallel programming models

9 GRAMPS Overview

10 10 GRAMPS Programs are graphs of stages and queues –Expose the program structure –Leave the program internals unconstrained

11 11 Writing a GRAMPS program Design the application graph and queues: Design the stages Instantiate and launch. Credit: http://www.foodnetwork.com/recipes/alton-brown/the-chewy-recipe/index.html Cookie Dough Pipeline

12 12 Queues Bounded size, operate at “packet” granularity –“Opaque” and “Collection” packets GRAMPS can optionally preserve ordering –Required for some workloads, adds overhead

13 13 Thread (and Fixed) stages Preemptible, long-lived, stateful –Often merge, compare, or repack inputs Queue operations: Reserve/Commit (Fixed: Thread stages in custom hardware)

14 14 Shader stages: Automatically parallelized: –Horde of non-preemptible, stateless instances –Pre- reserve /post- commit Push : Variable/conditional output support –GRAMPS coalesces elements into full packets

15 15 Queue sets: Mutual exclusion Independent exclusive (serial) subqueues –Created statically or on first output –Densely or sparsely indexed Bonus: Automatically instanced Thread stages Cookie Dough Pipeline

16 16 Queue sets: Mutual exclusion Independent exclusive (serial) subqueues –Created statically or on first output –Densely or sparsely indexed Bonus: Automatically instanced Thread stages Cookie Dough (with queue set)

17 17 A few other tidbits Instanced Thread stages Queues as barriers / read all-at-once In-place Shader stages / coalescing inputs

18 18 Formative influences The Graphics Pipeline, early GPGPU “Streaming” Work-queues and task-queues

19 Study 1: Future Graphics Architectures (with Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan; appeared in Transactions on Computer Graphics, January 2009)

20 20 Graphics is a natural first domain Table stakes for commodity parallelism GPUs are full of heterogeneity Poised to transition from fixed/configurable pipeline to programmable We have a lot of experience in it

21 21 The Graphics Pipeline in GRAMPS Graph, setup are (application) software –Can be customized or completely replaced Like the transition to programmable shading –Not (unthinkably) radical Fits current hw: FIFOs, cores, rasterizer, …

22 22 Reminder: Design goals Broad application scope Multi-platform applicability Performance: scale-out, footprint-aware

23 23 The Experiment Three renderers: –Rasterization, Ray Tracer, Hybrid Two simulated future architectures –Simple scheduler for each

24 24 Scope: Two(-plus) renderers Ray Tracing Extension Rasterization Pipeline (with ray tracing extension) Ray Tracing Graph

25 25 Platforms: Two simulated systems CPU-Like: 8 Fat Cores, Rast GPU-Like: 1 Fat Core, 4 Micro Cores, Rast, Sched

26 26 Performance— Metrics “Maximize machine utilization while keeping working sets small” Priority #1: Scale-out parallelism –Parallel utilization Priority #2: ‘Reasonable’ bandwidth / storage –Worst case total footprint of all queues –Inherently a trade-off versus utilization

27 27 Performance— Scheduling Simple prototype scheduler (both platforms): Static stage priorities: Only preempt on Reserve and Commit No dynamic weighting of current queue sizes (Lowest) (Highest)

28 28 Performance— Results Utilization: 95+% for all but rasterized fairy (~80%). Footprint: < 600KB CPU-like, < 1.5MB GPU-like Surprised how well the simple scheduler worked Maintaining order costs footprint

29 Study 2: Current Multi-core CPUs (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)

30 30 Reminder: Design goals Broad application scope Multi-platform applicability Performance: scale-out, footprint-aware

31 31 The Experiment 9 applications, 13 configurations One (more) architecture: multi-core x86 –It’s real (no simulation here) –Built with pthreads, locks, and atomics Per-pthread task-priority-queues with work-stealing –More advanced scheduling

32 32 Scope: Application bonanza GRAMPS Ray tracer (0, 1 bounce) Spheres (No rasterization, though) MapReduce Hist (reduce / combine) LR (reduce / combine) PCA Cilk(-like) Mergesort CUDA Gaussian, SRAD StreamIt FM, TDE

33 33 Scope: Many different idioms FM Merge Sort Ray Tracer SRAD MapReduce

34 34 Platform: 2xQuad-core Nehalem Queues: copy in/out, global (shared) buffer Threads: user-level scheduled contexts Shaders: create one task per input packet Native: 8 HyperThreaded Core i7’s

35 35 Performance— Metrics (Reminder) “Maximize machine utilization while keeping working sets small” Priority #1: Scale-out parallelism Priority #2: ‘Reasonable’ bandwidth / storage

36 36 Performance– Scheduling Static per-stage priorities (still) Work-stealing task-priority-queues Eagerly create one task per packet (naïve) Keep running stages until a low watermark –(Limited dynamic weighting of queue depths)

37 37 Performance– Good Scale-out (Footprint: Good; detail a little later) Parallel Speedup Hardware Threads

38 38 Performance– Low Overheads ‘App’ and ‘Queue’ time are both useful work. Percentage of Execution Execution Time Breakdown (8 cores / 16 hyperthreads)

39 Comparison with Other Schedulers (with (alphabetically) Christos Kozyrakis, David Lo, Daniel Sanchez, Richard Yoo; submitted to PACT 2010)

40 40 Three archetypes Task-Stealing: (Cilk, TBB) Low overhead with fine granularity tasks  No producer-consumer, priorities, or data-parallel Breadth-First: (CUDA, OpenCL) Simple scheduler (one stage at the time)  No producer-consumer, no pipeline parallelism Static: (StreamIt / Streaming) No runtime scheduler; complex schedules  Cannot adapt to irregular workloads

41 41 GRAMPS is a natural framework Shader Support Producer- Consumer Structured ‘Work’ Adaptive GRAMPS Task- Stealing  Breadth- First   Static 

42 42 The Experiment Re-use the exact same application code Modify the scheduler per archetype: –Task-Stealing: Unbounded queues, no priority, (amortized) preempt to child tasks –Breadth-First: Unbounded queues, stage at a time, top-to-bottom –Static: Unbounded queues, offline per-thread schedule using SAS / SGMS

43 43 Seeing is believing (ray tracer) GRAMPS Breadth-First Static (SAS) Task-Stealing

44 44 Comparison: Execution time Mostly similar: good parallelism, load balance Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

45 45 Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Comparison: Execution time Breadth-first can exhibit load-imbalance

46 46 Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Comparison: Execution time Task-stealing can ping-pong, cause contention Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First)

47 47 Comparison: Footprint Breadth-First is pathological (as expected) Relative Packet Footprint (Log-Scale) Size versus GRAMPS

48 48 Footprint: GRAMPS & Task-Stealing Relative Packet Footprint Relative Task Footprint

49 49 GRAMPS gets insight from the graph: (Application-specified) queue bounds Group tasks by stage for priority, preemption Footprint: GRAMPS & Task-Stealing MapReduce Ray Tracer MapReduce Ray Tracer

50 50 Static scheduling is challenging Generating good Static schedules is *hard*. Static schedules are fragile: –Small mismatches compound –Hardware itself is dynamic (cache traffic, IRQs, …) Limited upside: dynamic scheduling is cheap! Execution TimePacket Footprint

51 51 Discussion (for multi-core CPUs) Adaptive scheduling is the obvious choice. –Better load-balance / handling of irregularity Semantic insight (app graph) gives a big advantage in managing footprint. More cores, development maturity → more complex graphs and thus more advantage.

52 Conclusion

53 53 Contributions Revisited GRAMPS programming model design –Graph of heterogeneous stages and queues Good results from actual implementation –Broad scope: Wide range of applications –Multi-platform: Three different architectures –Performance: High parallelism, good footprint

54 54 Anecdotes and intuitions Structure helps: an explicit graph is handy. Simple (principled) dynamic scheduling works. Queues impedance match heterogeneity. Graphs with cycles and push both paid off. (Also: Paired instrumentation and visualization help enormously)

55 55 Conclusion: Future trends revisited Core counts are increasing –Parallel programming models Memory and bandwidth are precious –Working set, locality (i.e., footprint) management Power, performance driving heterogeneity –All ‘cores’ need to communicate, interoperate  GRAMPS fits them well.

56 56 Thanks Eric, for agreeing to make this happen. Christos, for throwing helpers at me. Kurt, Mendel, and Pat, for, well, a lot. John Gerth, for tireless computer servitude. Melissa (and Heather and Ada before her)

57 57 Thanks My practice audiences My many collaborators Daniel, Kayvon, Mike, Tim Supporters at NVIDIA, ATI/AMD, Intel Supporters at VMware Everyone who entertained, informed, challenged me, and made me think

58 58 Thanks My funding agencies: –Rambus Stanford Graduate Fellowship –Department of the Army Research –Stanford Pervasive Parallelism Laboratory

59 59 Q&A Thank you for listening! Questions?

60 Extra Material (Backup)

61 61 Data: CPU-Like & GPU-Like

62 62 Footprint Data: Native

63 63 Tunability Diagnosis: –Raw counters, statistics, logs –Grampsviz Optimize / Control: –Graph topology (e.g., sort-middle vs. sort-last) –Queue watermarks (e.g., 10x win for ray tracing) –Packet size: Match SIMD widths, share data

64 64 Tunability– Grampsviz (1) GPU-Like: Rasterization pipeline

65 65 Tunability– Grampsviz (2) CPU-Like: Histogram (MapReduce) ReduceCombine

66 66 Graph topology/design: Tunability– Knobs Sort-MiddleSort-Last Sizing critical queues:

67 Alternatives

68 68 Alternate Contribution Formulation Design of the GRAMPS model –Structure computation as a graph of heterogeneous stages –Communication via programmer sized queues Many applications written in GRAMPS GRAMPS runtime for three platforms (dynamic scheduling) Evaluation of GRAMPS scheduler against Task- Stealing, Breadth-First, Static

69 69 A few other tidbits In-place Shader stages / coalescing inputs Image Histogram Pipeline Instanced Thread stages Queues as barriers / read all-at-once

70 70 Performance– Good Scale-out (Footprint: Good; detail a little later) Parallel Speedup Hardware Threads

71 71 Seeing is believing (ray tracer) GRAMPS Static (SAS) Task-Stealing Breadth-First

72 72 Percentage of Time Time Breakdown (GRAMPS, Task-Stealing, Breadth-First) Comparison: Execution time Small ‘Sched’ time, even with large graphs


Download ppt "Programming Many-Core Systems with GRAMPS Jeremy Sugerman 14 May 2010."

Similar presentations


Ads by Google