DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand,

DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, Ching-Tsun Chou University of Illinois at Urbana-Champaign and Intel denovo@cs.illinois.edu

Motivation Goal: Safe and efficient parallel computing – Easy, safe programming model – Complexity-, power-, performance-scalable hardware Today: shared-memory – Complex, power- and performance-inefficient hardware * Complex directory coherence, unnecessary traffic,... – Difficult programming model * Data races, non-determinism, composability/modularity?, testing? – Mismatched interface between HW and SW, a.k.a memory model * Can’t specify “what value can read return” * Data races defy acceptable semantics  Fundamentally broken for hardware & software

Goal: Safe and efficient parallel computing – Easy, safe programming model – Complexity-, power-, performance-scalable hardware Today: shared-memory – Complex, power- and performance-inefficient hardware * Complex directory coherence, unnecessary traffic,... – Difficult programming model * Data races, non-determinism, composability/modularity?, testing? – Mismatched interface between HW and SW, a.k.a memory model * Can’t specify “what value can read return” * Data races defy acceptable semantics  Fundamentally broken for hardware & software Motivation Banish shared memory?

Goal: Safe and efficient parallel computing – Easy, safe programming model – Complexity-, power-, performance-scalable hardware Today: shared-memory – Complex, power- and performance-inefficient hardware * Complex directory coherence, unnecessary traffic,... – Difficult programming model * Data races, non-determinism, composability/modularity?, testing? – Mismatched interface between HW and SW, a.k.a memory model * Can’t specify “what value can read return” * Data races defy acceptable semantics  Fundamentally broken for hardware & software Motivation Banish wild shared memory! Need disciplined shared memory!

What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

What is Shared-Memory? Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

What is Shared-Memory? Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, synchronization Explicit, structured side-effects

Benefits of Explicit Effects Disciplined Shared Memory Strong safety properties Determinism-by-default, explicit & safe non-determinism Simple semantics, composability, testability Efficiency: complexity, performance, power Simplify coherence and consistency Optimize communication and storage layout explicit effects + structured parallel control explicit effects + structured parallel control - Deterministic Parallel Java (DPJ) - DeNovo Simple programming model AND Complexity, power, performance-scalable hardware

DeNovo Hardware Project Exploit discipline for efficient hardware – Current driver is DPJ (Deterministic Parallel Java) – End goal is language-oblivious interface Software research strategy – Deterministic codes – Safe non-determinism – OS, legacy, … Hardware research strategy – On-chip: coherence, consistency, communication, data layout – Off-chip: similar ideas apply, next step

Current Hardware Limitations Complexity – Subtle races and numerous transient states in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation and ack messages – False sharing – Indirection through the directory – Fixed cache-line communication granularity

Complexity – Subtle races and numerous transient sates in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation and ack messages – False sharing – Indirection through the directory – Traffic (fixed cache-line communication) Contributions − No transient states − Simple to extend for optimizations − No storage overhead for directory information − No invalidation and ack messages − No false sharing Current Hardware Limitations − No indirection through the directory − Flexible, not cache-line, communication Up to 81% reduction in memory stall time Up to 81% reduction in memory stall time

Outline Motivation Background: DPJ Base DeNovo Protocol DeNovo Optimizations Evaluation – Complexity – Performance Conclusion and Future Work

Background: DPJ [OOPSLA 09] Extension to Java; fully Java-compatible Structured parallel control: nested fork-join style – foreach, cobegin A novel region-based type and effect system Speedups close to hand-written Java programs Expressive enough for irregular, dynamic parallelism

Regions and Effects Region: a name for a set of memory locations – Programmer assigns a region to each field and array cell – Regions partition the heap Effect: a read or write on a region – Programmer summarizes effects of method bodies Compiler checks that – Region types are consistent, effect summaries are correct – Parallel tasks are non-interfering (no conflicts) – Simple, modular type checking (no inter-procedural ….) Programs that type-check are guaranteed determinism

Example: A Pair Class class Pair { region Blue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); // writes Blue setY(y); // writes Red } Pair Pair.BlueX3 Pair.RedY42 Declaring and using region names Region names have static scope (one per class)

Example: A Pair Class Writing method effect summaries class Pair { region Blue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); // writes Blue setY(y); // writes Red } Pair Pair.BlueX3 Pair.RedY42

Example: A Pair Class Expressing parallelism class Pair { region Blue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); // writes Blue setY(y); // writes Red } Pair Pair.BlueX3 Pair.RedY42 Inferred effects

Memory Consistency Model Guaranteed determinism  Read returns value of last write in sequential order 1.Same task in this parallel phase 2.Or before this parallel phase LD 0xa ST 0xa Parallel Phase ST 0xa

Memory Consistency Model Guaranteed determinism  Read returns value of last write in sequential order 1.Same task in this parallel phase 2.Or before this parallel phase LD 0xa Coherence Mechanism ST 0xa Parallel Phase

Cache Coherence Coherence Enforcement 1.Invalidate stale copies in caches 2.Track up-to-date copy Explicit effects – Compiler knows all regions written in this parallel phase – Cache can self-invalidate before next parallel phase * Invalidates data in writeable regions not accessed by itself Registration – Directory keeps track of one up-to-date copy – Writer updates before next parallel phase

Basic DeNovo Coherence Assume (for now): Private L1, shared L2; single word line – Data-race freedom at word granularity No space overhead – Keep valid data or registered core id – L2 data arrays double as directory No transient states registry InvalidValid Registered Read Write

Example Run class Pair { X in DeNovo-region Blue; Y in DeNovo-region Red; void setX(int x) writes Blue { this.X = x; } Pair PArray[size];... Phase1 writes Blue { // DeNovo effect foreach i in 0, size { PArray[i].setX(…); } self_invalidate(Blue); } R X0X0 V Y0Y0 R X1X1 V Y1Y1 R X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 L1 of Core 1 R X0X0 V Y0Y0 R X1X1 V Y1Y1 R X2X2 V Y2Y2 I X3X3 V Y3Y3 I X4X4 V Y4Y4 I X5X5 V Y5Y5 L1 of Core 2 I X0X0 V Y0Y0 I X1X1 V Y1Y1 I X2X2 V Y2Y2 R X3X3 V Y3Y3 R X4X4 V Y4Y4 R X5X5 V Y5Y5 Shared L2 Registered Valid Invalid V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 R X3X3 V Y3Y3 R X4X4 V Y4Y4 R X5X5 V Y5Y5 Registration Ack V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 R C1 V Y0Y0 R V Y1Y1 R V Y2Y2 R C2 V Y3Y3 R V Y4Y4 R V Y5Y5

Practical DeNovo Coherence Basic protocol impractical – High tag storage overhead (a tag per word) Address/Transfer granularity > Coherence granularity DeNovo Line-based protocol – Traditional software-oblivious spatial locality – Coherence granularity still at word * no word-level false-sharing “Line Merging” Cache VVR Tag VVV

Current Hardware Limitations ✔ ✔ ✔ ✔ Complexity – Subtle races and numerous transient states in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation and ack messages – False sharing – Indirection through the directory – Fixed cache-line communication granularity

Protocol Optimizations Insights 1.Traditional directory must be updated at every transfer ⇒ DeNovo can copy valid data around freely 1.Traditional systems send cache line at a time ⇒ DeNovo uses regions to transfer only relevant data

L1 of Core 1 … … R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 I X3X3 V Y3Y3 V Z3Z3 I X4X4 V Y4Y4 V Z4Z4 I X5X5 V Y5Y5 V Z5Z5 I X0X0 V Y0Y0 V Z0Z0 I X1X1 V Y1Y1 V Z1Z1 I X2X2 V Y2Y2 V Z2Z2 R X3X3 V Y3Y3 V Z3Z3 R X4X4 V Y4Y4 V Z4Z4 R X5X5 V Y5Y5 V Z5Z5 R C1 V Y0Y0 V Z0Z0 R V Y1Y1 V Z1Z1 R V Y2Y2 V Z2Z2 R C2 V Y3Y3 V Z3Z3 R V Y4Y4 V Z4Z4 R V Y5Y5 V Z5Z5 R egistered V alid I nvalid X3X3 LD X 3 Y3Y3 Z3Z3 Protocol Optimizations Direct cache-to-cache transfer Flexible communication L1 of Core 2 … … Shared L2 … …

L1 of Core 1 … … R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 I X3X3 V Y3Y3 V Z3Z3 I X4X4 V Y4Y4 V Z4Z4 I X5X5 V Y5Y5 V Z5Z5 I X0X0 V Y0Y0 V Z0Z0 I X1X1 V Y1Y1 V Z1Z1 I X2X2 V Y2Y2 V Z2Z2 R X3X3 V Y3Y3 V Z3Z3 R X4X4 V Y4Y4 V Z4Z4 R X5X5 V Y5Y5 V Z5Z5 R C1 V Y0Y0 V Z0Z0 R V Y1Y1 V Z1Z1 R V Y2Y2 V Z2Z2 R C2 V Y3Y3 V Z3Z3 R V Y4Y4 V Z4Z4 R V Y5Y5 V Z5Z5 R egistered V alid I nvalid Protocol Optimizations Direct cache-to-cache transfer Flexible communication L1 of Core 2 … … Shared L2 … … R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 V X3X3 V Y3Y3 V Z3Z3 V X4X4 V Y4Y4 V Z4Z4 V X5X5 V Y5Y5 V Z5Z5 X3X3 X4X4 X5X5 LD X 3 Direct cache-to-cache transfer Flexible communication

Current Hardware Limitations ✔ ✔ ✔ ✔ ✔ ✔ ✔ Complexity – Subtle races and numerous transient states in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation and ack messages – False sharing – Indirection through the directory – Fixed cache-line communication granularity

Protocol Verification DeNovo vs. MESI word with Murphi model checking Correctness – Six bugs in MESI protocol * Difficult to find and fix – Three bugs in DeNovo protocol * Simple to fix Complexity – 15x fewer reachable states for DeNovo – 20x difference in the runtime

Performance Evaluation Methodology Simulator: Simics + GEMS + Garnet System Parameters – 64 cores – Simple in-order core model Workloads – FFT, LU, Barnes-Hut, and radix from SPLASH-2 – bodytrack and fluidanimate from PARSEC 2.1 – kd-Tree (two versions) [HPG 09]

MESI Word (MW) vs. DeNovo Word (DW) FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix DW’s performance competitive with MW

MESI Line (ML) vs. DeNovo Line (DL) FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix DL about the same or better memory stall time than ML DL outperforms ML significantly with apps with false sharing

Optimizations on DeNovo Line FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix Combined optimizations perform best – Except for LU and bodytrack – Apps with low spatial locality suffer from line-granularity allocation

Network Traffic DeNovo has less traffic than MESI in most cases DeNovo incurs more write traffic – due to word-granularity registration – Can be mitigated with “write-combining” optimization FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix

Conclusion and Future Work DeNovo rethinks hardware for disciplined models Complexity – No transient states: 20X faster to verify than MESI – Extensible: optimizations without new states Storage overhead – No directory overhead Performance and power inefficiencies – No invalidations, acks, false sharing, indirection – Flexible, not cache-line, communication – Up to 81% reduction in memory stall time Future work: region-driven layout, off-chip memory, safe non-determinism, sync, OS, legacy

Thank You!

DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand,

Similar presentations

Presentation on theme: "DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand,

Similar presentations

Presentation on theme: "DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand,"— Presentation transcript:

Similar presentations

About project

Feedback