DeNovo † : Rethinking Hardware for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Rob Bocchino, Sarita Adve, Vikram Adve Other collaborators:

Slides:



Advertisements
Similar presentations
Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.
Advertisements

CMSC 611: Advanced Computer Architecture
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Cache Optimization Summary
Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita Adve University of Illinois Acks: Mark Hill, Kourosh Gharachorloo,
DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Nima Honarmand, Rakesh Komuravelli, Robert Smolinski, Hyojin Sung,
Rethinking Shared-Memory Languages and Hardware Sarita V. Adve University of Illinois Acks: M. Hill, K. Gharachorloo, H. Boehm, D. Lea,
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Continuously Recording Program Execution for Deterministic Replay Debugging.
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.
DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand,
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois Acks: Mark Hill, Kourosh.
Foundations of the C++ Concurrency Memory Model Hans-J. Boehm Sarita V. Adve HP Laboratories UIUC.
Lecture 13: Multiprocessors Kai Bu
The Imperative of Disciplined Parallelism: A Hardware Architect’s Perspective Sarita Adve, Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun.
Rethinking Hardware and Software for Disciplined Parallelism Sarita V. Adve University of Illinois
DeNovo: A Software-Driven Rethinking of the Memory Hierarchy Sarita Adve, Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun Chou, Stephen.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
1 CACM July 2012 Talk: Mark D. Hill, Cornell University, 10/2012.
Group 3: Architectural Design for Enhancing Programmability Dean Tullsen, Josep Torrellas, Luis Ceze, Mark Hill, Onur Mutlu, Sampath Kannan, Sarita Adve,
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.
Parallelism without Concurrency Charles E. Leiserson MIT.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
DeNovo: A Software-Driven Rethinking of the Memory Hierarchy Sarita Adve with Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun Chou, Stephen.
The University of Adelaide, School of Computer Science
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Lecture 13: Multiprocessors Kai Bu
Cache Coherence: Directory Protocol
Cache Coherence: Directory Protocol
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
תרגול מס' 5: MESI Protocol
Stash: Have Your Scratchpad and Cache it Too
Lecture 18: Coherence and Synchronization
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
Transactional Memory Coherence and Consistency
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 9: Directory Protocol Implementations
Lecture 24: Multiprocessors
Programming with Shared Memory Specifying parallelism
CPE 631 Lecture 20: Multiprocessors
Lecture 19: Coherence and Synchronization
Presentation transcript:

DeNovo † : Rethinking Hardware for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Rob Bocchino, Sarita Adve, Vikram Adve Other collaborators: Languages: Adam Welc, Tatiana Shpeisman, Yang Ni (Intel) Applications: John Hart, Victor Lu † De Novo = From the beginning, anew

Motivation Goal: Power-, complexity-, performance-scalable hardware Today: shared-memory – Directory-based coherence Complex, unscalable – Address, communication, coherence granularity is cache line Software-oblivious, inefficient power, bandwidth, latency, area – Difficult programming model Data races, non-determinism, no safety/composability/modularity, … – Can’t specify “what value can read return” a.k.a. memory model Data races defy acceptable semantics; mismatched hardware/software Fundamentally broken for hardware & software

Motivation Goal: Power-, complexity-, performance-scalable hardware Today: shared-memory – Directory-based coherence Complex, unscalable – Address, communication, coherence granularity is cache line Inefficient in power, bandwidth, latency, area, especially for O-O codes – Difficult programming model Data races, non-determinism, no safety/composability/modularity, … – Can’t specify “what value a read can return” a.k.a. memory model Data races defy acceptable semantics; mismatched hardware/software Fundamentally broken for hardware & software Banish shared memory?

Motivation Goal: Power-, complexity-, performance-scalable hardware Today: shared-memory – Directory-based coherence Complex, unscalable – Address, communication, coherence granularity is cache line Inefficient in power, bandwidth, latency, area, especially for O-O codes – Difficult programming model Data races, non-determinism, no safety/composability/modularity, … – Can’t specify “what value a read can return” a.k.a. memory model Data races defy acceptable semantics; mismatched hardware/software Fundamentally broken for hardware & software Banish wild shared memory!

Motivation Goal: Power-, complexity-, performance-scalable hardware Today: shared-memory – Directory-based coherence Complex, unscalable – Address, communication, coherence granularity is cache line Inefficient in power, bandwidth, latency, area, especially for O-O codes – Difficult programming model Data races, non-determinism, no safety/composability/modularity, … – Can’t specify “what value a read can return” a.k.a. memory model Data races defy acceptable semantics; mismatched hardware/software Fundamentally broken for hardware & software Banish wild shared memory! Need disciplined shared memory!

What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

What is Shared-Memory? Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

What is Shared-Memory? Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, synchronization Explicit, structured side-effects

Disciplined Shared-Memory Top-down view: prog model, language, … (earlier today) – Use explicit effects for semantic guarantees Data-race-freedom, determinism-by-default, controlled non-determinism – Reward: simple semantics, safety, composability, … Bottom-up view: hardware, runtime, … – Use explicit effects + above guarantees for Simple coherence and consistency Software-aware address/comm/coherence granularity, data layout – Reward: power-, complexity-, performance-scalable hardware Top-down and bottom-up views are synergistic!

DeNovo Research Strategy 1.Start with deterministic (  data-race-free) codes ― Common & best case, basis for extension to other codes 2.Disciplined non-deterministic codes 3.Wild non-deterministic, legacy codes Work with languages for best h/w-s/w interface – Current driver is DPJ – End-goal is language-oblivious interface Work with realistic applications – Current work with kd-trees

Progress Summary 1.Deterministic codes ― Language level model with DPJ, translation to h/w interface ― Design for simple coherence, s/w-aware comm. and layout ― Baseline coherence implemented, rest underway 2.Disciplined non-deterministic codes ― Language level model with DPJ (and Intel) 3.Wild non-deterministic, legacy codes – Just begun, will be influenced by above Collaborations with languages & applications groups – DPJ; first scalable SAH quality kd-tree construction

Progress Summary 1.Deterministic codes ― Language level model with DPJ, translation to h/w interface ― Design for simple coherence, s/w-aware comm. and layout ― Baseline coherence implemented, rest underway 2.Disciplined non-deterministic codes ― Language level model with DPJ (and Intel) 3.Wild non-deterministic, legacy codes – Just begun, will be influenced by above Collaborations with languages & applications groups – DPJ; first scalable SAH quality kd-tree construction

Coherence and Consistency Key Insights Guaranteed determinism Explicit effects

Coherence and Consistency Key Insights Guaranteed determinism  – Read should return value of last write in sequential order From same task in this parallel phase, if it exists Or from previous parallel phase No concurrent conflicting writes in this phase Explicit effects

Coherence and Consistency Key Insights Guaranteed determinism  – Read should return value of last write in sequential order From same task in this parallel phase, if it exists Or from previous parallel phase No concurrent conflicting writes in this phase Explicit effects  – Compiler knows all regions written in this parallel phase – Cache can self-invalidate before next parallel phase Invalidates data in writeable regions not written by itself

Today's Coherence Protocols Snooping: broadcast, ordered networks Directory: avoid broadcast through indirection – Complexity: Races in protocol – Overhead: Sharer list – Performance: All cache misses go through directory

Today's Coherence Protocols Snooping: broadcast, ordered networks Directory: avoid broadcast through indirection – Complexity: Races in protocol Race-free software  (almost) race-free coherence protocol No transient states, much simpler protocol – Overhead: Sharer list – Performance: All cache misses go through directory

Today's Coherence Protocols Snooping: broadcast, ordered networks Directory: avoid broadcast through indirection – Complexity: Races in protocol Race-free software  (almost) race-free coherence protocol No transient states, much simpler protocol – Overhead: Sharer list Explicit effects enable self-invalidations No need for sharer lists – Performance: All cache misses go through directory

Today's Coherence Protocols Snooping: broadcast, ordered networks Directory: avoid broadcast through indirection – Complexity: Races in protocol Race-free software  (almost) race-free coherence protocol No transient states, much simpler protocol – Overhead: Sharer list Explicit effects enable self-invalidations No need for sharer lists – Performance: All cache misses go through directory Directory only tracks one up-to-date copy, not sharers or serialization Data copies can move from cache to cache without telling directory

Baseline DeNovo Coherence Assume (for now): Private L1, shared L2; single word line Directory tracks one current copy of line, not sharers L2 data arrays double as directory – Keep valid data or registered core id, no space overhead S/W inserts self-invalidates for regions w/ write effects L1 states = invalid, valid, registered No transient states: Protocol ≈ 3-state textbook pictures – Formal specification and verification with Intel

DeNovo Region Granularity DPJ regions too fine-grained – Ensure accesses to individual objects/fields don’t interfere – Too many regions for hardware DeNovo only needs aggregate data written in phase – E.g., can summarize a field of entire data structure as one region Can we aggregate to few enough regions, without excessive invalidations?

Evaluation Methodology Modified Wisconsin GEMS + (Intel!) Simics simulator 4 apps: LU, FFT, Barnes, kd-tree – Converted DPJ regions into DeNovo regions by hand Compared DeNovo vs. MESI for single word lines – Goal: Does simplicity impact performance? E.g., are self-invalidations too conservative? (Efficiency enhancements are next step) AppBarnesFFTLUKd-tree # Regions8324

Results for Baseline Coherence DeNovo comparable to MESI Simple protocol is foundation for efficiency enhancements

Improving Performance & Power Insight: Can always copy valid data to another cache – w/o demand access, w/o going through directory – If later demand read sees it, it must be correct – No false sharing effects (no loss of “ownership”)  Simple line based protocol (with word based valid bits)  Can get line from anywhere, not just from directory L2 ― Point-to-point transfer, sender-initiated transfer  Can transfer more than line at a time ― Point-to-point bulk transfer  Transfer granularity can be region-driven ― AoS vs. SoA optimization is natural

Towards Ideal Efficiency Current systems: ― Address, transfer, coherence granularity = fixed cache line Denovo so far: ― Transfer is flexible, but address is still line, coherence is still word Next step: Region-centered caches – Use regions for memory layout Region based “pool allocation,” fields of same region at fixed strides – Cache banks devoted to regions – Regions accessed together give address, transfer granularity – Regions w/ same sharing behavior give coherence granularity Applicable to main memory and pin bandwidth Interactions with runtime scheduler

Summary Current shared-memory models fundamentally broken – Semantics, programmability, hardware Disciplined programming models solve these problems DeNovo = hardware for disciplined programming – Software-driven memory hierarchy – Coherence, consistency, communication, data layout, … – Simpler, faster, cooler, cheaper, … Sponsor interactions: – Nick Carter, Mani Azimi/Akhilesh Kumar/Ching-Tsun Chou – Non-deterministic model: Adam Welc/Shpeisman/Ni – SCC prototype exploration: Jim Held

Next steps Phase 1: – Full implementation, verification, results for deterministic codes – Design and results for disciplined non-determinism – Design for wild non-deterministic, legacy codes – Continue work with language and application groups – Explore prototyping on SCC Phase 2: – Design and simulation results for complete DeNovo system running large applications – Language-oblivious hardware-software interface – Prototype and tech transfer