Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois Acks: Mark Hill, Kourosh.

Slides:

Advertisements

Similar presentations

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

Advertisements

Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability.

Memory Models (1) Xinyu Feng University of Science and Technology of China.

Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita Adve University of Illinois Acks: Mark Hill, Kourosh Gharachorloo,

Rethinking Shared-Memory Languages and Hardware Sarita V. Adve University of Illinois Acks: M. Hill, K. Gharachorloo, H. Boehm, D. Lea,

CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Memory Models: The Case for Rethinking Parallel Languages and Hardware Sarita Adve University of Illinois at Urbana-Champaign

DeNovo † : Rethinking Hardware for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Rob Bocchino, Sarita Adve, Vikram Adve Other collaborators:

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

Lock-free Cache-friendly Software Queue for Decoupled Software Pipelining Student: Chen Wen-Ren Advisor: Wuu Yang 學生 : 陳韋任指導教授 : 楊武 Abstract Multicore.

“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.

By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.

Formalisms and Verification for Transactional Memories Vasu Singh EPFL Switzerland.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Computer Architecture II 1 Computer architecture II Lecture 9.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

Memory Consistency Models

CS510 Concurrent Systems Class 5 Threads Cannot Be Implemented As a Library.

DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand,

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

A Behavioral Memory Model for the UPC Language Kathy Yelick Joint work with: Dan Bonachea, Jason Duell, Chuck Wallace.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Memory Models: A Case for Rethinking Parallel Languages and Hardware COE 502/CSE 661 Fall 2011 Parallel Processing Architectures Professor: Muhamed Mudawar,

Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.

ROBERT BOCCHINO, ET AL. UNIVERSAL PARALLEL COMPUTING RESEARCH CENTER UNIVERSITY OF ILLINOIS A Type and Effect System for Deterministic Parallel Java *Based.

Foundations of the C++ Concurrency Memory Model Hans-J. Boehm Sarita V. Adve HP Laboratories UIUC.

Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.

The Imperative of Disciplined Parallelism: A Hardware Architect’s Perspective Sarita Adve, Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun.

Rethinking Hardware and Software for Disciplined Parallelism Sarita V. Adve University of Illinois

DeNovo: A Software-Driven Rethinking of the Memory Hierarchy Sarita Adve, Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun Chou, Stephen.

Aritra Sengupta, Swarnendu Biswas, Minjia Zhang, Michael D. Bond and Milind Kulkarni ASPLOS 2015, ISTANBUL, TURKEY Hybrid Static-Dynamic Analysis for Statically.

Writing Systems Software in a Functional Language An Experience Report Iavor Diatchki, Thomas Hallgren, Mark Jones, Rebekah Leslie, Andrew Tolmach.

By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial.

Group 3: Architectural Design for Enhancing Programmability Dean Tullsen, Josep Torrellas, Luis Ceze, Mark Hill, Onur Mutlu, Sampath Kannan, Sarita Adve,

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Fence Scoping Changhui Lin †, Vijay Nagarajan*, Rajiv Gupta † † University of California, Riverside * University of Edinburgh.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Threads Cannot be Implemented as a Library Hans-J. Boehm.

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.

Multiprocessor Cache Consistency (or, what does volatile mean?) Andrew Whitaker CSE451.

December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.

Threads cannot be implemented as a library Hans-J. Boehm (presented by Max W Schwarz)

Memory Consistency Zhonghai Lu Outline Introduction What is a memory consistency model? Who should care? Memory consistency models Strict.

ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.

CS533 Concepts of Operating Systems Jonathan Walpole.

Detecting Atomicity Violations via Access Interleaving Invariants

Specifying Multithreaded Java semantics for Program Verification Abhik Roychoudhury National University of Singapore (Joint work with Tulika Mitra)

DeNovo: A Software-Driven Rethinking of the Memory Hierarchy Sarita Adve with Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun Chou, Stephen.

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

Lecture 20: Consistency Models, TM

Memory Consistency Models

The University of Adelaide, School of Computer Science

Memory Consistency Models

Amir Kamil and Katherine Yelick

Threads and Memory Models Hal Perkins Autumn 2011

Shared Memory Consistency Models: A Tutorial

Threads and Memory Models Hal Perkins Autumn 2009

Lecture 22: Consistency Models, TM

Memory Consistency Models

Amir Kamil and Katherine Yelick

The University of Adelaide, School of Computer Science

Xinyu Feng University of Science and Technology of China

Tools for the development of parallel applications

Programming with Shared Memory Specifying parallelism

Lecture: Consistency Models, TM

The University of Adelaide, School of Computer Science

Presentation transcript:

Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois Acks: Mark Hill, Kourosh Gharachorloo, Jeremy Manson, Bill Pugh, Hans Boehm, Doug Lea, Herb Sutter, Vikram Adve, Rob Bocchino, Marc Snir, Byn Choi, Rakesh Komuravelli, Hyojin Sung † Also a paper by S. V. Adve & H.-J. Boehm, To appear in CACM

Memory Consistency Models Parallelism for the masses! Shared-memory most common Memory model = Legal values for reads

Memory Consistency Models Parallelism for the masses! Shared-memory most common Memory model = Legal values for reads

Memory Consistency Models Parallelism for the masses! Shared-memory most common Memory model = Legal values for reads

Memory Consistency Models Parallelism for the masses! Shared-memory most common Memory model = Legal values for reads

Memory Consistency Models Parallelism for the masses! Shared-memory most common Memory model = Legal values for reads

20 Years of Memory Models Memory model is at the heart of concurrency semantics – 20 year journey from confusion to convergence at last! – Hard lessons learned – Implications for future Current way to specify concurrency semantics is too hard – Fundamentally broken Must rethink parallel languages and hardware – E.g., Illinois Deterministic Parallel Java, DeNovo architecture

What is a Memory Model? Memory model defines what values a read can return Initially A=B=C=Flag=0 Thread 1 Thread 2 A = 26 while (Flag != 1) {;} B = 90 r1 = B … r2 = A Flag = 1 …

Memory Model is Key to Concurrency Semantics Interface between program and transformers of program – Defines what values a read can return C++ program Compiler Dynamic optimizer Hardware Weakest system component exposed to the programmer – Language level model has implications for hardware Interface must last beyond trends Assembly

Desirable Properties of a Memory Model 3 Ps – Programmability – Performance – Portability Challenge: hard to satisfy all 3 Ps – Late 1980’s - 90’s: Largely driven by hardware Lots of models, little consensus – 2000 onwards: Largely driven by languages/compilers Consensus model for Java, C++ (C, others ongoing) Had to deal with mismatches in hardware models Path to convergence has lessons for future

Programmability – SC [Lamport79] Programmability: Sequential consistency (SC) most intuitive – Operations of a single thread in program order – All operations in a total order or atomic But Performance? – Recent (complex) hardware techniques boost performance with SC – But compiler transformations still inhibited But Portability? – Almost all h/w, compilers violate SC today  SC not practical, but…

Next Best Thing – SC Almost Always Parallel programming too hard even with SC – Programmers (want to) write well structured code – Explicit synchronization, no data races Thread 1 Thread 2 Lock(L) Read Data1 Read Data2 Write Data2 Write Data1 … Unlock(L) Unlock(L) – SC for such programs much easier: can reorder data accesses  Data-race-free model [AdveHill90] – SC for data-race-free programs – No guarantees for programs with data races

Definition of a Data Race Distinguish between data and non-data (synchronization) accesses Only need to define for SC executions  total order Two memory accesses form a race if – From different threads, to same location, at least one is a write – Occur one after another Thread 1 Thread 2 Write, A, 26 Write, B, 90 Read, Flag, 0 Write, Flag, 1 Read, Flag, 1 Read, B, 90 Read, A, 26 A race with a data access is a data race Data-race-free-program = No data race in any SC execution

Data-Race-Free Model Data-race-free model = SC for data-race-free programs – Does not preclude races for wait-free constructs, etc. Requires races be explicitly identified as synchronization E.g., use volatile variables in Java, atomics in C++ – Dekker’s algorithm Initially Flag1 = Flag2 = 0 volatile Flag1, Flag2 Thread1 Thread2 Flag1 = 1 Flag2 = 1 if Flag2 == 0 if Flag1 == 0 //critical section //critical section SC prohibits both loads returning 0

Data-Race-Free Approach Programmer’s model: SC for data-race-free programs Programmability – Simplicity of SC, for data-race-free programs Performance – Specifies minimal constraints (for SC-centric view) Portability – Language must provide way to identify races – Hardware must provide way to preserve ordering on races – Compiler must translate correctly

1990's in Practice (The Memory Models Mess) Hardware – Implementation/performance-centric view – Different vendors had different models – most non-SC Alpha, Sun, x86, Itanium, IBM, AMD, HP, Cray, … – Various ordering guarantees + fences to impose other orders – Many ambiguities - due to complexity, by design(?), … High-level languages – Most shared-memory programming with Pthreads, OpenMP Incomplete, ambiguous model specs Memory model property of language, not library [Boehm05] – Java – commercially successful language with threads Chapter 17 of Java language spec on memory model But hard to interpret, badly broken LD ST LD Fence

2000 – 2004: Java Memory Model ~ 2000: Bill Pugh publicized fatal flaws in Java model Lobbied Sun to form expert group to revise Java model Open process via mailing list – Diverse participants – Took 5 years of intense, spirited debates – Many competing models – Final consensus model approved in 2005 for Java 5.0 [MansonPughAdve POPL 2005]

Java Memory Model Highlights Quick agreement that SC for data-race-free was required Missing piece: Semantics for programs with data races – Java cannot have undefined semantics for ANY program – Must ensure safety/security guarantees – Limit damage from data races in untrusted code Goal: Satisfy security/safety, w/ maximum system flexibility – Problem: “safety/security, limited damage” w/ threads very vague

Java Memory Model Highlights Initially X=Y=0 Thread 1 Thread 2 r1 = X r2 = Y Y = r1 X = r2 Is r1=r2=42 allowed? Data races produce causality loop! Definition of a causality loop was surprisingly hard Common compiler optimizations seem to violate“causality”

Java Memory Model Highlights Final model based on consensus, but complex – Programmers can (must) use “SC for data-race-free” – But system designers must deal with complexity – Correctness tools, racy programs, debuggers, …?? – Recent discovery of bugs [SevcikAspinall08]

:C++, Microsoft Prism, Multicore ~ 2005: Hans Boehm initiated C++ concurrency model – Prior status: no threads in C++, most concurrency w/ Pthreads Microsoft concurrently started its own internal effort C++ easier than Java because it is unsafe – Data-race-free is plausible model BUT multicore  New h/w optimizations, more scrutiny – Mismatched h/w, programming views became painfully obvious – Debate that SC for data-race-free inefficient w/ hardware models

Hardware Implications of Data-Race-Free Synchronization (volatiles/atomics) must appear SC – Each thread’s synch must appear in program order synch Flag1, Flag2 Thread 1 Thread 2 Flag1 = 1 Flag2 = 1 Fence Fence if Flag2 == 0 if Flag1 == 0 critical section critical section SC  both reads cannot return 0 – Requires efficient fences between synch stores/loads – All synchs must appear in a total order (atomic)

Independent reads, independent writes (IRIW): Initially X=Y=0 T1 T2 T3 T4 X = 1 Y = 1 … = Y … = X fence fence … = X … = Y SC  no thread sees new value until old copies invalidated – Shared caches w/ hyperthreading/multicore make this harder – Programmers don’t usually use IRIW – Why pay cost for SC in h/w if not useful to s/w? 0 Implications of Atomic Synch Writes 1 1 0

C++ Challenges 2006: Pressure to change Java/C++ to remove SC baseline To accommodate some hardware vendors But what is alternative? – Must allow some hardware optimizations – But must be teachable to undergrads Showed such an alternative (probably) does not exist

C++ Compromise Default C++ model is data-race-free AMD, Intel, … on board But – Some systems need expensive fence for SC – Some programmers really want more flexibility C++ specifies low-level model only for experts Complicates spec, but only for experts We are not advertising this part – [BoehmAdve PLDI 2008]

Summary of Current Status Convergence to “SC for data-race-free” as baseline For programs with data races – Minimal but complex semantics for safe languages – No semantics for unsafe languages

Lessons Learned SC for data-race-free minimal baseline Specifying semantics for programs with data races is HARD – But “no semantics for data races” also has problems Not an option for safe languages; debugging; correctness checking tools Hardware-software mismatch for some code – “Simple” optimizations have unintended consequences  State-of-the-art is fundamentally broken

Lessons Learned SC for data-race-free minimal baseline Specifying semantics for programs with data races is HARD – But “no semantics for data races” also has problems Not an option for safe languages; ebugging; correctness checking tools Hardware-software mismatch for some code – “Simple” optimizations have unintended consequences  State-of-the-art is fundamentally broken Banish shared-memory?

Lessons Learned SC for data-race-free minimal baseline Specifying semantics for programs with data races is HARD – But “no semantics for data races” also has problems Not an option for safe languages; debugging; correctness checking tools Hardware-software mismatch for some code – “Simple” optimizations have unintended consequences  State-of-the-art is fundamentally broken We need – Higher-level disciplined models that enforce discipline – Hardware co-designed with high-level models Banish wild shared-memory!

Lessons Learned SC for data-race-free minimal baseline Specifying semantics for programs with data races is HARD – But “no semantics for data races” also has problems Not an option for safe languages; debugging; correctness checking tools Hardware-software mismatch for some code – “Simple” optimizations have unintended consequences  State-of-the-art is fundamentally broken We need – Higher-level disciplined models that enforce discipline – Hardware co-designed with high-level models Banish wild shared-memory! Deterministic Parallel Java [V. Adve et al.] DeNovo hardware

Research Agenda for Languages Disciplined shared-memory models – Simple – Enforceable – Expressive – Performance Key: What discipline? How to enforce it?

Data-Race-Free A near-term discipline: Data-race-free Enforcement – Ideally, language prohibits by design – Else, runtime catches as exception But data-race-free still not sufficiently high level

Deterministic-by-Default Parallel Programming Even data-race-free parallel programs are too hard – Multiple interleavings due to unordered synchronization (or races) – Makes reasoning and testing hard But many algorithms are deterministic – Fixed input gives fixed output – Standard model for sequential programs – Also holds for many transformative parallel programs Parallelism not part of problem specification, only for performance Why write such an algorithm in non-deterministic style, then struggle to understand and control its behavior?

Deterministic-by-Default Model Parallel programs should be deterministic-by-default – Sequential semantics (easier than SC!) If non-determinism is needed – should be explicitly requested – should be isolated from deterministic parts Enforcement: – Ideally, language prohibits by design – Else, runtime

State-of-the-art Many deterministic languages today – Functional, pure data parallel, some domain-specific, … – Much recent work on runtime, library-based approaches Our work: Language approach for modern O-O methods – Deterministic Parallel Java (DPJ) [V. Adve et al.]

Deterministic Parallel Java (DPJ) Object-oriented type and effect system – Use “named” regions to partition the heap – Annotate methods with effect summaries: regions read or written – If program type-checks, guaranteed deterministic * Simple, modular compiler checking * No run-time checks today, may add in future – Side benefit: regions, effects are valuable documentation Extended sequential subset of Java (DPC++ ongoing) – Initial evaluation for expressivity, performance [Oopsla09] – Integrating disciplined non-determinism – Encapsulating frameworks and unchecked code – Semi-automatic tool for effect annotations [ASE09]

Research Agenda for Hardware Current hardware not matched even to current model Near term: ISA changes, speculation Long term: Co-design hardware with new software models

Illinois DeNovo Project Design hardware to exploit disciplined parallelism – Simpler hardware – Scalable performance – Power/energy efficiency Working with DPJ as example disciplined model – Exploit data-race-freedom, region/effect information * Simpler coherence * Efficient communication: point to point, bulk, … * Efficient data layout: region vs. cache line centric memory * New hardware/software interface

Cache Coherence Commonly accepted definition (software-oblivious) – All writes to the same location appear in the same order – Source of much complexity – Coherence protocols to scale to 1000 cores? What do we really need (software-aware)? – Get the right data to the right task at the right time – Disciplined models make it easier to determine what is “right” (Assume only for-each loops) Read must return value of – Last write in its own task or – Last write in previous for-each loop

Today's Coherence Protocols Snooping – Broadcast, ordered networks Directory – avoid broadcast through level of indirection – Complexity: Races in protocol – Performance: Level of indirection – Overhead: Sharer list

Today's Coherence Protocols Snooping – Broadcast, ordered networks Directory – avoid broadcast through level of indirection – Complexity: Races in protocol Race-free software  race-free coherence protocol – Performance: Level of indirection But definition of coherence no longer requires serialization – Overhead: Sharer list Region-effects enable self-invalidations + No false sharing, flexible communication granularity, region based data layout  Simpler, more efficient DeNovo protocol

Conclusions Current way to specify concurrency semantics fundamentally broken – Best we can do is SC for data-race-free * But cannot hide from programs with data races – Mismatched hardware-software * Simple optimizations give unintended consequences Need – High-level disciplined models that enforce discipline – Hardware co-designed with high-level model DPJ – deterministic-by-default parallel programming DeNovo – hardware for disciplined parallel programming Previous memory models convergence from similar process – But this time, let’s co-design s/w, h/w