(C) 2003 Mulitfacet ProjectUniversity of Wisconsin-Madison Revisiting “Multiprocessors Should Support Simple Memory Consistency Models” Mark D. Hill Multifacet.

Slides:



Advertisements
Similar presentations
1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,
Advertisements

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.
Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
(C) 2001 Daniel Sorin Correctly Implementing Value Prediction in Microprocessors that Support Multithreading or Multiprocessing Milo M.K. Martin, Daniel.
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,
Instruction-Level Parallelism (ILP)
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Single-Chip Multiprocessors: the Rebirth of Parallel Architecture Guri Sohi University of Wisconsin.
Lecture 13: Consistency Models
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Computer Architecture II 1 Computer architecture II Lecture 9.
Multiscalar processors
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
RISC:Reduced Instruction Set Computing. Overview What is RISC architecture? How did RISC evolve? How does RISC use instruction pipelining? How does RISC.
Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Grad Student Visit DayUniversity of Wisconsin-Madison Wisconsin Computer Architecture Guri SohiMark HillMikko LipastiDavid WoodKaru Sankaralingam Nam Sung.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.
CS533 Concepts of Operating Systems Jonathan Walpole.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
CISC 879 : Advanced Parallel Programming Rahul Deore Dept. of Computer & Information Sciences University of Delaware Exploring Memory Consistency for Massively-Threaded.
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.
The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Interactions with Microarchitectures and I/O Copyright 2004 Daniel.
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture 20: Consistency Models, TM
Memory Consistency Models
Lecture 11: Consistency Models
Memory Consistency Models
5.2 Eleven Advanced Optimizations of Cache Performance
/ Computer Architecture and Design
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Levels of Parallelism within a Single Processor
Lecture 14: Reducing Cache Misses
Hardware Multithreading
Address-Value Delta (AVD) Prediction
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Presented to CS258 on 3/12/08 by David McGrogan
Single-Chip Multiprocessors: the Rebirth of Parallel Architecture
Mark D. Hill Multifacet Project ( Computer Sciences Department
Control unit extension for data hazards
Shared Memory Consistency Models: A Tutorial
COMS 361 Computer Organization
Levels of Parallelism within a Single Processor
Control unit extension for data hazards
Control unit extension for data hazards
Lecture 21: Synchronization & Consistency
Presentation transcript:

(C) 2003 Mulitfacet ProjectUniversity of Wisconsin-Madison Revisiting “Multiprocessors Should Support Simple Memory Consistency Models” Mark D. Hill Multifacet Project ( Computer Sciences Department University of Wisconsin—Madison October 2003

Wisconsin Multifacet Project Dagstuhl 10/ High- vs. Low-Level Memory Model Interface C w/ PosixJava w/ ThreadsHPF SW Multithreaded Hardware Most of This Workshop This Talk Relaxed HL InterfaceRelaxed HW Interface

Wisconsin Multifacet Project Dagstuhl 10/ Outline Subroutine Call –Value Prediction & Memory Model Subtleties Review Original Paper [Computer, Dec. 1998] –Commercial Memory Model Classes –Performance Similarities & Differences –Predictions & Recommendation Revisit in 2003 –Revisiting 1998 Predictions –“SC + ILP = RC?” Paper –Revisiting Commercial Memory Model Classes –Analysis, Predictions. & Recommendation

(C) 2003 Mulitfacet ProjectUniversity of Wisconsin-Madison Correctly Implementing Value Prediction in Microprocessors that Support Multithreading or Multiprocessing Milo M.K. Martin, Daniel J. Sorin, Harold W. Cain, Mark D. Hill, and Mikko H. Lipasti Computer Sciences Department Department of Electrical and Computer Engineering University of Wisconsin—Madison

Wisconsin Multifacet Project Dagstuhl 10/ Big Picture Naïve value prediction can break concurrent systems Microprocessors incorporate concurrency –Multithreading (SMT) –Multiprocessing (SMP, CMP) –Coherent I/O Correctness defined by memory consistency model –Comparing predicted value to actual value not always OK –Different issues for different models Violations can occur in practice Solutions exist for detecting violations

Wisconsin Multifacet Project Dagstuhl 10/ Value Prediction Predict the value of an instruction –Speculatively execute with this value –Later verify that prediction was correct Example: Value predict a load that misses in cache –Execute instructions dependent on value-predicted load –Verify the predicted value when the load data arrives Without concurrency: simple verification is OK –Compare actual value to predicted Value prediction literature has ignored concurrency

Wisconsin Multifacet Project Dagstuhl 10/ Informal Example of Problem, part 1 Student #2 predicts grades are on bulletin board B Based on prediction, assumes score is 60 Grades for Class Student IDscore Bulletin Board B

Wisconsin Multifacet Project Dagstuhl 10/ Informal Example of Problem, part 2 Professor now posts actual grades for this class –Student #2 actually got a score of 80 Announces to students that grades are on board B Grades for Class Student IDscore Bulletin Board B

Wisconsin Multifacet Project Dagstuhl 10/ Informal Example of Problem, part 3 Student #2 sees prof’s announcement and says, “ I made the right prediction (bulletin board B), and my score is 60”! Actually, Student #2’s score is 80 What went wrong here? –Intuition: predicted value from future Problem is concurrency –Interaction between student and professor –Just like multiple threads, processors, or devices E.g., SMT, SMP, CMP

Wisconsin Multifacet Project Dagstuhl 10/ Linked List Example of Problem (initial state) head A null 4260 null A.data B.data A.next B.next Linked list with single writer and single reader No synchronization (e.g., locks) needed Initial state of list Uninitialized node

Wisconsin Multifacet Project Dagstuhl 10/ Linked List Example of Problem (Writer) head B null 4280 A Writer sets up node B and inserts it into list A.data B.data A.next B.next Code For Writer Thread W1: store mem[B.data] <- 80 W2: load reg0 <- mem[Head] W3: store mem[B.next] <- reg0 W4: store mem[Head] <- B Insert { Setup node

Wisconsin Multifacet Project Dagstuhl 10/ Linked List Example of Problem (Reader) head ? null 4260 null Reader cache misses on head and value predicts head=B. Cache hits on B.data and reads 60. Later “verifies” prediction of B. Is this execution legal? A.data B.data A.next B.next Predict head=B Code For Reader Thread R1: load reg1 <- mem[Head] = B R2: load reg2 <- mem[reg1] = 60

Wisconsin Multifacet Project Dagstuhl 10/ Why This Execution Violates SC Sequential Consistency –Simplest memory consistency model –Must exist total order of all operations –Total order must respect program order at each processor Our example execution has a cycle –No total order exists

Wisconsin Multifacet Project Dagstuhl 10/ Trying to Find a Total Order What orderings are enforced in this example? Code For Writer Thread W1: store mem[B.data] <- 80 W2: load reg0 <- mem[Head] W3: store mem[B.next] <- reg0 W4: store mem[Head] <- B Code For Reader Thread R1: load reg1 <- mem[Head] R2: load reg2 <- mem[reg1] Setup node Insert { Setup node

Wisconsin Multifacet Project Dagstuhl 10/ Program Order Code For Writer Thread W1: store mem[B.data] <- 80 W2: load reg0 <- mem[Head] W3: store mem[B.next] <- reg0 W4: store mem[Head] <- B Code For Reader Thread R1: load reg1 <- mem[Head] R2: load reg2 <- mem[reg1] Setup node Insert { Must enforce program order

Wisconsin Multifacet Project Dagstuhl 10/ Data Order If we predict that R1 returns the value B, we can violate SC Code For Writer Thread W1: store mem[B.data] <- 80 W2: load reg0 <- mem[Head] W3: store mem[B.next] <- reg0 W4: store mem[Head] <- B Code For Reader Thread R1: load reg1 <- mem[Head] = B R2: load reg2 <- mem[reg1] = 60 Setup node Insert {

Wisconsin Multifacet Project Dagstuhl 10/ Value Prediction and Sequential Consistency Key: value prediction reorders dependent operations –Specifically, read-to-read data dependence order Execute dependent operations out of program order Applies to almost all consistency models –Models that enforce data dependence order Must detect when this happens and recover Similar to other optimizations that complicate SC

Wisconsin Multifacet Project Dagstuhl 10/ How to Fix SC Implementations Address-based detection of violations –Student watches board B between prediction and verification –Like existing techniques for out-of-order SC processors –Track stores from other threads –If address matches speculative load, possible violation Value-based detection of violations –Student checks grade again at verification –Also an existing idea –Replay all speculative instructions at commit –Can be done with dynamic verification (e.g., DIVA)

Wisconsin Multifacet Project Dagstuhl 10/ Relaxed Consistency Models Relax some orderings between reads and writes Allows HW/SW optimizations Software must add memory barriers to get ordering Intuition: should make value prediction easier Our intuition is wrong …

Wisconsin Multifacet Project Dagstuhl 10/ Weakly Ordered Consistency Models Relax orderings unless memory barrier between Examples: –SPARC RMO –IA-64 –PowerPC –Alpha Subtle point that affects value prediction –Does model enforce data dependence order?

Wisconsin Multifacet Project Dagstuhl 10/ Relaxed Models that Enforce Data Dependence Examples: SPARC RMO, PowerPC, and IA-64 Code For Writer Thread W1: store mem[B.data] <- 80 W2: load reg0 <- mem[Head] W3: store mem[B.next] <- reg0 W3b: Memory Barrier W4: store mem[Head] <- B Code For Reader Thread R1: load reg1 <- mem[Head] R2: load reg2 <- mem[reg1] Memory barrier orders W4 after W1, W2, W3 Insert { Setup node

Wisconsin Multifacet Project Dagstuhl 10/ Violating Consistency Model Simple value prediction can break RMO, PPC, IA-64 How? By relaxing dependence order between reads Same issues as for SC and PC

Wisconsin Multifacet Project Dagstuhl 10/ Solutions to Problem 1.Don’t enforce dependence order (add memory barriers) –Changes architecture –Breaks backward compatibility –Not practical 2.Enforce SC or PC –Potential performance loss 3.More efficient solutions possible

Wisconsin Multifacet Project Dagstuhl 10/ Models that Don’t Enforce Data Dependence Example: Alpha Requires extra memory barrier (between R1 & R2) Code For Writer Thread W1: store mem[B.data] <- 80 W2: load reg0 <- mem[Head] W3: store mem[B.next] <- reg0 W3b: Memory Barrier W4: store mem[Head] <- B Code For Reader Thread R1: load reg1 <- mem[Head] R1b: Memory Barrier R2: load reg2 <- mem[reg1] Insert { Setup node

Wisconsin Multifacet Project Dagstuhl 10/ Issues in Not Enforcing Data Dependence Works correctly with value prediction –No detection mechanism necessary –Do not need to add any more memory barriers for VP Additional memory barriers –Non-intuitive locations –Added burden on programmer

Wisconsin Multifacet Project Dagstuhl 10/ Summary of Memory Model Issues SC Relaxed Models Weakly Ordered Models PC IA-32 SPARC TSO Enforce Data Dependence NOT Enforce Data Dependence IA-64 SPARC RMO Alpha

Wisconsin Multifacet Project Dagstuhl 10/ Conclusions Naïve value prediction can violate consistency Subtle issues for each class of memory model Solutions for SC & PC require detection mechanism –Use existing mechanisms for enhancing SC performance Solutions for more relaxed memory models –Enforce stronger model

Wisconsin Multifacet Project Dagstuhl 10/ Outline Subroutine Call –Value Prediction & Memory Model Subtleties Review Original Paper [Computer, Dec. 1998] –Commercial Memory Model Classes –Performance Similarities & Differences –Predictions & Recommendation Revisit in 2003 –Revisiting 1998 Predictions –“SC + ILP = RC?” Paper –Revisiting Commercial Memory Model Classes –Analysis, Predictions. & Recommendation

Wisconsin Multifacet Project Dagstuhl 10/ Commercial Memory Model Classes Sequential Consistency (SC) –MIPS/SGI –HP PA-RISC Processor Consistency (PC) –Relax write  read dependencies –Intel x86 (a.k.a., IA-32) –Sun TSO Relaxed Consistency (RC) –Relax all dependencies, but add fences –DEC Alpha –IBM PowerPC –Sun RMO (no implementations)

Wisconsin Multifacet Project Dagstuhl 10/ With All Models, Hardware Can Use –Coherent Caches –Non-binding prefetches –Simultaneous & vertical multithreading With Speculative Execution –Allow “expected” misses to prefetch –Speculatively perform all reads & writes What’s different?

Wisconsin Multifacet Project Dagstuhl 10/ Performance Difference RC/PC/SC can do same optimzations But RC/PC can sometimes commit early While SC can lose performance –Undoing execution on (suspected) model violation –Stalls due to full instruction windows, etc. Performance over SC [Ranganathan et al. 1997] –11% for PC –20% for RC –Closer if SC uses their Speculative Retirement

Wisconsin Multifacet Project Dagstuhl 10/ Predictions & Recommendation My Performance Gap Predictions + Longer (relative) memory latency -Larger caches, bigger windows, etc. -New inventions -My Recommendation -Implement SC (or PC) -Keep interface simple -Innovate in implementation

Wisconsin Multifacet Project Dagstuhl 10/ Outline Subroutine Call –Value Prediction & Memory Model Subtleties Review Original Paper [Computer, Dec. 1998] –Commercial Memory Model Classes –Performance Similarities & Differences –Predictions & Recommendation Revisit in 2003 –Revisiting 1998 Predictions –“SC + ILP = RC?” Paper –Revisiting Commercial Memory Model Classes –Analysis, Predictions. & Recommendation

Wisconsin Multifacet Project Dagstuhl 10/ Revisiting Predictions Evolutionary Predictions + Longer (relative) memory latency -Larger caches -Larger instruction windows. New Inventions -Run-ahead & Helper threads -SMT commercialized -Chip Multiprocessors (CMPs) -SC + ILP = RC? Wonderful prefetching Many threads per processor Many threads per chip Can close gap Relaxed HW memory model offers little more performance Happened, but on-balance made gap bigger

Wisconsin Multifacet Project Dagstuhl 10/ SC + IPC = RC?, 1999 Challenge –Hill, however, argues that with current trends toward larger levels of on-chip integration, sophisticated microarchitectural innovation, and larger caches, the performance gap between memory models should eventually vanish. Response –This paper confirms Hill’s conjecture by showing, for the first time, that an SC implementation can perform as well as an RC implementation if hardware provides enough support for speculation. –Deep history buffer & write speculative stores into cache –Filter table to detect conflicts on snoops

Wisconsin Multifacet Project Dagstuhl 10/ Commercial Memory Model Classes Sequential Consistency (SC) –MIPS/SGI –HP PA-RISC Processor Consistency (PC) –Relax write  read dependencies –Intel x86 (a.k.a., IA-32) –Sun TSO Relaxed Consistency (RC) –Relax all dependencies, but add fences –DEC Alpha –IBM PowerPC –Sun RMO (no implementations) + Intel IPF (IA-64)

Wisconsin Multifacet Project Dagstuhl 10/ Current Analysis Architectures changed mostly for business reasons No one substantially changed model class Clearly, all three classes work –E.g., generating fences not too bad

Wisconsin Multifacet Project Dagstuhl 10/ Current Options Assume Relaxed HLL model  Three HW Model Options Expose SC/PC & Implement SC/PC –Add SC/PC mechanisms & speculate! (somewhat complex) –HW implementers & verifiers know what correct is Expose Relaxed & Implement Relaxed –Many HW implementers & verifiers don’t understand relaxed –More performance? –Deep speculation require HW to pass fences Run-ahead & throw all away? Speculative execution with SC/PC-like mechanisms? Expose Relaxed & Implement SC/PC –Implement fences as no-ops –Use SC/PC mechanisms, speculate! –HW implementers & verifiers know what correct is

Wisconsin Multifacet Project Dagstuhl 10/ Predictions & Recommendation Predictions –Longer (relative) memory latency –Only partially compensated by caches, etc. –Will speculate further without larger windows (run-ahead) –Will need to speculate past synchronization & fences –Use CMPs to get many outstanding misses per chip Recommendations (unrepentant ) -Implement SC (or PC) -Keep interface simple -Innovate in implementation

Wisconsin Multifacet Project Dagstuhl 10/ Outline Subroutine Call –Value Prediction & Memory Model Subtleties Review Original Paper [Computer, Dec. 1998] –High- vs. Low-Level Memory Models –Commercial Memory Model Classes –Performance Similarities & Differences –Predictions & Recommendation Revisit in 2003 –Revisiting 1998 Predictions –“SC + ILP = RC?” Paper –Revisiting Commercial Memory Model Classes –Analysis, Predictions. & Recommendation