Presentation is loading. Please wait.

Presentation is loading. Please wait.

CISC 879 : Advanced Parallel Programming Rahul Deore Dept. of Computer & Information Sciences University of Delaware Exploring Memory Consistency for Massively-Threaded.

Similar presentations


Presentation on theme: "CISC 879 : Advanced Parallel Programming Rahul Deore Dept. of Computer & Information Sciences University of Delaware Exploring Memory Consistency for Massively-Threaded."— Presentation transcript:

1 CISC 879 : Advanced Parallel Programming Rahul Deore Dept. of Computer & Information Sciences University of Delaware Exploring Memory Consistency for Massively-Threaded Throughput- Oriented Processors Blake A. Hechtman and Daniel J. Sorin

2 CISC 879 : Advanced Parallel Programming Outline Motivation Background Memory Consistency MTTOP’s Experimentation Conclusion Some Slides adapted from Blake et. al. and S. Zuckerman et. al. - Memory Consistency

3 CISC 879 : Advanced Parallel Programming Motivation x = 1 r1 = y

4 CISC 879 : Advanced Parallel Programming Motivation y = 1 r2 = x

5 CISC 879 : Advanced Parallel Programming Motivation y = 1 r2 = x x = 1 r1 = y Thread 1Thread 2 Initially, x=y=0, is it possible to have r1=r2=0 ? Motivating example (1)

6 CISC 879 : Advanced Parallel Programming Motivation A similar motivating example (2) from: How to Make a Multiprocessor Computer That Correctly Execute Multiprocessor Programs – LESLIE LAMPORT 1979 Process 1 A := 1 If (B = 0) then critical section; A := 0 Else something else Process 2 B := 1 If (A = 0) then critical section; B := 0 Else something else Initially A=B=0, is mutual exclusion guaranteed ? Example Source: Leslie Lamport 1979 Sequential Consistency

7 CISC 879 : Advanced Parallel Programming Background What is Memory Consistency all about???

8 CISC 879 : Advanced Parallel Programming Memory Consistency Q. What happens when at least two concurrent memory operations arrive at the same memory location x?

9 CISC 879 : Advanced Parallel Programming Memory Consistency Q. What happens when at least two concurrent memory operations arrive at the same memory location x? Data Race?

10 CISC 879 : Advanced Parallel Programming Memory Consistency Uniform Memory Consistency Models Strongest MCMs Weaker MCMs Non Uniform Memory Consistency Models Hardware Oriented MCMs Software and Programmer Oriented MCMs Slides adapted from s. Zuckerman et. al

11 CISC 879 : Advanced Parallel Programming Atomic Consistency A system is AC if; All memory operations are issued and performed in some total order Real time constraint Memory operations must follow program order Strongest MCM that was conceived Never implemented

12 CISC 879 : Advanced Parallel Programming Sequential Consistency A system is SC if; All memory operations appear to follow some total order Memory operations (appear to) follow program order Lamport’s Definition of SC:... the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.

13 CISC 879 : Advanced Parallel Programming Back to our example… NO → There is no total linear order which allows both Thread 0 and Thread 1 to see memory operations happening in the same order such that r1 = r2 = 0 y = 1 r2 = x x = 1 r1 = y Thread 1Thread 2 Initially, x=y=0, is it possible to have r1=r2=0 ?

14 CISC 879 : Advanced Parallel Programming Coherence (cache consistency) Cache coherence is the consistency of shared resource data that ends up stored in multiple local caches Image Source: Wikipedia

15 CISC 879 : Advanced Parallel Programming Coherence is achieved if For each memory location x, there is a total order of all the memory operations dealing with x Memory operations on x follow the program order Coherence (cache consistency)

16 CISC 879 : Advanced Parallel Programming Back to our example… YES ! → r1 = y, x := 1 r2 = x, y := 1 y = 1 r2 = x x = 1 r1 = y Thread 1Thread 2 Initially, x=y=0, is it possible to have r1=r2=0 ?

17 CISC 879 : Advanced Parallel Programming High-level Overview

18 CISC 879 : Advanced Parallel Programming Some of my observations Image source: preshing.com

19 CISC 879 : Advanced Parallel Programming Some of my observations Image source: preshing.com Most of the examples assume sequentially consistent memory model

20 CISC 879 : Advanced Parallel Programming Why this re-ordering??? Performance

21 CISC 879 : Advanced Parallel Programming MTTOPs Massively Threaded Throughput-Oriented Processors (MTTOPs) like GPUs are being integrated on chips with CPUs and being used for general purpose programming Conventional wisdom favors weak consistency on MTTOPs This paper implements (SC, TSO and RMO) on MTTOPs Experiments show that strong consistency models are viable for MTTOPs Slides adopted from Blake et. al

22 CISC 879 : Advanced Parallel Programming MTTOPs Massively Threaded Throughput-Oriented 4-16 core clusters 8-64 threads wide SIMD 64-128 deep SMT  Thousands of concurrent threads Massively Threaded Throughput-Oriented Sacrifice latency for throughput Heavily banked caches and memories Many cores, each of which is simple

23 CISC 879 : Advanced Parallel Programming MTTOPs Examples Upto 16 cores 64 and 72 cores 2688 simple cores per chip Upto 61 cores

24 CISC 879 : Advanced Parallel Programming MTTOPs Conventional Wisdom Highly parallel systems benefit from less ordering (Graphics doesn’t need ordering) Strong Consistency seems likely to limit MLP Strong Consistency likely to suffer extra latencies Weak ordering helps CPUs, does it help MTTOPs?

25 CISC 879 : Advanced Parallel Programming Memory Models Experimented SC SC with write buffer Total store order (TSO) RC

26 CISC 879 : Advanced Parallel Programming MTTOPS Sys Configuration

27 CISC 879 : Advanced Parallel Programming Differences CPUs Prior work shows CPUs perform 2-4 loads per store Weak Consistency reduces impact of store latency on performance MTTOPs perform more loads per store  store latency optimizations will not be as critical to MTTOP performance

28 CISC 879 : Advanced Parallel Programming Differences Outstanding L1 cache misses Weak consistency enables more outstanding L1 misses per thread MTTOPs have more outstanding L1 cache misses  thread reordering enabled by weak consistency is less important to handle memory latency

29 CISC 879 : Advanced Parallel Programming Results – Benchmarks used Ported Rodinia benchmarks bfs, hotspot, kmeans, and nn Handwritten benchmarks dijkstra, 2dconv, and matrix_mul

30 CISC 879 : Advanced Parallel Programming Results

31 CISC 879 : Advanced Parallel Programming Conclusion Strong Consistency should not be ruled out for MTTOPs on the basis of performance Improving store performance with write buffers appears unnecessary Graphics-like workloads may get significant MLP from load reordering (dijkstra, 2dconv) Conventional wisdom may be wrong about MTTOPs

32 CISC 879 : Advanced Parallel Programming THANK YOU QUESTIONS ?


Download ppt "CISC 879 : Advanced Parallel Programming Rahul Deore Dept. of Computer & Information Sciences University of Delaware Exploring Memory Consistency for Massively-Threaded."

Similar presentations


Ads by Google