Presentation is loading. Please wait.

Presentation is loading. Please wait.

TESTING AND EXPOSING WEAK GPU MEMORY MODELS MS Thesis Defense by Tyler Sorensen Advisor : Ganesh Gopalakrishnan May 30, 2014 1.

Similar presentations


Presentation on theme: "TESTING AND EXPOSING WEAK GPU MEMORY MODELS MS Thesis Defense by Tyler Sorensen Advisor : Ganesh Gopalakrishnan May 30, 2014 1."— Presentation transcript:

1 TESTING AND EXPOSING WEAK GPU MEMORY MODELS MS Thesis Defense by Tyler Sorensen Advisor : Ganesh Gopalakrishnan May 30,

2 Joint Work with: Jade Alglave (University College London), Daniel Poetzl (University of Oxford), Luc Maranget (Inria), Alastair Donaldson, John Wickerson, (Imperial College London), Mark Batty (University of Cambridge) 2

3 Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion 3

4 Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion 4

5 GPU Background 5 Images from Wikipedia [16,17,18] GPU is a highly parallel co-processor Currently found in devices from tablets to top super computers (Titan) Not just used for visualization anymore!

6 GPU Programming Explicit Hierarchical concurrency model 6 Thread Hierarchy: Thread Warp CTA (Cooperative Thread Array) Kernel (GPU program) Memory Hierarchy: Shared Memory Global Memory

7 GPU Programming 7

8 GPUs are SIMT (Single Instruction, Multiple Thread) NVIDIA GPUs may be programmed using CUDA or OpenCL 8

9 GPU Programming 9

10 Weak Memory Models 10 Consider the test known as Store Buffering (SB)

11 Weak Memory Models 11 Consider the test known as Store Buffering (SB) Initial State: x and y are memory locations

12 Weak Memory Models 12 Consider the test known as Store Buffering (SB) Thread IDs

13 Weak Memory Models 13 Consider the test known as Store Buffering (SB) Program: for each thread ID

14 Weak Memory Models 14 Consider the test known as Store Buffering (SB) Assertion: question about the final state of registers

15 Weak Memory Models 15 Consider the test known as Store Buffering (SB) Can this assertion be satisfied?

16 16

17 17

18 18

19 19

20 20

21 21 Assertion cannot be satisfied by interleavings This is known as sequential consistency (or SC) [1]

22 Weak Memory Models Can we assume assertion will never pass? 22

23 Weak Memory Models Can we assume assertion will never pass? No! 23

24 Weak Memory Models Executing this test with the Litmus tool [2] on an Intel i7 x86 processor for iterations, we get the following histogram of results: 24

25 Weak Memory Models What Happened? Architectures implement weak memory models where the hardware is allowed to re-order certain memory instructions. On x86 architectures, the hardware is allowed to re-order write instructions with program-order later read instructions [3] 25

26 GPU Memory Models What type of memory model do current GPUs implement? Documentation is sparse CUDA has 1 page + 1 example [4] PTX has 1 page + 0 examples [5] No specifics about which instructions are allowed to be re-ordered We need to know if we are to write correct GPU programs ! 26

27 Our Approach Empirically explore the memory model implemented on deployed NVIDIA GPUs Achieved by developing a memory model testing tool for NVIDIA GPUs with specialized heuristics We analyze classic memory model properties and CUDA applications in this framework with unexpected results We test large families of tests on GPUs as a basis for modeling and bug hunting 27

28 Our Approach Disclaimer: Testing is not guaranteed to reveal all behaviors 28

29 Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion 29

30 Prior Work Testing Memory Models: Pioneered by Bill Collier in ARCHTEST in 1992 [6] TSOTool in 2004 [7] Litmus in 2011 [2] We extend this tool 30

31 Prior Work (GPU Memory Models) June 2013: Hower et al. proposed a SC for race-free memory model for GPUs [8] Sorensen et al. proposed an operational weak GPU memory model based on available documentation [9] 2014: Hower et al. proposed two SC for race-free memory model for GPUs, HRF- direct and HRF-indirect [10] It remains unclear what memory model deployed GPUs implement 31

32 Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion 32

33 Testing Framework GPU litmus test 33

34 Testing Framework GPU litmus test PTX instructions 34

35 Testing Framework GPU litmus test What memory region (shared or global) are x and y in? 35

36 Testing Framework GPU litmus test Are T0 and T1 in the same CTA? Or different CTAs? 36

37 Testing Framework We consider three different GPU configurations for tests: D-warp:S-cta-Shared: Different warp, Same CTA, targeting shared memory D-warp:S-cta-Global: Different warp, Same CTA, targeting global memory D-cta:S-ker-Global: Different CTA, Same kernel, targeting global memory 37

38 Testing Framework Given a GPU Litmus test produce executable CUDA or OpenCL 38

39 Testing Framework Host (CPU) generated code 39

40 Testing Framework Host (CPU) generated code 40

41 Testing Framework Host (CPU) generated code 41

42 Testing Framework Host (CPU) generated code 42

43 Testing Framework Host (CPU) generated code 43

44 Testing Framework Host (CPU) generated code 44

45 Testing Framework Host (CPU) generated code 45

46 Testing Framework Kernel generated code 46

47 Testing Framework Kernel generated code 47

48 Testing Framework Kernel generated code 48

49 Testing Framework Kernel generated code 49

50 Testing Framework Kernel generated code 50

51 Testing Framework Basic Framework shows NO weak behaviors We develop heuristics (we dub incantations) to encourage weak behaviors to appear 51

52 Testing Framework General bank conflict incantation Each access in test is exclusively one of: Optimal 52

53 Testing Framework General bank conflict incantation Each access in test is exclusively one of: Optimal Broadcast 53

54 Testing Framework General bank conflict incantation Each access in test is exclusively one of: Optimal Broadcast Bank Conflict 54

55 Testing Framework General Bank Conflict Heuristic Given this test: 55

56 Testing Framework General Bank Conflict Heuristic One possible general bank conflict scheme: Bank Conflict Optimal Broadcast 56

57 Testing Framework Two critical incantations (without them we observe no weak executions): General Bank Conflicts (shown previously) Memory Stress: All non-testing threads read/write to memory 57

58 Testing Framework Two extra incantations: Sync: testing threads synchronize before test Randomization: testing thread IDs are randomized 58

59 Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion 59

60 Traditional Tests We show the results for these tests which have been studied for CPUs in [3]: MP (Message Passing): can stale values can be read in a handshake idiom? SB (Store Buffering): can stores can be buffered after loads? LD (Load Delaying): can loads can be delayed after stores? Results show running 100,000 iterations over 3 chips: Tesla C2075 (Fermi), GTX Titan (Kepler), and GTX 750 (Maxwell) 60

61 Message Passing 61 Tests how to implement a handshake idiom

62 Message Passing 62 Tests how to implement a handshake idiom Flag

63 Message Passing 63 Tests how to implement a handshake idiom Data

64 Message Passing 64 Tests how to implement a handshake idiom Stale Data

65 Message Passing 65

66 Message Passing 66 How do we disallow reading stale data? PTX gives 2 fences for intra-device [5 p.165] membar.cta – Gives ordering properties intra-CTA membar.gl – Gives ordering properties over device

67 Message Passing 67 Test amended with a parameterizable fence

68 Message Passing 68

69 Message Passing 69

70 Message Passing 70

71 Store Buffering 71 Can stores can be delayed after loads?

72 Store Buffering 72

73 Load Delaying 73 Can loads can be delayed after stores?

74 Load Delaying 74

75 CoRR Test Coherence is SC per memory location [11, p. 14] Modern processors (ARM, POWER, x86) implement coherence All language models require coherence (C++11, OpenCL 2.0) Has been observed and confirmed buggy in ARM chips [3, 12] 75

76 CoRR Test Coherence of Read-Read test Can loads from the same location be return stale values? 76

77 CoRR Test 77

78 CoRR Test 78

79 CoRR Test 79

80 CoRR Test Coherence of Read-Read test Test amended with a parameterized fence 80

81 CoRR Test 81

82 CoRR Test 82

83 CoRR Test 83

84 Results Take Away Current GPUs implement observably weak memory models with scoped properties. Without formal docs, how can developers know what behaviors to rely on? This is biting developers even now (discussed next) 84

85 Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion 85

86 GPU Spin Locks Inter-CTA lock presented in the book CUDA By Example [13] 86

87 GPU Spin Locks Inter-CTA lock presented in the book CUDA By Example [13] 87

88 GPU Spin Locks Inter-CTA lock presented in the book CUDA By Example [13] 88

89 GPU Spin Locks Inter-CTA lock presented in the book CUDA By Example [13] 89

90 GPU Spin Locks Distilled to a litmus test ( y is mutex, x is data): 90

91 GPU Spin Locks Distilled to a litmus test ( y is mutex, x is data): 91 Initially Locked by T0

92 GPU Spin Locks Distilled to a litmus test ( y is mutex, x is data): 92 Unlock CS* *CS = Critical Section

93 GPU Spin Locks Distilled to a litmus test ( y is mutex, x is data): 93 lock CS* *CS = Critical Section

94 GPU Spin Locks Distilled to a litmus test ( y is mutex, x is data): 94 T1 Observes Stale Value *CS = Critical Section

95 GPU Spin Locks Distilled to a litmus test ( y is mutex, x is data): 95 *CS = Critical Section

96 GPU Spin Locks Do we observe stale data in the Critical Section? 96

97 GPU Spin Locks Do we observe stale data in the Critical Section? Yes! 97

98 GPU Spin Locks Spin lock test amended with fences 98

99 GPU Spin Locks Now test with fences: 99

100 GPU Spin Locks Now test with fences: Is membar.cta enough? 100 Is membar.cta enough? Is?

101 GPU Spin Locks Now test with fences: Is membar.cta enough? No! It is an inter-CTA lock! Is membar.gl enough? 101 Is membar.cta enough?

102 GPU Spin Locks Now test with fences: Is membar.cta enough? No! It is an inter-CTA lock! Is membar.gl enough? Yes! 102

103 GPU Spin Lock More examples without fences, which have similar issues: Mutex in Efficient Synchronization Primitives for GPUs [14] Non-blocking GPU deque in GPU Computing Gems Jade Edition [15] GPU applications must use fences!!! 103

104 Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion 104

105 Bulk Testing Daniel Poetzl (University of Oxford) is developing GPU extensions to DIY test generation [3] Test generation is based on critical cycles Used for validating models, finding bugs, gaining intuition about observable behaviors Image used with permission from [3] 105

106 Bulk Testing We have generated over 8000 tests across intra/inter CTA interactions and targeting both shared and global memory Tests include memory barriers (e.g. membar.{cta,gl,sys} ), and dependencies (data, address, and control) Tested 5 chips across 3 generations GTX 540m (Fermi), Tesla C2075 (Fermi), GTX 660 (Kepler), GTX Titan (Kepler) GTX 750 Ti (Maxwell) 106

107 Roadmap Background and Approach Prior Work Testing Framework Results CUDA Spin Locks Bulk Testing Future Work and Conclusion 107

108 Future Work Test more complicated GPU configurations (e.g. both shared and global in the same test) Example: Intra-CTA Store Buffering (SB) test is observable on Maxwell only with mixed shared and global memory locations.

109 Future Work Axiomatic memory model in Herd [3] New scoped relations: Internal–CTA: Contains pairs of instructions that are in the same CTA Can easily compare model to observations Based on acyclic relations Image used with permission from [3] 109

110 Conclusion Current GPUs have observably weak memory models which are largely undocumented GPU programming in proceeding without adequate guidelines which results in buggy code (development of reliable GPU code impossible without specs) Rigorous documentation, testing, and verification of GPU programs based on formal tools is the way forward in terms of developing reliable GPU applications 110

111 References [1] L. Lamport, "How to make a multiprocessor computer that correctly executes multi-process programs," IEEE Trans. Comput., pp , Sep [2] J. Alglave, L. Maranget, S. Sarkar, and P. Sewell, "Litmus: Running tests against hardware," ser. TACAS'11. Springer-Verlag, pp [3] J. Alglave, L. Maranget, and M. Tautschnig, "Herding cats: modelling, simulation, testing, and data-mining for weak memory," 2014, to appear in TOPLAS. [4] NVIDIA, "CUDA C programming guide, version 6," C Programming Guide.pdf, July [5] NVIDIA, "Parallel Thread Execution ISA: Version 4.0 (Feb. 2014)," [6] W. W. Collier, Reasoning About Parallel Architectures. Prentice-Hall, Inc., [7] S. Hangal, D. Vahia, C. Manovit, and J.-Y. J. Lu, "TSOtool: A program for verifying memory systems using the memory consistency model," ser. ISCA '04. IEEE Computer Society, 2004, pp

112 References [8] D. R. Hower, B. M. Beckmann, B. R. Gaster, B. A. Hechtman, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Sequential consistency for heterogeneous-race-free," ser. MSPC'13. ACM, [9] T. Sorensen, G. Gopalakrishnan, and V. Grover, "Towards shared memory consistency models for GPUs," ser. ICS'13. ACM, 2013, pp [10] D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Heterogeneous-race-free memory models," ser. ASPLOS'14. ACM, 2014, pp [11] D. J. Sorin, M. D. Hill, and D. A. Wood, A Primer on Memory Consistency and Cache Coherence, ser. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, [12] ARM, "Cortex-A9 MPCore, programmer advice notice, read-after-read hazards," ARM Reference a9 read read.pdf, accessed: May [13] J. Sanders and E. Kandrot, CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional,

113 References [14] J. A. Stuart and J. D. Owens, "Efficient synchronization primitives for GPUs," CoRR, 2011, [15] W.-m. W. Hwu, GPU Computing Gems Jade Edition. Morgan Kaufmann Publishers Inc., [16] [17] [18] 113

114 Acknowledgements Advisor: Ganesh Gopalakrishnan Committee: Zvonimir Rakamaric, Mary Hall UK Group: Jade Alglave (University College London), Daniel Poetzl (University of Oxford), Luc Maranget (Inria), John Wickerson, Alastair Donaldson (Imperial College London), Mark Batty (University of Cambridge) Mohammed for feedback on practice runs 114

115 Thank You 115

116 Prior Work (GPU Memory Models) June 2010: Feng and Xiao revisit their GPU device-wide synchronization method [?] to repair it with fences [?] Speaking about weak behaviors, they state: In practice, it is infinitesimally unlikely that this will ever happen given the amount of time that is spent spinning at the barrier, e.g., none of our thousands of experimental runs ever resulted in an incorrect answer. Furthermore, no existing literature has been able to show how to trigger this type of error. 116

117 Testing Framework Evaluate inter-CTA incantations using these tests: MP: checks if stale values can be read in a handshake idiom LD: checks if loads can be delayed after stores SB: checks if stores can be delayed after loads Results show average of running 100,000 iterations over 3 chips: Tesla C2075 (Fermi), GTX Titan (Kepler), and GTX 750 (Maxwell) 117

118 Inter-CTA interactions 118

119 Without Critical Incantations, No Weak Behaviors Are Observed Inter-CTA interactions 119

120 Inter-CTA interactions 120

121 Most Effective Incantations Inter-CTA interactions 121

122 Testing Framework Evaluate intra-CTA incantations using these tests*: MP-Global: Message Passing tests targeting global memory region MP-Shared: Message Passing tests targeting global memory region 122 * The previous tests (LD, SB) are not observable intra-CTA

123 Intra-CTA interactions 123

124 Intra-CTA interactions Without Critical Incantations, No Weak Behaviors Are Observed 124

125 Intra-CTA interactions 125

126 Intra-CTA interactions Most Effective Incantations 126

127 Bulk Testing Invalidated GPU memory model from [?] Model disallows behaviors observed on hardware Gives too strong of orderings to load operations inter-CTA 127

128 Bulk Testing Invalidated GPU memory model from [?] Model disallows behaviors observed on hardware Gives too strong of orderings to load operations inter-CTA 128

129 GPU Hardware Multiple SMs (Streaming Multiprocessors) SMs contain CUDA Cores Each SM has an L1 cache All SMs share an L2 cache and DRAM Warp scheduler executes in groups of

130 GPU Hardware 130

131 GPU Programming to Hardware 131 Threads in same CTA are mapped to same SM Shared memory is in L1 (Maxwell is an Exception) Global memory is in DRAM and cached in L2 (Fermi is an Exception) Warp scheduler executes threads in groups of 32

132 Testing Framework 132

133 Testing Framework Initial value of shared memory locations 133

134 Testing Framework Thread IDs 134

135 Testing Framework Programs (written in NVIDIA PTX) 135

136 Testing Framework Assertion about final state of system 136

137 GPU Terminology 137 We Use


Download ppt "TESTING AND EXPOSING WEAK GPU MEMORY MODELS MS Thesis Defense by Tyler Sorensen Advisor : Ganesh Gopalakrishnan May 30, 2014 1."

Similar presentations


Ads by Google