Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

Similar presentations


Presentation on theme: "1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga."— Presentation transcript:

1 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga P. Rajan 2 1 Feb. 2012 Fujitsu Labs of America 2 1

2 GPUs are widely used! About 40 of the top 500 machines are GPU based Personal supercomputers used for scientific research (biology, physics, …) increasingly based on GPUs 2 (courtesy of AMD) (courtesy of Nvidia) (courtesy of Nvidia, www.engadget.com) (courtesy of Intel) In such application domains, it is important that GPU computations yield correct answers and are bug-free.

3 Existing GPU Testing Methods are Inadequate Insufficient branch-coverage and interleaving-coverage, leading to – Missed data races 3

4 Existing GPU Testing Methods are Inadequate Insufficient branch-coverage and interleaving-coverage, leading to – Missed data races 4 Write(a) Read(a)

5 Existing GPU Testing Methods are Inadequate Data races are a huge problem – Testing is NEVER conclusive – One has to infer data race's ill effects indirectly through corrupted values – Even instrumented race checking gives results only for a specific platform, and not for future validations, for example for a different warp scheduling, e.g. change over from old Tesla to New Fermi 5

6 Existing GPU Testing Methods are Inadequate Insufficient branch-coverage and interleaving-coverage, leading to – Missed data races – Missed deadlocks 6

7 Existing GPU Testing Methods are Inadequate Insufficient branch-coverage and interleaving-coverage, leading to – Missed data races – Missed deadlocks 7 __SyncThreads()

8 Existing GPU Testing Methods are Inadequate Insufficient branch-coverage and interleaving-coverage, leading to – Missed data races – Missed deadlocks Insufficient measurement of performance penalties due to – Warp Divergence 8

9 Existing GPU Testing Methods are Inadequate Insufficient branch-coverage and interleaving-coverage, leading to – Missed data races – Missed deadlocks Insufficient measurement of performance penalties due to – Warp Divergence 9

10 Existing GPU Testing Methods are Inadequate Insufficient branch-coverage and interleaving-coverage, leading to – Missed data races – Missed deadlocks Insufficient measurement of performance penalties due to – Warp Divergence – Non-coalesced memory accesses 10

11 Existing GPU Testing Methods are Inadequate Insufficient branch-coverage and interleaving-coverage, leading to – Missed data races – Missed deadlocks Insufficient measurement of performance penalties due to – Warp Divergence – Non-coalesced memory accesses 11 Memory

12 Existing GPU Testing Methods are Inadequate Insufficient branch-coverage and interleaving-coverage, leading to – Missed data races – Missed deadlocks Insufficient measurement of performance penalties due to – Warp Divergence – Non-coalesced memory accesses – Bank conflicts 12 Memory Banks

13 Existing GPU Testing Methods are Inadequate CUDA GDB Debugger – Manually debug the code and check races and deadlocks CUDA Profiler – Report numbers difficult to read – Low coverage (i.e. no all possible inputs) 13 GKLEE – Better tool for verification and testing – Can address all the previously mentioned points – e.g. has found bugs in real SDK kernels previously thought to be bug-free – give root causes of the bugs

14 Our Contributions GKLEE: a Symbolic Virtual GPU for Verification, Analysis, and Test-generation GKLEE reports Races, Deadlocks, Bank Conflicts, Non-Coalesced Accesses, Warp Divergences GKLEE generates Tests to Run on GPU Hardware 14

15 15 Architecture of GKLEE LLVM GCC Compiler GKLEE (Executor, scheduler, checker, test generator) GKLEE (Executor, scheduler, checker, test generator) C++ GPU Program (with Sym. Inputs) LLVM cuda GPU configuration CUDA Syntax Handler NVCC Test Cases Replay on Real GPU Statistics /Bugs

16 16 Rest of the Talk Simple CUDA example Details of Symbolic Virtual GPU Analysis Details: – Races, Deadlocks – Degree of Warp divergences, Bank Conflicts, Non-Coalesced Accesses – Functional Correctness Automatic Test Generation – Coverage-directed test-case reduction

17 CUDA A simple dialect of C++ with CUDA directives Thread blocks / teams -- SIMD “warps” Synchronization through barriers / atomics (GKLEE being extended to handle atomics) 17

18 18 Example: Increment Array Elements Increment N-element array A by scalar b tid 0 1 … A A[0]+b __global__ void inc_gpu(int*A, int b, intN) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) A[idx] = A[idx] + b; }... A[1]+b t0t1

19 19 Illustration of Race Increment N-element vector A by scalar b tid 0 1 63 A t63: write A[63]... __global__ void inc_gpu(int*A, int b, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) A[idx] = A[(idx – 1) % N] + b; } RACE! t0: read A[63]

20 20 Illustration of Deadlock Increment N-element vector A by scalar b tid 0 1 … A... __global__ void inc_gpu(int*A, int b, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x ; if (idx < N) { A[idx] = A[idx] + b; __syncthreads(); } DEADLOCK! idx < N idx ≥ N

21 21 Example of a Race Found by GKLEE 21 __global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4);... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos];... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads();... } inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data) { s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; } “GKLEE: Is there a Race ?”

22 22 Example of a Race Found by GKLEE 22 __global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4);... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos];... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads();... } inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data) { s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; } Threads 5 and and 13 have a WW race when d_Data[5] = 0x04040404 and d_Data[13] = 0. GKLEE

23 23 Example of Test Coverage due to GKLEE 23 __global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads(); for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid]; } __shared__ unsigned shared[NUM]; inline void swap(unsigned& a, unsigned& b) { unsigned tmp = a; a = b; b = tmp; }

24 24 Example of Test Coverage due to GKLEE 24 __global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads(); for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid]; } __shared__ unsigned shared[NUM]; inline void swap(unsigned& a, unsigned& b) { unsigned tmp = a; a = b; b = tmp; } “How do we test this?”

25 25 Example of Test Coverage due to GKLEE 25 __global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads(); for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid]; } __shared__ unsigned shared[NUM]; inline void swap(unsigned& a, unsigned& b) { unsigned tmp = a; a = b; b = tmp; } Answer 1 : “Random + “

26 26 Example of Test Coverage due to GKLEE 26 __global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads(); for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid]; } __shared__ unsigned shared[NUM]; inline void swap(unsigned& a, unsigned& b) { unsigned tmp = a; a = b; b = tmp; } Answer 2 : Ask GKLEE: Here are 5 tests with 100% source code coverage 79% avg. thread + barrier interval coverage

27 27 GKLEE: Symbolic Virtual GPU Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) GKLEE models a GPU using software – The virtual GPU represents the CUDA Programming Model ( hence hide many hardware details) – Similar to the CUDA emulator in this aspect; but with many unique features – Can simulate CPU+GPU virtual CPU virtual GPU GKLEE

28 28 Concolic Execution on the Virtual GPU The values can be CONCrete or symbOLIC (CONCOLIC) in GKLEE – A value may be a complicated symbolic expression – Symbolic expressions are handled by constraint solvers Determine satisfiability Give concrete values as evidence – Constraint solving has become 1,000x faster over the last 10 years

29 29 Comparing Concrete and Symbolic Execution 10 abc Program: b = a * 2; c = a + b; if (c > 100) assert(0); 201030 2010 unreachable All values are concrete

30 30 Comparing Concrete and Symbolic Execution x  (- ,+  ) abc Program: b = a * 2; c = a + b; if (c > 100) assert(0); else … reachable, e.g. x = 40 x  (- ,+  ) 2x x  (- ,+  ) 3x 2x reachable, e.g. x = 30 Now path condition is: 3x <= 100 The values can be concrete or symbolic

31 31 GKLEE Works on LLVM Bytecode CUDA C++ programs are compiled to LLVM bytecode by LLVM-GCC with our CUDA syntax handler Our online technical report contains detailed description GKLEE extends KLEE to handle CUDA features LLVM cuda Syntax and Semantics

32 32 Thread Scheduling: In general, an Exp. Number of Schedules! It is like shuffling decks of cards > 13 trillion shuffles exist for 5 decks with 5 cards !! > 13 trillion schedules exist for 5 threads with 5 instructions !! More precisely, 25! / (5!) 5

33 33 GKLEE Avoids Examining Exp. Schedules !! Instead of considering all Schedules and All Potential Races…

34 34 GKLEE Avoids Examining Exp. Schedules !! Instead of considering all Schedules and All Potential Races… Consider JUST THIS SINGLE CANONICAL SCHEDULE !! Folk Theorem (proved in our paper): “We will find A RACE If there is ANY race” !!

35 35 Closer Look: canonical scheduling Race-free operations can be exchanged another valid schedule (e.g. canonical schedule): t1:a1: read x t2:a2: write y t1:a3: write x t2:a4: write y t1:a5: read x t2:a6: read y a valid schedule: t2:a2: write y t1:a1: read x t1:a3: write x t2:a4: write y t2:a6: read y t1:a5: read x The scheduler: (1)Applies the canonical schedule; (2)Checks races upon the barriers; (3)If no race then continues; otherwise reports the race and terminate

36 36 SIMD-aware Canonical Scheduling in GKLEE SIMD/Barrier Aware Canonical scheduling within warp/block t1t1 t 32 Barrier Interval ( BI1 ) Barrier Interval ( BI2 ) Instr. 1 t2t2 Instr. 2 Instr. 3 t 33 t 64 Instr. 1 t 34 Instr. 2 Instr. 3 Instr. 4 Instr. 5 Instr. 6 Instr. 4 Instr. 5 Instr. 6 … Record accesses in canonical schedule Check whether the accesses conflict (e.g. have the same address)

37 37 SIMD-aware Race Checking in GKLEE Check races on the fly (in the canonical schedule) t1t1 t 32 Barrier Interval ( BI1 ) Barrier Interval ( BI2 ) Instr. 1 t2t2 Instr. 2 Instr. 3 t 33 t 64 Instr. 1 t 34 Instr. 2 Instr. 3 Instr. 4 Instr. 5 Instr. 6 Instr. 4 Instr. 5 Instr. 6 … intra-warp races inter-warp and inter-block races

38 38 SIMD-aware Race Checking in GKLEE Check races on the fly (in the canonical schedule) t1t1 t 32 Barrier Interval ( BI1 ) Barrier Interval ( BI2 ) Instr. 1 t2t2 Instr. 2 Instr. 3 t 33 t 64 Instr. 1 t 34 Instr. 2 Instr. 3 Instr. 4 Instr. 5 Instr. 6 Instr. 4 Instr. 5 Instr. 6 … intra-warp races inter-warp and inter-block races

39 SDK Kernel Example: race checking __global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4);... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos];... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads();... } inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data) { s_Hist[threadPos + IMUL(data, THREAD_N)]++; } threadPos = … data = (data4>26) & 0x3FU s_Hist[threadPos + Data*THREAD_N]++; s_Hist[threadPos + data*THREAD_N]++; t1t2

40 SDK Kernel Example: race checking threadPos = … data = (data4>26) & 0x3FU s_Hist[threadPos + data*THREAD_N]++; s_Hist[threadPos + data*THREAD_N]++; RW set: t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) > 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), … t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) > 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), … t1t2  t1,t2,d_Data: (t1  t2)  (((t1 & (~63)) >> 0) | ((t1 & 15) > 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64) == ((((t2 & (~63)) >> 0) | ((t2 & 15) > 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64) ?

41 SDK Kernel Example: race checking threadPos = … data = (data4>26) & 0x3FU s_Hist[threadPos + data*THREAD_N]++; s_Hist[threadPos + data*THREAD_N]++; t1t2 GKLEE indicates that these two addresses are equal when t1 = 5, t2 = 13, d_data[5]= 0x04040404, and d_data[13] = 0 indicating a Write-Write race RW set: t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) > 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), … t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) > 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …

42 42 Experimental Results, Part I (check correctness and performance issues) The results of running GKLEE on CUDA SDK 2.0 kernels. GKLEE checks (1)well synchronized barriers; (2) races; (3) functional correctness; (4) bank conflicts; (5) memory coalescing; (6) warp divergence; (7) required volatile keyword. KernelsLocRaceFunc. Corrct. #TBank Conflict  Perf. Coalesced Accesses (  Perf.) Warp Diverg  perf. Volatile Needed 1.X2.X ≤1.12.x Bitonic Sort30yes40% 100% 60%no Scalar Prod.30yes640% 11%100% yes Matric Mult61yes640% 100% 0%no Histogram64 th. 69WWunknown3266% 100% 0%yes Reduction (7)231yes160% 100% 16-83%yes Scan Best78yes3271% 100% 71%no Scan Naïve28yes320% 50%100%85%yes Scan Effi.60yes3283%16%0% 83%no Scan Large196yes3271% 100% 71%no Radix Sort750WWunknown163%0% 100%5%yes Bisect Small1,000ben._1638%0%97%100%43%yes Bisect Large1,400ben._1615%0%99%100%53%yes

43 43 Automatic Test Generation GKLEE guarantees to explore all paths w.r.t. given inputs The path constraint at the end of each path is solved to generate concrete test cases – GKLEE supports many heuristic reduction techniques t1 c2 ¬c1c1 ¬c2 c4 ¬c3 ¬c4 c3 t2 c2 ¬c1 c1 ¬c2 c4 ¬c3 ¬c4 c3 c4 ¬c3 ¬c4 c3 c4 ¬c3 ¬c4 c3 t1+t2 c1  c2  c3  c4 ¬ c1  ¬c3 … solve this constraint to give a concrete test

44 44 SDK Example: comprehensive testing 44 __global__ void BitonicKernel(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads(); for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid]; } … shared[0] > shared[1] shared[0]≤s hared[1] shared[1] < shared[2] shared[1] ≥ shared[2] shared[0] > shared[2] shared[0] ≤ shared[2] Unsat: shared[0] > shared[1]  shared[1] ≥ shared[2]  shared[0] ≤ shared[2]

45 45 SDK Example: comprehensive verification 45 … …… … Functional correctness: output values is sorted: values[0] ≤ values[1] ≤ … ≤ values[n] … values =… … ……

46 46 Experimental Results, Part II… (Automatic Test Generation) Coverage information about the generated tests for some CUDA kernels. Kernelssrc. code coverage Avg. Cov t max. Cov t Avg. CovBI t Max. CovBI t Exec. time Bitonic Sort100%/100%78%/76%100%/94%79%/66%90%/76%1s Merge Sort100%/100%88%/70%100%/85%93%/86%100%/100%1.6s Word Search100%/100%100%/81%100%/85%100%/97%100%/100%0.1s Suffix Tree Match 100%/90%55%/49%98%/66%55%/49%98%/83%31s Histogram64100%/100%100%/75% 100%/100% 600s Cov t and CovTB t measure bytecode coverage w.r.t threads. No test reductions used in generating this table. Exec. time on typical workstation.

47 47 Experimental Results, Part II (Coverage Directed Test Reduction) Results after applying reduction Heuristics RedTB and RedBI cut the paths according to the coverage information of Thread+Barrier and Barrier respectively. Basically a path is pruned if it is unlikely to contribute new coverage. KernelsNo ReductionsRedTBRedBI #pathAvg. CovBI t #pathAvg. CovBI t #pathAvg. CovBI t Bitonic Sort2879%/66%5 579%/65% Merge Sort3493%/86%492%/84%4 Word Search8100%/97%2 294%/85% Suffix Tree Match3155%/49%6 6 Histogram6413100%/100%5 5

48 48 Additional GKLEE Features GKLEE employs an efficient memory organization Employs many expression evaluation optimizations Simplify concolic expressions on the fly Dynamically cache results Apply dependency analysis before constraint solving Use manually optimized C/C++ Libraries GKLEE also handles all of the C++ Syntax GKLEE never generates false alarms

49 49 Experimental Results, Part III (performance comparison of two tools) Execution times (in seconds) of GKLEE and PUG [SIGSOFT FSE 2010] for functional correctness check. #T is the number of threads. Time is reported in the format of GPU time(entire time); T.O means > 5 minutes. Kernels#T = 4#T = 16#T = 64#T = 256#T = 1,024 PUGGKLEEPUGGKLEE Simple Reduct.2.8<0.1(<0.1)T.O<0.1(<0.1) 0.2(0.3)2.3(2.9) Matrix. Transp.1.9<0.1(<0.1)T.O<0.1(0.3)<0.1(3.2)<0.1(63)0.9(T.O) Bitonic Sort3.70.9(1)T.O Scan Large_<0.1(<0.1)_ 0.1(0.2)1.6(3)22(51)

50 50 Other Details Diverged warp scheduling, intra-warp, inter-warp/- block race checking, textual aligned barrier checking Checking performance issues – warp divergence, bank conflicts, global memory coalescing Path/Test reduction techniques Volatile declaration checking Handling symbolic aliasing and pointers Drivers for the kernels and replaying on the real GPU Other results, e.g. on CUDA SDK 4.0 programs CUDA’s relaxed memory model and semantics

51 51 Summary GKLEE: symbolic virtual GPU – Identify correctness and performance issues – Produce concrete tests with high code coverage – Enable symbolic parallel debugging for CUDA programs – Good for other CUDA applications (e.g. compiler optimization verification, regression testing, etc.) The tool is open source and available at: – www.cs.utah.edu/fv/GKLEE www.cs.utah.edu/fv/GKLEE – with tutorial, manual, tech. report, liveDVD,, etc. Future Work – Parameterized verification (e.g. equivalence checking) – Support for floating point numbers – Combination with runtime execution (on the real GPU)

52 52 Thank You! Questions? Obtain GKLEE from www. cs. utah. edu / fv / GKLEE


Download ppt "1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga."

Similar presentations


Ads by Google