Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indranil Roy High Performance Computing (HPC) group

Similar presentations


Presentation on theme: "Indranil Roy High Performance Computing (HPC) group"— Presentation transcript:

1 Indranil Roy High Performance Computing (HPC) group
UPC CHECK: A scalable tool for detecting run-time errors in Unified Parallel C Indranil Roy High Performance Computing (HPC) group

2 Segmentation error. Core dumped.

3 A good error message Thread 0 encountered invalid arguments in function upc all broadcast at line 26 in file /home/jjc/ex1.upc. Error: Parameter (sizeof(int ) * sh val) passes non- positive value of 0 to nbytes argument Variable sh val was declared at line 10 in file /home/jjc/ex1.upc.

4 Outline Understanding a Unified Parallel C UPC-CHECK 1.0 tool
How does it work? Usability Error coverage and quality of error reports generated Testing Overheads Scalability Known limitations Challenges in argument error detection Deadlock detection algorithm Demo

5 Understanding Unified Parallel C
Shared memory model Distributed memory model

6 Understanding Unified Parallel C
Distributed Shared Memory Model or Partitioned Global Address Space Model

7 UPC-CHECK v1.0 Source to source translator Pre-compiler Error handling
Argument errors Deadlocks

8 UPC-CHECK: Usability Portable Ease of use Freely available
Machine independent Compiler independent Ease of use Easy to install install_UPC-CHECK Easy to run Freely available wget CHECK/UPC-CHECK.tar.gz

9 UPC-CHECK 1.0: Usability Usage
upc-check [compiler options] [--upccheck:flag [- -upccheck:flag] ...] -c sourcefile.upc -a|-d_argument_check disables argument checking (enabled by default) -d|-d_deadlock_check disables deadlock checking (enabled by -s|-e_track_func_call_stack enables tracing of function call stack (disabled by default) -h|--h|-help prints help for UPC-CHECK Just replace your compile-command with upc-check.

10 Quality of error reports generated
Coyle, J., Hoekstra, J., Kraeva, M., Luecke, G. R., Kleiman, R., Srinivas, V., Tripathi, A., Weiss, O., Wehe, A., Xu, Y., Yahya, M. (2008). UPC Run-Time Error Detection Test Suite. Iowa State University, High Performance Computing Group. A score of 5 is given for a detailed error message that will assist a programmer to x the error. A score of 4 is given for error messages with more information than a score of 3 and less than 5. This is tailored for each test. A score of 3 is given for error messages with the correct error name, line number and the name of the file where the error occurred. A score of 2 is given for error messages with the correct error name and line number where the error occurred but not the file name where the error occurred. A score of 1 is given for error messages with the correct error name. A score of 0 is given when the error was not detected.

11 Run-time environments
Argument errors Deadlocks Cray 0.38 Berkeley 0.04 0.58 HP 0.36 GNU 0.27 UPC-CHECK 4.89 5

12 UPC-CHECK 1.0: Testing 400 error test-cases 1800 false-positive cases
Additional testing for deadlocks Testing across application programs

13 UPC-CHECK 1.0: Overhead Base memory requirement
~ 128 KB per thread With every acquired or requested shared memory lock, requirement goes by around 256 B while tracking function call stack, with every level of nested function call, memory requirement goes by around 512 B Increase of code section ~ 100 lines of instrumentation per UPC operation ~12000 lines from support files

14 Efficiency overhead 1.654 1.073 1.725 1.099 Maximum Average Maximum
Berkeley UPC Cray UPC Original Instumented Overhead CG-S 7.329 7.393 1.009 7.34 7.924 1.080 CG-W 7.554 7.613 1.008 7.576 8.344 1.101 CG-A 8.531 9.378 1.099 8.372 9.133 1.091 CG-B 73.619 74.222 56.376 63.239 1.122 CG-C 171.36 1.010 1.055 EP-S 8.048 8.581 1.066 5.319 5.307 0.998 EP-W 8.944 10.179 1.138 6.039 6.019 0.997 EP-A 19.71 25.193 1.278 14.755 14.743 0.999 EP-B 57.366 92.385 1.610 44.706 46.567 1.042 EP-C 1.654 FT-S 7.529 7.74 1.028 4.97 4.918 0.990 FT-W+ 7.651 7.68 1.004 5.135 5.151 1.003 FT-A*+ 15.34 14.312 0.933 9.173 9.084 FT-B*+ 83.981 77.339 0.921 50.621 50.613 1.000 FT-C*+ 0.000 1.095 IS-S 7.257 7.389 1.018 4.954 4.894 0.988 IS-W 7.409 7.441 5.099 5.006 0.982 IS-A 8.526 8.435 0.989 5.787 5.799 1.002 IS-B 12.038 12.115 1.006 8.604 8.54 0.993 IS-C*+ 25.397 25.69 1.012 19.662 21.655 MG-S 7.298 7.45 1.021 4.798 4.79 MG-W 7.631 7.499 0.983 5.239 5.815 1.110 MG-A*+ 11.32 10.73 0.948 6.979 1.725 MG-B*+ 19.083 19.1 1.001 11.718 16.88 1.441 MG-C*+ 68.33 1.575 Maximum Average 1.073 Maximum 1.654 Average 1.073 Maximum 1.725 Average 1.099

15 UPC-CHECK 1.0: Scalability
Original Instrumented Slowdown CG-S 4.742 4.942 1.042 CG-W 15.664 15.708 1.003 CG-A 4.912 4.99 1.016 CG-B 54.183 54.239 1.001 CG-C 58.309 58.281 1.000 EP-S 1.145 EP-W 6.247 6.243 0.999 EP-A 1.417 1.427 1.007 EP-B 7.116 7.128 1.002 EP-C 11.19 11.17 0.998 FT-S DNR 0.000 FT-W FT-A FT-B 15.528 15.556 FT-C* 22.855 22.735 0.995 IS-S 3.541 3.594 1.015 IS-W 10.422 10.961 1.052 IS-A 3.56 3.658 1.028 IS-B 8.752 8.776 IS-C 10.089 10.073 MG-S MG-W 8.288 8.308 MG-A MG-B 9.293 9.341 1.005 MG-C 13.551 13.579 Maximum Average 1.008 CROW cluster Cray compiler Cray run-time environment 128 threads Maximum 1.052 Average 1.008

16 UPC-CHECK v1.0: Known limitations
UPC-CHECK will not test the single-valued requirement of upc forall statements. Since UPC-CHECK works on UPC source programs, it will be unable to handle any deadlocks which are created in a library that a user might be using. UPC-CHECK should not be used for programs where the ‘main' function lies within a header file Best effort will be made, but may lead to memory leaks at end of execution.

17 Challenges in checking argument errors
Engineering challenges Exhaustiveness Argument checks against multiple functions Handling vector arguments Dependency of one argument on another argument Data-structures used Displaying the errors

18 A novel Deadlock Detection Algorithm
Dynamic Optimal O(1) for deadlocks created by collective routines O(n) for deadlocks created by locks Distributed Scalable

19 A few more terms:“collective” operations
“Collective” is a constraint placed on some language operations which requires evaluation of such operations to be matched across all threads. The behavior of collective operations is undefined unless all threads execute the same sequence of collective operations. “Single valued” refers to an operand to a collective operation, which has the same value on every thread. The behavior of the operation is otherwise undefined.

20 Central idea The collective requirement simply states a relative ordering property of calls to collective operations that must be maintained in the parallel execution trace for all executions of any legal program.

21 time threads

22 Deadlocks in UPC 1. Not all threads are waiting at the same collective routine threads 1 2 i j T-2 T-1 time

23 2. Some threads are waiting at the same collective routine when at least
one of the threads has reached end-of-execution threads 1 2 i j T-2 T-1 time End-of-execution One of the threads at a collective routine is holding a lock that at least one of the threads are trying to acquire. threads 1 2 i j T-2 T-1 time

24 5. Circular dependency for acquiring locks amongst threads
Definition: A thread i is dependent on another thread j if the thread i is trying to acquire a lock held by thread j threads 1 2 i j T-2 T-1 time

25 Chain of dependency for acquiring locks leads to a thread which is
waiting at a collective routine. threads 1 2 i j T-2 T-1 time

26 Chain of dependency for acquiring locks leads to a thread which is
reached end of execution. threads 1 2 i j T-2 T-1 time End-of-execution

27 Algorithm: Get all the threads in the picture
j T-1 i+2 i+1 1 i 2 3 i-1

28 Validation method: A basic block
threads threads R R time time i-1 i i-1 i

29 Implementation: Algorithm 1
shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS]; 1 i-1 i i+1 T-1 un 1 i-1 i i+1 T-1 nk un 1 i-1 I i+1 T-1 nk un 1 i-1 i i+1 T-1 nk un state desired_state i-1 i i+1

30 shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS];
1 i-1 i i+1 T-1 nk un state desired_state i-1 i i+1

31 shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS];
1 i-1 i i+1 T-1 nk un state desired_state i-1 i i+1

32 shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS];
1 i-1 i i+1 T-1 nk un state desired_state i-1 i i+1

33 shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS];
1 i-1 i i+1 T-1 un state desired_state i-1 i i+1

34 shared [1] deadlock_ctxt_t unified_deadlock_ctxt[THREADS];
1 i-1 i i+1 T-1 un state desired_state i-1 i i+1

35 Atomicity and serialization of status checks
One centralized lock solution Efficiency hit – complete serialization Decentralized lock solution –one lock per thread shared [1] upc_lock_t upc_check_deadlock_detection_lock[THREADS]; 1 2 i i+1 T-3 T-2 T-1

36 Avoiding deadlocks created by the checks
1 2 i i+1 T-3 T-2 T-1

37 Scheme 1 of acquiring locks
1 2 i i+1 T-3 T-2 T-1 Even thread: lock[i] then lock[(i+1) %THREADS] Odd thread: lock[(i+1) %THREADS] then lock[i] Legend: : First lock acquired : Second lock acquired

38 Scheme 1: Maximum latency of acquiring locks for even number of threads
i i i i+1 Longest dependency chains when i is even i i i i+2 Longest dependency chains when i is odd Maximum latency is 3 or O(1)

39 Maximum latency: when total number of threads are odd
Maximum latency is 4 or O(1)

40 Efficiency The number of threads for which any thread has to wait before entering its critical section is is O(1). The number of remote memory access is O(1) as any thread i only accesses memory related to the state of only thread I and thread (i+1)%THREADS. Optimal!

41 When thread reaches a upc_lock
Track requested locks and acquired locks Look out cyclical hold-and-wait conditions Look out for chain of hold-and-wait conditions which lead to a thread blocked at a collective routine If a thread has reached a collective routine, check if there is a request for a lock that the thread is holding Look out for chain of hold-and-wait conditions which lead to a thread which has reached end-of-execution If a thread is exiting without freeing all locks held by it, then check if there is a request for a lock that the thread is holding

42 Papers Coyle, J., Hoekstra, J., Kraeva, M., Luecke, G. R., Kleiman, R., Roy, I. (2009). UPC Compile-Time Error Detection Test Suite. Iowa State University High Performance Computing Group. Roy, I., Luecke, G. R., Coyle, J., Kraeva, M., Hoekstra, J. (2011). UPC- CHECK: A run-time error detection tool for programs written in UPC. Preprint Roy, I., Luecke, G. R., Coyle, J., Kraeva, M., Hoekstra, J. (2011). An O(1) algorithm to detect deadlocks in collective routines in the distributed shared memory model. Preprint

43 Thank you


Download ppt "Indranil Roy High Performance Computing (HPC) group"

Similar presentations


Ads by Google