Presentation on theme: "UPC-Check Tutorial * High Performance Computing Group Glenn Luecke(director), James Coyle, James Hoekstra, Marina Kraeva and Indranil Roy Iowa State University."— Presentation transcript:
UPC-Check Tutorial * High Performance Computing Group Glenn Luecke(director), James Coyle, James Hoekstra, Marina Kraeva and Indranil Roy Iowa State University Aug 30, 2011 * This work was supported by the United States Department of Defense & used resources of the Extreme Scale Systems Center at Oak Ridge National of Oak Ridge National Laboratory. 1
UPC-CHECK Tutorial Outline UPC-CHECK Design Current Functionality of UPC-CHECK UPC-CHECK syntax How to use UPC-CHECK to find and correct program errors. (6 examples) Efficiency of UPC-CHECK Scalability of UPC-CHECK Memory overhead of UPC-CHECK 2
3 UPC-CHECK Design Original UPC Program UPC to UPC Translator UPC program with error checking UPC-CHECK Support Routines UPC Compiler Executable with error checking
Current Functionality of UPC-CHECK Argument checking for UPC functions Deadlock detection 4
UPC-CHECK Syntax Use upc-check the same as your UPC compiler, e.g. instead of upcc -O –T 3 a.upc r.o issue: upc-check –O –T 3 a.upc r.o In a Makefile, change UPC=upcc to UPC=upc-check Note: the -T compiler option must be used with the upc-check command since ROSE currently requires that the number of threads be known at compile time (UPC-CHECK uses the ROSE Toolkit from Lawrence Livermore National Laboratory to instrument UPC source code). 5
Run-Time Errors Detected by UPC-CHECK UPC-CHECK detects Argument Errors in UPC Functions and Deadlocks in UPC programs. UPC-CHECK will not test the single-valued requirement of upc_forall statements. Since UPC-CHECK works on UPC source programs, it cannot detect deadlocks within library functions. Currently, UPC-CHECK requires that programs do not define the ‘main' function in a header file.
Quantifying the quality of a tool which detects UPC run-time errors. Iowa State University has a Test Suite that scores the ability of UPC compilers/tools to detect run-time errors: see http://rted.public.iastate.edu/UPC/ http://rted.public.iastate.edu/UPC/ This Test Suite uses the following scoring system: –A score of 5 is given for a detailed error message that will assist a programmer to quickly correct the error. –A score of 4 is given for error messages with more information than a score of 3 and less than 5. –A score of 3 is given for error messages with the correct error name, line number and the name of the file where the error occurred. –A score of 2 is given for error messages with the correct error name and line number where the error occurred but not the file name where the error occurred. –A score of 1 is given for error messages with the correct error name. –A score of 0 is given when the error was not detected.
How UPC-CHECK compares Results from ISU’s test suite: http://rted.public.iastate.edu/UPC/RESULTS/result_table.html UPC-CHECK gets the highest score for Deadlocks and the highest score for all but 3 tests in the Argument Errors section. Compiler Argument Errors Deadlocks UPC-CHECK 4.89 5.00 Berkley UPC 0.04 0.58 Cray 0.38 0.00 HP 0.00 0.36 GNU 0.00 0.27
Additional Checks While collecting the information necessary to instrument the Argument Errors and Deadlock checks, UPC-CHECK sometimes detects a different error. Whichever error would occur first is reported with a meaningful error message. (E.g. a collective routine within a upc_forall.) Due to this, more categories in ISU’s RTED_UPC test suite showed improvement when using UPC-CHECK, see: http://rted.public.iastate.edu/UPC/RESULTS/result_table.html In addition, some errors are detected and reported at translation/compile time.
Examples illustrating how to use UPC-CHECK to find and correct program errors * 10 * All examples use the Berkeley UPC compiler.
Example 1: http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex1.upc This program contains the function: upc_all_broadcast(arrA, arrB, sizeof(int)*sh_val, UPC_IN_NOSYNC | UPC_OUT_NOSYNC); and sh_val is declared as static shared int sh_val; However the program does not initialize sh_val The declaration means that sh_val has an initial value of zero. Therefore the third argument of the above broadcast function is zero. This is not allowed by the UPC specification. 11
When issuing: upcc -T 4 -o ex1 ex1.upc; upcrun -n 4./ex1; the program executes without any error messages being issued. When issuing: upc-check -T 4 –o ex1 ex1.upc; upcrun -n 4./ex1; the following message is issued: Thread 0 encountered invalid arguments in function upc_broadcast at line 26 in file /home/jjc/ex1.upc. Error: Parameter (((sizeof(int )) *(sh_val))) passes non-positive value of 0 to nbytes argument Variable sh_val was declared at line 10 in file /home/jjc/ex1.upc. 12
Correcting Example 1 Seeing that sizeof(int)*shval was zero, the programmer can see that sh_val still has the default value of zero due to its declaration in line 10. (Static shared variables are initialized to zero according to the UPC Spec.) Thus, an assignment of a value to sh_val before line 26 is missing. Inserting the statement sh_val=BLOCK_SIZE ; at line 16 fixes this error. 13
Example 2: http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex2.upc This program contains: numhints = 1; fd = upc_all_fopen("upcio1.txt", UPC_INDIVIDUAL_FP|UPC_WRONLY| UPC_CREATE, numhints, hints); And the program does not allocate space for the structure hints. 14
When issuing: upcc -T 4 -o ex2 ex2.upc; upcrun -n 4./ex2; the following is printed from the printf in the program: File not open. When issuing: upc-check -T 4 –o ex2 ex2.upc; upcrun -n 4./ex2; the following message is issued: Thread 0 encountered invalid arguments in function upc_all_fopen at line 13 in file /home/jjc/ex2.upc. Error: Parameter numhints passes non-zero value of 1 to 'numhints' argument while target of parameter (hints) passed to 'hints' argument is unallocated. Variable numhints was declared at line 7 in file /home/jjc/ex2.upc. Variable hints was declared at line 9 in file /home/jjc/ex2.upc 15
Correcting Example 2 The argument hints is not used unless numhints is positive. hints may be used to convey information about a file in hopes of more efficient I/O. Therefore, example 2 can be corrected by either 1)setting numhints to 0, or 2)allocating hints and assigning correct values to it. 16
Example 3: http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex3.upc http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex3_s.upc In this program, the upc_barrier function is not called by all threads, and causes a deadlock. This error is difficult to find since the barrier is contained inside a function which is called from within an if block. 17
When issuing: upcc -T 4 -o ex3 ex3.upc ex3_s.upc; upcrun -n 4./ex3; a deadlock occurs and the upcrun command never returns. When issuing: upc-check -T 4 -o ex3 ex3.upc ex3_s.upc; upcrun -n 4./ex3; the following message is issued: Runtime error: Deadlock condition detected: One or more threads have finished executing while other threads are waiting at a collective routine Status of threads ================= Thread id:Status:Presently waiting at line number:of file -------------------------------------------------------- 0:waiting at upc_barrier: 7: /home/jjc/ex3_s.upc 1:reached end of execution through: 39: /home/jjc/ex3.upc 2:waiting at upc_barrier: 7: /home/jjc/ex3_s.upc 3:waiting at upc_barrier: 7: /home/jjc/ex3_s.upc 18
Correcting Example 3 The upc_barrier is called from funcA. Two of the three possible paths through the two nested if statements appear and contain a upc_barrier, but the third possible ( else ) path is missing. This error can be corrected by creating the missing else block and placing either a call to funcA, or a upc_barrier call. 19
Example 4: http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex4.upc http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex4_s.upc In this program, not all threads call the UPC collective function upc_all_fsync. 20
When issuing: upcc -T 4 -o ex4 ex4.upc ex4_s.upc; upcrun -n 4./ex4; the upcrun command never completes. When issuing: upc-check -T 4 –o ex4 ex4.upc ex4_s.upc; upcrun -n 4./ex4; the following message is issued: Runtime error: Deadlock condition detected: Different threads waiting at different collective routines Status of threads ================= Thread id:Status:Presently waiting at line number:of file --------------------------------------------------------- 0:waiting at upc_all_fsync on file pointer fd: 9: /home/jjc/ex4_s.upc 1:waiting at upc_all_fclose on file pointer fd: 52: /home/jjc/ex4.upc 2:waiting at upc_all_fsync on file pointer fd: 9: /home/jjc/ex4_s.upc 3:waiting at upc_all_fsync on file pointer fd: 9: /home/jjc/ex4_s.upc 21
Correcting Example 4 This is another case where a UPC collective (in this case upc_all_fsync ) is not called by all threads, as required. This is detected when one set of threads executes upc_all_fsync, while another set executes upc_all_fclose. Inserting an else clause with the statement upc_all_fsync(fd) corrects the problem. 22
Example 5: http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex5.upc In this program, all of the threads call the UPC collective function upc_all_reduceI, but they call with different source arrays, which is not allowed by the UPC specification. Without UPC-CHECK, when issuing: upcc -T 4 -o ex5 ex5.upc; upcrun -n 4./ex5; the following is printed: sumA=120 23
When issuing: upc-check -T 4 –o ex5 ex5.upc; upcrun -n 4./ex5; the following message is issued: Runtime error: Unspecified behavior condition detected, may lead to deadlock : One or more threads have different values for single_valued parameters. Status of threads ================= Thread id:Status:Presently waiting at line number:of file --------------------------------------------------------- 0:waiting at upc_all_reduceI: 21: /home/jjc/ex5.upc 1:waiting at upc_all_reduceI: 21: /home/jjc/ex5.upc 2:waiting at upc_all_reduceI: 21: /home/jjc/ex5.upc 3:waiting at upc_all_reduceI: 21: /home/jjc/ex5.upc Mismatch in parameter: src. Thread no. =================================================================== 0:ptrA points to memory location 0x2b7dd810dff0. Variable ptrA was declared at line 7 in file /home/jjc/ex5.upc. 1:ptrA points to memory location 0x2b7dd810dfe0. Variable ptrA was declared at line 7 in file /home/jjc/ex5.upc. 2:ptrA points to memory location 0x2b7dd810dfc0. Variable ptrA was declared at line 7 in file /home/jjc/ex5.upc. 3:ptrA points to memory location 0x2b7dd810dfd0. Variable ptrA was declared at line 7 in file /home/jjc/ex5.upc. 24
Correcting Example 5 The error message on the previous slide reports that threads have different values of the src parameter of function upc_all_reduceI. ptrA, declared at line 7 of file ex5.upc, points to different memory locations. Looking at the ptrA declaration, we see that ptrA is a private pointer-to-shared. Later in the code ptrA is assigned the value returned by the call to upc_global_alloc. This function is not collective. If it's called by multiple threads, all threads which make the call get different allocations. Changing upc_global_alloc to upc_all_alloc corrects the problem since now ptrA will have the same value on every thread. Note that with the current version of Berkley UPC compiler, the value of sumA will be the same in either case, but this behavior is not guaranteed for the test above. 25
Example 6 Example 6 is the Dining Philosopher’s problem, a classic deadlock problem. http://hpcgroup.public.iastate.edu/UPC-CHECK/Ex/ex6.upc Without UPC-CHECK, when issuing: upcc -T 3 -o ex6 ex6.upc; upcrun -n 3./ex6; the output produced varies from run to run. For one run the following output was produced: philosopher # 0 got the left fork philosopher # 0 got the right fork philosopher # 0 got the left fork philosopher # 1 got the left fork philosopher # 2 got the left fork the program then deadlocks and no additional output is issued. 26
When issuing: upc-check -T 3 –o ex6 ex6.upc; upcrun -n 3./ex6; the program exits after issuing the following message: Runtime error: Deadlock condition detected: Found cycle of hold-and-wait dependencies for acquiring locks: Thread 2 is waiting at upc_lock function at line 18 of file /home/jjc/ex6.upc to acquire lock forks[((MYTHREAD ) + 1) % 3] pointing to location 0x9f40. Lock forks[((MYTHREAD ) + 1) % 3] was already acquired as forks[MYTHREAD ] by thread 0 with 'upc_lock' at line 16 of file /home/jjc/ex6.upc. Thread 0 is waiting at upc_lock function at line 18 of file /home/jjc/ex6.upc to acquire lock forks[((MYTHREAD ) + 1) % 3] pointing to location 0x9f20. Lock forks[((MYTHREAD ) + 1) % 3] was already acquired as forks[MYTHREAD ] by thread 1 with 'upc_lock' at line 16 of file /home/jjc/ex6.upc. Thread 1 is waiting at upc_lock function at line 18 of file /home/jjc/ex6.upc to acquire lock forks[((MYTHREAD ) + 1) % 3] pointing to location 0x9f00. Lock forks[((MYTHREAD ) + 1) % 3] was already acquired as forks[MYTHREAD ] by thread 2 with 'upc_lock' at line 16 of file /home/jjc/ex6.upc. 27
Correcting Example 6 The error message on the previous slide shows where the deadlock is occurring (line 18 of the indicated file), which locks are involved, who holds which locks, and what lock each thread is waiting on. The deadlock can be avoided by numbering the forks, and picking up the even fork first, then another fork, and putting them down in the reverse order. 28
Efficiency of UPC-CHECK UPC-CHECK has been carefully designed to minimize the overhead when executing the instrumented UPC program. Using the UPC implementation of the NAS Parallel CG benchmark, we timed both the instrumented and non-instrumented executables using 4 threads for the smallest 3 benchmarks (S, A, and B). In these cases we also use the Berkley UPC compiler. We see essentially zero overhead. WallTime(sec.) Size No UPC-CHECK With UPC-CHECK Overhead S 7.36 7.41 0.7% A 9.06 9.12 0.7% B 85.03 83.04 - 2.3% 29
Scalability of UPC-CHECK checks Type of check Overhead (for T threads) Argument checking O(1) Deadlocks Collective routines O(1) UPC_Locks O(L), L<=T Where L is the length of the longest hold-and-wait chain. For a program that does not use upc_locks, the overhead in using UPC-CHECK does not depend on the number of threads. This is because all checking can be done via values local to the threads and its neighboring threads. The O(1) deadlock checking for collective routines will be described in a paper that is being prepared. A program that uses upc_locks may have overhead that depends on the number of threads because there may be a chain of lock dependencies (a deadlock) which spans all threads. 30
Overhead on a Cray XT using the Cray compiler 128 threads NAS Benchmark Execution Time for Original Program (sec.) Execution Time for Instrumented Program (sec.) Slowdown CG-A 4.9124.991.02 CG-B 54.18354.2391.00 CG-C 58.30958.2811.00 EP-A 1.4171.4271.01 EP-B 7.1167.1281.00 EP-C 11.1911.171.00 IS-A 3.563.6581.03 IS-B 8.7528.7761.00 IS-C 10.08910.0731.00 Total 159.528159.7421.00
Memory overhead of UPC-CHECK The memory overhead per thread consists of three components: 1) Extra context variables allocated to support checks: approximately 128KB. 2) Extra information about call stack if call stack tracking is requested: 1/2 KB per call level per thread 3) Executable size: The support routines add less than1MB and each UPC routine adds about 3.5Kbytes. 32