Lawrence Livermore National Laboratory Automated Extraction of Skeleton Apps from Apps February 2012 Daniel Quinlan (LLNL) Matt Sottile (Galois), Aaron.

Lawrence Livermore National Laboratory Automated Extraction of Skeleton Apps from Apps February 2012 Daniel Quinlan (LLNL) Matt Sottile (Galois), Aaron Tomb (Galois) Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 Operated by Lawrence Livermore National Security, LLC, or the U.S. Department of Energy, National Nuclear Security Administration under Contract DE-AC52-07NA27344

2 What is a Skeleton and why you want one  A skeleton is a reduced size version of an application that focuses on one or more aspects of the behavior of the full original application. Examples include: MPI usage, message passing patterns; memory traversal; I/O demands  This is important for Exascale: Provides inputs to simulators for evaluation of expected Exascale architectures and features (e.g. SST/macro) Provides smaller applications for independent study  A skeleton program will not get the same answer as the original application  There is prior work in this area…  I think we are the only ones with a distributed tool for this…

3 CoDesign Tool Flow Automatic Generation of Skeletons for Rapid Analysis 3 This talk is about these arrows

4 We can generate many skeletons from an App  Many skeletons could be generated from a single application  The process can work on full applications or smaller compact applications Single App with many files Aspect A Aspect B Aspect X Skeleton A Skeleton B Skeleton X Many Skeleton Apps each with maybe many files

5 An Automated or Semi-Automated Process  We treat this as a compiler research problem  We are building tools to automate the generation of skeletons, but some questions are difficult to resolve May require dynamic analysis to identify important values May require some user annotations to define some behavior  We start with the original application and transform it to modify and remove code to define an automated process; this is a source-to-source solution

6 We are using the ROSE Source-To-Source Compiler to support this work Science & Technology: Computation Directorate Source Code Fortran/C/C++ OpenMP Transformed Source Code ROSE IR Analyses/ Transformation/ Optimizations System-dependency Sliced-system- dependency Control- Flow Control dependency Control flow Unparser ROSE Frontend ROSE-based Skeleton Generation Tool

7 A Non-trivial problem to Automate  Different aspects are related (they are not actually orthogonal) Example: inter-message timings are a function of the computational work that an app does.  Static analysis is not always precise, and dynamic analysis is not always complete  We are focused on using static analysis and formal methods to generate plausible, realistic skeletons is the focus of our research work.

8 Example of Automated Skeleton Code Generation: Before/After do { if (rank < size - 1) MPI_Send( xlocal[maxn/size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ); if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); itcnt ++; diffnorm = 0.0; for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) { xnew[i][j] = (xlocal[i][j+1] + xlocal[i][j-1] + xlocal[i+1][j] + xlocal[i-1][j]) / 4.0; diffnorm += (xnew[i][j] - xlocal[i][j]) * (xnew[i][j] - xlocal[i][j]); } for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) xlocal[i][j] = xnew[i][j]; MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD ); gdiffnorm = sqrt( gdiffnorm ); if (rank == 0) printf( "At iteration %d, diff is %e\n”, itcnt, gdiffnorm ); } while (gdiffnorm > 1.0e-2 && itcnt < 100); do { if (rank < size - 1) MPI_Send( xlocal[maxn / size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ) if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); itcnt ++; MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD ); } while (gdiffnorm > 1.0e-2 && itcnt < 100); Before After

9 Example of Automated Skeleton Code Generation: Larger example  Source-to-source transformation  Def-use analysis of variables leading to MPI calls  Future work will explore use of: System Dependence Graph (SDG) Data flow framework and defined concepts of dead-code elimination. Can be supplemented with dynamic information Can be applied to abstract other things than MPI use #include #include "mpi.h" /* This example handles a 12 x 12 mesh, on 4 processors only. */ #define maxn 12 int main( argc, argv ) int argc; char **argv; { int rank, size, i, j, itcnt; int i_first, i_last; MPI_Status status; double diffnorm, gdiffnorm; double xlocal[(12/4)+2][12]; double xnew[(12/3)+2][12]; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); if (size != 4) MPI_Abort( MPI_COMM_WORLD, 1 ); /* xlocal[][0] is lower ghostpoints, xlocal[][maxn+2] is upper */ /* Note that top and bottom processes have one less row of interior points */ i_first = 1; i_last = maxn/size; if (rank == 0) i_first++; if (rank == size - 1) i_last--; /* Fill the data as specified */ for (i=1; i<=maxn/size; i++) for (j=0; j<maxn; j++) xlocal[i][j] = rank; for (j=0; j<maxn; j++) { xlocal[i_first-1][j] = -1; xlocal[i_last+1][j] = -1; } itcnt = 0; do { /* Send up unless I'm at the top, then receive from below */ /* Note the use of xlocal[i] for &xlocal[i][0] */ if (rank < size - 1) MPI_Send( xlocal[maxn/size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ); if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); /* Send down unless I'm at the bottom */ if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); /* Compute new values (but not on boundary) */ itcnt ++; diffnorm = 0.0; for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) { xnew[i][j] = (xlocal[i][j+1] + xlocal[i][j-1] + xlocal[i+1][j] + xlocal[i-1][j]) / 4.0; diffnorm += (xnew[i][j] - xlocal[i][j]) * (xnew[i][j] - xlocal[i][j]); } /* Only transfer the interior points */ for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) xlocal[i][j] = xnew[i][j]; MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD ); gdiffnorm = sqrt( gdiffnorm ); if (rank == 0) printf( "At iteration %d, diff is %e\n", itcnt, gdiffnorm ); } while (gdiffnorm > 1.0e-2 && itcnt < 100); MPI_Finalize( ); return 0; } #include #include "mpi.h" /* This example handles a 12 x 12 mesh, on 4 processors only. */ #define maxn 12 int main( argc, argv ) int argc; char **argv; { int rank, size, i, j, itcnt; int i_first, i_last; MPI_Status status; double diffnorm, gdiffnorm; double xlocal[(12/4)+2][12]; double xnew[(12/3)+2][12]; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); if (size != 4) MPI_Abort( MPI_COMM_WORLD, 1 ); /* xlocal[][0] is lower ghostpoints, xlocal[][maxn+2] is upper */ /* Note that top and bottom processes have one less row of interior points */ i_first = 1; i_last = maxn/size; if (rank == 0) i_first++; if (rank == size - 1) i_last--; /* Fill the data as specified */ for (i=1; i<=maxn/size; i++) for (j=0; j<maxn; j++) xlocal[i][j] = rank; for (j=0; j<maxn; j++) { xlocal[i_first-1][j] = -1; xlocal[i_last+1][j] = -1; } itcnt = 0; do { /* Send up unless I'm at the top, then receive from below */ /* Note the use of xlocal[i] for &xlocal[i][0] */ if (rank < size - 1) MPI_Send( xlocal[maxn/size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ); if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); /* Send down unless I'm at the bottom */ if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); /* Compute new values (but not on boundary) */ itcnt ++; diffnorm = 0.0; for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) { xnew[i][j] = (xlocal[i][j+1] + xlocal[i][j-1] + xlocal[i+1][j] + xlocal[i-1][j]) / 4.0; diffnorm += (xnew[i][j] - xlocal[i][j]) * (xnew[i][j] - xlocal[i][j]); } /* Only transfer the interior points */ for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) xlocal[i][j] = xnew[i][j]; MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD ); gdiffnorm = sqrt( gdiffnorm ); if (rank == 0) printf( "At iteration %d, diff is %e\n", itcnt, gdiffnorm ); } while (gdiffnorm > 1.0e-2 && itcnt < 100); MPI_Finalize( ); return 0; } Generated Skeleton Code: rank(int iteration) Original Source Code: rank(int iteration) void rank( int iteration ) { INT_TYPE i, k; INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2; INT_TYPE key; INT_TYPE2 bucket_sum_accumulator, j, m; INT_TYPE local_bucket_sum_accumulator; INT_TYPE min_key_val, max_key_val; INT_TYPE *key_buff_ptr; TIMER_START( T_RANK ); /* Iteration alteration of keys */ if(my_rank == 0 ) { key_array[iteration] = iteration; key_array[iteration+MAX_ITERATIONS] = MAX_KEY - iteration; } /* Initialize */ for( i=0; i<NUM_BUCKETS+TEST_ARRAY_SIZE; i++ ) { bucket_size[i] = 0; bucket_size_totals[i] = 0; process_bucket_distrib_ptr1[i] = 0; process_bucket_distrib_ptr2[i] = 0; } /* Determine where the partial verify test keys are, load into */ /* top of array bucket_size */ for( i=0; i<TEST_ARRAY_SIZE; i++ ) if( (test_index_array[i]/NUM_KEYS) == my_rank ) bucket_size[NUM_BUCKETS+i] = key_array[test_index_array[i] % NUM_KEYS]; /* Determine the number of keys in each bucket */ for( i=0; i<NUM_KEYS; i++ ) bucket_size[key_array[i] >> shift]++; /* Accumulative bucket sizes are the bucket pointers */ bucket_ptrs[0] = 0; for( i=1; i< NUM_BUCKETS; i++ ) bucket_ptrs[i] = bucket_ptrs[i-1] + bucket_size[i-1]; /* Sort into appropriate bucket */ for( i=0; i<NUM_KEYS; i++ ) { key = key_array[i]; key_buff1[bucket_ptrs[key >> shift]++] = key; } TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM ); /* Get the bucket size totals for the entire problem. These will be used to determine the redistribution of keys */ MPI_Allreduce( bucket_size, bucket_size_totals, NUM_BUCKETS+TEST_ARRAY_SIZE, MP_KEY_TYPE, MPI_SUM, MPI_COMM_WORLD ); TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK ); /* Determine Redistibution of keys: accumulate the bucket size totals till this number surpasses NUM_KEYS (which the average number of keys per processor). Then all keys in these buckets go to processor 0. Continue accumulating again until supassing 2*NUM_KEYS. All keys in these buckets go to processor 1, etc. This algorithm guarantees that all processors have work ranking; no processors are left idle. The optimum number of buckets, however, does not result in as high a degree of load balancing (as even a distribution of keys as is possible) as is obtained from increasing the number of buckets, but more buckets results in more computation per processor so that the optimum number of buckets turns out to be 1024 for machines tested. Note that process_bucket_distrib_ptr1 and..._ptr2 hold the bucket number of first and last bucket which each processor will have after the redistribution is done. */ bucket_sum_accumulator = 0; local_bucket_sum_accumulator = 0; send_displ[0] = 0; process_bucket_distrib_ptr1[0] = 0; for( i=0, j=0; i<NUM_BUCKETS; i++ ) { bucket_sum_accumulator += bucket_size_totals[i]; local_bucket_sum_accumulator += bucket_size[i]; if( bucket_sum_accumulator >= (j+1)*NUM_KEYS ) { send_count[j] = local_bucket_sum_accumulator; if( j != 0 ) { send_displ[j] = send_displ[j-1] + send_count[j-1]; process_bucket_distrib_ptr1[j] = process_bucket_distrib_ptr2[j-1]+1; } process_bucket_distrib_ptr2[j++] = i; local_bucket_sum_accumulator = 0; } /* When NUM_PROCS approaching NUM_BUCKETS, it is highly possible that the last few processors don't get any buckets. So, we need to set counts properly in this case to avoid any fallouts. */ while( j < comm_size ) { send_count[j] = 0; process_bucket_distrib_ptr1[j] = 1; j++; } TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM ); /* This is the redistribution section: first find out how many keys each processor will send to every other processor: */ MPI_Alltoall( send_count, 1, MPI_INT, recv_count, 1, MPI_INT, MPI_COMM_WORLD ); /* Determine the receive array displacements for the buckets */ recv_displ[0] = 0; for( i=1; i<comm_size; i++ ) recv_displ[i] = recv_displ[i-1] + recv_count[i-1]; /* Now send the keys to respective processors */ MPI_Alltoallv( key_buff1, send_count, send_displ, MP_KEY_TYPE, key_buff2, recv_count, recv_displ, MP_KEY_TYPE, MPI_COMM_WORLD ); TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK ); /* The starting and ending bucket numbers on each processor are multiplied by the interval size of the buckets to obtain the smallest possible min and greatest possible max value of any key on each processor */ min_key_val = process_bucket_distrib_ptr1[my_rank] << shift; max_key_val = ((process_bucket_distrib_ptr2[my_rank] + 1) << shift)-1; /* Clear the work array */ for( i=0; i<max_key_val-min_key_val+1; i++ ) key_buff1[i] = 0; /* Determine the total number of keys on all other processors holding keys of lesser value */ m = 0; for( k=0; k<my_rank; k++ ) for( i= process_bucket_distrib_ptr1[k]; i<=process_bucket_distrib_ptr2[k]; i++ ) m += bucket_size_totals[i]; /* m has total # of lesser keys */ /* Determine total number of keys on this processor */ j = 0; for( i= process_bucket_distrib_ptr1[my_rank]; i<=process_bucket_distrib_ptr2[my_rank]; i++ ) j += bucket_size_totals[i]; /* j has total # of local keys */ /* Ranking of all keys occurs in this section: */ /* shift it backwards so no subtractions are necessary in loop */ key_buff_ptr = key_buff1 - min_key_val; /* In this section, the keys themselves are used as their own indexes to determine how many of each there are: their individual population */ for( i=0; i<j; i++ ) key_buff_ptr[key_buff2[i]]++; /* Now they have individual key */ /* population */ /* To obtain ranks of each key, successively add the individual key population, not forgetting the total of lesser keys, m. NOTE: Since the total of lesser keys would be subtracted later in verification, it is no longer added to the first key population here, but still needed during the partial verify test. This is to ensure that 32-bit key_buff can still be used for class D. */ /* key_buff_ptr[min_key_val] += m; */ for( i=min_key_val; i<max_key_val; i++ ) key_buff_ptr[i+1] += key_buff_ptr[i]; /* This is the partial verify test section */ /* Observe that test_rank_array vals are */ /* shifted differently for different cases */ for( i=0; i<TEST_ARRAY_SIZE; i++ ) { k = bucket_size_totals[i+NUM_BUCKETS]; /* Keys were hidden here */ if( min_key_val <= k && k <= max_key_val ) { /* Add the total of lesser keys, m, here */ INT_TYPE2 key_rank = key_buff_ptr[k-1] + m; int failed = 0; switch( CLASS ) { case 'S': if( i <= 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'W': if( i < 2 ) { if( key_rank != test_rank_array[i]+(iteration-2) ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'A': if( i <= 2 ) { if( key_rank != test_rank_array[i]+(iteration-1) ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-(iteration-1) ) failed = 1; else passed_verification++; } break; case 'B': if( i == 1 || i == 2 || i == 4 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'C': if( i <= 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'D': if( i < 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; } if( failed == 1 ) printf( "Failed partial verification: " "iteration %d, processor %d, test key %d\n", iteration, my_rank, (int)i ); } TIMER_STOP( T_RANK ); /* Make copies of rank info for use by full_verify: these variables in rank are local; making them global slows down the code, probably since they cannot be made register by compiler */ if( iteration == MAX_ITERATIONS ) { key_buff_ptr_global = key_buff_ptr; total_local_keys = j; total_lesser_keys = 0; /* no longer set to 'm', see note above */ } void rank( int iteration ) { INT_TYPE i, k; INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2; INT_TYPE key; INT_TYPE2 bucket_sum_accumulator, j, m; INT_TYPE local_bucket_sum_accumulator; INT_TYPE min_key_val, max_key_val; INT_TYPE *key_buff_ptr; TIMER_START( T_RANK ); /* Iteration alteration of keys */ if(my_rank == 0 ) { key_array[iteration] = iteration; key_array[iteration+MAX_ITERATIONS] = MAX_KEY - iteration; } /* Initialize */ for( i=0; i<NUM_BUCKETS+TEST_ARRAY_SIZE; i++ ) { bucket_size[i] = 0; bucket_size_totals[i] = 0; process_bucket_distrib_ptr1[i] = 0; process_bucket_distrib_ptr2[i] = 0; } /* Determine where the partial verify test keys are, load into */ /* top of array bucket_size */ for( i=0; i<TEST_ARRAY_SIZE; i++ ) if( (test_index_array[i]/NUM_KEYS) == my_rank ) bucket_size[NUM_BUCKETS+i] = key_array[test_index_array[i] % NUM_KEYS]; /* Determine the number of keys in each bucket */ for( i=0; i<NUM_KEYS; i++ ) bucket_size[key_array[i] >> shift]++; /* Accumulative bucket sizes are the bucket pointers */ bucket_ptrs[0] = 0; for( i=1; i< NUM_BUCKETS; i++ ) bucket_ptrs[i] = bucket_ptrs[i-1] + bucket_size[i-1]; /* Sort into appropriate bucket */ for( i=0; i<NUM_KEYS; i++ ) { key = key_array[i]; key_buff1[bucket_ptrs[key >> shift]++] = key; } TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM ); /* Get the bucket size totals for the entire problem. These will be used to determine the redistribution of keys */ MPI_Allreduce( bucket_size, bucket_size_totals, NUM_BUCKETS+TEST_ARRAY_SIZE, MP_KEY_TYPE, MPI_SUM, MPI_COMM_WORLD ); TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK ); /* Determine Redistibution of keys: accumulate the bucket size totals till this number surpasses NUM_KEYS (which the average number of keys per processor). Then all keys in these buckets go to processor 0. Continue accumulating again until supassing 2*NUM_KEYS. All keys in these buckets go to processor 1, etc. This algorithm guarantees that all processors have work ranking; no processors are left idle. The optimum number of buckets, however, does not result in as high a degree of load balancing (as even a distribution of keys as is possible) as is obtained from increasing the number of buckets, but more buckets results in more computation per processor so that the optimum number of buckets turns out to be 1024 for machines tested. Note that process_bucket_distrib_ptr1 void rank(int iteration) { INT_TYPE i; INT_TYPE k; INT_TYPE shift = (23 - 10); INT_TYPE key; INT_TYPE2 bucket_sum_accumulator; INT_TYPE2 j; INT_TYPE2 m; INT_TYPE local_bucket_sum_accumulator; INT_TYPE min_key_val; INT_TYPE max_key_val; INT_TYPE *key_buff_ptr; /* Get the bucket size totals for the entire problem. These will be used to determine the redistribution of keys */ MPI_Allreduce(bucket_size,bucket_size_totals,((1 << 10) + 5),MPI_INT,MPI_SUM,MPI_COMM_WORLD); /* This is the redistribution section: first find out how many keys each processor will send to every other processor: */ MPI_Alltoall(send_count,1,MPI_INT,recv_count,1,MPI_INT,MPI_COMM_WORLD); /* Now send the keys to respective processors */ MPI_Alltoall(key_buff1,send_count,send_displ,MPI_INT,key_buff2,recv_count,recv_displ,MPI_INT,MPI_COMM_WOR LD); } INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS, it is highly possible that the last few processors don't get any buckets. So, we need to set counts properly in this case to avoid any fallouts. */ while( j < comm_size ) { send_count[j] = 0; process_bucket_distrib_ptr1[j] = 1; j++; } TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM ); /* This is the redistribution section: first find out how many keys each processor will send to every other processor: */ MPI_Alltoall( send_count, 1, MPI_INT, recv_count, 1, MPI_INT, MPI_COMM_WORLD ); /* Determine the receive array displacements for the buckets */ recv_displ[0] = 0; for( i=1; i<comm_size; i++ ) recv_displ[i] = recv_displ[i-1] + recv_count[i-1]; /* Now send the keys to respective processors */ MPI_Alltoallv( key_buff1, send_count, send_displ, MP_KEY_TYPE, key_buff2, recv_count, recv_displ, MP_KEY_TYPE, MPI_COMM_WORLD ); TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK ); /* The starting and ending bucket numbers on each processor are multiplied by the interval size of the buckets to obtain the smallest possible min and greatest possible max value of any key on each processor */ min_key_val = process_bucket_distrib_ptr1[my_rank] << shift; max_key_val = ((process_bucket_distrib_ptr2[my_rank] + 1) << shift)-1; /* Clear the work array */ for( i=0; i<max_key_val-min_key_val+1; i++ ) key_buff1[i] = 0; /* Determine the total number of keys on all other processors holding keys of lesser value */ m = 0; for( k=0; k<my_rank; k++ ) for( i= process_bucket_distrib_ptr1[k]; i<=process_bucket_distrib_ptr2[k]; i++ ) m += bucket_size_totals[i]; /* m has total # of lesser keys */ /* Determine total number of keys on this processor */ j = 0; for( i= process_bucket_distrib_ptr1[my_rank]; i<=process_bucket_distrib_ptr2[my_rank]; i++ ) j += bucket_size_totals[i]; /* j has total # of local keys */ /* Ranking of all keys occurs in this section: */ /* shift it backwards so no subtractions are necessary in loop */ key_buff_ptr = key_buff1 - min_key_val; /* In this section, the keys themselves are used as their own indexes to determine how many of each there are: their individual population */ for( i=0; i<j; i++ ) key_buff_ptr[key_buff2[i]]++; /* Now they have individual key */ /* population */ /* To obtain ranks of each key, successively add the individual key population, not forgetting the total of lesser keys, m. INT_TYPE i, k; INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2; INT_TYPE key; INT_TYPE2 bucket_sum_accumulator, j, m; ailed = 0; switch( CLASS ) { case 'S': if( i <= 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'W': if( i < 2 ) { if( key_rank != test_rank_array[i]+(iteration-2) ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'A': if( i <= 2 ) { if( key_rank != test_rank_array[i]+(iteration-1) ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-(iteration-1) ) failed = 1; else passed_verification++; } break; case 'B': if( i == 1 || i == 2 || i == 4 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'C': if( i <= 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } void rank( int iteration ) { INT_TYPE i, k; INT_TYPE shift = 'D': if( i < 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; } if( failed == 1 ) printf( "Failed partial verification: " "iteration %d, processor %d, test key %d\n", iteration, my_rank, (int)i ); } TIMER_STOP( T_RANK ); /* Make copies of rank info for use by full_verify: these variables in rank are local; making them global slows down the code, probably since they cannot be made register by compiler */ if( iteration == MAX_ITERATIONS ) { key_buff_ptr_global = key_buff_ptr; total_local_keys = j; total_lesser_keys = 0; /* no longer set to 'm', see note above */ }

10 Static Analysis Drives Skeleton Generation  First prototype: Generate skeleton representing message passing via static analysis (using the use- def analysis in ROSE)  Basic concept, where MPI is the target aspect: Identify message passing (MPI) operations. Preserve MPI operations and code that they depend on, removing superfluous code. Aim to remove large blocks of computational code, replacing it with surrogate code that is simpler to produce skeleton of app that contains essential message passing structure without the actual work.  Our research approach has been to explore four different forms of analysis to drive the skeleton generation: 1)Use-def analysis (to generate a form of program slice), works on the AST directly, not directly using the inter-procedural control flow graph (CFG) 2)Program slicing using ROSE’s System Dependence graph (SDG) which captures the def-use analysis and more on the inter-procedural control flow graph in ROSE 3)A new Data-Flow Framework in ROSE; another form of analysis using the interprocedural control flow graph in ROSE 4)Connections to Formal methods

11 Static Analysis: Program Slicing int returnMe (int me) { return me; } int main (int argc, char ** argv) { int a = 1; int b; returnMe(a); b = returnMe(a); #pragma SliceTarget return b; }  System (Inter-procedural) Dependence Analysis  A sequence of directed edges define a slice  Can be used for Model extraction

12 Data Flow as an alternative approach to Drive Skeleton Generation  Future work will explore the use of a new Data Flow Framework in ROSE to support analysis required to generate skeletons May be an easier way (for users) to specify aspects It is related to slicing in that it uses the same inter- procedural control flow graph internally  Each form of analysis (Use-def, SDG, and Data-Flow) are an orthogonal direction of work which share the common infrastructure we have built for skeleton generation.  The analysis and infrastructure in implemented using ROSE

13 A Generic API for Skeletonization  Generalized skeletonization target APIs Original work focused on skeletonizing relative to the MPI API. Current code extended to allow skeletons against any API (e.g., Visualization and Data Analysis, I/O and Storage, use of domain-specific abstractions, etc.) Important for building skeletons to probe different aspects of program behavior – IO, message passing, threading, app- specific libraries

14 Annotation guided skeletonization  Annotation guided skeletonization Previous work focused on purely dependency-based slicing. This led to problems:  Removal of computational code could cause loops to cease to converge (iterate forever).  Branching patterns no longer meaningful with computational code gone. Annotations let the user guide skeletonization to add semantics the skeleton that is impossible/difficult to statically infer.  Loop iteration counts ; branching probabilities ; variable initialization values.

15 Use of an Annotation Before/After int main() { int x = 0; int i; // execute exactly 10 times #pragma skel loopIterate 10 for (i = 0; x < 100 ; i++) { if (x % 2) x += 5; } return x; } int main() { int x = 0; int i; // execute exactly 10 times #pragma skel loopIterate 10 int k = 0; for (i = 0; k < 10; k++) {{ if ((x % 2) != 0) x += 5; } rose_label__1: i++; } return x; } Before After

16 User Work Flow for Skeletonization Science & Technology: Computation Directorate Original Application Program Dynamic Measurements Of Program Annotated Application Program Skeleton Program Skeleton Extraction Tool Observe Behavior Of Skeleton Satisfactory Behavior Keep Skeleton Unsatisfactory behavior: modify or add annotations to tune skeleton generator - Branch probabilities - Average loop iteration counts - Legitimate data values

17 Future work  SDG version of analysis for skeletonization  Using the new Data Flow framework in ROSE for skeletonization  Galois will be working on adding formal-methods-based analysis to the skeleton generator to analyze regions of code to remove. Floating point range analysis. Symbolic execution.  Formal methods will aim to answer questions to aid skeleton generation such as: What range of values do we expect a complex computation to produce?  Allows us to automatically select surrogate values for populating data structures  Know when specific values are critical Under specific input conditions, what code is reachable or not reachable?  Allows us to build skeletons for specific input circumstances, instead of generic skeletons  This is a connection to path feasibility analysis currently being developed in ROSE

18 Front-End Back-End AST Builder API High Level IRs (AST) IR Extension API (ROSETTA) High Level Analysis & Optimization Framework Exascale Architecture Mid-End Low Level Analysis & Optimization Low Level IR (LLVM) Unparser Existing LLVM Analysis & Optimization Exascale Vendor Compiler Infrastructures LLVM Backend Code Generation Exascale Vendor Compilers General Purpose Languages used within DOE Python C & C++Fortran (F77-F2003) UPC 1.1 OpenMP 3.0 CUDA ROSE Compiler Design

Lawrence Livermore National Laboratory Automated Extraction of Skeleton Apps from Apps February 2012 Daniel Quinlan (LLNL) Matt Sottile (Galois), Aaron.

Similar presentations

Presentation on theme: "Lawrence Livermore National Laboratory Automated Extraction of Skeleton Apps from Apps February 2012 Daniel Quinlan (LLNL) Matt Sottile (Galois), Aaron."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lawrence Livermore National Laboratory Automated Extraction of Skeleton Apps from Apps February 2012 Daniel Quinlan (LLNL) Matt Sottile (Galois), Aaron.

Similar presentations

Presentation on theme: "Lawrence Livermore National Laboratory Automated Extraction of Skeleton Apps from Apps February 2012 Daniel Quinlan (LLNL) Matt Sottile (Galois), Aaron."— Presentation transcript:

Similar presentations

About project

Feedback