Solution Approach Step 1 Find the location of the Head of the list; Head = n*(n-1)/2 – SUM_SUCC Step 2 Select s random locations of X to split the list into s random sublists Step 3 Using the standard sequential algorithm, compute the prefix sums of each sublist separately Step 4 Compute the prefix sums of the list consisting exclusively of the splitters Step 5 Update the prefix value of each element of the array X by using the prefix sum values computed in Step 4 GPU CPU
Problems Faced The author failed to mention that of the s random sub-lists generated, one of the sublists head must be the head of the list. Considering this, I have kept the head of the first sublist as the head of list. Rest of the lists are random as suggested in the paper. One other problem faced was in executing steps 4,5. Since the sublists are random and not ordered, the prefix sum computation of last elements of sublists again becomes the problem of computing prefix sum of link list. For this, we need to make have another array which specifies which sublist comes after the current list.
Optimizations The main reason for making the assumption that head is not known is to explore the impact of the presence of significant caches since the initial step that determines the head of the list will fill the cache with some of the input data thereby rendering the execution of later steps faster on such processors. The total number of nodes handled by a thread is about the same as any other thread with high probability if the number of sublists is at least lnp n and the number of processors p <, where n is the total number of nodes. The number of sublists are managed such that there exists an optimal balance between the desirability of a large number of sublists (for fine-grain data parallel computations and load balancing) and the splitting/merging costs.
Optimizations The step 4 sequentially computes the prefix sum instead of a recursive method, thereby cutting down a significant overhead. Randomizing the positions of splitters gives high probability of a overall procedure is load balanced. The total number of sublists per thread is min(2*(size/1 20 ),32) (size>1 20 ). This is the optimum value found experimentally, as beyond this value the optimization caused by increasing the number of sublists is worse than the overhead of creating and joining them in other stages of the algorithm.
Results For List Size 64M, stride 1001, Sublists per thread 32. StepsPaper Results(ms) GTX 580(ms) C1060 (ms) Step 111.5381.6343.105 Step 23.5800.1670.297 Step 3225.137243.657454.588 Step 450.6430.0930.362