yaSpMV: Yet Another SpMV Framework on GPUs

yaSpMV: Yet Another SpMV Framework on GPUs
Shengen Yan, Chao Li, Yunquan Zhang, Huiyang Zhou Thanks for the introduction. Good afternoon everyone. The tile of my talk is yaSpMV: Yet Another SpMV Framework on GPUs. I am Shengen Yan. I am a Phd student at chinese academy of sciences and a visiting student at NC State University. This is a joint work with my classmate Chao Li, my advisor Yunquan Zhang at chinese academy of sciences and my advisor Huiyang Zhou at NC State University.

Introduction Sparse Matrix-Vector Multiplication
spmv is a very important linear algebra algorithm Serial implementation is quite simple // A*x=y, where A is stored in the CSR format. for (i = 0; i < m; ++i) { double y0 = y[i]; for (k = rowptr[i]; k < rowptr[i+1]; ++k) y0 = value[k] * x[column_index[k]]; y[i] = y0; } There are many work involved in its optimization on both CPUs and GPUs Many formats have been proposed. 1.1 Sparse matrix vector multiplication (SpMV) is heavily used in many important application domains. It’s serial implementation is very simple. From the slide, we can see that the serial implementation only needs 6 line codes. There are many work involved in the optimization of spmv on both CPUs and GPUs. The previous research mainly focus on proposing new sparse matrix formats.

Introduction Parallel implementation: two challenges Bandwidth
the upper bound of flop:byte ratio is 0.25 Load imbalance Different number of non-zeroes in different rows Worse on GPUs x = 1.0 Different from the serial implementation, the parallel spmv implementation is quite difficult. 1.2 There are two challenges in the parallel implementation. The first challenge is that the matrix data and the corresponding index data have low reuse as each non-zero element is only used once for computing the corresponding result. 1.1 The second challenge is that a typical way to implement spmv on GPUs is using the row-based parallelization. However, the row-based parallelization, suffers from the load imbalance problem as non-zeros in a matrix may not be evenly distributed across different rows. (Animation) (Animation)As shown in matrix A. There are three non-zeros in the first row and six non-zeros in the forth row. Such a load imbalance problem is more severe in GPU architectures since the threads in a warp operate in the single-instruction multiple-data (SIMD) manner. 𝐴=

Executive Summary BCCOO format Customized efficient segmented scan/sum
addressing the bandwidth challenge Customized efficient segmented scan/sum addressing load imbalance problem very efficient Results (GTX 680) vs. CUSPARSE V5.0 up to 229% and 65% on average improvement vs. clSpMV up to 195% and 70% on average improvement In our paper, we proposed a new format and a corresponding efficient algorithm to address the previous two challenges. The proposed a new format which addressed the bandwidth challenge. And the corresponding algorithm efficiently addressed the load imbalance problem. Here is the experiment results on Kepler(GTX 680). Compared to CUSPARSE V5.0, we get up to 229% improvement and 65% improvement on average. Compared to clSpMV we get up to 195 improvement and 70% improvement on average.

Outline Introduction Formats for SpMV
addressing the bandwidth challenge Efficient Segmented Sum/Scan for SpMV Auto-Tuning Framework Experimentation Conclusions Here is the outline of this slides. In the following slides, I will first show our newly proposed format which addressed the bandwidth challenge and then show how our corresponding algorithm works. In the fourth section, I will show our auto-tuning framework and then I will explain our experimentation results in detail. The last section is the conclusions.

COO format 𝐴= 𝑉𝑎𝑙𝑢𝑒=[ ] 𝑅𝑜𝑤 𝑖𝑛𝑑𝑒𝑥=[ ] 𝐶𝑜𝑙 𝑖𝑛𝑑𝑒𝑥=[ ] Our format derived from the COO format. So I will start with the COO format. I will use matrix A to show how our format constructed from the COO format. The COO format is a widely used format for sparse matrices. It has explicit storage for the column and row indices for all non-zeros in a sparse matrix. As shown in the picture, the matrix A can be represented with three arrays, a data value array, a row index array and a column index array. COO format of matrix A

Blocked COO (BCOO) format
𝐴= Based on the COO format, we first proposed a Blocked COO format. We call it BCOO format. We divide the original matrix into some blocks. As the dashed lines show, the original matrix has been divided into eight blocks. The block size is 2*2. Then we can only index those blocks which has at least one non-zero.

BCOO format block size 2x2
Blocked COO (BCOO) format 𝐴= 𝑉𝑎𝑙𝑢𝑒= [ ] [ ] 𝐵𝑙𝑜𝑐𝑘𝑒𝑑 𝑅𝑜𝑤 𝑖𝑛𝑑𝑒𝑥=[ ]𝐵𝑙𝑜𝑐𝑘𝑒𝑑 𝐶𝑜𝑙 𝑖𝑛𝑑𝑒𝑥= [ ] 5 1 (Animation)Here is the first non-zero block. (Animation) (Animation) The corresponding blocked row index of the first non-zero block is 0 and the corresponding column index is 1. 1 BCOO format block size 2x2

Blocked COO (BCOO) format 𝐴= 𝑉𝑎𝑙𝑢𝑒= [ ] [ ] 𝐵𝑙𝑜𝑐𝑘𝑒𝑑 𝑅𝑜𝑤 𝑖𝑛𝑑𝑒𝑥=[ ]𝐵𝑙𝑜𝑐𝑘𝑒𝑑 𝐶𝑜𝑙 𝑖𝑛𝑑𝑒𝑥= [ ] 3 0 5 1 6 9 4 0 (Animation)The second non-zero block. (Animation) (Animation)The corresponding blocked row index of the second non-zero block is 0 and the corresponding column index is 3. 1 3 BCOO format block size 2x2

Blocked COO (BCOO) format 𝐴= 𝑉𝑎𝑙𝑢𝑒= [ ] [ ] 𝐵𝑙𝑜𝑐𝑘𝑒𝑑 𝑅𝑜𝑤 𝑖𝑛𝑑𝑒𝑥=[ ]𝐵𝑙𝑜𝑐𝑘𝑒𝑑 𝐶𝑜𝑙 𝑖𝑛𝑑𝑒𝑥= [ ] 3 0 5 1 6 9 4 0 0 0 4 7 7 2 1 3 3 5 8 4 All the other non-zero blocks are the same. From the figure, we can see that there are only 5 non-zero blocks. Although there are some zeros in the value array, the index data will be significantly reduced compared to the original COO format. 1 1 1 1 3 2 3 BCOO format block size 2x2

Blocked compressed COO (BCCOO) format
Row Index Compression Ratio: 1/32 𝐴= Bit Integer 𝑉𝑎𝑙𝑢𝑒= [ ] [ ] 𝐵𝑙𝑜𝑐𝑘𝑒𝑑 𝑅𝑜𝑤 𝑖𝑛𝑑𝑒𝑥=[ ] Difference value =[ ] Bit Flag (flipped)=[ ] 𝐵𝑙𝑜𝑐𝑘𝑒𝑑 𝐶𝑜𝑙 𝑖𝑛𝑑𝑒𝑥= [ ] 3 0 5 1 6 9 4 0 0 0 4 7 7 2 1 3 3 5 8 4 1.Although the Blocked COO format can significantly reduce the index data. But the block size and the number of zero fillings also very important to the format. Smaller block size or more zero fillings means more memory consumption. 2.Our key extension to the COO format is to use a bit flag array to compress the row index array. In order to generate the Bit Flag for each non-zero block. We first apply a difference function on the row index array. Which means using the following row index minus the current row index. After this step, we get a difference value array.(Animation) For the convenience of following step, we need to flip the difference value. Here is the flipped value and it’s our Bit Flag array(Animation). The key difference between the row index array and the bit flag array is that the row index array need a integer to store each row index and the bit flag array only need a bit for each non-zero blocks. (Animation) Here the compression ratio is 1 over 32. (Animation) We call our newly proposed format as BCCOO format. 1 1 1 1 3 2 3 BCCOO format block size 2x2

Extensions of BCCOO format
Formats for SpMV Extensions of BCCOO format BCCOO+ format Rearrange the non-zeros blocks. Relief the irregular access to the vector Column index compression Using difference function on the column index. There are also some extensions of our BCCOO format. The details can be found in the paper.

BCCOO format of matrix B (Block size 1x1)
Example matrix Assume there are 4 threads 𝐵= 𝐵it Flag=[ ] 𝐶𝑜𝑙 𝑖𝑛𝑑𝑒𝑥=[ ] 𝑉𝑎𝑙𝑢𝑒=[ ] Matrix B will be used in the following slides to explain our algorithm. For the easy of description, we assume that the block size is 1*1. We also assume that there are only four threads (one workgroup). Each thread in charge of four non-zeros (Animation) BCCOO format of matrix B (Block size 1x1)

Auxiliary Information for SpMV: Result Entry
Getting the location of the first result generated by each thread in the output array. That’s to say to compute the row index that the first result in each thread belongs to. Only need to count the zero number in the bit flag array of the previous threads. To make the computation of SpMV more convenient, we need to generate some auxiliary information using the bit flag array. Based on the number of non-zeros that each thread will process, we need to compute the location of the first result generated by each thread in the output array. That’s to say to compute the row index that the first result in each thread belongs to. Such auxiliary information can be computed with a scan operation on the bit flag array. Each thread only need to count the ‘0’ numbers in the bit flag array of its previous thread. (Animation) Here is the result entry of matrix B. 𝐵it Flag=[ ] 𝑅𝑒𝑠𝑢𝑙𝑡 𝐸𝑛𝑡𝑟𝑦: 2 3 Thread 0 Thread 1 Thread 2 Thread 3

Outline Introduction Formats for SpMV
Efficient Segmented Sum/Scan for SpMV addressing load imbalance problem Auto-Tuning Framework Experimentation Conclusions In the following section I will introduce our corresponding algorithm which addressed the load imbalance problem.

Even workload partition
No workload imbalance Non-zero Blocks workgroups workgroup 0 workgroup 1 workgroup 2 workgroup 3 Firstly, I will show how to partition the working sets. The programmer can invoke hundreds of thousands of threads on GPUs. All these threads are organized into many workgroups (Thread Blocks). In our algorithm, all the input data are divided evenly among the workgroups. (Animation) Each workgroup get a workgroup-level tile. And each workgroup-level tile finally been divided evenly among all the threads in the workgroup. (Animation) We call it thread-level tile. From the slide, we can see that all the work is divided evenly among threads. So our algorithm is load balanced. threads T0 T1 T2 T3

Efficient Segmented Sum/Scan for SpMV
Three logic steps Read the data and multiply them with the corresponding vector values. Perform a segmented sum/scan using the bit flag array from our BCCOO/BCCOO+ format Results combination and write back the results to global memory. All these three steps are implemented in one kernel Giving a sparse matrix stored in our BCCOO format, SpMV can be implemented in three logical steps: read the data value arrays and multiply them with the corresponding vector values indexed by the column index array; (2)perform a segmented scan using the bit flag array in our BCCOO format; (3) write back the results to global memory. In our algorithm, all these three steps are implemented in one kernel.

Step 1 Read the data and multiply with vector values
Ex: 4 Threads 𝐵= 𝐵it Flag=[ ] 𝐶𝑜𝑙 𝑖𝑛𝑑𝑒𝑥=[ ] 𝑉𝑎𝑙𝑢𝑒=[ ] 𝑅𝑒𝑠𝑢𝑙𝑡 𝐸𝑛𝑡𝑟𝑦:[ ] Here is the matrix B and the corresponding BCCOO format. There are four threads. BCCOO format of matrix B

Step 1 Read the data and multiply with vector values
Ex: 4 Threads Problem: B*x =? Assume: x=[ ] 𝐶𝑜𝑙 𝑖𝑛𝑑𝑒𝑥= 𝑉𝑒𝑐𝑡𝑜𝑟 𝑣𝑎𝑙𝑢𝑒=[ ] 𝐷𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒= 𝐼𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒=[ ] 2 4 6 2 6 4 7 X Our problem is matrix B multiply with vector x. We first read the column index array and the data value array from the global memory. (Animation) Then, we read the vector value according to the column index array. Show the animation. (Animation) (Animation) Then, we multiply the vector value and the data value. The result are stored in a intermediate array. (Animation) (Animation) Until now we finished the first step: read the data and multiply with the vector values. =

Step 2 Segmented sum/scan
Three types of rows in our algorithm All the non-zeros of the row are in the same thread. Serial segmented sum/scan in threads A row spans multiple threads. + Parallel segmented sum/scan among threads. A row spans multiple workgroups. + Cross workgroup synchronization (details in paper) The second step of our algorithm is segmented sum/scan. There are three types of rows in our algorithm. The first is the row which all the non-zeros are in the same thread. (Animation) For that type of rows, we only need to perform a serial segmented sum/scan on the intermediate array in each thread. (Animation) Since we evenly divided the workload between each thread, there may be some rows spans multiple threads or multiple workgroups. The second type is that a row spans multiple threads. (Animation)For this type of rows, we need to add a parallel segmented sum/scan among threads to accumulate the intermediate value. (Animation) The third type of row is that a row spans multiple workgroups. (Animation) For this type of rows, we need to add a cross workgroup synchronization. (Animation) In the following slides, I will show how to deal with the first two type of rows, and for how to deal with the third type of rows, please read our paper.

1) Serial segmented sum/scan in each thread Problem: B*x =? Assume: x=[ ] Serial Segmented Scan (intermediate[-1]=0): Intermediate[i] = intermediate[i-1] * BitFlag[i-1] + Intermediate[i] 𝐼𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒 𝐵it Flag=[ ] 𝐼𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒[ ] We first do a serial segmented sum/scan in each thread. (Animation)The formula of serial segment scan is shown on the slide. From the formula we can know that why we need to flip the difference value in our BCCOO format. In the thread 0, there is no zero bit flag, so we just accumulate the value in the intermediate array. (Animation) (Animation) In thread 1, (Animation) (Animation)According to the formula, (Animation)the first result is 27. (Animation)According to the formula, (Animation)the second result is 5. (Animation)According to the formula, (Animation)the third result is 33. (Animation)According to the formula, (Animation) the fourth result is 36. (Animation)All the others are similar. Scan Scan Scan Scan 27 5 33 36

2) Generate last partial sum and perform parallel segmented scan among threads for the last partial sum 𝐼𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒 𝐵it Flag=[ ] 𝐼𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒 Partial Sums Head Flag(Exist ‘0’ in Bit Flag?) Scan Scan Scan Scan For the rows spans multiple threads, we need to perform a parallel segmented scan among threads. At first, we take the last intermediate value in each thread to another auxiliary array which we called partial sums array. (Animation) If the last bit flag in a thread is zero, the corresponding partial sum is set to zero. Then, since it is a segmented scan, we have to generate the head flag for each thread. To generate the head flag we only need to check if there is at least one ‘0’in the bit flag array of current thread. There are no ‘0’in the bit flag array of thread 0, (Animation)so the head flag of thread 0 is 0. There are two zeros in the bit flag array of thread 1, (Animation)so the head flag of thread 1 is 1. There are one zero in the bit flag array of thread 2, (Animation)so the head flag of thread 2 is 1. There are one zero in the bit flag array of thread 3, (Animation)so the head flag or thread 3 is 1. Then, perform a parallel segmented scan. (Animation) (Animation) Here is the results. As shown in the previous slide, there are also some rows may across multiple workgroups. How can we accumulate partial sums across workgroups. For the details, please read our paper. 1 1 1 Parallel segmented scan

2) Generate last partial sum and perform parallel segmented scan among threads for the last partial sum 𝐼𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒 𝐵it Flag=[ ] 𝐼𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒 Partial Sums Head Flag(Exist ‘0’ in Bit Flag?) 1 Scan Scan Scan Scan In this example, the parallel segmented scan out is the same as the input. However, if there is a row spans three threads, the parallel segmented scan out put should be different from the input. Let’s flip this bit flag to 1, and the head flag of thread 2 should be 0, the partial sum result of thread 2 should be 135, different from the input 99. 1 1 1 Parallel segmented scan 135

Step 3 Results combination and write the results to global memory.
Problem: B*x =? Assume: x=[ ] 𝐵it Flag=[ ] 𝐼𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒 Partial Sums Combined results 27 33 56 97 + + + The third step of our algorithm is to generate the final result and write to global memory. At this step, we will combine the intermediate value and the partial sums to generate the final result according to the bit flag array. For the first ‘0’bit flag in each thread (zero means the end of a row) (Animation), we combine the corresponding intermediate value with the partial sums of its previous thread. Intermediate value(Animation),combine(Animation),previous partial sum(Animation) (Animation) . Then write the result to the final result array according to the result entry. Here is the result entry. (Animation), then write the result to the output array(Animation) (Animation). For each ‘0’bit flag which is not the first ‘0’bit flag in its thread(Animation), we direct write the corresponding intermediate value to the final result array. (Animation) (Animation) (Animation) And the location can be got from the result entry by using the addition operation. (Animation) (Animation) (Animation) Until now, we get the final results. We addressed our problem matrix B multiple with vector x. and the result is shown in the slide. 105 33 92 196 𝑅𝑒𝑠𝑢𝑙𝑡 𝐸𝑛𝑡𝑟𝑦 2 3 0+1 𝐅𝐢𝐧𝐚𝐥 𝐑𝐞𝐬𝐮𝐥𝐭 [ , , , ] 105 33 92 196

Auto-Tuning Framework
In order to generate the optimal kernel code, we employ the auto-tuning technique to search the best parameters. Average auto-tuning time: 13 seconds Auto-tuning speed: ~1 million non-zeros per seconds Tunable parameters In order to generate the optimal kernel code, we implemented a auto-tuning framework. The table shows the tunable parameters. There are many tunable parameters. In order to accelerate the auto-tuning phase, we optimized our auto-tuning framework. (Animation)The average auto-tuning time in our experiment is 13 seconds. And the auto-tuning speed is about 1 million non-zeros per seconds. In order to further improve the performance, we also added some fine-grained optimization approaches. For the details, please read our paper.

Experiments Experimental Methodology
We have implemented our proposed scheme in OpenCL. We have evaluated our scheme on GTX 480 and GTX 680 using 20 real world matrices. Comparison library CUSPARSE V5.0 (Nvidia official SpMV library) CUSP (SC 09) clSpMV (ICS 12) In the following slides, I will explain our experiment result in detail. Here is our experimental methodology. We have implemented our proposed scheme in OpenCL. We used 20 real world matrices to evaluate our proposed scheme on both Fermi and Kepler. We compared our scheme with three existed library. CUSPARSE V5.0 and CUSP which both are proposed by nvidia. And clSpMV which is proposed by berkeley.

Used Matrices Name Size Non-zeros (NNZ) NNZ/Row Dense 2K * 2K 4M 2000
Protein 36K * 36K 4.3M 119 FEM/Spheres 83K * 83K 6M 72 FEM/Cantilever 62K * 62K 65 Wind Tunnel 218K*218K 11M 53 FEM/Harbor 47K * 47K 2.3M 59 QCD 49K * 49K 1.9M 39 FEM/Ship 141K*141K 7.8M 28 Economics 207K*207K 1.2M 6 Epidemiology 526K*526K 2.1M 4 FEM/Accelerator 121K*121K 2.6M 22 Circuit 171K*171K 0.95M Webbase 1M * 1M 3.1M 3 LP 4K * 1.1M 11.3M 2825 Circuit5M 5.56M* 5.56M 59.5M 11 eu-2005 863K*863K 19.2M Ga41As41H72 268K*268K 18.4M 67 in-2004 1.38M* 1.38M 17M 12 mip1 66K * 66K 10.3M 152 Si41Ge41H72 186K*186K 15M 81 Here is the list of used matrices. All these matrices are used in the previous works. Used Matrices

Performance results on Kepler (GTX 680)
GFLOPS The blue bar is the performance of CUSPARSE on kepler. Performance results on Kepler (GTX 680)

GFLOPS The red bar is the performance of CUSP on kepler. Performance results on Kepler (GTX 680)

GFLOPS The green bar is the performance of clspmv cocktail format on kepler. Performance results on Kepler (GTX 680)

GFLOPS The light blue bar is the performance of clspmv best single format on kepler. Performance results on Kepler (GTX 680)

Average Performance Improvement:
GFLOPS Average Performance Improvement: 65% over CUSPARSE, 70% over clSpMV COCKTAIL, 88% over clSpMV best single, 150% over CUSP The black bar is the performance of our proposed scheme on kepler. (Animation)Here is the average performance improvement. Performance results on Kepler (GTX 680)

Performance breakdown on Kepler (GTX680)
GFLOPS In order to show the performance contribution of different accelerating approach, we collected the performance breakdown on kepler. We start with COO format. The blue bar is the performance of COO format. Performance breakdown on Kepler (GTX680)

GFLOPS The red bar is the performance by using the BCCOO format to replace the COO format. Performance breakdown on Kepler (GTX680)

GFLOPS The grean bar is the performance by using the efficient segmented sum/scan algorithm to replace the tradition segmented scan approach. Performance breakdown on Kepler (GTX680)

GFLOPS The light blue bar is the performance by using the adjacent synchronization to replace the global synchronization. Performance breakdown on Kepler (GTX680)

Average Performance Improvement (vs. COO format): +BCCOO: 66%
+Efficient Segmented Sum/Scan: 192% +Adjacent Synchronization: 212% +Fine-Grain Optimizations: 257% GFLOPS The black bar is the performance adding the fine-grained optimizations. (Animation)Here is the average performance improvement of each accelerating approach. Performance breakdown on Kepler (GTX680)

Average memory footprint consumption: vs. COO: 60% vs. ELL: 19%
Relative memory footprint Average memory footprint consumption: vs. COO: 60% vs. ELL: 19% vs. Cocktail:79% vs. Best single: 69% Here is the relative memory footprint consumption of different format. The longest bar of each matrix means that format consumes the most memory footprint. From the figure, although the ELL is load balance, but this format consumes the most memory footprint. Our format consumes the least memory footprint. (Animation), here is the average memory footprint consumption of our format compared with the other formats. Relative memory footprint of different formats

Conclusions The BCCOO format
Addressed the memory bandwidth problem The customized matrix-based segmented sum/scan algorithms Addressed the work load imbalance problem Only need to invoke one kernel. Very efficient: used a lot of optimization approaches. Results (GTX 680) Vs. CUSPARSE V5.0 up to 229% and 65% on average improvement Vs. clSpMV up to 195% and 70% on average improvement Code is available online Here is the conclusions. We proposed a new format which addressed the memory bandwidth problem of spmv. We also proposed a corresponding algorithm which not only addressed the work load imbalance problem but also very efficient. And compared with cusparse which is proposed by nvidia we get up to 229% performance improvement and 65% performance improvement on average. Compared with clSpMV which proposed by berkely, we get up to 195% performance improvement and 70% performance improvement on average. Our code is available online.

Thanks & Question? Thanks for your attention and I am glad to take questions.

COO format 𝐴= Our format derived from the COO format. So I will start with the COO format. I will use matrix A to show how our format constructed from COO format.

COO format 𝐴= 𝑉𝑎𝑙𝑢𝑒=[ ] 𝑅𝑜𝑤 𝑖𝑛𝑑𝑒𝑥=[ ] 𝐶𝑜𝑙 𝑖𝑛𝑑𝑒𝑥=[ ] 3 The COO format is a widely used format for sparse matrices. It has explicit storage for the column and row indices for all non-zeros in a sparse matrix. Take the first non-zeros 3 from matrix A. The corresponding row index is 0 and the corresponding column index is 2. 2

COO format 𝐴= 𝑉𝑎𝑙𝑢𝑒=[ ] 𝑅𝑜𝑤 𝑖𝑛𝑑𝑒𝑥=[ ] 𝐶𝑜𝑙 𝑖𝑛𝑑𝑒𝑥=[ ] 3 6 Take the second non-zeros 6 from matrix A. The corresponding row index is 0 and the corresponding column index is 6. 2 6

3) Accumulating partial sums across workgroups Generate Partial Sums Generate Partial Sums Generate Partial Sums Generate Partial Sums P0 P1 P2 P3 There are also some rows may across multiple workgroups, in order to accumulate partial sums across workgroups, we adopted the adjacent synchronization mechanism. Using this mechanism, the synchronization only occurs between the adjacent workgroups and make sure that we only need to invoke a single kernel. There is a while loop before step 3 to wait for the partial sum of its previous workgroups. Once the a workgroup get the partial sum of its previous workgroups, that workgroup will accumulate its own partial sum to the previous partial sums and send it to the next workgroups. Step 3 Step 3 Step 3 Step 3 Using Adjacent Synchronization

Fine-grained optimizations
Texture memory for vector read Cut the adjacent synchronization chain as early as possible Remove the parallel segmented scan if possible If the number of columns is smaller than short type column index array may be helpful to decrease the memory traffic. In order to further improve the performance, we added some fine-grained optimization approaches. Texture memory for vector read Since the texture memory is good for the strided memory access. We use texture memory to store the vector values. Cut the adjacent synchronization chain as early as possible If exist at least one stop in the workgroup, this workgroup should send the partial sum to the next workgroup without waiting for its previous workgroup’s partial sums. Remove the parallel segmented scan if possible If each thread in the workgroup exist at least one stop, the parallel segmented step can be removed. If the number of columns is smaller than short type column index array may be helpful to decrease the memory traffic. 3.The parallel segmented scan step in our example can be removed.

Average Performance Improvement:
GFLOPS Average Performance Improvement: 42% over CUSPARSE, 40% over clSpMV COCKTAIL, 60% over clSpMV best single, 74% over CUSP Here is the performance of the five schemes on Fermi. The last five bars are the Harmonic means of the five schemes. Performance results on Fermi (GTX480)

Absolute memory footprint consumption of COO,BCOO,BCCOO formats
Here is the absolute memory footprint consumption of COO, BCOO and BCCOO formats Absolute memory footprint consumption of COO,BCOO,BCCOO formats

yaSpMV: Yet Another SpMV Framework on GPUs

Similar presentations

Presentation on theme: "yaSpMV: Yet Another SpMV Framework on GPUs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

yaSpMV: Yet Another SpMV Framework on GPUs

Similar presentations

Presentation on theme: "yaSpMV: Yet Another SpMV Framework on GPUs"— Presentation transcript:

Similar presentations

About project

Feedback