Presentation is loading. Please wait.

Presentation is loading. Please wait.

IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual Information Technology IIIT, Hyderabad.

Similar presentations


Presentation on theme: "IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual Information Technology IIIT, Hyderabad."— Presentation transcript:

1 IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual Information Technology IIIT, Hyderabad

2 Lessons from GPU Computing Massively multithreaded: several thousands to millions of threads for good performance Good performance depends on a lot – Resource utilization: shared memory, registers – Memory access: locality, arithmetic intensity Optimum point may change with architecture – Retuning infeasible for every developer Solution: Use standard libraries or primitives – Implemented well keeping the trade-offs in mind – Used by everyone: build your algorithms using them

3 IIIT, Hyderabad What are the primitives? Standard data-parallel primitives – scan, reduce – sort, split But also: – segmented split – scatter, gather, data-copy – Transpose Could have domain-specific primitives – Graph theory, numerical algorithms – Computer vision, Image processing

4 IIIT, Hyderabad Computing Using Primitives A typical program will/should have 75-80% of the work done through such primitives Application developer writes glue kernels to connect and clean up the components – Code for this simple and perhaps unchanging – Even inefficient implementations non-critical Example: A program with running time T uses primitives for 75% of operations. A new architecture doubles performance New running time: (with no speedup for non-primitive part) – 0.5 * (0.75 T) + 0.25 T = 0.625 T, instead of ideal 0.5 T. – 0.6 Tif 80% was using primitives and 0.55 T if 90%

5 IIIT, Hyderabad Primitive vs Library Both motivated by similar thinking: Reuse! Primitive is typically an algorithmic step, which finds diverse use – Used as a low-level step of an algorithm A library function provides an end-to-end functionality – Used to achieve a high-level functionality – Could be a “primitive” at a sufficiently high level! Use a library if available. Avoids development even using primitives!

6 IIIT, Hyderabad K-Means Clustering An iteration (with N vectors of d dimensions and K clusters) – Each vector finds distances to each cluster center O (N K d) operations – Attach itself to the closest centre; take its label O (N K) operations to find the minimum distance – Compute the mean of each cluster or vectors with the same label O (N d) operations to find K means. GPU implementation of clustering of 128-dimensional SIFT vectors, a frequent problem in Computer Vision. Recompute Cluster Means Assign New Labels Compute Distances

7 IIIT, Hyderabad SIFT Clustering Problem: Cluster a few (4-8) million, (128 dimensional) SIFT vectors into a few (1-2) thousand clusters using K- Means Representation: row major. That is, the N components of each of the 128 dimensions stored together, tightly. (N rows of d each) Given: initial cluster means (could be random vectors) Output: K cluster means and N labels, one for each input vector giving cluster membership Large amount of computations; well suited to a GPU- like architecture

8 IIIT, Hyderabad Data Representation 1 2 N 123d Input Vectors in Row Major 1 2 K 12d Cluster Centers in Row Major 123N Cluster Labels

9 IIIT, Hyderabad Distance Computation 1.Loop over K clusters, loading c cluster centers to shared memory at a time 2.A block of t threads loops over all d components of t input vectors, loading component v i and accumulating (C i – v i ) 2 3.Write distances in a K x N array, with K distances for a vector stored consecutively. Shared memory used to the maximum and all memory accesses are perfectly coalesced.

10 IIIT, Hyderabad After Distance Evaluations 1 2 K 123N Vector to Cluster Distance Matrix

11 IIIT, Hyderabad Finding Closest Center We need to know the index of the centre that gave the minimum distance. A block of t threads load t distances for a particular centre. Keep track of the minimum distance and the corresponding index across the K centers. Write index into a new labels array of length N. All memory accesses are perfectly coalesced.

12 IIIT, Hyderabad New Cluster Centers The new labels are given in the input vector order. Next step: Find the mean of all vectors with same label. Find their sum first. Rearrange input vectors so that vectors of each category are placed together. Column major storage makes the memory accesses non-coalesced and inefficient. Rearrange and convert to row major. Summing is easy thereafter!

13 IIIT, Hyderabad Finding New Centers 1.gIndex = splitGatherIndex(new Labels) 2.dCopy = gather(inputVectors, gIndex) 3.temp = transpose(dCopy) 4.Perform segmented add reduce of temp with segments at label boundaries. Store results in an dx K array newCenters 5.inputVectors = transpose(dCopy) 6.centers = transpose(newCenters) Now, input vectors are rearranged with new cluster centers. (Need to also keep track of a composition of gIndex values to maintain connection to input vectors)

14 IIIT, Hyderabad Input : Input vectors, n, Cluster centers, dim, k Output :New Membership array(n*1), New cluster centers(k*dim), Global Index(n*1).

15 IIIT, Hyderabad Storage per Block 1 2 4 123dim Four Input Vectors 2.4 11 15 28 19 3.1 123 dim Center on shared memory 3 Four input vectors loaded per block and their corresponding differences are stored in shared memory which consumes 2*2048 bytes of memory, also the center is on shared memory. on the difference we perform tree based addition for each vector.

16 IIIT, Hyderabad Algorithm Flow Perform distance evaluations between input and current centers to generate new membership array Apply split sort on membership array sorting as per cluster center ids. Create flag and perform segmented scan to get histogram for each cluster Rearrange data as per cluster ids Perform transpose on rearranged data for coalesced access Use CUDPP segmented scan on rearranged data followed by CUDPP compact to extract the summation Divide the summation by histogram generated for each cluster to get new cluster centers Update the global Index

17 IIIT, Hyderabad The Global Index is initialized by Global Index[i]=i After sorting the membership array, we have sorted_membership_index[] i.e. the order in which vectors are supposed to be arranged The sorted membership index after split sort is used to get global index Global Index[sorted_membership_index[i]] =i In the final Global Index, i is the actual vector id of Input vectors and Global Index[i] is the position of i’th vector id in the final rearranged input data.

18 IIIT, Hyderabad Distance Evaluation Sequential approach takes O(dim) steps Simple tree based parallel approach Takes O(log(dim) )steps to evaluate the net distance In a block only 256/2 i threads are active during i’th iteration of an distance evaluation Effectively performed on the shared memory Reduces the complexity by a factor of log

19 IIIT, Hyderabad Distance Evaluation 8 8 16 8 8 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 Tree based addition in log 8 steps 2 3 itr

20 IIIT, Hyderabad Algorithm for Distance evaluation Algorithm (Input: d_input, d_centers, dim, no_centers) for i=0 to no_centers do shared[threadIdx.x]= (d_input[id]-d_centers[i]) 2 for j= dim/2 to 0 do If(threadIdx.x<j) then shared[threadIdx.x]+=shared[2*threadIdx.x+j] end if j=j/2 __syncthreads() end of inner for loop if min > shared[0] Min=shared[0] end if end of outer for loop

21 IIIT, Hyderabad Kernel Level Execution 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 256 1 2 Final iteration 12Dim =128 Every iteration number of active threads reduce by a factor of 2 Threads Id 128

22 IIIT, Hyderabad Kernel Functions Distance – Evaluates the distance between vectors (block 128,4, grid n/4p,p) Get_long_membership – creates a variable of type long consisting membership id and corresponding vector id. SplitSort – Sorts membership array as per cluster ids CUDPPSegmented Scan – Scan operation on sorted membership array Get_flag – Generate flag for CUDPP operations(block 256,1) Gather_histogram – Gathers the final values after scan Rearrange_data – Arrange input as per clusters ids (block 128,4, grid n/4p,p) Transpose – Performs transpose on rearranged data CUDPPCompact – Extracts summed up center values

23 IIIT, Hyderabad Rearranging data 1 2 N 123d Input Vectors in Row Major 1 1 1 1 2 2 k k k k k k 455989 23 45 59 23 123d Rearranged Vectors in Row Major based on Sorted Membership array Vec id

24 IIIT, Hyderabad Center Evaluation 45 59 23 123dim Rearranged Input Vectors 1 2 dim 455923 Transposed Vectors Vec Id Vec ID We may apply segmented scan on transposed vectors which is a coalesced operation, flag values can be got with the help of histograms generated for each cluster.

25 IIIT, Hyderabad Global Index 1 1 2 2 52 3 3 57 1 1 19 49 57 89 1 2 Final iteration 12n Updating the global Index array after every iteration Global Index[membership_sorted_index[i]] =i Vector Id

26 IIIT, Hyderabad Why use Split Sort, Transpose? New centers evaluation requires concurrent writes which is not easily parallelizable Sorts membership array grouping vector ids belonging to same cluster together Helpful for rearranging entire input vectors as per their clusters Transpose provides coalesced access for center evaluation using segmented scan

27 IIIT, Hyderabad Issues Major time is consumed by distance evaluations as input size increases. Input size and number of clusters majorly control the performance

28 IIIT, Hyderabad Result Kmeans++ to generate initial centers Time taken to generate initial cluster centers Input sizeCluster centersCPU (P4, 2.4Ghz)GPU( GTX 280) 1,000804480 ms12.177 ms 10,00080039341.2 ms670.06 ms 1,00,0008000897326.5 ms62547.035 ms 1 Million800009943472.8 ms126392.1 ms

29 IIIT, Hyderabad Results Variation with number of input vectors (128 dimension) Time taken per iteration to generate new membership array and new cluster centers (excluding time for kmeans++) Input sizeCluster centersCPU (P4, 2.4Ghz)GPU( GTX 280) 1,00080370 ms9.91 ms 10,00080082900 ms487.3 ms 1,00,0008000679923.1 ms36623.58 ms 1 Million80005189450.4 ms45789.29 ms

30 IIIT, Hyderabad Result Variation of cluster centers N = 10000, Dimension =128 Input sizeCluster centersGPU( GTX 280) 1,00,0005002188.91 ms 1,00,00010004486.37 ms 1,00,00020009241.58 ms 1,00,000400018419.71 ms 1,00,000800036623.58 ms

31 IIIT, Hyderabad Result Variation with dimension of SIFT vector N = 10000, Cluster centers =8000 Input sizeDimensionGPU( GTX 280) 1,00,00016118.91 ms 1,00,00032997.3 ms 1,00,000648623.58 ms 1,00,00012836623.58 ms

32 IIIT, Hyderabad Result Coalesced vs Non-Coalesced Coalesced involves transpose followed by segmented scan and non- coalesced involves gather followed by segmented scan Input size – Cluster centers Non CoalescedCoalesced 1,000 – 800.043 ms0.077 ms 10,000 – 8001.28 ms0.217 ms 1,00,000 -800015.45 ms1.955 ms 1,00,000 - 8000083.24 ms19.343 ms

33 IIIT, Hyderabad Result Membership vs New Centers The membership generation consumes major chunk of time Input size – Cluster centers MembershipNew centers 1,000 – 804.23 ms5.68 ms 10,000 – 800369.07 ms118.23 ms 1,00,000 -800036465.2 ms158.38 ms 10,00,000 - 800045559.48 ms229.83 ms


Download ppt "IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual Information Technology IIIT, Hyderabad."

Similar presentations


Ads by Google