Presentation is loading. Please wait.

Presentation is loading. Please wait.

CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped.

Similar presentations


Presentation on theme: "CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped."— Presentation transcript:

1 CO-CLUSTERING USING CUDA

2 Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped together? Why?  What are shared features?  Co-clustering provides you the “why” explicitly  Correlated sample/feature pair Row cluster: s 1 and s 3 are in a group Column cluster: distinguishing features are 2,3, and 5

3 Co-Clustering - Details  Using Information Theoretic Co-clustering, as parallelized for Hadoop architecture in: Disco: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining, Papadimitriou et.al, Data Mining 2008 Disco: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining  Partition entire matrix into row groups, col groups Minimize length of encoding of resulting partitioned matrix Competing code length factors: number of row groups & col groups, homogeneity of clusters Iterate over rows, rearrange and sub-partition to find better encoding using heuristic Repeat for columns, then rows again, until local optimum is found Complexity: O(n*fp*(row_groups+col_groups) 2 *iters) Credit: Chakrabarti et. al, KDD 2004

4 Implementation - Basics  Initial matrix generation : CPU  Initial random row/column group assignment: CPU  Memory structures very simple, arrays of ints

5 Implementation – Stats step 1  Statistics calculations:  Calculates statistics for each row of each column group Statistic is number of 1’s in a column group Straight-forward parallelization (each thread works on one row at a time), global memory 2 3 1 3 2 3511435114 Column Groups Row Groups Stat(Row 3, ColumnGroup 3) = 1

6 Room For Improvement  Calculate row statistics according to histogram algorithm from text book  Block columns  Assign one thread block to each block  Compute shared memory histograms within block  Merge back to global memory when finished

7 Implementation – Stats step 2  Calculates cost for each row group of each column group Essentially a reduce on the per- row data Block the rows, assign block to thread block Use shared memory and atomics to build histogram of all rows in a given row group Merge shared histogram with global histogram for that row group Iterate over all row groups 2 3 1 3 2 3511435114 Column Groups Row Groups Stat(RowGroup 1, ColumnGroup 3) = 2

8 Implementation – Row/Col Group Optimization  For each row, find optimal group it could belong to  Parallelized straight-forwardly, one row per thread, loop and stride to get all rows  Each row calculation goes through all row groups, determines global cost of moving to that row group  Move all rows to their optimal group  Recompute statistics  Repeat for column groups  Continue alternating row/column groupings until convergence

9 Room For Improvement  Parallelization could be more sophisticated  Could block the rows and compute the cost of the row joining each row group in parallel  Using shared memory atomics to identify minimum cost  In practice, this algorithm heavily favors a small number of row and column groups  The parllelization would be therefore be small

10 Implementation Outer Loop  After local minimum is found, change initial number of row and column groups and retry  Change number of row groups or number of column groups, up or down  Continue changing number of row or column groups in that direction until cost fails to decrease  Try both directions in both dimensions before stopping  Outer loop performed on CPU

11 Room for Improvement  Outer loop could parallelize inner loop actions over different GPUs  Each could explore the different dimensions and directions in parallel

12 Implementation – CPU + Validation  CPU implementation performed all steps described earlier, but sequentially  Validation  Used CPU implementation of statistics calculations to validate GPU stats calculations  CPU and GPU log implementations differ, so validated cost calculations by allowing for a tolerance of 5% btw results  Did not have time to validate the overall algorithm or visualize the outputs to it to see if coclusters produced were reasonable

13 Timing Measurements  Time was measured by clock_t/CLOCKS_PER_SEC under CPU implementation  Measured by cuda events under GPU implementation

14 Development Lessons Learned  CUDA and structured data is a bad idea  Even structs of arrays are impossible to deal with  Host-side pointer math on device pointers does not work  CUDA API has REALLY unfriendly error messages  Take care to do very, very little through that API  __device__ variables declared globally must be passed to kernels  Runtime errors otherwise  You can malloc and free shared memory in device code as of 3.2

15 Development Lessons Learned Cont  Visual Studio CUDA integration leaves a lot to be desired  All optimizations removed, still can’t set breakpoints everywhere  Many variables show as freed  No in-IDE, real-time, in editor compile errors  But, Visual Studio does give nice auto-complete, auto-definition navigation  No CUDA linker => separate files must be directly #include’d

16 Experiment - Environment  Float.cs.drexel.edu  CPU: 4 quad-core Intel Xeon L5360 processors @2.13 Ghz  GPU: 2 Nvidia GeForce GTX 580 GPUs @1544Mhz

17 Experiment - Description  Sequential (CPU) and Parallel (GPU) tested on square matrices of order 100, 1000, and 10000  Larger matrices caused memory problems  GPU tested with varying block and thread counts  Num blocks: 10, 100, 5000  Num threads: 10, 100, 1024 (max)  Resulting co-clusters usually stayed in the 50-200 row/column group range, regardless of matrix order  Row and column groupings are important in the calculation of matrix statistics, rows and columns are blocked by these

18 Experiment Results Num Threads

19 Experiment Results  For small number of blocks, 100 thread performance peaks at num_blocks * num_threads = matrix_order  I would expect this to be the optimal configuration, when num_blocks ~= num_row_groups ~= num_col_groups  Slowdown occurs when matrix order exceeds total number of threads and more must be done serially

20 Experiment - Results Num Threads

21 Experiment Results Num Threads

22 Experiment Results  Interestingly, the maximum speedup was the same in all block counts  Roughly speaking, as long as num_blocks * num_threads >= matrix order, max speedup of ~70 is achieved 10 threads never got there, due to block scheduling overhead? Possibly cost of copying to shared memory for block processing was not recouped in 10 thread case?  Maxing out thread count is counter-productive in smaller matrices  Hypothesis: When block count is excessive (as for small matrices), scheduling of large blocks of threads that return immediately is costly

23 Experiment Results Num Threads

24 Experiment Results Num Threads

25 Experiment Results Num Threads

26 Experiment Results  Efficiency is consistently highest for the smaller numbers of blocks and smaller numbers of threads within those blocks  Hypothesis: Overhead of starting blocks and threads must be high enough to result in diminishing returns when adding blocks and threads


Download ppt "CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped."

Similar presentations


Ads by Google