CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped.

Slides:



Advertisements
Similar presentations
Introduction to C Programming
Advertisements

List Ranking and Parallel Prefix
List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Creating Computer Programs lesson 27. This lesson includes the following sections: What is a Computer Program? How Programs Solve Problems Two Approaches:
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
“Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Computations” By Ravi, Ma, Chiu, & Agrawal Presented.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Parallel Computation of the Minimum Separation Distance of Bezier Curves and Surfaces Lauren Bissett, Nicholas Woodfield,
CUDA and the Memory Model (Part II). Code executed on GPU.
C++ for Engineers and Scientists Third Edition
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Massively Parallel LDPC Decoding on GPU
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.
IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual Information Technology IIIT, Hyderabad.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 3, 2011outline.1 ITCS 6010/8010 Topics in Computer Science: GPU Programming for High Performance.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Pseudorandom Number Generation on the GPU Myles Sussman, William Crutchfield, Matthew Papakipos.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Scope When we create variables and functions, they are limited in where they are visible and where they can be referenced For the most part, the identifiers.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,
VISUAL C++ PROGRAMMING: CONCEPTS AND PROJECTS Chapter 7A Arrays (Concepts)
Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
CS/EE 217 – GPU Architecture and Parallel Programming
CS/EE 217 – GPU Architecture and Parallel Programming
ECE 498AL Lecture 15: Reductions and Their Implementation
Synchronization These notes introduce:
6- General Purpose GPU Programming
Presentation transcript:

CO-CLUSTERING USING CUDA

Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped together? Why?  What are shared features?  Co-clustering provides you the “why” explicitly  Correlated sample/feature pair Row cluster: s 1 and s 3 are in a group Column cluster: distinguishing features are 2,3, and 5

Co-Clustering - Details  Using Information Theoretic Co-clustering, as parallelized for Hadoop architecture in: Disco: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining, Papadimitriou et.al, Data Mining 2008 Disco: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining  Partition entire matrix into row groups, col groups Minimize length of encoding of resulting partitioned matrix Competing code length factors: number of row groups & col groups, homogeneity of clusters Iterate over rows, rearrange and sub-partition to find better encoding using heuristic Repeat for columns, then rows again, until local optimum is found Complexity: O(n*fp*(row_groups+col_groups) 2 *iters) Credit: Chakrabarti et. al, KDD 2004

Implementation - Basics  Initial matrix generation : CPU  Initial random row/column group assignment: CPU  Memory structures very simple, arrays of ints

Implementation – Stats step 1  Statistics calculations:  Calculates statistics for each row of each column group Statistic is number of 1’s in a column group Straight-forward parallelization (each thread works on one row at a time), global memory Column Groups Row Groups Stat(Row 3, ColumnGroup 3) = 1

Room For Improvement  Calculate row statistics according to histogram algorithm from text book  Block columns  Assign one thread block to each block  Compute shared memory histograms within block  Merge back to global memory when finished

Implementation – Stats step 2  Calculates cost for each row group of each column group Essentially a reduce on the per- row data Block the rows, assign block to thread block Use shared memory and atomics to build histogram of all rows in a given row group Merge shared histogram with global histogram for that row group Iterate over all row groups Column Groups Row Groups Stat(RowGroup 1, ColumnGroup 3) = 2

Implementation – Row/Col Group Optimization  For each row, find optimal group it could belong to  Parallelized straight-forwardly, one row per thread, loop and stride to get all rows  Each row calculation goes through all row groups, determines global cost of moving to that row group  Move all rows to their optimal group  Recompute statistics  Repeat for column groups  Continue alternating row/column groupings until convergence

Room For Improvement  Parallelization could be more sophisticated  Could block the rows and compute the cost of the row joining each row group in parallel  Using shared memory atomics to identify minimum cost  In practice, this algorithm heavily favors a small number of row and column groups  The parllelization would be therefore be small

Implementation Outer Loop  After local minimum is found, change initial number of row and column groups and retry  Change number of row groups or number of column groups, up or down  Continue changing number of row or column groups in that direction until cost fails to decrease  Try both directions in both dimensions before stopping  Outer loop performed on CPU

Room for Improvement  Outer loop could parallelize inner loop actions over different GPUs  Each could explore the different dimensions and directions in parallel

Implementation – CPU + Validation  CPU implementation performed all steps described earlier, but sequentially  Validation  Used CPU implementation of statistics calculations to validate GPU stats calculations  CPU and GPU log implementations differ, so validated cost calculations by allowing for a tolerance of 5% btw results  Did not have time to validate the overall algorithm or visualize the outputs to it to see if coclusters produced were reasonable

Timing Measurements  Time was measured by clock_t/CLOCKS_PER_SEC under CPU implementation  Measured by cuda events under GPU implementation

Development Lessons Learned  CUDA and structured data is a bad idea  Even structs of arrays are impossible to deal with  Host-side pointer math on device pointers does not work  CUDA API has REALLY unfriendly error messages  Take care to do very, very little through that API  __device__ variables declared globally must be passed to kernels  Runtime errors otherwise  You can malloc and free shared memory in device code as of 3.2

Development Lessons Learned Cont  Visual Studio CUDA integration leaves a lot to be desired  All optimizations removed, still can’t set breakpoints everywhere  Many variables show as freed  No in-IDE, real-time, in editor compile errors  But, Visual Studio does give nice auto-complete, auto-definition navigation  No CUDA linker => separate files must be directly #include’d

Experiment - Environment  Float.cs.drexel.edu  CPU: 4 quad-core Intel Xeon L5360 Ghz  GPU: 2 Nvidia GeForce GTX 580

Experiment - Description  Sequential (CPU) and Parallel (GPU) tested on square matrices of order 100, 1000, and  Larger matrices caused memory problems  GPU tested with varying block and thread counts  Num blocks: 10, 100, 5000  Num threads: 10, 100, 1024 (max)  Resulting co-clusters usually stayed in the row/column group range, regardless of matrix order  Row and column groupings are important in the calculation of matrix statistics, rows and columns are blocked by these

Experiment Results Num Threads

Experiment Results  For small number of blocks, 100 thread performance peaks at num_blocks * num_threads = matrix_order  I would expect this to be the optimal configuration, when num_blocks ~= num_row_groups ~= num_col_groups  Slowdown occurs when matrix order exceeds total number of threads and more must be done serially

Experiment - Results Num Threads

Experiment Results Num Threads

Experiment Results  Interestingly, the maximum speedup was the same in all block counts  Roughly speaking, as long as num_blocks * num_threads >= matrix order, max speedup of ~70 is achieved 10 threads never got there, due to block scheduling overhead? Possibly cost of copying to shared memory for block processing was not recouped in 10 thread case?  Maxing out thread count is counter-productive in smaller matrices  Hypothesis: When block count is excessive (as for small matrices), scheduling of large blocks of threads that return immediately is costly

Experiment Results Num Threads

Experiment Results Num Threads

Experiment Results Num Threads

Experiment Results  Efficiency is consistently highest for the smaller numbers of blocks and smaller numbers of threads within those blocks  Hypothesis: Overhead of starting blocks and threads must be high enough to result in diminishing returns when adding blocks and threads