Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th 2010

Thesis Objectives Develop high performance parallel implementations of data clustering algorithms leveraging the computational power of GPUs and the CUDA framework – Make clustering flow cytometry data sets practical on a single lab machine Use OpenMP and MPI for distributing work to multiple GPUs in a grid computing or commodity cluster environment

Outline Overview of the application domain GPU Architecture, CUDA Parallel Implementation Results Multi-GPU architecture More Results

Data Clustering A form of unsupervised learning that groups similar objects into relatively homogeneous sets called clusters How do we define similarity between objects? – Depends on the application domain, implementation Not to be confused with data classification, which assigns objects to predefined classes

Data Clustering Algorithms Clustering Taxonomy from “Data Clustering: A Review”, by Jain et al. [1]

Example: Iris Flower Data

Flow Cytometry Technology used by biologists and immunologists to study the physical and chemical characteristics of cells Example: Measure T lymphocyte counts to monitor HIV infection [2]

Flow Cytometry Cells in a fluid pass through a laser Measure physical characteristics with scatter data Add fluorescently labeled antibodies to measure other aspects of the cells

Flow Cytometer

Flow Cytometry Data Sets Multiple measurements (dimensions) for each event – Upwards of 6 scatter dimensions and 18 colors per experiment On the order of 10 5 – 10 6 events ~24 million values that must be clustered Lots of potential clusters Example: 10 6 events, 100 clusters, 24 dimensions – C-means: O(NMD) = 2.4 x 10 9 – Expectation Maximization: O(NMD 2 ) = 5.7 x 10 10

Parallel Processing Clustering can take many hours on a single CPU Data growth is accelerating Performance gains of single-threaded applications slowing down Fortunately many data clustering algorithms lend themselves naturally to parallel processing

Multi-core Current trends: – Adding more cores – More SIMD SSE3/AVX – Application specific extensions VT-x, AES-NI – Point-to-Point interconnects, higher memory bandwidths

GPU Architecture Trends CPU GPU Figure based on Intel Larabee Presentation at SuperComputing 2009 Fixed Function Fully Programmable Partially Programmable Multi-threadedMulti-coreMany-core Intel Larabee NVIDIA CUDA

GPU vs. CPU Peak Performance

Tesla GPU Architecture

Tesla Cores

GPGPU General Purpose computing on Graphics Processing Units Past – Programmable shader languages: Cg, GLSL, HLSL – Use textures to store data Present: – Multiple frameworks using traditional general purpose systems and high-level languages

CUDA: Software Stack Image from [5]

CUDA: Program Flow Application Start Search for CUDA DevicesLoad data on hostAllocate device memoryCopy data to deviceLaunch device kernels to process dataCopy results from device to host memory CPU Main Memory Device Memory GPU Cores PCI-Express Device Host

CUDA: Streaming Multiprocessors Image from [3]

CUDA: Thread Model Kernel – A device function invoked by the host computer – Launches a grid with multiple blocks, and multiple threads per block Blocks – Independent tasks comprised of multiple threads – no synchronization between blocks SIMT: Single-Instruction Multiple- Thread – Multiple threads executing time instruction on different data (SIMD), can diverge if necessary Image from [3]

CUDA: Memory Model Image from [3]

When is CUDA worthwhile? High computational density – Worthwhile to transfer data to separate device Both coarse-grained and fine-grained SIMD parallelism – Lots of independent tasks (blocks) that don’t require frequent synchronization – Within each block, lots of individual SIMD threads Contiguous memory access patterns Frequently/Repeatedly used data small enough to fit in shared memory

C-means Minimizes square error between data points and cluster centers using Euclidean distance Alternates between computing membership values and updating cluster centers Time complexity: O(N*M*D*I) N = vectors, M = clusters, D = dimensions per vector, I = number of iterations

C-means Example

C-means Program Flow Read Input DataCopy Data To GPU Copy Centers to GPU Distances >> Memberships >> Center Numerators >> Center Denominators >> Choose Initial Centers Compute Centers Output Results Error < ε? No Yes Host Device Legend

C-means Distance Kernel Inputs: [D x N] data matrix, [M x D] centers matrix Outputs: [M x N] distance matrix Kernel Grid: [N/512 x M] blocks, 512 threads per block Distance Kernel Grid: [M x N] Distance Matrix B1 = (1,1)B2 = (2,1)(N/512,1) (1,2) 512 0 1024NN-512 0 1 M (1,M)(N/512,M) All Threads [ 1 x D] Center in Shared Memory [ D x N] Data Matrix in Global Memory t1t1 t2t2 …t 512 0512 … N 0 D

C-means Membership Kernel Kernel Grid: [N/512] blocks, 512 threads per block Transforms [M x N] distance matrix to [M x N] membership matrix (in-place) – Each thread makes two passes through distance matrix. First to compute sum Second to compute each membership Kernel Computes [M x N] Membership Matrix Block (1)Block (2)(N/512) 512 0 1024NN-512 [ M x N] Distance Matrix in Global Memory t1t1 t2t2 …t 512

C-means Centers Kernel Kernel Grid: [M/4 x D] blocks, 256 threads per block [ D x N] Data Matrix Centers Kernel Grid: Computes [M x D] Matrix B1 = (1,1)B2 = (2,1)(M/4,1) (1,2) 4 0 8MM-4 0 1 D (1,D)(M/4,D) 256 threads cycle through the N events Each value gets re-used 4 times [4 x 256] Partial Sums: Shared Memory t1t1 t2t2 …t 256 [ 4 x 256 ] Partial sums reduced to [ 4 x 1 ] with butterfly sum [ M x N] Data Matrix

Expectation Maximization with a Gaussian mixture model Data described by a mixture of M Gaussian distributions Each Gaussian has 3 parameters

E-step Compute likelihoods based on current model parameters – O(NMD 2 ) Convert likelihoods into membership values O(NM)

M-step Update model parameters

EM Program Flow Read Input Data, Transpose Copy Data to GPU Initialize Models Copy Models to GPU Likelihoods >> Constants >> Output Results Δ Likelihood < ε No Yes Memberships >> Covariance >> Means >> N >> Desired # Clusters? No Yes Combine 2 closest Gaussians Host Device Legend

EM: Likelihood Kernel Kernel Grid: [M x 16], 512 threads per block Kernel Grid Computes [M x N] Likelihood Matrix B1 = (1,1)(2,1)(1,16) B2 = (1,2) N/16 0 2N/16N15N/16 0 1 M (M,1)(M,16) Shared Memory [1 x M] Mean Vector [M x M] Covariance Matrix Reads all dimensions of first N/16 events from [D x N] Data Matrix in Global Memory t1t1 t2t2 …t 512 Writes 1/16 th of a row to [M x N] Matrix in Global Memory t1t1 t2t2 … t 512 ……..… t1t1 t2t2 …t 512 t1t1 t2t2 … 0512N/16 0512N/16 ……..… …N

EM: Covariance Kernel Kernel Grid: [M/6 x D(D+1)/2] blocks, 256 threads per block Loop unrolled by 6 clusters per block (limited by resource constraints) Clusters 1 to 6 … Clusters M-5 to M [ D x N] Data Matrix [6 x 256] Partial Sums: Shared Memory t1t1 t2t2 …t 256 [ 6 x 256 ] Partial sums reduced to [ 6 x 1 ] with butterfly sum [ M x N] Membership Matrix

Performance Tuning Global Memory Coalescing – 8000/9000 (1.0/1.1) series devices All 16 threads in half-warp access consecutive elements Starting address aligned to 16*sizeof(element) – GT200 (1.2,1.3) series devices 32B/64B/128B aligned transactions Minimize # of transactions by accessing consecutive elements Images from [4]

Performance Tuning Partition Camping – Global memory has 6 to 8 partitions (64-bit wide DRAM channels) – Organized into 256-byte wide blocks – Want an even distribution of accesses to the different channels

Performance Tuning Occupancy – Ratio of active warps to maximum allowed by a multiprocessor – Multiple active warps hides register dependencies, global memory access latencies – # of warps restricted by device resources Registers required per thread Shared memory per block Total number of threads per block

Performance Tuning Block Count – More small/simple blocks often better than larger blocks with loops

Performance Tuning CUBLAS – CUDA Basic Linear Algebra Subprograms – SGEMM more scalable solution for some of the kernels – Makes good use of shared memory – Poor blocking on rectangular matrices

Testing Environments Oak: 2 GPU Server @ RIT – Tesla C870: G80 architecture, 128 cores, 76.8 GB/sec – GTX260: GT200 architecture, 192 cores, 112 GB/sec – CUDA 2.3, Ubuntu Server 8.04 LTS, GCC 4.2, OpenMP Lincoln, a TeraGrid HPC resource – Cluster located in the National Center for Supercomputing Applications @ University of Illinois – 192 nodes each with 8 CPU cores and 16GB of memory connected with Infiniband networking – 92 Tesla S1070 accelerator units, for a total of 368 GPUs (2 per node) Each has 240 cores, 102 GB/sec – CUDA 2.3, RHEL4, GCC 4.2, OpenMP, MVAPICH2 for MPI

Results – Speedup C-meansExpectation Maximization

Results – Overhead

Comparisons to Prior Work C-means – Order of magnitude improvement over our previous publication [6] – 1.04x to 4.64x improvement on data sizes provided in Anderson et al. [7]

Comparisons to Prior Work Expectation Maximization – 3.72x to 10.1x improvement over Kumar et al. [8] – [8] only supports diagonal covariance, thesis implementation capable of full covariance – Andrew Harp EM implementation [9] Did not provide raw execution times, only speedup. Reports 170x speedup on a GTX285 Using his CPU source code was able to get raw CPU times on comparable hardware and compute speedup for the GTX260 with implementation in this thesis Speedup with thesis implementation over his CPU reference was 446x, effectively at least a 2.6x speedup

Multi-GPU Strategy 3 Tier Parallel hierarchy

Changes from Single GPU Very little impact on GPU kernel implementations Normalization for some results (C-means cluster centers, EM M-step kernels) done on host after collecting results from each GPU -Very low overhead Majority of overhead is initial distribution of input data, final collection of event-level membership results to nodes in the cluster – Asynchronous MPI sends from host instead of each node reading input file from data store – Need to transpose membership matrix and then gather data from each node

Multi-GPU Strategy

Multi-GPU Analysis Fixed Problem Size Analysis – i.e. Amdahl’s Law, Strong Scaling, True Speedup – Kept input size and parameters the same – Increased the number of nodes on Lincoln cluster Time-constrained Analysis – i.e. Gustafson’s Law, Weak Scaling, Scaled Speedup – Problem size changed such that execution time would remain the same with ideal speedup Execution time proportional to N/p Problem size scaled to N x p

Fixed Problem Size Analysis

Overhead in Parallel Implementation

Time-Constrained Analysis

Synthetic Data Results C-means Result EM Gaussian Result

Flow Cytometry Results Combination of Mouse blood and Human blood cells 21 dimensions – 6 Scatter, 7 human, 7 mouse, 1 in common Attempt to distinguish mouse from human cells with clustering Data courtesy of Ernest Wang, Tim Mossman, James Cavenaugh, Iftekhar Naim, and others from the Flowgating group at the University of Rochester Center for Vaccine Biology and Immunology

99.3% of the 200,000 cells properly grouped into mouse or human clusters with a 50/50 mixture Reducing mixture to 10,000 human and 100,000 mouse, stilled properly grouped 98.99% of all cells With only 1% human cells, they start getting harder to distinguish. 62 (6.2%) of human cells were grouped into predominately mouse clusters. 192 mouse cells were grouped into the predominately human clusters

Conclusions Both C-means and Expectation Maximization with Gaussians have abundant data parallelism that maps well to massively parallel many-core GPU architectures – Nearly 2 orders of magnitude improvement over single-threaded algorithms on comparable CPUs GPUs allow clustering of large data sets, like those found in flow cytometry, to be practical (a few minutes instead of a few hours) on a single lab machine GPU co-processors and the CUDA framework can be combined with traditional parallel programming techniques for efficient high performance computing – Over 6000x speedup compared to a single CPU with only 64 server nodes and 128 GPUs Parallelization strategies used in this thesis applicable to other clustering algorithms

Future Work Apply CUDA, OpenMP, MPI to other clustering algorithms or other parts of workflow – Skewed T mixture model clustering – Accelerated data preparation, visualization, statistical inference between data sets Improvements to current implementations – CUDA SGEMM with better performance on highly rectangular matrices could replace some kernels – Try MAGMA, CULAtools – 3 rd party BLAS libraries for GPUs Investigate performance on new architectures and frameworks – NVIDIA Fermi and Intel Larabee architectures – OpenCL, DirectCompute, PGI frameworks/compilers Improvements to multi-node implementation – Remove master-slave paradigm for data distribution and final result collection (currently root node needs enough memory to hold it all – not scalable to very large data sets) – Dynamic load balancing for heterogeneous environments – Use CPU cores along with GPU for processing portions of the data, instead of idling during kernels

Questions?

References 1.A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, 1999. 2.H. Shapiro, J. Wiley, and W. InterScience, Practical flow cytometry. Wiley-Liss New York, 2003. 3.NVIDIA, “NVIDIA CUDA Programming Guide 2.3”. [Online] available: http://developer.nvidia.com/object/cuda_2_3_downloads.html 4.NVIDIA, “NVIDIA CUDA C Programming Best Practices Guide”. [Online] available: http://developer.nvidia.com/object/cuda_2_3_downloads.html 5.NVIDIA, “NVIDIA CUDA Architecture Introduction & Overview”. [Online] available: http://developer.nvidia.com/object/cuda_2_3_downloads.html 6.J. Espenshade, A. Pangborn, G. von Laszewski, D. Roberts, and J. Cavenaugh, “Accelerating partitional algorithms for flow cytometry on gpus,” in Parallel and Distributed Processing with Applications, 2009 IEEE International Symposium on, Aug. 2009, pp. 226–233. 7.D. Anderson, R. Luke, and J. Keller, “Speedup of fuzzy clustering through stream processing on graphics processing units,” Fuzzy Systems, IEEE Transactions on, vol. 16, no. 4, pp. 1101-1106, Aug. 2008. 8.N. Kumar, S. Satoor, and I. Buck, “Fast parallel expectation maximization for gaussian mixture models on gpus using cuda,” in 11th IEEE International Conference on High Performance Computing and Communications, 2009. HPCC’09, 2009, pp. 103–109. 9.A. Harp, “EM of GMMs with GPU acceleration,” May 2009. [Online]. Available: http://andrewharp.com/gmmcuda

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Similar presentations

Presentation on theme: "Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Similar presentations

Presentation on theme: "Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th."— Presentation transcript:

Similar presentations

About project

Feedback