Generalized and Hybrid Fast-ICA Implementation using GPU

Generalized and Hybrid Fast-ICA Implementation using GPU
Presenter: [Titus Nanda Kumara]

Blind Source Separation (BSS)
To a computer it has no idea about The original signal How they are mixed But we need the original signal separately This is called Blind Source Separation Image source : The solution is given by Independent Component Analysis (ICA)

ICA in one picture Assumptions
We have two recordings to separate two sources All signal arrives at the same time. (No delay difference between them) Amplitude of the original signal can change, but the mixing factors remain same (Singer or the Saxophone does not move) 0.4x 0.8x 0.9x 0.5x Unknown Left ear (X1) = 0.8 times Saxophone music times voice Right ear (X2) = 0.9 times Saxophone music times voice

Independent Component Analysis (ICA)
Problem Solution How to unmix a mixed signal (x) if we do not know both original sources (s) and mixing factors (A) Assume the mixture is a linear mixture & the sources are independent Problem can be written as x=As If we have an estimate of A-1 A-1x = A-1As s = A-1x

ICA is used in Separating EEG signal for Brain Computer Interface and other medical or research purposes Separation of Magnetoencephalography (MEG) data Improving the quality of music or sound signals by eliminating cross-talk or noise Finding hidden or fundamental factors in financial data such as background currency exchanges or stock market data ICA is a highly compute intensity algorithm. When the data size is larger it takes a considerable amount of time to run

Fast - ICA Suggested by Aapo Hyvärinen at Helsinki University of Technology around late 1990s Comparatively fast, accurate and highly parallelizable Matrix operations are used in most of the places. Good starting point to improve performance using GPU

GPUs for General Purpose Applications (GPGPU Computing)
Facilitate to program the GPU as the programmers desire What is so important about GPU? CPU – Several cores running around 4GHz GPU – Thousands of cores running around 1GHz If the task is completely parallel, it is hundreds or thousands time faster to do it in GPU !

Improving performance of Fast-ICA
Divide the algorithm into five sections Input reading Pre-processing Fast-ICA loop Post-processing Output writing Execution Time for matrix sizes of 6 x 8192 6 x 100 x 8192 100 x 0.5%~1.6% 98%~99% 0.2%~0.3%

Amdahl's law To improve the performance, we focused on Fast-ICA loop
W matrices size nxn (n is number of sources) Z matrices size nxp p>>>n (p is number of samples)

Inside the Fast-ICA loop

Improving the contrast function
A custom kernel was written to apply a non linear function to each element of the matrix. This is a complete parallelizable task

Only the contrast function is not enough
The data should transfer between RAM and GPU memory through PCI Express bus. This introduce a delay. The communication delay hides the speed gain

Only the contrast function is not enough
To hide the data transferring delay and gain performance, we need a large number of computations happen in GPU

Inside the Fast-ICA loop

Improve matrix operations using cuBLAS
cuBLAS is the CUDA implementation for the BLAS library Highly optimized, most of the cases writing custom kernels for matrix operation give lower performance than cuBLAS routines Dimensions

Acceleration of the complete algorithm
Pre processing Centering and Whitening to remove the correlation among each row of input - (culaDeviceDgesvd and custom kernels) Fast-ICA loop Matrix multiplications transformations – (cublasDgemm and cublasDgeam ) Contrast function – (custom kernels) Eigen decomposition – (culaDeviceDgeev) Post processing Matrix multiplication with cublasDgemm

Running full algorithm in GPU
Running the full algorithm in GPU is not always a good idea

Switching between GPU and CPU
When CPU execution is faster, we can switch to CPU But should be careful about switching points because of memory copy delay This operation heavily depends on the data size of the input

Data size vs performance
We tested for number of sources Number of samples 1024 – Each section is tested for all the combinations CPU is better GPU is better Pre-Processing

CPU is better GPU is better ICA-main loop

CPU is better GPU is better

Switching between GPU and CPU
The switching points will be based on Hardware Data size Data transfer delay Option 1 : The program can be profiled for the hardware for all the data sizes and define boundaries Option 2 : The program can decide the places based on previous iterations of the Fast-ICA loop

Conclusions Fast-ICA can be efficiently executed in GPU but not for all the cases We cannot write a static program to handle all the cases because the performance of CPU and GPU is depends on the data size The program should intelligently switch between GPU and CPU in appropriate locations to gain the maximum performance in all the scenarios

Thank you

Generalized and Hybrid Fast-ICA Implementation using GPU

Similar presentations

Presentation on theme: "Generalized and Hybrid Fast-ICA Implementation using GPU"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Generalized and Hybrid Fast-ICA Implementation using GPU

Similar presentations

Presentation on theme: "Generalized and Hybrid Fast-ICA Implementation using GPU"— Presentation transcript:

Similar presentations

About project

Feedback