Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn CUBLAS Library 模板(先进计算技术与应用) Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Similar presentations


Presentation on theme: "Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn CUBLAS Library 模板(先进计算技术与应用) Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn."— Presentation transcript:

1 Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn
CUBLAS Library 模板(先进计算技术与应用) Dr. Bo Yuan

2 What is CUBLAS Library? BLAS Basic Linear Algebra Subprogram
A library to perform basic linear algebra Divided into three levels Such as MKL BLAS,CUBLAS, C++ AMP BLAS…… CUBLAS An high level implementation of BLAS on top of the NVIDIA CUDA runtime Single GPU or Multiple GPUs Support CUDA Stream Icon

3 Three Levels Of BLAS Level 1
This level contains vector operations of the form Level 2 This level contains matrix-vector operations of the form Level 3 This level contains matrix-matrix operations of the form

4 Why we need CUBLAS? CUBLAS
Full support for all 152 standard BLAS routines Support single-precision, double-precision, complex and double complex number data types Support for CUDA steams Fortran bindings Support for multiple GPUs and concurrent kernels Very efficient

5 Why we need CUBLAS?

6 Getting Started Basic preparation Install CUDA Toolkit
Include cublas_v2.h Link cublas.lib Some basic tips Every CUBLAS function needs a handle The CUBLAS function must be written between cublasCreate() and cublasDestory() Every CUBLAS function returns a cublasStatus_t to report the state of execution. Column-major storage References CUDA Toolkit 5.0 CUBLAS Library.pdf Courier New font

7 CUBLAS Data Types cublasHandle_t cublasStatus_t CUBLAS_STATUS_SUCCESS
CUBLAS_STATUS_NOT_INITIALIZED CUBLAS_STATUS_ALLOC_FAILED CUBLAS_STATUS_INVALID_VALUE CUBLAS_STATUS_ARCH_MISMATCH CUBLAS_STATUS_MAPPING_ERROR CUBLAS_STATUS_EXECUTION_FAILED CUBLAS_STATUS_INTERNAL_ERROR The cublasHandle_t type is a pointer type to an opaque structure holding the CUBLAS library context. The CUBLAS library context must be initialized using cublasCreate() and the returned handle must be passed to all subsequent library function calls. The context should be destroyed at the end using cublasDestroy(). The cublasStatus_t type is used for function status returns. All CUBLAS library functions return their status, which can have the following values.

8 CUBLAS Data Types cublasOperation_t
The cublasOperation_t type indicates which operation needs to be performed with the dense matrix. Its values correspond to Fortran characters `N' or `n' (non-transpose), `T' or `t' (transpose) and `C' or `c' (conjugate transpose) that are often used as parameters to legacy BLAS implementations.

9 CUBLAS Datatypes cublasFillMode_t cublasSideMode_t
The cublasFillMode_t type indicates which part (lower or upper) of the dense matrix was fillled and consequently should be used by the function. Its values correspond to Fortran characters `L' or `l' (lower) and `U' or `u' (upper) that are often used as parameters to legacy BLAS implementations The cublasSideMode_t type indicates whether the dense matrix is on the left or right side in the matrix equation solved by a particular function. Its values correspond to Fortran characters `L' or `l' (left) and `R' or `r' (right) that are often used as parameters to legacy BLAS implementations.

10 CUBLAS Data Types cublasPointerMode_t cublasAtomicsMode_t
The cublasPointerMode_t type indicates whether the scalar values are passed by reference on the host or device. It is important to point out that if several scalar values are present in the function call, all of them must conform to the same single pointer mode. The pointer mode can be set and retrieved using cublasSetPointerMode() and cublasGetPointerMode() routines, respectively. The cublasAtomicsMode_t type indicates whether CUBLAS routines which has an alternate implementation using atomics can be used. The atomics mode can be set and queried using cublasSetAtomicsMode() and cublasGetAtomicsMode() routines, respectively.

11 Example Code #include <stdio.h> #include <stdlib.h> #include <math.h> #include <cuda_runtime.h> #include "cublas_v2.h" //调用CUBLAS必须包含的头文件 #define M 6 #define N 5 #define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1)) //按列访问数组下标 static __inline__ void modify(cublasHandle_t handle,float* m,int ldm,int n,int p,int q,float alpha,float beta) { cublasSscal(handle,n-p+1,&alpha,&m[IDX2F(p,q,ldm)],ldm); cublasSscal(handle,ldm-p+1,&beta,&m[IDX2F(p,q,ldm)],1); }

12 Example Code int main(void){ cudaError_t cudaStat; cublasStatus_t stat; cublasHandle_t handle; int i,j; float* devPtrA; float* a=0; a=(float*)malloc(M*N*sizeof(*a)); //在host上开辟数组空间 if (!a) { printf("host memory allocation failed"); return EXIT_FAILURE; }

13 Example Code for (j=1;j<=N;j++) //数组初始化 { for (i=1;i<=M;i++) a[IDX2F(i,j,M)]=(float)((i-1)*M+j); } cudaStat = cudaMalloc((void**)&devPtrA,M*N*sizeof(*a)); //在device上开辟内存空间 if (cudaStat != cudaSuccess) printf("device memory allocation failed"); return EXIT_FAILURE; stat = cublasCreate(&handle); //初始化CUBLAS环境

14 Example Code if (stat != cudaSuccess) { printf("CUBLAS initialization failed\n"); return EXIT_FAILURE; } stat = cublasSetMatrix(M,N,sizeof(*a),a,M,devPtrA,M); //把数据从host拷贝到device if (stat != CUBLAS_STATUS_SUCCESS) printf("data download failed"); cudaFree(devPtrA); cublasDestroy(handle); modify(handle,devPtrA,M,N,2,3,16.0f,12.0f); stat = cublasGetMatrix(M,N,sizeof(*a),devPtrA,M,a,M); //把数据从device拷贝到host

15 Example Code if (stat != CUBLAS_STATUS_SUCCESS) { printf("data upload failed"); cudaFree(devPtrA); cublasDestroy(handle); return EXIT_FAILURE; } cudaFree(devPtrA); //释放指针 cublasDestroy(handle); //关闭CULBAS环境 for (j=1;j<=N;j++) for (i=1;i<=M;i++) printf("%7.0f",a[IDX2F(i,j,M)]); return EXIT_SUCCESS;

16 Matrix Multiply Use level-3 function Function Introduce
cublasStatus_t cublasSgemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const float *alpha, const float *A, int lda, const float *B, int ldb, const float *beta, float *C, int ldc)

17 Matrix Multiply Image Quality

18 Matrix Multiply int MatrixMulbyCUBLAS(float *A,float *B,int HA,
int WB,int WA,float *C){ float *d_A,*d_B,*d_C; CUDA_SAFE_CALL(cudaMalloc((void **)&d_A,WA*HA*sizeof(float))); CUDA_SAFE_CALL(cudaMalloc((void **)&d_B,WB*WA*sizeof(float))); CUDA_SAFE_CALL(cudaMalloc((void **)&d_C,WB*HA*sizeof(float))); CUDA_SAFE_CALL(cudaMemcpy(d_A,A,WA*HA*sizeof(float),cudaMemcpyHostToDevice)); CUDA_SAFE_CALL(cudaMemcpy(d_B,B,WB*WA*sizeof(float),cudaMemcpyHostToDevice)); cublasStatus_t status; cublasHandle_t handle; status=cublasCreate(&handle); if (status!=CUBLAS_STATUS_SUCCESS) { printf("CUBLAS initialization error\n"); return EXIT_FAILURE; }

19 Matrix Multiply int devID; cudaDeviceProp props;
CUDA_SAFE_CALL(cudaGetDevice(&devID)); CUDA_SAFE_CALL(cudaGetDeviceProperties(&props,devID)); printf("Device %d: \"%s\" with Compute %d.%d capability\n", devID, props.name, props.major, props.minor); const float alpha=1.0f; const float beta=0.0f; cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,WB,HA,WA,&alpha,d_B,WB,d_A,WA,&beta,d_C,WB); //level 3 function CUDA_SAFE_CALL(cudaMemcpy(C,d_C,WB*HA*sizeof(float),cudaMemcpyDeviceToHost)); cublasDestroy(handle); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); return 0; } 对齐

20 The Rusult

21 Some New Features The handle to the CUBLAS library context is initialized using the cublasCreate function and is explicitly passed to every subsequent library function call. This allows the user to have more control over the library setup when using multiple host threads and multiple GPUs. The scalars a and b can be passed by reference on the host or the device, instead of only being allowed to be passed by value on the host. This change allows library functions to execute asynchronously using streams even when a and b are generated by a previous kernel.

22 Some New Features When a library routine returns a scalar result, it can be returned by reference on the host or the device, instead of only being allowed to be returned by value only on the host. This change allows library routines to be called asynchronously when the scalar result is generated and returned by reference on the device resulting in maximum parallelism.

23 Stream Stream Concurrent Execution between Host and Device
Overlap of Data Transfer and Kernel Execution With devices of compute capability 1.1 or higher Hidden Data Transfer Time Rules Functions in a same stream execute sequentially Functions in different streams execute concurrently References CUDA C Programming Guide.pdf Courier New font

24 Parallelism with Streams
Create and set stream to be used by each CUBLAS routine Users must call function cudaStreamCreate() to create different streams . Users must call function cublasSetStream() to set a stream to be used by each individual CUBLAS routine. Use asynchronous transfer function cudaMemcpyAsync() The application can conceptually associate each stream with each task. In order to achieve the overlap of computation between the tasks, the user should create CUDA streams using the function cudaStreamCreate() and set the stream to be used by each individual CUBLAS library routine by calling cublasSetStream() just before calling the actual CUBLAS routine Then, the computation performed in separate streams would be overlapped automatically when possible on the GPU. This approach is especially useful when the computation performed by a single task is relatively small and is not enough to fill the GPU with work.

25 Parallelism with Streams
start=clock(); for (int i = 0; i < nstreams; i++) { cudaMemcpy(d_A,A,WA*HA*sizeof(float),cudaMemcpyHostToDevice); cudaMemcpy(d_B,B,WB*WA*sizeof(float),cudaMemcpyHostToDevice); cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,WB,HA,WA,&alpha,d_B,WB,d_A,WA,&beta,d_C,WB); cudaMemcpy(C,d_C,WB*HA*sizeof(float),cudaMemcpyDeviceToHost); } end=clock(); printf(“GPU Without Stream time: %.2f秒.\n", (double)(end-start)/CLOCKS_PER_SEC);

26 Parallelism with Streams
start=clock(); for (int i = 0; i < nstreams; i++) { cudaMemcpyAsync(d_A,A,WA*HA*sizeof(float),cudaMemcpyHostToDevice,streams[i]); cudaMemcpyAsync(d_B,B,WB*WA*sizeof(float),cudaMemcpyHostToDevice,streams[i]); cublasSetStream(handle,streams[i]); cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,WB,HA,WA,&alpha,d_B,WB,d_A,WA,&beta,d_C,WB); cudaMemcpyAsync(C,d_C,WB*HA*sizeof(float),cudaMemcpyDeviceToHost); } end=clock(); printf("GPU With Stream time: %.2f秒.\n", (double)(end-start)/CLOCKS_PER_SEC);

27 The Result

28 Review What is core functionality of BLAS and CUBLAS?
What is the advantage of CUBLAS? What is the importance of handle in CUBLAS? How to perform matrix multiplication using CUBLAS? How is a matrix stored in CUBLAS? How to use CUBLAS with stream techniques? What can we do using CUBLAS in our research?


Download ppt "Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn CUBLAS Library 模板(先进计算技术与应用) Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn."

Similar presentations


Ads by Google