Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.

Similar presentations


Presentation on theme: "Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the."— Presentation transcript:

1 Algorithm Engineering „GPGPU“ Stefan Edelkamp

2 Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the masses“  Application: Fourier-Transformation, Model Checking, Bio-Informatics, see CUDA-ZONE

3 Programming the Graphics Processing Unit with Cuda

4 Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  GPGPU languages  CUDA  Small Example

5 Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  GPGPU languages  CUDA  Small Example

6 Cluster / Multicore / GPU  Cluster system many unique systems each one  one (or more) processors  internal memory  often HDD communication over network  slow compared to internal  no shared memory CPURAM HDD CPURAM HDD CPURAM HDD Switch

7 Cluster / Multicore / GPU  Multicore systems multiple CPUs RAM external memory on HDD communication over RAM CPU1CPU2 CPU4CPU3 RAM HDD

8 Cluster / Multicore / GPU  System with a Graphic Processing Unit Many (240) Parallel processing units Hierarchical memory structure  RAM  VideoRAM  SharedRAM Communication  PCI BUS Graphics Card GPU SRAM VRAM RAM CPU Hard Disk Drive

9 Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  GPGPU languages  CUDA  Small Example

10 Computing on the GPU  Hierarchical execution Groups  executed sequentially Threads  executed parallel  lightweight (creation / switching nearly free)‏ one Kernel function  executed by each thread Group 0

11 Computing on the GPU  Hierarchical memory Video RAM Video RAM  1 GB  Comparable to RAM Shared RAM in the GPU  16 KB  Comparable to registers  parallel access by threads Graphic Card GPU SRAM VideoRAM

12 Beispielarchitektur G200 z.B. in 280GTX

13 Beispielprobleme

14 Ranking und Unranking mit Parity

15 2-Bit BFS

16 1-Bit BFS

17 Schiebepuzzle

18 Some Results…

19 Weitere Resultate …

20 Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  GPGPU languages  CUDA  Small Example

21 GPGPU Languages  RapidMind Supports MultiCore, ATI, NVIDIA and Cell C++ analysed and compiled for target hardware  Accelerator (Microsoft)‏ Library for.NET language  BrookGPU (Stanford University)‏ Supports ATI, NVIDIA Own Language, variant of ANSI C

22 Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  Programming languages  CUDA  Small Example

23 CUDA  Programming language  Similar to C  File suffix.cu  Own compiler called nvcc  Can be linked to C

24 CUDA C++ codeCUDA Code Compile with GCCCompile with nvcc Link with ld Executable

25 CUDA  Additional variable types Dim3 Int3 Char3

26 CUDA  Different types of functions __global__ invoked from host __device__ called from device  Different types of variables __device__ located in VRAM __shared__ located in SRAM

27 CUDA  Calling the kernel function name >>(...)‏  Grid dimensions (groups)‏  Block dimensions (threads)‏

28 CUDA  Memory handling CudaMalloc(...) - allocating VRAM CudaMemcpy(...) - copying Memory CudaFree(...) - free VRAM

29 CUDA  Distinguish threads blockDim – Number of all groups blockIdx – Id of Group (starting with 0)‏ threadIdx – Id of Thread (starting with 0)‏ Id = blockDim.x*blockIdx.x+threadIdx.x

30 Overview  Cluster / Multicore / GPU comparison  Computing on the GPU  Programming languages  CUDA  Small Example

31 CUDA void inc(int *a, int b, int N) { for (int i = 0; i<N; i++) a[i] = a[i] + b; } void main()‏ {... inc(a,b,N); } __global__ void inc(int *a, int b, int N)‏ { int id = blockDim.x*blockIdx.x+threadIdx.x; if (id<N) a[id] = a[id] + b; } void main()‏ {... int * a_d = CudaAlloc(N); CudaMemCpy(a_d,a,N,HostToDevice); dim3 dimBlock ( blocksize, 0, 0 ); dim3 dimGrid ( N / blocksize, 0, 0 ); inc >>(a_d,b,N); }

32 Realworld Example  LTL Model checking Traversing an implicit Graph G=(V,E)‏ Vertices called states Edges represented by transitions Duplicate removal needed

33 Realworld Example  External Model checking Generate Graph with external BFS Each BFS layer needs to be sorted  GPU proven to be fast in sorting

34 Realworld Example  Challenges Millions of states in one layer Huge state size Fast access only in SRAM Elements needs to be moved

35 Realworld Example  Solutions: Gpuqsort  Qsort optimized for GPUs  Intensive swapping in VRAM Bitonic based sorting  Fast for subgroups  Concatenating Groups slow

36 Realworld Example  Our solution States S presorted by Hash H(S) Bucket sorted in SRAM by a Group VRAM SRAM

37 Realworld Example  Our solution Order given by H(S),S

38 Realworld Example  Results

39 Questions??? Programming the GPU


Download ppt "Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the."

Similar presentations


Ads by Google