Presentation is loading. Please wait.

Presentation is loading. Please wait.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

Similar presentations


Presentation on theme: "Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic."— Presentation transcript:

1 Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu

2 Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic for CUDA Multiple-Precision Arithmetics Implementation on GPUs Data Structure Optimization of Data on CUDA Example Experimental Result

3 Multiple-Precision Integer 32bit & 64bit System Multiple-Precision Integer GPU Computing & CUDA GPGPU CUDA

4 10 Based Integer Big Integer in System b is 2^32

5 Computing Capability Memory Bandwidth

6 L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread IssuePixel Thread Issue Input Assembler Host SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB Streaming Multiprocessor (SM) Streaming Processor (SP)

7 CUDA: CPU + GPU C Parallel Computing modal Single instruction Multiple Thread (SIMT) All threads run the same function(1000s threads on the fly) Each core deal with different data Hidden the IO by multiple-threads(more than 1000s threads) Speed up Computing / IO Translation Coalesce the IO one time When half warp thread access neighboring data 1 cycle@GPU vs. ~1000 cycles@CPU

8 Background & Related Work. Multiple-Precision Arithmetic for CUDA Multiple-Precision Arithmetics Implementation on GPUs Data Structure Optimization of Data on CUDA Example Experimental Result

9 1. Multiple-precision Comparison 2. Multiple-precision Addition 3. Multiple-precision Subtraction 4. Multiple-precision Modular Addition 5. Multiple-precision Modular Subtraction

10 6. Multiple-precision Multiplication 7. Multiple-precision Division 8. Multiple-precision Montgomery Reduction 9. Multiple-precision Montgomery Multiplication 10.Barrett Modular Reduction Algorithm

11 11. Multiple-precision Multiplicative Inversion 12. Multiple-precision Montgomery Exponentiation 13. Montgomery Multi- Exponentiation 14. Multiple-precision Modular Addition …

12 Background & Related Work. Multiple-Precision Arithmetic for CUDA. Implementation on GPUs Data Structure Optimization of Data on CUDA Example Experimental Result

13 Two types of Data Structure Data Structure Using Cache memory with Constant Constant Value Using Shared memory for temp value Temp value Balance the threads and memory Balance Resource Data encoding Example

14 C = vectorA * Matrix B % prime

15 There is no cache for global memory on G80/G200 Constant memory & texture memory have little cache IO latency 400-600 clock cycles This is the bottle neck Key to Optimization!

16

17

18

19 Global memory access by threads in a half-warp can be coalesced When the words accessed by all threads lie in the same segment of size equal to: 32 bytes if all threads access 8-bit words 64 bytes if all threads access 16-bit words 128 bytes if all threads access 32-bit or 64-bit words Any pattern of addresses requested by the half- warp Including patterns where multiple threads access the same address

20 Address 0 Thread 0 Address 4 Address … Address 116 Address 120 Address 124 Address 128 Address … Address 172 Address 176 Address 180 Address 184 Address 188 Address 252 Thread 1 Thread 2 Thread 3 Thread … Thread 14 Thread 15 … Segment 0 (128B)Segment 1 (128B) Reduced to 32B Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data.

21 C = vectorA * Matrix B % prime

22 Background & Related Work. Multiple-Precision Arithmetic for CUDA. Implementation on GPUs. Experimental Result

23

24 CPU: Intel® Core™ i7 CPU 860 @ 2.80 GHz (single thread) GPU: XFX GTX280, 1.24 GHz

25 C = vectorA * Matrix B % prime

26 CPU: Intel® Core™ i7 CPU 860 @ 2.80 GHz (single thread) GPU: XFX GTX280, 1.24 GHz

27 Multiple-Precision 1 Arithmetic 2 GPU Computing & Optimization 3 Example & result 4 Summary

28


Download ppt "Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic."

Similar presentations


Ads by Google