Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu
Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic for CUDA Multiple-Precision Arithmetics Implementation on GPUs Data Structure Optimization of Data on CUDA Example Experimental Result
Multiple-Precision Integer 32bit & 64bit System Multiple-Precision Integer GPU Computing & CUDA GPGPU CUDA
10 Based Integer Big Integer in System b is 2^32
Computing Capability Memory Bandwidth
L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread IssuePixel Thread Issue Input Assembler Host SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB Streaming Multiprocessor (SM) Streaming Processor (SP)
CUDA: CPU + GPU C Parallel Computing modal Single instruction Multiple Thread (SIMT) All threads run the same function(1000s threads on the fly) Each core deal with different data Hidden the IO by multiple-threads(more than 1000s threads) Speed up Computing / IO Translation Coalesce the IO one time When half warp thread access neighboring data 1 vs. ~1000
Background & Related Work. Multiple-Precision Arithmetic for CUDA Multiple-Precision Arithmetics Implementation on GPUs Data Structure Optimization of Data on CUDA Example Experimental Result
1. Multiple-precision Comparison 2. Multiple-precision Addition 3. Multiple-precision Subtraction 4. Multiple-precision Modular Addition 5. Multiple-precision Modular Subtraction
6. Multiple-precision Multiplication 7. Multiple-precision Division 8. Multiple-precision Montgomery Reduction 9. Multiple-precision Montgomery Multiplication 10.Barrett Modular Reduction Algorithm
11. Multiple-precision Multiplicative Inversion 12. Multiple-precision Montgomery Exponentiation 13. Montgomery Multi- Exponentiation 14. Multiple-precision Modular Addition …
Background & Related Work. Multiple-Precision Arithmetic for CUDA. Implementation on GPUs Data Structure Optimization of Data on CUDA Example Experimental Result
Two types of Data Structure Data Structure Using Cache memory with Constant Constant Value Using Shared memory for temp value Temp value Balance the threads and memory Balance Resource Data encoding Example
C = vectorA * Matrix B % prime
There is no cache for global memory on G80/G200 Constant memory & texture memory have little cache IO latency clock cycles This is the bottle neck Key to Optimization!
Global memory access by threads in a half-warp can be coalesced When the words accessed by all threads lie in the same segment of size equal to: 32 bytes if all threads access 8-bit words 64 bytes if all threads access 16-bit words 128 bytes if all threads access 32-bit or 64-bit words Any pattern of addresses requested by the half- warp Including patterns where multiple threads access the same address
Address 0 Thread 0 Address 4 Address … Address 116 Address 120 Address 124 Address 128 Address … Address 172 Address 176 Address 180 Address 184 Address 188 Address 252 Thread 1 Thread 2 Thread 3 Thread … Thread 14 Thread 15 … Segment 0 (128B)Segment 1 (128B) Reduced to 32B Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data.
C = vectorA * Matrix B % prime
Background & Related Work. Multiple-Precision Arithmetic for CUDA. Implementation on GPUs. Experimental Result
CPU: Intel® Core™ i7 CPU 2.80 GHz (single thread) GPU: XFX GTX280, 1.24 GHz
C = vectorA * Matrix B % prime
CPU: Intel® Core™ i7 CPU 2.80 GHz (single thread) GPU: XFX GTX280, 1.24 GHz
Multiple-Precision 1 Arithmetic 2 GPU Computing & Optimization 3 Example & result 4 Summary