OpenSSL acceleration using Graphics Processing Units

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.
McGraw-Hill©The McGraw-Hill Companies, Inc., Security PART VII.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
CryptoGraphics: Cryptography using Graphics Processing Units Bachir Babale CSEPtu 590 3/8/2006.
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.
Contemporary Languages in Parallel Computing Raymond Hummel.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Computer Graphics Graphics Hardware
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
GPU Architecture and Programming
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
1 Latest Generations of Multi Core Processors
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Computer Graphics Graphics Hardware
GPU Architecture and Its Application
Implementation of IDEA on a Reconfigurable Computer
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
Computer Graphics Graphics Hardware
Graphics Processing Unit
6- General Purpose GPU Programming
Presentation transcript:

OpenSSL acceleration using Graphics Processing Units Pedro Miguel Costa Saraiva

Introduction Cryptography: The study of security techniques OpenSSL acceleration using Graphics Processing Units Introduction Cryptography: The study of security techniques SSL: A set of rules governing authentication and encrypted client/server communication De facto standard for secure electronic communications Computationally intensive Large volumes of SSL traffic impact performance Pedro Miguel Costa Saraiva

OpenSSL acceleration using Graphics Processing Units Introduction GPU: A specialised processing unit designed to manipulate graphics Originally used solely for graphics calculations Recent developments enable its use for general purpose computing Massive computational power Pedro Miguel Costa Saraiva

Introduction OpenSSL Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Introduction OpenSSL Open-source implementation of the SSL and TLS protocols Core-library implements a variety of cryptographic functions Intensively used by an extremely large number of both open and proprietary applications Pedro Miguel Costa Saraiva

Introduction Objectives Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Introduction Objectives Efficiently offload cryptographic operations onto a GPU Add GPU functionality to OpenSSL Lighten the load on the CPU Pedro Miguel Costa Saraiva

Introduction Pedro Miguel Costa Saraiva Structure State of the art OpenSSL acceleration using Graphics Processing Units Introduction Structure State of the art OpenSSL GPU Programming the GPU OpenCL CUDA OpenCL vs CUDA Main challenges Implementation Results Conclusion Pedro Miguel Costa Saraiva

State of the art OpenSSL Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units State of the art OpenSSL Commercial-grade full-featured open source toolkit Divided into libssl and libcrypto Core library written in C Supports accelerator hardware via engines Pedro Miguel Costa Saraiva

State of the art GPU Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units State of the art GPU Massive parallel processing power Roughly ten times the floating point capability of a high end CPU Faster growth rate than CPUs Pedro Miguel Costa Saraiva

State of the art GPU - Programming Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units State of the art GPU - Programming At the end of the 90s, graphics cards could not be programmed Things changed in 2001 with the release of DirectX 8 and OpenGL Programmers had to express their computations in terms of textures, vertices and shader programs Pedro Miguel Costa Saraiva

State of the art GPU - Programming Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units State of the art GPU - Programming 2006: NVIDIA created the CUDA framework ATI created the CTM low-level framework 2008: NVIDIA and ATI joined the Khronos Group Development of an industry standard for hybrid computing OpenCL version 1.0 released in December 2008 Pedro Miguel Costa Saraiva

State of the art GPU - OpenCL Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units State of the art GPU - OpenCL Open, royalty-free standard for general purpose programming Supports CPUs, GPUs, and other types of processors Maintained by the non-profit consortium Khronos Group Adopted by Intel, AMD, NVIDIA, and ARM Holdings Pedro Miguel Costa Saraiva

State of the art GPU - OpenCL Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units State of the art GPU - OpenCL API for coordinating parallel computation across different processors Cross-platform programming languages Subset of ISO C99 Low performance on NVIDIA GPUs Pedro Miguel Costa Saraiva

State of the art GPU - CUDA Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units State of the art GPU - CUDA Proprietary hardware and software architecture Designed by NVIDIA Manages computations on a GPU API is programmed with “C for CUDA” Third party wrappers available for other languages Pedro Miguel Costa Saraiva

State of the art GPU - Main Challenges Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units State of the art GPU - Main Challenges Well suited to extremely parallel problems Interaction between threads should be minimal Diverging executions paths are slow Limited memory Slow memory swapping Data-intensive operations are discouraged No file or standard I/O operations Pedro Miguel Costa Saraiva

Implementation Structure Pedro Miguel Costa Saraiva OpenSSL AES OpenSSL acceleration using Graphics Processing Units Implementation Structure OpenSSL AES RSA Key Generation RSA Cipher Pedro Miguel Costa Saraiva

Implementation OpenSSL Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Implementation OpenSSL ENGINE component supports alternative cryptography implementations Supports dynamic loading of external engines Pedro Miguel Costa Saraiva

Implementation OpenSSL Engine Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Implementation OpenSSL Engine Binding function defines supported algorithms Pointers to functions implementing the defined algorithms Pedro Miguel Costa Saraiva

Implementation AES Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Implementation AES CBC mode encryption cannot be parallelised Previous ciphertext block is required to begin encryption of the next one CBC mode decryption can be parallelised All blocks are decrypted in parallel ECB mode can be parallelised Pedro Miguel Costa Saraiva

Implementation AES Pedro Miguel Costa Saraiva Initialisation Cipher OpenSSL acceleration using Graphics Processing Units Implementation AES Initialisation Key expansion is performed on the CPU Cipher Initialises the GPU Allocates host and GPU memory for input and output data Pedro Miguel Costa Saraiva

Implementation AES Pedro Miguel Costa Saraiva Cipher OpenSSL acceleration using Graphics Processing Units Implementation AES Cipher Input data transferred to the GPU memory All data transferred at once GPU Kernel is called Output data is transferred from the GPU memory Pedro Miguel Costa Saraiva

Implementation AES Pedro Miguel Costa Saraiva GPU Kernel OpenSSL acceleration using Graphics Processing Units Implementation AES GPU Kernel For CBC encryption, a single thread is called Encrypts every block serially For CBC decryption and ECB operations, a thread is called for every block All blocks are processed in parallel Pedro Miguel Costa Saraiva

Implementation RSA Key Generation Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Implementation RSA Key Generation Generation function (CPU side) Calls the GPU to generate a large amount of prime candidates No more numbers are generated until the initial pool is exhausted Pedro Miguel Costa Saraiva

Implementation RSA Key Generation Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Implementation RSA Key Generation Generation function (GPU call) GPU RNG is initialised Device memory is allocated A large amount of threads is called to generate prime BIGNUMs Pedro Miguel Costa Saraiva

Implementation RSA Key Generation Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Implementation RSA Key Generation Generation function (GPU kernel) Random BIGNUM is generated BIGNUM p is tested for primality Miller-Rabin probabilistic primality test BIGNUMs determined to be prime are written into global memory Each thread tests one BIGNUM Pedro Miguel Costa Saraiva

Implementation RSA Key Generation Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Implementation RSA Key Generation Generation function (GPU call) Output data copied back to the host Required implementing the entire OpenSSL BIGNUM library on the GPU Pedro Miguel Costa Saraiva

Implementation RSA Cipher Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Implementation RSA Cipher BIGNUMs used in RSA must be broken down into small words Multiple threads can each process a word Chinese Remainder Theorem can split private key operations in half Pedro Miguel Costa Saraiva

Implementation RSA Cipher Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Implementation RSA Cipher Multi-Precision Algorithm K-bit integer A is broken into s k/64 words O(s) parallel implementation Runs s threads in two phases Pedro Miguel Costa Saraiva

Implementation RSA Cipher Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Implementation RSA Cipher First phase accumulates s partial products in 2s steps Carries accumulated in a separate array Second phase adds the carries to the intermediate result\ Worst case scenario is s-1 iterations Usually only one or two Pedro Miguel Costa Saraiva

Results Testing Framework Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Results Testing Framework Intel Core i7 950 CP, 3.07GHz NVIDIA GeForce GTX 580 Stress tool used on heavy CPU load tests 300 threads looping on sqrt, malloc/free and sync Pedro Miguel Costa Saraiva

Results AES – CBC Decryption Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Results AES – CBC Decryption Slower until the amount of data reaches 3KB Up to 43 times faster Pedro Miguel Costa Saraiva

Results AES – CBC Encryption Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Results AES – CBC Encryption Slower than the CPU Only 2.7% impact on CPU load Pedro Miguel Costa Saraiva

Results AES – ECB Encryption Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Results AES – ECB Encryption Slower until the amount of data reaches 3KB Up to 43 times faster Pedro Miguel Costa Saraiva

Results AES – ECB Decryption Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Results AES – ECB Decryption Slower until the amount of data reaches 3KB Up to 43 times faster Pedro Miguel Costa Saraiva

Results RSA Key Generation Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Results RSA Key Generation Slower until the amount of data reaches 3KB Up to 43 times faster Pedro Miguel Costa Saraiva

Results RSA Key Generation – Heavy CPU load Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Results RSA Key Generation – Heavy CPU load Slower until the amount of data reaches 3KB Up to 43 times faster Pedro Miguel Costa Saraiva

Results RSA Cipher RSA Cipher Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Results RSA Cipher RSA Cipher Single message Slower until the amount of data reaches 3KB Up to 43 times faster Single message, heavy CPU load Multiple messages (4096-bit) Pedro Miguel Costa Saraiva

Results RSA Key Generation – Heavy CPU load Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Results RSA Key Generation – Heavy CPU load Slower until the amount of data reaches 3KB Up to 43 times faster Pedro Miguel Costa Saraiva

Results RSA Key Generation – Heavy CPU load Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Results RSA Key Generation – Heavy CPU load Slower until the amount of data reaches 3KB Up to 43 times faster Pedro Miguel Costa Saraiva

Conclusion Pedro Miguel Costa Saraiva OpenSSL acceleration using Graphics Processing Units Conclusion Significant performance boost for AES ECB and CBC Decryption AES CBC Encryption is slower, but significantly lighter on the CPU RSA Key Generation is significantly faster for multiple keys RSA Cipher is significantly slower Pedro Miguel Costa Saraiva

Future Work Pedro Miguel Costa Saraiva AES CTR Cipher Mode OpenSSL acceleration using Graphics Processing Units Future Work AES CTR Cipher Mode OpenSSL implementation still unstable Manager to cache RSA requests for more effective use of the GPU Pedro Miguel Costa Saraiva