Faster File matching using GPGPU’s Deephan Mohan Professor: Dr

Slides:

Advertisements

Similar presentations

Lecture 5: Cryptographic Hashes

Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

ECE454/CS594 Computer and Network Security Dr. Jinyuan (Stella) Sun Dept. of Electrical Engineering and Computer Science University of Tennessee Fall 2011.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

MD5 Generation Auto-Generated Slides To Visualize MD5 Hash Generation by Chris Fremgen.

MD5 Generation Auto-Generated Slides To Visualize MD5 Hash Generation by Chris Fremgen.

MD5 Generation Auto-Generated Slides To Visualize MD5 Hash Generation by Chris Fremgen.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

OpenSSL acceleration using Graphics Processing Units

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Digital signature using MD5 algorithm Hardware Acceleration

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Cuda application-Case study 2015/10/24 1. Introduction (1) 2015/10/24 GPU Workshop 2 The fast increasing power of the GPU (Graphics Processing Unit) and.

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

IP Routing Processing with Graphic Processors Author: Shuai Mu, Xinya Zhang, Nairen Zhang, Jiaxin Lu, Yangdong Steve Deng, Shu Zhang Publisher: IEEE Conference.

Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.

QCAdesigner – CUDA HPPS project

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Sunpyo Hong, Hyesoon Kim

Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.

GPU Programming Contest. Contents Target: Clustering with Kmeans How to use toolkit1.0 Towards the fastest program.

CS480 Cryptography and Information Security Huiping Guo Department of Computer Science California State University, Los Angeles 13.Message Authentication.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

Blocked 2D Convolution Ravi Sankar P Nair

“The quick brown fox jumps over the lazy dog”

CS 179: GPU Programming Lecture 1: Introduction 1

Clusters of Computational Accelerators

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Speedup over Ji et al.'s work

Linchuan Chen, Xin Huo and Gagan Agrawal

SHA-3 on GPU By Lee, Jae-song

MASS CUDA Performance Analysis and Improvement

IEEE i Dohwan Kim.

CS/EE 217 – GPU Architecture and Parallel Programming

Parallel Computation Patterns (Scan)

Implementation of a De-blocking Filter and Optimization in PLX

SHA: Secure Hash Algorithm

6- General Purpose GPU Programming

Multicore and GPU Programming

The Secure Hash Function (SHA)

Presentation transcript:

Faster File matching using GPGPU’s Deephan Mohan Professor: Dr Faster File matching using GPGPU’s Deephan Mohan Professor: Dr.John Cavazos University of Delaware Clarification about title: Only NVIDIA GPUS. Extension to others in progress. Inclusion of partial file matching support. 11/15/2018 SAAHPC 2010

Presentation Outline Introduction MD6 Algorithm CUDA MD6 Experiments and Results Conclusion 11/15/2018 SAAHPC 2010

Introduction 11/15/2018 SAAHPC 2010

Introduction File matching Motivation Indispensable in fields like forensics and Information security Robustness of hashing algorithms used Motivation Advent of GPU computing Faster file matching Faster Hashing algorithms 11/15/2018 SAAHPC 2010

Faster file matching Hashing Algorithms MD -4, -5, SHA -1, -2, -256, -512, Tiger, Whirlpool Used in Integrity checking, checksum calculation, message authentication etc. Existing file matching programs SSDEEP HASHDEEP Tons of proprietary file matching programs 11/15/2018 SAAHPC 2010

The MD6 Algorithm 11/15/2018 SAAHPC 2010

Merkle Tree Computation proceeds from bottom up Each leaf represents a data chunk Each intermediate node represents a compression node 11/15/2018 SAAHPC 2010

The MD6 Algorithm MD6 Inputs: MD6 Compression : M  the message to be hashed (mandatory). d  message digest length desired, in bits (mandatory) . K  key value (optional). r  number of rounds (optional). MD6 Compression : MD6 word size  8 Bytes MD6 Buffer size  64 words (512 Bytes) Each Buffer is pre-processed f : W64 W89  W16 W16 is post processed The final hash is exactly d bits in length 11/15/2018 SAAHPC 2010

CUDA MD6 11/15/2018 SAAHPC 2010

CUDA MD6 Implementation Step (i): Host buffers in the content of the source file Step (ii): Allocates adequate memory on the device Step (iii): Invoke kernels md6_compress_block() – Preprocessing module md6_compress() – Compression module md6_rewrite() – MD6 hash aggregation module Step (iv): Do step iii N+1 times to generate the final hash Step (v): Perform hash comparison Step (vi): Store hash in the hashdb Step (ii): If file size larger, chunk it. Step(iii): Sequential run of three kernels. Keep data in GPU memory during computational time without offloading it to the host. Improves performance. 11/15/2018 SAAHPC 2010

CUDA MD6 md6_compress_block() md6_compress() md6_rewrite() Data preprocessing module f : W64  W89 <<<Grid, Threads>>>  <<<Total number of buffers, 1>>> md6_compress() Performs MD6 compression <<<Grid, Threads>>>  <<<Total number of buffers, 16>>> md6_rewrite() Performs MD6 Hash Aggregation f : W89  W16 <<<Grid, Threads>>>  <<<(Total number of buffers/4), 1>>> Sequential run of three kernels 11/15/2018 SAAHPC 2010

Preprocessing kernel Transforms each MD6 buffer 15 words – constant vector (Q) 8 words - Key (K) U,V – unique control words Last 64 words – Data chunk A 15 word vector composed of constants (primes) Header Data chunk (64) 11/15/2018 SAAHPC 2010

CUDA MD6 Compression Kernel For each block do Set index to blockID For each data[index] do Set i to n + ThreadID: /* 16 steps */ x = Si-n xor Ai-n xor Ai-t0 x = x xor (Ai-t1 ^ Ai-t2) xor (Ai-t3 ^ Ai-t4) x = x xor (x >> ri-n) x = x xor (x << ri-n) exit CUDA block exit CUDA kernel call 11/15/2018 SAAHPC 2010

CUDA MD6 Compression Coalesced memory reads and writes Use of constant and shared memory within kernel Compression function loop unrolled Compression rounds Integrity Sliding window thread access 11/15/2018 SAAHPC 2010

CUDA MD6 Execution Thread block STEP 2: Compress data Call md6_compress () <<< Total buffers, Threads>>>  <<< 8, 16>>> STEP 1: Read in data Call md6_compress_block () <<< Total buffers, Threads>>>  <<< 8, 1>>> Grid 11/15/2018 SAAHPC 2010

CUDA MD6 Execution STEP 3: Write hash into appropriate node Call md6_rewrite() <<< Total buffers, Threads>>>  <<< 2, 1>>> 11/15/2018 SAAHPC 2010

CUDA MD6 Execution STEP 1: STEP 2: Read in data Compress data Call md6_compress () <<< Total buffers, Threads>>>  <<< 2, 16>>> STEP 1: Read in data Call md6_compress_block () <<< Total buffers, Threads>>>  <<< 2, 1>>> 11/15/2018 SAAHPC 2010

CUDA MD6 Execution Final step Write out the final hash End of CUDA kernel STEP 3: Write hash into appropriate node Call md6_rewrite() <<< Total buffers, Threads>>>  <<< 2, 1>>> 11/15/2018 SAAHPC 2010

CUDA MD6 for File matching Absolute file matching Message digest is unique User can input predetermined set of hashes Comparison of Input hashes with GPU generated hashes File matching can be done in two modes Direct Hashing (Single files) Recursive Hashing (Archive of files) Hashing Larger Files Larger files are broken down into data chunks Each chunk is hashed and finally aggregated 11/15/2018 SAAHPC 2010

Experiments and Results 11/15/2018 SAAHPC 2010

Benchmarking platform GPU Nvidia GeForce 8800 GTX card (112 cores) CUDA toolkit version 2.2 CPU Quad-core Intel Xeon E5335 CPU Sequential Iterative implementation of MD6 11/15/2018 SAAHPC 2010

Experiment 1: Executing CUDA MD6 on single files 11/15/2018 SAAHPC 2010

Experiment 2: Executing CUDA MD6 on archive of files 11/15/2018 SAAHPC 2010

Experiment 3: Executing CUDA MD6 with varying buffer sizes 11/15/2018 SAAHPC 2010

Number of compression rounds Vs Speedup 11/15/2018 SAAHPC 2010

Wall clock time Vs Kernel execution time 11/15/2018 SAAHPC 2010

Conclusion Speedup ranging from 2X to exceeding 250X Performance degraders Host to device data transfer, Device initialization, Idle threads Faster hashing also depends on hash integrity Speedup should scale with increased number of GPU cores Point 2: 11/15/2018 SAAHPC 2010

Questions… 11/15/2018 SAAHPC 2010

Thank you!!! 11/15/2018 SAAHPC 2010