A Comparison-FREE SORTING ALGORITHM ON CPUs

Slides:



Advertisements
Similar presentations
SHREYAS PARNERKAR. Motivation Texture analysis is important in many applications of computer image analysis for classification or segmentation of images.
Advertisements

Parallel Processing with OpenMP
Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
5.6 Semaphores Semaphores –Software construct that can be used to enforce mutual exclusion –Contains a protected variable Can be accessed only via wait.
Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
Design of parallel algorithms Sorting J. Porras. Problem Rearrange numbers (x 1,...,x n ) into ascending order ? What is your intuitive approach –Take.
On How to Talk Mihai Budiu Monday seminar, Apr 12, 2004.
K-Ary Search on Modern Processors Fakultät Informatik, Institut Systemarchitektur, Professur Datenbanken Benjamin Schlegel, Rainer Gemulla, Wolfgang Lehner.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.
Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.
High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
Multi-Threading and Load Balancing Compiled by Paul TaylorCSE3AGR Stolen mainly from Orion Granatir
Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
The HDF Group Multi-threading in HDF5: Paths Forward Current implementation - Future directions May 30-31, 2012HDF5 Workshop at PSI 1.
Algorithm Evaluation. What’s an algorithm? a clearly specified set of simple instructions to be followed to solve a problem a way of doing something What.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
QCAdesigner – CUDA HPPS project
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Debugging Threaded Applications By Andrew Binstock CMPS Parallel.
Bitwise Sort By Matt Hannon. What is Bitwise Sort It is an algorithm that works with the individual bits of each entry in order to place them in groups.
TEMPLATE DESIGN © A Comparison-Free Sorting Algorithm Saleh Abdel-hafeez 1 and Ann Gordon-Ross 2 1 Jordan University of.
Martin Kruliš by Martin Kruliš (v1.1)1.
C H A P T E R E L E V E N Concurrent Programming Programming Languages – Principles and Paradigms by Allen Tucker, Robert Noonan.
Development of a GPU based PIC
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
MAHARANA PRATAP COLLEGE OF TECHNOLOGY SEMINAR ON- COMPUTER PROCESSOR SUBJECT CODE: CS-307 Branch-CSE Sem- 3 rd SUBMITTED TO SUBMITTED BY.
S ORTING ON P ARALLEL C OMPUTERS Dr. Sherenaz Al-Haj Baddar KASIT University of Jordan
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Implementing RISC Multi Core Processor Using HLS Language - BLUESPEC Liam Wigdor Instructor Mony Orbach Shirel Josef Semesterial Winter 2013.
ARCHITECTURE-ADAPTIVE CODE VARIANT TUNING
EECE571R -- Harnessing Massively Parallel Processors ece
Jie Liu, Ph.D. Professor and Chair Department of Computer Science
Introduction to Parallelism.
Architecture Background
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Parallel and Distributed Computing
Linchuan Chen, Xin Huo and Gagan Agrawal
IMPLEMENTATION OF MULTIPLE-PRECISION MODULAR MULTIPLICATION ON GPU
Implementation of IDEA on a Reconfigurable Computer
Memory Hierarchies.
Computer Architecture Lecture 4 17th May, 2006
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
© 2012 Elsevier, Inc. All rights reserved.
Midterm review: closed book multiple choice chapters 1 to 9
All-Pairs Shortest Paths
Threading And Parallel Programming Constructs
Shared Memory Programming
Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.
A Domain Decomposition Parallel Implementation of an Elasto-viscoplasticCoupled elasto-plastic Fast Fourier Transform Micromechanical Solver with Spectral.
Concurrency: Mutual Exclusion and Process Synchronization
Parallel sorting.
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Force Directed Placement: GPU Implementation
Presentation transcript:

A Comparison-FREE SORTING ALGORITHM ON CPUs Saleh Abdel-hafeez, Jordan (JUST) Ann Gordon-Ross, USA (UF) Samer AbuBaker, Jordan (JUST)

Highlights Principle Example Potential Key Factors CPU Simulation Single Threaded (no Parallelism) C-Code (Memory Locality) Execution Time Simulations Multi-threaded (Parallelism) C-Code (Atomic and Semaphore Vs. Memory) Conclusions

Principle Example

Potential Key Factors Two Representations N=2K Computations less Idea Binary One-Hot N=2K Computations less Memory Transpose Memory Mapping Idea Reduce the size of One-Hot (NxN) to NX1 Improve Locality (Spatial and Temporal)

CPU Single Thread

Loop1 Time vs. Loop2 Time (MEMORY LOCALITY)

Dependent Less on Input Distribution

CPU Single thread (Time Simulation)

CPU Single Thread Significant The Fastest Minor Effect on Data Type Distribution One Dimensional Memory Less Computations Easy to work with Less Energy & Power 7 8 10 12 14 16 18 20 22 24 26 28 29 Free-comparison 6 41 145 584 2317 6839 31414 69519 418684 1828644 7654605 16689404 quick 15 30 140 602 2673 11409 47064 148004 456904 1842128 7859271 33662489 68942299

CPU Multiple Threads (8-Threads & 4-Core)

CPU Multiple threaded (TIME)

Execution Time vs. Data Sizes 7 8 10 12 14 16 18 20 22 24 26 28 30 32 34 8-thread 345 333 363 386 1070 2085 7658 17309 58822 234639 1084792 4411107 11969863 32481103 88139858 Non-thread 6 41 145 584 2317 6839 31414 69519 418684 1828644 7654605 60934070 2.22E+08 8.12E+08

Memory Usage

Comparison with Parallel Sorting Algorithms Avoid Mutual Exclusive (Memory Blocked) Use More Memory for threaded Use Atomic for less memory Execution Time (Second) 14 20 24 26 Comparison-Free 0.00107/0.0005 0.002 0.235 1.08 [1]-2011-Bitonic-Sort-CPU&GPU 0.0012 0.076 1.97 2.23 [2]-2010-Intel (Radix) CPU 0.0075 0.025 0.081 0.33 [3]-2009-Invidia (Radix) GPU 0.008 0.031 0.12 0.27

CONCLUSION The Design is novel and is not an incremental of other hybrid sorting algorithms (Future Work); the C-Code is clear and is available Comparison-free: Single-Threaded The fastest for data sizes < 216 Comparison-free: Multi-threaded CPU (Simple 4-Core) fastest at data 220 CPU (Advance Multi-Core) need to investigate GPU (Simple and Advance) need to investigate Use less memory, and expecting less energy