Parallel Programming Models

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Distributed Systems CS

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Introduction CS 524 – High-Performance Computing.

Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Contemporary Languages in Parallel Computing Raymond Hummel.

Panda: MapReduce Framework on GPU’s and CPU’s

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Computer System Architectures Computer System Software

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Extreme scale parallel and distributed systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Contemporary Languages in Parallel Computing Raymond Hummel.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.

Scaling up R computation with high performance computing resources.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

GridOS: Operating System Services for Grid Architectures

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Organizations Are Embracing New Opportunities

Chapter 4: Multithreaded Programming

Introduction to Parallel Processing

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

CS427 Multicore Architecture and Parallel Computing

Chilimbi, et al. (2014) Microsoft Research

For Massively Parallel Computation The Chaotic State of the Art

Constructing a system with multiple computers or processors

Graphics Processing Unit

Performance Evaluation of Adaptive MPI

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Many-core Software Development Platforms

Multi-Processing in High Performance Computer Architecture:

Ministry of Higher Education

Applying Twister to Scientific Applications

Chapter 4: Threads.

Chapter 4: Threads.

Chapter 2: System Structures

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Hadoop Technopoints.

All-Pairs Shortest Paths

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Distributed Systems CS

Hybrid Programming with OpenMP and MPI

CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM

MPJ: A Java-based Parallel Computing System

EE 4xx: Computer Architecture and Performance Programming

Chapter 4: Threads & Concurrency

CSE 502: Computer Architecture

Presentation transcript:

Parallel Programming Models Monica Borra

Outline Shared Memory Models – Revision Comparison of shared memory models Distributed Memory Models Parallelism in Big Data Technologies Conclusion

Shared Memory Models Multi-threaded – Posix Threads(PThreads), TBB, OpenMP Multi-processors – Cilk, ArBB, CUDA, Microsoft Parallel Patterns Compiler Directives and Library functions(PRAM-like)

Comparative Study I Most commercially available general purpose computers include hardware features to increase the parallelism. hyperthreading, multi-core, ccNUMA architecture General-purpose threading CPU vector instructions and GPUs SIMD

Compared models Four Parallel programming models have been selected. Each of these models exploits different hardware parallel features mentioned earlier. Also, they require different levels of programming skills OPenMP , Intel TBB – parallel threads on multicore systems Intel ArBB – threads + multicore SIMD features CUDA – SIMD GPU features.

CUDA CUDA (Compute Unified Device Architecture) is a C/C++ programming model and API (Application Programming Interface) introduced by NVIDIA to enable software developers to code general purpose apps that run on the massively parallel hardware on GPUs. GPUs are optimal for data parallel apps aka SIMD (Single Instruction Multiple Data). Threads running in parallel use extremely fast shared memory for communication.

Evaluations: 4 benchmarks: Matrix multiplication, Simple 2D convolution, Histogram computation and Mandelbrot. Different underlying computer architectures.

Comparison between OpenMP – TBB and ArBB – CUDA for simple 2D Convolution

Comparison Summary: OpenMP and TBB show a very low performance compared to ArBB and CUDA. TBB seems to have a lower performance than OpenMP for single socket architectures, the situation seems to reverse when running on ccNUMA architectures, where TBB shows a significant improvement. ArBB and CUDA. But also that ArBB performance tends to be comparable with CUDA performance in most cases (although it is normally lower). Hence, there are evidences that a carefully designed top range multicore and multisocket architecture( advantage of the TLP and SIMD features) like ArBB applications may approach the performance of top range CUDA GPGPU.

Comparative Study II OpenMP, Pthread, Microsoft Parallel Patterns APIs Computation of matrix multiplication Performed on an Intel i5 processor Execution time and speed up

Experimental Results:

Distributed Parallel Computing Cluster based Message Passing Interface(MPI) – de-facto standard More advantageous when communication between the nodes is high Originally designed for HPC Apache Hadoop Parallel processing for Big Data Implementation of a programming Model, “Map Reduce”

Why is parallelism in Big Data important? Innumerable sources – RFID, Sensors, Social Networking Volume, Velocity and Variety

Apache Hadoop Framework that allows for the distributed parallel processing of large data sets. Batch processes raw unstructured data Highly reliable and scalable Consists of 4 modules: common utilities, storage, resource management and processing Parallel

Case Study: Can we take advantage of MPI to overcome communication overhead in Big Data Technologies? Challenges: 1. Is it worth to speed-up communication? a. Percentage of time taken for communications alone b. Comparisons of achievable latency and peak bandwidth for point to point communications through MPI against Hadoop. 2. How difficult it is to adapt MPI to Hadoop and what are the minimal extensions to the MPI standard? A pair of new MPI calls supporting Hadoop data communication specified via key-value pairs.

Contributions of the case study: Abstracting the requirements of the communication model Dichotomic, dynamic, data-centric bipartite model. Key-Value pair based Novel design of DataMPI – High Performance Communication Library Various benchmarks to prove efficiency and ease of use.

Contributions:

Comparision: DataMPI vs Hadoop Several big data representative benchmarks WordCount, Terasort, K-means, Top K, PageRank Compared for various parameters Efficiency, fault tolerance, easy of use

Comparisons for Terasort Both Hadoop and DataMPI exhibit similar trends DataMPI shows better results in all cases.

Results: Efficiency: DataMPI speeds up varied Big Data workloads and improves job execution time by 31%-41%. Fault Tolerance: DataMPI supports fault tolerance. Evaluations show that DataMPI-FT can attain 21% improvement over Hadoop. Scalability: DataMPI achieves high scalability as Hadoop and 40% performance improvement. Flexibile and the coding complexity of using DataMPI is on par with that of using traditional Hadoop

Conclusion: The efficiency of a model in shared memory parallel computing depends on the type of the program and best use of underlying hardware parallel processing features. Extending MPI for high computational problems like big data mining is much more efficient than the traditional frameworks. Shared memory models are easy to implement but MPI gives best optimal results for more complex problems.

References L. SanChez, J. Fernandez, R. Sotomayor, J. D. Garcia, “A Comparative Evaluation of Parallel Programming Models for Shared-Memory Architectures”, IEEE 10th International Symposium on Parallel and Distributed Processing with Applications, 2012, pp 363 - 374 M. Sharma, P. Soni, “Comparative Study of Parallel Programming Models to Compute Complex Algorithm”, IEEE International Journal of Computer Applications, 2014, pp 174 - 180 Apache Hadoop, hadoop.apache.org Xiaoyi Lu, Fan Liang, Bing Wang, Li Zha, Zhiwei Xu, “DataMPI: Extending MPI to Hadoop-like Big Data Computing”, IEEE 28th Internation Parallel and Distributed Processing Symposium, 2014, pp 829 - 838 Lorin Hochstein, Victor R. Basili, Uzi Vishkin, John Gilbert, “A pilot study to compare programming effort for two parallel programming models”, The Journal of Systems and Software, 2008

Questions? Thank you!!