Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Basic Communication Operations
CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Sparse Triangular Solve in UPC By Christian Bell and Rajesh Nishtala.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
1 Effect of Increasing Chip Density on the Evolution of Computer Architectures R. Nair IBM Journal of Research and Development Volume 46 Number 2/3 March/May.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Speaker: Xin Zuo Heterogeneous Computing Laboratory (HCL) School of Computer Science and Informatics University College Dublin Ireland International Parallel.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.
GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Basic Communication Operations Carl Tropper Department of Computer Science.
Communication Optimizations in Titanium Programs Jimmy Su.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Parallel Computing Presented by Justin Reschke
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Ioannis E. Venetis Department of Computer Engineering and Informatics
Fei Cai Shaogang Wu Longbing Zhang and Zhimin Tang
Department of Computer Science University of California, Santa Barbara
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
CSCE569 Parallel Computing
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Parallel build blocks.
Department of Computer Science University of California, Santa Barbara
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
Presentation transcript:

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio State University (Joint work with Rishi Kumar and Guang Gao At University of Delaware)

Outline: Motivation Problems Considered Near-neighbor communication, global reduction, producer- consumer pattern Compiler Analysis and Cost Model Code generation for combination of patterns Results Conclusion

Motivation Communication volume and efficiency have a major impact on performance of parallel code Multithreading can hide communication and synchronization costs by switching to a different ready thread Compiler generated multithreaded code can provide ease of programming and good performance

Issues and Approach What multithreaded programming styles could be used for different types of programs How can the compiler automatically generate these patterns What kind of performance can be achieved Focus on three types of communication patterns: near-neighbor, global reduction, and producer- consumer

Near Neighbor Communication Widely studied for distributed memory compilation Typically use collective communication Communication volume and frequency can limit parallel performance For (m….) For (n…) A(m,n) = ƒ{B(m-1,n), B(m+1,n), B(m, n-1), B(m,n+1)

Multithreaded programming style Thread P 1 1 Thread P n 3 Thread P 1 2 Thread P 1 3 Thread P 2 3 Thread P 2 2 Thread P 2 1 Thread P n 2 Thread P n 1 Processor 1Processor 3Processor 2 Only the first and last threads on each processor need to wait for data

Compilation and Performance Issues How can the compiler automatically generate code What thread granularity to choose Larger number of threads mean more work w/o waiting for data from other processors Smaller number of threads mean less threading overhead

Global Reduction Main issue: avoiding collective communication phases using multithreading DATA SYNC Reduce DATA SYNC Broadcast Building Block of Reduction Tree Real A(n); Real sum=0; For(j=0; j<n; j++) S1: sum=sum+A(j); Example of Sum Reduction

Producer Consumer Pattern Example: Sparse matrix vector multiplication, A.v=q, where A=NxN sparse matrix, v=Nx1 vector and q is result. Reduction Implementations on p processors: Logarithmic tree approach: O(N log p). Pipelined reduction in 2p distinct phases: O(N). Pipelined reduction eliminates the need for global barriers or synchronizations.

Pipelining of MVM Reduction X X

Cost Model for Thread Granularity Total threading costs depend on: Threading overheads Time spent waiting for data over network For Problem Size=NxN, Processors=p, Threads per processor=T Granularity, G = N/pT Overhead in 1 thread creation=β Amount of computation present in 1 row=C Minimizing total threading cost w.r.t. the number of threads, T, we get T = sqrt (2CN/p β)

Code Generation for Combination of Patterns Point to point communication – need to set up program to have addresses of threads you comminite with The three patterns are Binary Tree Pattern – global reductions Linear Chain Pattern – near neighbor communication Circular Ring Pattern - producer consumer Issues in combination Communication Setup Thread renumbering Synchronization between different patterns

Experimental Results Speedup Conjugate Gradient, Class A (14,000 Rows)

Experimental Results Speedup Tomcatv, Problem Size=256x256 elements

Comparison of Compiler Generated and Hand Written Code CG: Class A

Comparison of Compiler Generated and Hand Written Code Tomcatv, Problem Size=256x256 elements

Comparison of Compiler Generated and Hand Written Code Jacobi

Conclusion Compiler generated codes for CG, and Tomcatv provide good speedup and scalability. Compiler generated codes exhibit good performance for small problem sizes. In accordance with the cost model, Tomcatv which has higher amounts of computation gives better performance with higher number of threads. Similarly, Jacobi benchmark gives good performance with lower number of threads, due to less amount of computation present in the problem.