High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Load Balancing Parallel Applications on Heterogeneous Platforms.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Computer Abstractions and Technology
Lecture 7-2 : Distributed Algorithms for Sorting Courtesy : Michael J. Quinn, Parallel Programming in C with MPI and OpenMP (chapter 14)
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Communication Pattern Based Node Selection for Shared Networks
OpenFOAM on a GPU-based Heterogeneous Cluster
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
Reference: Message Passing Fundamentals.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
1 Dong Lu, Peter A. Dinda Prescience Laboratory Computer Science Department Northwestern University Virtualized.
Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.
Mapping Techniques for Load Balancing
Summary :- Distributed Process Scheduling Prepared BY:- JAYA KALIDINDI.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Parallel Programming in C with MPI and OpenMP
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Research Achievements Kenji Kaneda. Agenda Research background and goal Research background and goal Overview of my research achievements Overview of.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
A Virtual Machine Monitor for Utilizing Non-dedicated Clusters Kenji Kaneda Yoshihiro Oyama Akinori Yonezawa (University of Tokyo)
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Advanced / Other Programming Models Sathish Vadhiyar.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.
Domain Decomposed Parallel Heat Distribution Problem in Two Dimensions Yana Kortsarts Jeff Rufinus Widener University Computer Science Department.
Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Virtual Private Grid (VPG) : A Command Shell for Utilizing Remote Machines Efficiently Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa Department of Computer.
Using Heterogeneous Paths for Inter-process Communication in a Distributed System Vimi Puthen Veetil Instructor: Pekka Heikkinen M.Sc.(Tech.) Nokia Siemens.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Static Process Scheduling
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
A System Performance Model Distributed Process Scheduling.
edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
A Stable Broadcast Algorithm Kei Takahashi Hideo Saito Takeshi Shibata Kenjiro Taura (The University of Tokyo, Japan) 1 CCGrid Lyon, France.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Performance Evaluation of Adaptive MPI
Hybrid Programming with OpenMP and MPI
COMP60621 Fundamentals of Parallel and Distributed Systems
Introduction, background, jargon
Dense Linear Algebra (Data Distributions)
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
A Virtual Machine Monitor for Utilizing Non-dedicated Clusters
Presentation transcript:

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future Grid

Background Computing nodes on clusters/Grid are shared by multiple applications To obtain good performance, HPC applications should struggle with Background processes Dynamic changing available nodes Large latencies on the Grid

Performance limiting factor: background processes Other processes may run on background Network daemons, interactive shells, etc. Many typical applcations are written in synchronous style In such applications, delay of a single node degrades the overall performance

Performance limitng factor: Large latencies on the Grid In the future Grid environments, bandwidth will accommodate HPC applications Large latencies will remain to be obstacles Synchronous applications suffer from large latencies >100ms

Available nodes change dynamically Many HPC applications assumes that computing nodes are fixed If applications support dynamically changing nodes, we can harness computing resources more efficiently!

Goal of this work An LU factorization algorithm that Tolerates background processes & large latencies Supports dynamically changing nodes Overlapping multiple iterationsWritten in the Phoenix model Data mapping for dynamically changing nodes A fast HPC application on non-dedicated clusters and Grid

Outline of this talk The Phoenix model Our LU Algorithm Overlapping multiple iterations Data mapping for dynamically changing nodes Performance of our LU and HPL Related work Summary

Phoenix model [Taura et al. 03] A message passing model for dynamically changing environments Concept of virtual nodes Virtual nodes as destinations of messages Virtual nodes Physical nodes

Overview of our LU Like typical implementations, Based on message passing The matrix is decomposed into small blocks A block is updated by its owner node Unlike typical implementations, Asynchronous data-driven style for overlapping multiple iterations Cyclic-like data mapping for any & dynamically changing number of nodes (Currently, pivoting is not performed)

LU factorization for (k=0; k<B; k++) { A k,k =fact(A k,k ); for (i=k+1; i<B; i++) A i,k =update_L(A i,k,A k,k ); for (j=k+1; j<B; j++) A k,j =update_U(A k,j,A k,k ); for (i=k+1; i<B; i++) for (j=k+1; j<B; j++) A i,j =A i,j – A i,k x A k,j ; } L part U part Trail part Diagonal

Na ï ve implementation and its problem Iterations are separated Not tolerant to latencies/background processes! time k th iteration (k+1) th iteration(k+2) th iteration # of executable tasks DiagonalUL trail

Latency Hiding Techniques Overlapping iterations hides latencies Diagonal/L/U parts is advanced If computations of trail parts are separated, only adjacent two iterations are overlapped There is room for further improvement time

Overlapping multiple iterations for more tolerance We overlap multiple iterations by computing all blocks, including trail parts asynchronously Data driven style & prioritized task scheduling are used time

Prioritized task scheduling We assign a priority to updating task of each block k-th update of block A i,j has a priority of min(i-S, j-S, k) (smaller number is higher) where S is a desired overlap depth We can control overlapping by changing the value of S

Typical data mapping and its problem Two dimensional block cyclic distribution P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 matrix Good load balance and small communication, but The number of nodes must be fixed and factored into two small numbers How to support dynamically changing nodes?

Our data mapping for dynamically changing nodes Permutation is common among all nodes A 00 A 01 A 02 A 03 A 04 A 05 A 06 A 07 A 10 A 11 A 12 A 13 A 14 A 15 A 16 A 17 A 20 A 21 A 22 A 23 A 24 A 25 A 26 A 27 A 30 A 31 A 32 A 33 A 34 A 35 A 36 A 37 A 40 A 41 A 42 A 43 A 44 A 45 A 46 A 47 A 50 A 51 A 52 A 53 A 54 A 55 A 56 A 57 A 60 A 61 A 62 A 63 A 64 A 65 A 66 A 67 A 70 A 71 A 72 A 73 A 74 A 75 A 76 A 77 Original matrix A 40 A 41 A 42 A 43 A 44 A 45 A 46 A 47 A 20 A 21 A 22 A 23 A 24 A 25 A 26 A 27 A 70 A 71 A 72 A 73 A 74 A 75 A 76 A 77 A 30 A 31 A 32 A 33 A 34 A 35 A 36 A 37 A 50 A 51 A 52 A 53 A 54 A 55 A 56 A 57 A 00 A 01 A 02 A 03 A 04 A 05 A 06 A 07 A 60 A 61 A 62 A 63 A 64 A 65 A 66 A 67 A 10 A 11 A 12 A 13 A 14 A 15 A 16 A 17 Random Permutation Permuted matrix A 40 A 41 A 42 A 43 A 44 A 45 A 46 A 47 A 20 A 21 A 22 A 23 A 24 A 25 A 26 A 27 A 70 A 71 A 72 A 73 A 74 A 75 A 76 A 77 A 30 A 31 A 32 A 33 A 34 A 35 A 36 A 37 A 50 A 51 A 52 A 53 A 54 A 55 A 56 A 57 A 00 A 01 A 02 A 03 A 04 A 05 A 06 A 07 A 60 A 61 A 62 A 63 A 64 A 65 A 66 A 67 A 10 A 11 A 12 A 13 A 14 A 15 A 16 A 17 A 40 A 41 A 42 A 43 A 44 A 45 A 46 A 47 A 20 A 21 A 22 A 23 A 24 A 25 A 26 A 27 A 70 A 71 A 72 A 73 A 74 A 75 A 76 A 77 A 30 A 31 A 32 A 33 A 34 A 35 A 36 A 37 A 50 A 51 A 52 A 53 A 54 A 55 A 56 A 57 A 00 A 01 A 02 A 03 A 04 A 05 A 06 A 07 A 60 A 61 A 62 A 63 A 64 A 65 A 66 A 67 A 10 A 11 A 12 A 13 A 14 A 15 A 16 A 17 A 40 A 41 A 42 A 43 A 44 A 45 A 46 A 47 A 20 A 21 A 22 A 23 A 24 A 25 A 26 A 27 A 70 A 71 A 72 A 73 A 74 A 75 A 76 A 77 A 30 A 31 A 32 A 33 A 34 A 35 A 36 A 37 A 50 A 51 A 52 A 53 A 54 A 55 A 56 A 57 A 00 A 01 A 02 A 03 A 04 A 05 A 06 A 07 A 60 A 61 A 62 A 63 A 64 A 65 A 66 A 67 A 10 A 11 A 12 A 13 A 14 A 15 A 16 A 17 A 40 A 41 A 42 A 43 A 44 A 45 A 46 A 47 A 20 A 21 A 22 A 23 A 24 A 25 A 26 A 27 A 70 A 71 A 72 A 73 A 74 A 75 A 76 A 77 A 30 A 31 A 32 A 33 A 34 A 35 A 36 A 37 A 50 A 51 A 52 A 53 A 54 A 55 A 56 A 57 A 00 A 01 A 02 A 03 A 04 A 05 A 06 A 07 A 60 A 61 A 62 A 63 A 64 A 65 A 66 A 67 A 10 A 11 A 12 A 13 A 14 A 15 A 16 A 17

Dynamically joining nodes A new node sends a steal message to one of nodes The receiver abandons some virual nodes, and sends blocks to the new node The new node undertakes virtual nodes and blocks For better load balance, stealing process is repeated originalpermutedoriginalpermuted

Experimental environments (1) 112 nodes IBM BladeCenter Cluster Dual 2.4GHz Xeon: 70 nodes + Dual 2.8GHz Xeon: 42 nodes 1 CPU per node is used Slower CPU (2.4GHz) determines the overall performance Gigabit ethernet

Experimental environments (2) High performance Linpack (HPL) is by Petitet et al. GOTO BLAS is made by Kazushige Goto (UT-Austin) Ours (S=0): don ’ t overlap explicitly Ours (S=1): overlap with an adjacent iteration Ours (S=5): overlap multiple (5) iterations

Scalability Ours(S=5) achieves 190 GFlops with 108 nodes 65 times speedup Matrix size N=61440 Block size NB=240 Overlap depth S=0 or 5 x72 x65

Tolerance to background processes (1) We run LU/HPL with background processes We run 3 background processes per randomely chosen node The background processes are short term They move to other random nodes every 10 secs

Tolerance to background processes (2) HPL slows down heavily Ours(S=0) and Ours(S=1) also suffer By overlapping multiple iterations (S=5), Our LU becomes more tolerant ! 108 nodes for computation N= % -16% -36% -26%

Tolerance to large latencies (1) We emulate the future Grid environment with high bandwidth & large latencies Experiments are done on a cluster Large latencies are emulated by software +0ms, +200ms, +500ms

Tolerance to large latencies (2) S=0 suffers by 28% Overlapping of iterations makes our LU more tolerant Both S=1 and S=5 work well -28% -19% 108 nodes for computation N= %

Performance with joining nodes (1) 16 nodes at first, then 48 nodes are added dynamically 16 64

Performance with joining nodes (2) Flexibility to the number of nodes is useful to obtain higher performance Comared with Fixed-64, Dynamic suffers migration overhead etc. N=30720 S=5 x1.9 faster

Related Work Dyn-MPI [Weatherly et al. 03] An extended MPI library that supports dynamically changing nodes Dyn-MPIOur approach Redist methodSynchronousAsynchronous Distribution of 2D matrix Only the first dimension Arbitrary (Left for the programmers)

Summary An LU implementation suitable for non-dedicated clusters and the Grid Scalable Support dynamically changing nodes Tolerate background processes & large latencies

Future Work Perform pivoting More data dependencies are introduced Is our LU still tolerant? Improve dynamic load balancing Choose better target nodes for stealing Take care of CPU speeds Apply our approach to other HPC applications CFD applications

Thank you!

Typical task scheduling Each node updates blocks synchronously Not tolerant to background processes

Our task scheduling to tolerate delays (1) Each block is updated asynchronously Blocks may have different iteration numbers

Our task scheduling to tolerate delays (2) Not only allow skew, but make skew explicitly Intoduce prioritized task scheduling Give higher priority to upper left blocks Target skew = 3 (Similar to pipeline depth)

Performance with joining processes (3) procs added suffer from migration good peak speed longer tail end