Ioannis E. Venetis Department of Computer Engineering and Informatics

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Block LU Factorization Lecture 24 MA471 Fall 2003.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
OpenFOAM on a GPU-based Heterogeneous Cluster
Reference: Message Passing Fundamentals.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Multiscalar processors
Chapter 12 CPU Structure and Function. Example Register Organizations.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
High Performance Computing 1 Numerical Linear Algebra An Introduction.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
DLS on Star (Single-level tree) Networks Background: A simple network model for DLS is the star network with a master-worker platform. It consists of a.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
Auburn University
Auburn University
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure
ESE532: System-on-a-Chip Architecture
Chapter 2 Memory and process management
ChaNGa: Design Issues in High Performance Cosmology
Distributed Processors
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Parallel Algorithm Design
5.2 Eleven Advanced Optimizations of Cache Performance
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Test Sequence Length Requirements for Scan-based Testing
Department of Electrical & Computer Engineering
CSCI1600: Embedded and Real Time Software
Linchuan Chen, Peng Jiang and Gagan Agrawal
Memory Hierarchies.
Adaptive Strassen and ATLAS’s DGEMM
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Hybrid Programming with OpenMP and MPI
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Chapter 12 Pipelining and RISC
Adaptive Data Refinement for Parallel Dynamic Programming Applications
Reseeding-based Test Set Embedding with Reduced Test Sequences
Memory System Performance Chapter 3
Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang
CMSC 611: Advanced Computer Architecture
6- General Purpose GPU Programming
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
COMPUTER ORGANIZATION AND ARCHITECTURE
CSCI1600: Embedded and Real Time Software
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Mapping the LU Decomposition on a Many-Core Architecture: Challenges and Solutions Ioannis E. Venetis Department of Computer Engineering and Informatics University of Patras, Greece Guang R. Gao Department of Electrical and Computer Engineering University of Delaware, USA 18/5/2009 CF 2009, Ischia, Italy

LU Decomposition Assume that we need to solve the linear system Where: A is a dense N×N matrix x is the N×1 vector of values to be calculated b is a N×1 vector with known values Decompose matrix A into: A lower triangular matrix L An upper triangular matrix U Such that A = L·U

LU Decomposition Solve two easy linear systems: Why did we choose LU? Lower triangular: L·(U·x) = b  U·x = b' Upper triangular: U·x = b'  x = b'' Why did we choose LU? Well studied algorithm Multiple variations have been proposed Each one more suitable for a specific architecture Well understood behavior on traditional systems Easier to identify and understand differences on many-core systems with local-storage Instead of hardware-managed cache

Classic Block-LU Algorithms (1/2) Share similar characteristics at the highest level Partition the initial matrix into blocks of fixed size Each processed by one processor Usually square blocks SPLASH-2 implementation Targets shared-memory architectures Blocks should fit into the L1 data cache High Performance Linpack Targets mainly distributed memory architectures Blocks are first distributed among nodes Blocks within each node are divided to fit into cache

Classic Block-LU Algorithms (2/2) Data distribution is determined by the parameters of the memory subsystem Creates unbalance Number of blocks not divisible by number of processors Also true during processing of each block BLAS routines are used A hardware-managed cache is assumed Cache-based architectures have created a “cache- aware” programming consensus Is this the best choice for many-core systems with local- storage?

The architecture of Cyclops-64

Implications on the LU Decomposition No cache on Cyclops-64 How should the size of the blocks be determined? Our solution According to the number of processors Improves load-balance One drawback Some blocks processed are never used again Creates imbalance during the next step of LU Repartition the matrix

Dynamic Repartitioning Algorithm Traditionally serial Parallelize by: Applying algorithm recursively Improves load balancing Combine work to: Reduce overhead Improve data transfers Repartition remaining work: Maintain work balance on each step

Dynamic Repartitioning vs. SPLASH-2 Cyclops-64 700×700 matrix size Intel Xeon 3.0 GHz 4000×4000 matrix size

What should be our next step? (1/2) Improved performance But still only at 2.8% of peak performance! Cyclops-64 has only local-storage Each request for data has to go to main memory Our goal: Minimize the number of loads and stores Move to the next level of high-speed storage Register file (64 registers) Apply manually register tiling Do not rely only on the static semantics of loops As compilers do Exploit our high-level knowledge of the algorithm

What should be our next step? (2/2) How to fit each block into 64 registers? Further divide each block into sub-blocks Questions that arise What is the optimal size of each tile? Take into account how many registers an architecture has Take into account dependencies between tiles and blocks What is the optimal sequence in which tiles have to be traversed? Exhaustively analyze all possible ways to traverse tiles

Our solution We take a generic and systematic approach Assume our architecture has R registers Assume that sub-blocks from different blocks do not have the same size Identify all possible ways to perform the required calculations Calculate the number of loads and stores for each case Calculate the size for each sub-block that minimizes the number of loads and stores Use the best case in our implementation!

Dividing blocks into tiles

First case L3 L1 L2

Second case L3 L1 L2

Third case L3 L1 L2

Minimizing the number of loads (1/2) Observations Loads are minimized for larger L1 and L2 L3 is not present  L3 = 1 Data that must fit into registers:

Minimizing the number of loads (2/2) Calculate optimal L1: For Cyclops-64 R = 48 L1 = 6, L2 = 6, L3 = 1 6 times less loads and stores! Similar results for all other blocks Exploit “Load/Store Multiple” instructions 6 times less load/store instructions issued

Actual layout of sub-blocks

Impact of each optimization on 156 TUs Input matrix is of size 1000×1000 Matrix is assumed to be in SRAM Increase mainly from two sources Dynamic Repartitioning Register Tiling

Instruction mix break-down on 156 TUs Optimized version requires only 12% of instructions Loads and stores reduced 28 times! Integer instructions reduced 36 times! Waiting for data from memory Dropped to 4.7% from 31.4%

Performance vs. Matrix size Matrix is assumed to be in SRAM Simulator allows redefinition of the size of SRAM Implementation of C64 is at 65nm Possible to have more SRAM per TU in the future

Conclusions We presented a methodology to design algorithms for multi-core architectures Local-storage, instead of hardware-managed cache Distribution of work to improve load-balance Not according to memory parameters Apply application-aware register tiling Calculate optimal tile sizes to minimize loads/stores Applicable to other applications? Matrix multiplication, …

Questions?