Ioannis E. Venetis Department of Computer Engineering and Informatics

Mapping the LU Decomposition on a Many-Core Architecture: Challenges and Solutions
Ioannis E. Venetis Department of Computer Engineering and Informatics University of Patras, Greece Guang R. Gao Department of Electrical and Computer Engineering University of Delaware, USA 18/5/2009 CF 2009, Ischia, Italy

LU Decomposition Assume that we need to solve the linear system Where:
A is a dense N×N matrix x is the N×1 vector of values to be calculated b is a N×1 vector with known values Decompose matrix A into: A lower triangular matrix L An upper triangular matrix U Such that A = L·U

LU Decomposition Solve two easy linear systems: Why did we choose LU?
Lower triangular: L·(U·x) = b  U·x = b' Upper triangular: U·x = b'  x = b'' Why did we choose LU? Well studied algorithm Multiple variations have been proposed Each one more suitable for a specific architecture Well understood behavior on traditional systems Easier to identify and understand differences on many-core systems with local-storage Instead of hardware-managed cache

Classic Block-LU Algorithms (1/2)
Share similar characteristics at the highest level Partition the initial matrix into blocks of fixed size Each processed by one processor Usually square blocks SPLASH-2 implementation Targets shared-memory architectures Blocks should fit into the L1 data cache High Performance Linpack Targets mainly distributed memory architectures Blocks are first distributed among nodes Blocks within each node are divided to fit into cache

Classic Block-LU Algorithms (2/2)
Data distribution is determined by the parameters of the memory subsystem Creates unbalance Number of blocks not divisible by number of processors Also true during processing of each block BLAS routines are used A hardware-managed cache is assumed Cache-based architectures have created a “cache- aware” programming consensus Is this the best choice for many-core systems with local- storage?

The architecture of Cyclops-64

Implications on the LU Decomposition
No cache on Cyclops-64 How should the size of the blocks be determined? Our solution According to the number of processors Improves load-balance One drawback Some blocks processed are never used again Creates imbalance during the next step of LU Repartition the matrix

Dynamic Repartitioning Algorithm
Traditionally serial Parallelize by: Applying algorithm recursively Improves load balancing Combine work to: Reduce overhead Improve data transfers Repartition remaining work: Maintain work balance on each step

Dynamic Repartitioning vs. SPLASH-2
Cyclops-64 700×700 matrix size Intel Xeon 3.0 GHz 4000×4000 matrix size

What should be our next step? (1/2)
Improved performance But still only at 2.8% of peak performance! Cyclops-64 has only local-storage Each request for data has to go to main memory Our goal: Minimize the number of loads and stores Move to the next level of high-speed storage Register file (64 registers) Apply manually register tiling Do not rely only on the static semantics of loops As compilers do Exploit our high-level knowledge of the algorithm

What should be our next step? (2/2)
How to fit each block into 64 registers? Further divide each block into sub-blocks Questions that arise What is the optimal size of each tile? Take into account how many registers an architecture has Take into account dependencies between tiles and blocks What is the optimal sequence in which tiles have to be traversed? Exhaustively analyze all possible ways to traverse tiles

Our solution We take a generic and systematic approach
Assume our architecture has R registers Assume that sub-blocks from different blocks do not have the same size Identify all possible ways to perform the required calculations Calculate the number of loads and stores for each case Calculate the size for each sub-block that minimizes the number of loads and stores Use the best case in our implementation!

Dividing blocks into tiles

First case L3 L1 L2

Second case L3 L1 L2

Third case L3 L1 L2

Minimizing the number of loads (1/2)
Observations Loads are minimized for larger L1 and L2 L3 is not present  L3 = 1 Data that must fit into registers:

Minimizing the number of loads (2/2)
Calculate optimal L1: For Cyclops-64 R = 48 L1 = 6, L2 = 6, L3 = 1 6 times less loads and stores! Similar results for all other blocks Exploit “Load/Store Multiple” instructions 6 times less load/store instructions issued

Actual layout of sub-blocks

Impact of each optimization on 156 TUs
Input matrix is of size 1000×1000 Matrix is assumed to be in SRAM Increase mainly from two sources Dynamic Repartitioning Register Tiling

Instruction mix break-down on 156 TUs
Optimized version requires only 12% of instructions Loads and stores reduced 28 times! Integer instructions reduced 36 times! Waiting for data from memory Dropped to 4.7% from 31.4%

Performance vs. Matrix size
Matrix is assumed to be in SRAM Simulator allows redefinition of the size of SRAM Implementation of C64 is at 65nm Possible to have more SRAM per TU in the future

Conclusions We presented a methodology to design algorithms for multi-core architectures Local-storage, instead of hardware-managed cache Distribution of work to improve load-balance Not according to memory parameters Apply application-aware register tiling Calculate optimal tile sizes to minimize loads/stores Applicable to other applications? Matrix multiplication, …

Questions?

Ioannis E. Venetis Department of Computer Engineering and Informatics

Similar presentations

Presentation on theme: "Ioannis E. Venetis Department of Computer Engineering and Informatics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ioannis E. Venetis Department of Computer Engineering and Informatics

Similar presentations

Presentation on theme: "Ioannis E. Venetis Department of Computer Engineering and Informatics"— Presentation transcript:

Similar presentations

About project

Feedback