. G.Bilardi. –University of Padua, Dipartimento di Elettronica e informatica. P.D’Alberto and A.Nicolau. –University of California at Irvine, Information.

Slides:



Advertisements
Similar presentations
1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Advertisements

Section 13-4: Matrix Multiplication
CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Design Rule Generation for Interconnect Matching Andrew B. Kahng and Rasit Onur Topaloglu {abk | rtopalog University of California, San Diego.
R Lecture 4 Naomi Altman Department of Statistics Department of Statistics (Based on notes by J. Lee)
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Test practice Multiplication. Multiplication 9x2.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Faculty of Computer Science © 2006 CMPUT 229 Cache Performance Analysis Hitting for performance.
1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.
Simulations of Memory Hierarchy LAB 2: CACHE LAB.
Using one level of Cache:
1 ICS102: Introduction To Computing King Fahd University of Petroleum & Minerals College of Computer Science & Engineering Information & Computer Science.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
University of California San Diego Locality Phase Prediction Xipeng Shen, Yutao Zhong, Chen Ding Computer Science Department, University of Rochester Class.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Copyright 1998 UC, Irvine1 Miss Stride Buffer Department of Information and Computer Science University of California, Irvine.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
computer
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.
Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University.
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
Multiplication of Decimal Fractions STEPS : 1. Count the number of places BEHIND the decimals 2. Treat the problem like you are multiplying whole numbers.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
ICC Module 3 Lesson 2 – Memory Hierarchies 1 / 13 © 2015 Ph. Janson Information, Computing & Communication Memory Hierarchies – Clip 9 – Locality School.
Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.
Notes Over 4.2 Finding the Product of Two Matrices Find the product. If it is not defined, state the reason. To multiply matrices, the number of columns.
Multiplication Find the missing value x __ = 32.
University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
12-1 Organizing Data Using Matrices
Multiples and Factors Lesson 4.1.
Section 7: Memory and Caches
Matrix Multiplication
High-Performance Matrix Multiplication
Multiples and Factors Lesson 4.1.
Multiples and Factors Dr. Y. Calhoun.
Multiplying Matrices.
Memory Hierarchies.
Adaptive Strassen and ATLAS’s DGEMM
CHAPTER OBJECTIVES The primary objective of this chapter is to show how to compute the matrix inverse and to illustrate how it can be.
MobEyes: Smart Mobs for Urban Monitoring in Vehicular Sensor Networks
STUDY AND IMPLEMENTATION
[ ] [ ] [ ] [ ] EXAMPLE 3 Scalar multiplication Simplify the product:
Numerical Computation and Optimization
A Comparison of Cache-conscious and Cache-oblivious Codes
A Self-Tuning Configurable Cache
Multiples and Factors Chapter 5.
Multiples and Factors Lesson 4.1.
Multiples and Factors.
Multiplying Matrices.
Cache-Aware Partitioning of Multi-Dimensional Iteration Spaces
Low Depth Cache-Oblivious Algorithms
Multiples and Factors Lesson 4.1.
All we need in Game Programming Course Reference
Fall 2018, COMP 562 Poster Session
Cache-oblivious Programming
Automatic Tuning of Two-Level Caches to Embedded Applications
LU Decomposition.
What is the dimension of the matrix below?
Multiples and Factors Lesson 2.2.
Matrix Multiplication
Multiplying Matrices.
Multiplying Matrices.

Multiplying Matrices.
Presentation transcript:

. G.Bilardi. –University of Padua, Dipartimento di Elettronica e informatica. P.D’Alberto and A.Nicolau. –University of California at Irvine, Information and Computer Science += * C C0 C1 C2C3 A A0A1 A2A3 B B0B1 B2B3 Fractal Matrix Multiply

Talk Organization Motivations –Alias “why matrix multiply is so popular ?’’ Why did we jump into the Project ? Matrix multiply as it is done –How we differ Our Approach (performance related stuff) –How we did it –Experimental results Conclusions

Motivations Matrix Multiply as Example –For data reuse every element is used for n multiplications –Space requirements Sizes, layouts Matrix Multiply as Kernel –3-BLAS applications E.g. LU-decomposition

Why did we jump into the Project? Matrix Multiply is asymptotically optimal Cache hierarchy oblivious –Alias Cache hierarchy oblivious –n 3 -multiplications and kn 3 -misses (k <=3) –By Hung-Kung We can study safely different algorithms: –Safely: we do not loose optimality –Different algorithms: computation orders

Why we jumped in the project ? Cont. Optimal use of caches = optimal performance ? –Not really Performance: –Register allocation, scheduling, layouts, recursion/no recursion, RISC/no RISC architecture, compiler optimizations, ….. etcetera We want performance –MFLOPS

Multiplication as it is done 1.Tiling for L1 –Reduction to a single simple common problem –Then L2, L3 …. 2.Register allocation on the simple problem : –Number of registers –No RISC/RISC (Pentium/no Pentium) 3.Scheduling by compiler 4.Feedback and start over again if necessary

ATLAS for example: CA B Tiles fixed in size Registers = Tiles Copied in a Contiguous Workspace

How we differ from the others? We present –A unique Recursive Algorithm The Decomposition function of the problem size –Recursive Layout (Fractal layout alias Z-Morton) –Register allocation tuned on the number of registers for Register-file-based architecture Automatic generated –Optimization of the index computation and recursion –Scheduling by compiler

Our Approach: Fractal Layout (alias Z-Morton) A is near square matrix then A0, A1, A2, A3 are near square matrixes about ¼ the size of A and A0 is the largest. Near square Near square: |row-columns| <= 1 A0 A2A3 A1 A Layout in memory Sequential

Our Approach 1.A square problem is decomposed into 8 near square problems of size between and 2.Each sub-problem has the operands stored contiguously –TNX: the recursive layout 3.A sub-problem is decomposed if min(k,j,l) >32 4.Otherwise is solved directly –The operands are in row major format –Optimized at register-file level Reuse of common optimizations

Our Approach, cont. The Type DAG A recursion tree for problem has O(8 log n) different types The type determines the index computation for the sub-problems The types and the matrix offsets are determined and stored in a tree-like structure “type DAG’’ Reduction of index computations by 30% –With moderate extra space.

Recursive Tree and Type DAG C0+=A0B0 C0+=A1B2 C1+=A1B1 C1+=A0B3 C3+=A3B3 C3+=A2B1 C2+=A2B2 C2+=A3B0

Our approach, cont. Register Allocation When the recursion stops: 1.Sub-Problems smaller than are computed directly 2.Sub-Matrix smaller than 32 by 32 are stored in row major 3.Register Allocation 1.Fractal register allocation 2.C-tiling register allocation

Register Allocation, Fractal We applied the recursive decomposition at register level –We balance the distribution of registers for each matrix Adv: –Register file is considered as L0 Disadv: –The computation is expressed as straight line code, code explosion

Register Allocation, C-tiling No balanced distribution of registers R –s 2 registers for C, s for A and s for B (Use of 2s+s 2 Registers) The C is tiled further in sub-squares s x s and for each of them –s x s square of C tile is loaded in registers 1.s x 1 of A Tile is loaded in registers 2.1 x s of B Tile is loaded in registers 3.Scalar product

C-tiling, cont. Adv: more efficient than Fractal, reducing loads+stores Disadv: the register file is considered differently C A B

Cache Performance ULTRA5

Cache Performance SPARC5

MFLOPS Performance Pentium II

MFLOPS R5K_ip32 ultra2

Conclusions Algorithms exploiting cache hierarchy without taking in account cache parameters Performance is achieved optimizing the recursion: –Carefully pruning –Index computation optimization We used the matrix Multiply: –For LU-decomposition Improving further the performance

Thank you