A few words on locality and arrays

Slides:

Advertisements

Similar presentations

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Advertisements

Faculty of Computer Science © 2006 CMPUT 229 Cache Performance Analysis Hitting for performance.

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.

MATH 685/ CSI 700/ OR 682 Lecture Notes

Scientific Computing Linear Systems – Gaussian Elimination.

Numerical Algorithms Matrix multiplication

Numerical Algorithms • Matrix multiplication

Introduction to Analysis of Algorithms

Data Locality CS 524 – High-Performance Computing.

Data Locality CS 524 – High-Performance Computing.

Design of parallel algorithms Matrix operations J. Porras.

© 2004 Goodrich, Tamassia Dynamic Programming1. © 2004 Goodrich, Tamassia Dynamic Programming2 Matrix Chain-Products (not in book) Dynamic Programming.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Fast matrix multiplication; Cache usage

DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.

Linear Systems Gaussian Elimination CSE 541 Roger Crawfis.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Lecture 7 - Systems of Equations CVEN 302 June 17, 2002.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto

*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以利用编译器实现优化

Parallel Solution of the Poisson Problem Using MPI

Matrices and Systems of Equations

Matrices and Systems of Linear Equations

CS 484. Iterative Methods n Gaussian elimination is considered to be a direct method to solve a system. n An indirect method produces a sequence of values.

ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.

1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.

4.8 Using matrices to solve systems! 2 variable systems – by hand 3 or more variables – using calculator!

Finishing up Chapter 5. Will this code enter the if statement? G=[30,55,10] if G

Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.

Two-week ISTE workshop on Effective teaching/learning of computer programming Dr Deepak B Phatak Subrao Nilekani Chair Professor Department of CSE, Kanwal.

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Numerical Algorithms Chapter 11.

Chapter 4 Systems of Linear Equations; Matrices

Code Optimization.

Boyce/DiPrima 10th ed, Ch 7.2: Review of Matrices Elementary Differential Equations and Boundary Value Problems, 10th edition, by William E. Boyce and.

ECE 3301 General Electrical Engineering

Cache Memories CSE 238/2038/2138: Systems Programming

Matrix 2015/11/18 Hongfei Yan zip(*a) is matrix transposition

The Hardware/Software Interface CSE351 Winter 2013

Section 7: Memory and Caches

Matrix 2016/11/30 Hongfei Yan zip(*a) is matrix transposition

Prof. Zhang Gang School of Computer Sci. & Tech.

Matrices and Systems of Equations 8.1

CS 105 Tour of the Black Holes of Computing

Instruction of Chapter 8 on

BLAS: behind the scenes

Bojian Zheng CSCD70 Spring 2018

Array Processor.

Memory Hierarchies.

Parallel Matrix Operations

November 14 6 classes to go! Read

Numerical Algorithms • Parallelizing matrix multiplication

Review of Matrix Algebra

Numerical Analysis Lecture14.

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Siddhartha Chatterjee

Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:

Loop Optimization “Programs spend 90% of time in loops”

Memory System Performance Chapter 3

Matrix Addition and Multiplication

Cache Performance October 3, 2007

To accompany the text “Introduction to Parallel Computing”,

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Optimizing single thread performance

Writing Cache Friendly Code

Code Optimization.

Presentation transcript:

A few words on locality and arrays David Gregg Department of Software Systems Trinity College Dublin

Matrix problems Matrices are very important for certain types of computationally-intensive problems Mostly problems relating to the simulation of physical systems Wind tunnel simulation Nuclear weapons simulation Weather forecasting Flood mapping, etc.

Matrix Problems Matrices are commonly used to store the coefficients of variables in large sets of simultaneous equations Programs contain many operations on large matrices Matrix multiplication Matrix inversion Gaussian elimination Etc.

Matrix Problems Data sets for matrix problems are typically large Huge matrices with billions of elements Simulation of physical systems can be done at different levels of detail Usually want as much detail as possible Amount of detail is limited by processing speed/memory Faster computation allows more detail

Matrix Problems Many matrix problems can be parallelized extremely well Especially matrix multiplication But also others Historically, parallel computing was almost entirely focussed on executing large matrix problems in parallel

Matrix Problems Two common matrix problems are Matrix by vector multiplication y = x * A x, y are vectors and A is a matrix Matrix by matrix multiplication C = A * B A, B and C are matrices

Matrix Vector Mul // matrix vector mul for ( i = 0; i < N; i++ ) { y[i] = 0.0; for ( j = 0; j < N; j++ ) { y[i] += x[j] * A[i][j]; }

Interchanging the loops // intialize y to zero for ( j = 0; j < N; j++ ) y[j] = 0.0; // matrix vector mul for ( i = 0; i < N; i++ ) { for ( j = 0; j < N; j++ ) { y[j] += x[i] * A[j][i]; }

Original with caching // cache y[i] in a local variable for ( i = 0; i < N; i++ ) { sum = 0.0; for ( j = 0; j < N; j++ ) { sum += x[j] * A[i][j]; } y[i] = sum;

Locality A common pattern arises in matrix operations Simple way to write the code is to scan over all the rows (or columns) with a single loop But this can result in poor reuse of values in the arrays

Locality Array elements are pulled into the cache when first used If those elements will be reused by the algorithm, we want to reuse them from the cache Want to avoid reloading data from memory Rewriting the code to reuse data usually involves Reordering loop iterations or Reordering data in memory

Matrix Vector Mul Halve number of times x must be loaded: for ( i = 0; i < N; i+=2 ) { sum0 = 0.0; sum1 = 0.0; for ( j = 0; j < N; j++ ) { sum0 += x[j] * A[i][j]; sum1 += x[j] * A[i+1][j]; } y[i] = sum0; y[i+1] = sum1;

Matrix Vector Mul Quarter number of times x must be loaded: for ( i = 0; i < N; i+=4 ) { sum0 = 0.0; sum1 = 0.0; sum2 = 0.0; sum3 = 0.0; for ( j = 0; j < N; j++ ) { sum0 += x[j] * A[i][j]; sum1 += x[j] * A[i+1][j]; sum2 += x[j] * A[i+2][j]; sum3 += x[j] * A[i+3][j]; } y[i] = sum0; y[i+1] = sum1; y[i+2] = sum2; y[i+3] = sum3;

Matrix Vector Mul Reduce loading of x to cache by factor of k: for ( i = 0; i < N; i+=k ) { for ( temp = 0; temp < k; temp++ ) sum[temp] = 0.0; for ( j = 0; j < N; j++ ) { sum[temp] += x[j] * A[i+temp][j]; } y[i+temp] = sum[temp];

Matrix Vector Mul Make it work if N is not a multiple of K: for ( i = 0; i < N; i+=k ) { stop = (i + k < N) ? k : N – i; for ( temp = 0; temp < stop; temp++ ) sum[temp] = 0.0; for ( j = 0; j < N; j++ ) { sum[temp] += x[j] * A[i+temp][j]; } y[i+temp] = sum[temp];

Improving Locality This technique can be used to reduce cache misses on x a lot i.e. close to the minimum possible However, the inner loop that does the “blocking” of the x vector accesses different elements of the y vector If we keep increasing k (the “blocking factor”) we will eventually get a lot of cache misses on accesses to the y vector

Improving Locality So how big should k be? The “optimal” value of k will be one that keeps as much as feasible of x in cache Obvious answer is as much as will fit in cache However Cache is also used by other data Multiple levels of cache With k = 4, we already achieve 75% of possible benefit

Improving Locality Best choice of k is usually difficult to find analytically In practice auto-tuners are often used An auto-tuner is a program that is used to tune another program Auto-tuner tries out different values of constants that affect performance Compile, run, time, feedback, start again Auto-tuners often use machine learning techniques Huge search space of possible solutions

Matrix Multiplication Matrix multiplication (matmul) is similar to matrix vector multiplication Multiply two matrices to get a result which is a matrix For N X N matrices, there will be O(N3) operations Using the straightforward algorithm More computationally intensive than matrix-vector multiplication Large ratio of computation to memory ops

Matrix Multiplication for ( i = 0; i < N; i++ ) { for ( j = 0; j < N; j++ ) { sum = 0; for ( k = 0; k < N; k++ ) { sum += a[i][k] * b[k][j]; } c[i][j] = sum;

Blocked Mat Mul // init matrix c to all zeroes before starting for ( jj = 0; jj < n; jj += block_size ) { for ( kk = 0; kk < n; kk += block_size ) { for ( i = 0; i < n i++ ) { stopj = min(jj+block_size, n); for ( j = jj; j < stopj; j++ ) { sum = 0; stopk = min(kk+block_size, n); for ( k = kk; k < stopk; k++ ) { sum += a[i][k] * b[k][j]; } c[i][j]+= sum