CacheMiner : Run-Time Cache Locality Exploitation on SMPs CPU On-chip cache Off-chip cache Interconnection Network Shared Memory CPU On-chip cache Off-chip.

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

11 Memory Management The External View of the Memory Manager Hardware Application Program Application Program File MgrDevice MgrMemory Mgr Process Mgr.

Memory Management. Memory Manager Requirements –Minimize executable memory access time –Maximize executable memory size –Executable memory must be cost-effective.

Prof. Necula CS 164 Lecture 141 Run-time Environments Lecture 8.

*time Optimization Heiko, Diego, Thomas, Kevin, Andreas, Jens.

Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

CS 536 Spring Run-time organization Lecture 19.

3/17/2008Prof. Hilfinger CS 164 Lecture 231 Run-time organization Lecture 23.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Fair Scheduling in Web Servers CS 213 Lecture 17 L.N. Bhuyan.

C and Data Structures Baojian Hua

Run-time Environment and Program Organization

Data Locality CS 524 – High-Performance Computing.

Strategies for Implementing Dynamic Load Sharing.

CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson

Memory Layout C and Data Structures Baojian Hua

Ceng 545 Performance Considerations. Memory Coalescing High Priority: Ensure global memory accesses are coalesced whenever possible. Off-chip memory is.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Load Balancing in Distributed Computing Systems Using Fuzzy Expert Systems Author Dept. Comput. Eng., Alexandria Inst. of Technol. Content Type Conferences.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Parallel Programming in Split-C David E. Culler et al. (UC-Berkeley) Presented by Dan Sorin 1/20/06.

1 Arrays: Matrix Renamed Instructor: Mainak Chaudhuri

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

University of Washington Today Finished up virtual memory On to memory allocation Lab 3 grades up HW 4 up later today. Lab 5 out (this afternoon): time.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

Static Analysis of Parameterized Loop Nests for Energy Efficient Use of Data Caches P.D’Alberto, A.Nicolau, A.Veidenbaum, R.Gupta University of California.

Assembly - Arrays תרגול 7 מערכים.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

Memory Management Chapter 5 Advanced Operating System.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Slide 11-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 11.

VISUAL C++ PROGRAMMING: CONCEPTS AND PROJECTS Chapter 7A Arrays (Concepts)

CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.

Windows Programming Lecture 03. Pointers and Arrays.

F453 Module 8: Low Level Languages 8.1: Use of Computer Architecture.

Conception of parallel algorithms

Resource Elasticity for Large-Scale Machine Learning

Run-time organization

Workshop in Nihzny Novgorod State University Activity Report

Parallel Programming in C with MPI and OpenMP

Code Generation.

Memory Allocation CS 217.

Parallel Programming in C with MPI and OpenMP

Introduction to Optimization

Presentation transcript:

CacheMiner : Run-Time Cache Locality Exploitation on SMPs CPU On-chip cache Off-chip cache Interconnection Network Shared Memory CPU On-chip cache Off-chip cache CPU On-chip cache Off-chip cache

Example Program transformations for cache locality : Tiling for i = 1 to n for j = 1 to n for k = 1, n A[i,j ] = A[i,j] + B[i,k] * C[k,j] For a matrix multiplication of 1000 x 1000 … X= X= X= Data accessed

But it’s hard for the compiler to analyse indirect accesses.. void myfunc( int source_arr[], int key_arr[], int result_arr[], int n) { for(I=0;I< n ;I++) { result_arr[I] += source_arr[ key_arr[ I] ] ; // Indirection ! } The data access pattern of the function depends on the contents of key_arr[]. So the data access pattern cannot be determined at compile time, but only at run-time. Cacheminer is especially useful for such scenarios.

Targetted Model For ( i1 = lower_1 ; i1 < upper_1 ; i1 ++) For ( i2 = lower_2 ; i2 < upper_2 ; i2 ++) For ( i3 = lower_3 ; i3 < upper_3 ; i3 ++) For ( ik = lower_k ; ik < upper_k ; ik ++) { Task B = block of statements; } k nested loops Let B ( t1, t2…tk) : task B where t1, t2..tk represent particular values of variables i1, i2..ik respectively The tasks need to be data independent of each other i.e : Out (B1)Out (B2) = { empty set } Out (B1) In (B2) = { empty set } In (B1)Out (B2) = { empty set }

System Overview program Hint Addition Access Pattern Estimation Task Grouping Task Partitioning Task Scheduling Compiler Library C program Add calls to library functions which provide hints to the run-time system Use Hints to estimate the pattern of accesses. Group together tasks which access closely placed data into bins. Partition total bins among P processors to maximize data locality and also loadsharing. Schedule Tasks on the processor. Ensure overall load-balancing

Step 1 : Estimating Memory Accesses Assumption : Task B accesses only chunks of elements in multiple arrays 4 Hints provided to the module : a. Number of Arrays accessed : n (Compile Time) b. Size in bytes of each array : Vector (s1,s2…sn) (Compile Time) c. Number of processors : p (Compile Time). d. Access footprint B(a1,a2,….an) : starting access address for n arrays for the Task B. (Run Time). Each Task can then be a point B(a1,a2,a3..an) in n -dimensional space.

Example : int P [ 100 ] and int Q[ 200]. Memory Layout of P : size = 100 * sizeof(int) = 400 : starting address : &P[0] = 1000 Memory Layout of Q : size = 200 * sizeof(int) = 800 : starting address : &Q[0] = 100. Access dimension in P --> Access dimension in Q --> Each Task B(x,y) is a point in the 2-dimensional grid x : starting access address of array1 (P) for Task y : starting access address of array2 (Q) for Task B1 ( 1000, 900 ) B2 ( 1000, 100 )

Step 2 : Grouping Tasks By Locality A. Shift to Origin. Access dimension in P --> Access dimension in Q --> B1 ( 1000, 900) B2 ( 1000, 100) B. Shrink the Dimensions by (C/n) : In example : n = 2, cache size = 200 So shrink dimension by 200/2 = Bins

Step 3 : Partitioning Bins among ‘P’ Processors Need to form ‘P’ groups of bins such that the sharing between them is minimized. Problem is NP-complete, so use a heuristic method to divide up the bin space. i. Form prime factors of ‘P’ and divide each dimension of bin-space into Rj chunks, for each Prime factor Rj. Example : Suppose we have 6 processors : 6 = 2 x 3 So ‘x’ dimension divided into 2 parts. ‘y’ dimension divided into 3 parts. Thus, a total of 2 x 3 = 6 distinct regions ! (all bins in 1 region are processed by one processor) Distinct regions

Step 4 : Adaptive Scheduling of Task Groups Bin Processor Task List Bin Take ‘K’ bins at a time Local Scheduling Local Scheduling : Each processor processes bins from its own Task-list. Global Scheduling Global Scheduling : When a processor finishes its task list, it starts processing the task list of the most heavily loaded processor. Adaptive Control Adaptive Control : Processor takes ‘K’ bins at a time to process. K changes depending on no. of remaining bins max ( p /2, Ki - 1) if few bins remain in tasklist (light load) min ( 2p, Ki + 1) if lots of bins remain in tasklist (heavy load) Ki =

Results Static Access Pattern Dynamic Access Pattern Manually optimized With Cacheminer

Framework to exploit Run-Time Cache Locality on SMPs Targetted at nested-loop structures accessing number of arrays. Especially useful for indirect accesses where data access pattern cannot be determined till run-time. Overall phases : Summary program Hint Addition Access Pattern Estimation Task Grouping Task Partitioning Task Scheduling Compiler Library