Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spring 2008 CSE 591 Compilers for Embedded Systems

Similar presentations


Presentation on theme: "Spring 2008 CSE 591 Compilers for Embedded Systems"— Presentation transcript:

1 Spring 2008 CSE 591 Compilers for Embedded Systems
Aviral Shrivastava Department of Computer Science and Engineering Arizona State University

2 Lecture 6: Scratch Pad Memories
Management and Data Mapping Techniques

3 Energy Efficiency Operations/Watt [GOPS/W] Ambient Intelligence 10
DSP-ASIPs 1 ASIC Reconfigurable Computing Processors µPs poor design techniques 0.1 0.01 Technology 1.0µ 0.5µ 0.25µ 0.13µ 0.07µ Necessary to optimize; otherwise the price for flexibility cannot be paid! [H. de Man, Keynote, DATE‘02; T. Claasen, ISSCC99]

4 SPMs vs. Cache Energy consumption in tags, comparators and muxes is significant! [R. Banakar, S. Steinke, B.-S. Lee, 2001]

5 Advantages of Scratch Pads
Area advantage - For the same area, we can fit more memory of SPM than in cache (around 34%) SPM consists of just a memory array & address decoding circuitry Less energy consumption per access Absence of tag memory and comparators Performance comparable with cache Predictable WCET – required for RTES

6 Systems with SPM Most of the ARM architectures have an on-chip SPM termed as Tightly-coupled memory (TCM) GPUs such as Nvidia’s 8800 have a 16KB SPM Its typical for a DSP to have scratch pad RAM Embedded processors like Motorola Mcore, TI TMS370C Commercial network processors – Intel IXP Cell Broadband Engine

7 Challenges in using SPMs
In SPMs, application developer, or compiler has explicitly move data between memories Data mapping is transparent in cache based architectures Binary compatible? Do advantages translate to a different machine?

8 Data Allocation on SPM Data Classification Mapping classification
Global data Stack data Heap data Application Code Mapping classification Static – Mapping of data decided at compile time and the mapping persists throughout the execution Dynamic – Mapping of data decided at compile time but the mapping changes throughout execution Analysis Classification Profile based analysis Compile-time analysis Goal Classification To minimize off-chip memory access To reduce energy consumption To achieve better performance

9 Global Data Panda et al., “Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications” Map all scalars to SPM Very small in size Estimate conflicts in array IAC(u): Interference Access Count: No. of accesses to other arrays during lifetime of u VAC(u): Variable Access Count: Number of accesses to elements of u IF(u) = ILT(u)*VAC(u) Loop Conflict Graph Nodes are arrays Edge weight of (u -> v) is the number of accesses to u and v in the loop More conflict  SPM Either whole array goes to SPM or not

10 ILP for Global Variables
Memory units: m_i Power per access of each memory: p_m_i Number of times each variable is accessed: n_j Compute where each variable should be placed to minimize the power consumption Compiler decides where to place the global variables

11 ILP for Code+GVs Number of times each function is executed
Size of the function In terms of dynamic instruction count Energy Savings if the function is mapped to SPM Find out where to map the functions to minimize energy, given SPM size constraint Can be done at BB level also Is completely static Dynamic copying of code and data to scratch pad? Profile based analysis only Does it scale? How to do static analysis?

12 Stack Data Management Need to manage 2 stack pointers
Unlike global variables, stack variables are dynamic Stack variable addresses defined dynamically Amount of stack not known statically Stack frame granularity Keep only the most frequently used stack frames in the scratch pad Find out by profiling Keep only most frequently accessed variables into the SPM Need to manage 2 stack pointers Bar() DRAM CPU Foo() SPM

13 MMU based Stack Management
Deny read permissions to all stack space When processor accesses a function, a page fault is generated It resets the permission and brings the function to the SPM Address mapping is also modified Granularity Page-based Binary compatible!!

14 Heap Data Management Static techniques?
Like stack variables, heaps are also dynamically allocated Partition program into regions E.g. functions, loops Find time order between regions Add instructions to transfer portions of heap data to the scratch pad Static techniques?

15 Leakage Energy Minimization in Multi-bank scratch pad
Power of a bank is proportional to size of Scratchpad Partition data according to access frequency A bank can be put in low-power mode Partition data to increase the sleep time Reduces leakage of the bank Explore all possible page mappings > 60% energy reduction

16 SPM Management for MMU based systems
SPM split into pages MMU page fault abort exceptions used to intercept access to data to be copied to SPM on demand Physical address generated by MMU compared with SPM Base and accessed if within range System has a direct mapped cache SPM Manager calls are inserted for frequently accessed data pages How to map data so as to minimize the page faults?

17 Static Analysis Static analysis typically require well structured code
But that is not always the case While-loop Pointers Variable modifications Limits the ability of static analysis Simulate the program to find out the address functions Match the address functions with several affine patterns a, ai, ai+b, ai+bj+c

18 Data Reuse Concept 5 15 5 5 Main mem 15 5 array index time
for i=0 to 9 for j=0 to 1 for k=0 to 2 for l=0 to 2 for m=0 to 4 …=A[i*15+k*5+m] Main mem 15 5 Proc array index Reuse tree 15 5 5 5 time

19 Basic Idea for Data Reuse Analysis
These 25 elements are ‘alive’ only during 10 iterations The buffer size needed to keep all reused data is 25x10 = 250 elements X 10 20 30 40 50 60 70 Iteration 10 Iteration 15 Iteration 20 for y=0 to 4 for x=0 to 4 for dy = 0 to 4 for dx = 0 to 9 ...=A[10y+dy, 10x+dx] for dy = 0 to 14 for dx = 0 to 4 …=A[10y+dy+10, 10x+dx] Y

20 Basic Idea for Data Reuse Analysis
How to find the buffer size in a systematic way? Find the elements accessed during the first iteration Assign distance numbers X 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 for y=0 to 4 for x=0 to 4 for dy = 0 to 4 for dx = 0 to 9 ...=A[10y+dy, 10x+dx] for dy = 0 to 14 for dx = 0 to 4 …=A[10y+dy+10, 10x+dx] Distance=10 Distance=0 Distance=5 Overlapped area = # of reused elements (5x5=25 in our case) Buffer size = max difference in distance numbers * overlap area: 10 * 25 = 250; partial reuse is possible with smaller size Y

21 Array A[] in main memory
Reuse Graph Result of the data reuse analysis: buffer hierarchy Each buffer can be mapped to physical memory Many different possibilities for mapping Example: for y=0 to 4 for x=0 to 4 for dy = 0 to 4 for dx = 0 to 9 ...=A[10y+dy, 10x+dx] for dy = 0 to 14 for dx = 0 to 4 …=A[10y+dy+10, 10x+dx] Off-chip memory Array A[] in main memory On-chip RAM, Mb Level 2 Buf 1 Buf 2 On-chip RAM, Kb Lev 1 Buf 1 Buf 2 Buf 3 A[ ] A[ ]

22 Multiprocessor data reuse analysis
Multiprocessor program: A[100] for i1 = 0 to 9 for j1 = 0 to 9 for k1 = 0 to 9 n1+=f(A[10*i1+j1+k1]) for i2 = 0 to 9 for j2 = 0 to 9 for k2 = 0 to 9 n2+=g(A[10*i2+j2+k2]) Proc 1 Proc 2 for j1 = 0 to 9 for k1 = 0 to 9 n1+=f(A[10*i1+j1+k1]) for j2 = 0 to 9 for k2 = 0 to 9 n2+=g(A[10*i2+j2+k2]) Architecture with shared memory: Main memory Array A[] in main memory (100) SPM Shared buffer (19) Proc 1 Proc 2 Additional synchronization is required

23 Synchronization Models
Buffer Update with Barrier Synchronization: for i1 = 0 to 9 for j1 = 0 to 9 for k1 = 0 to 9 n1+=f(A[10*i1+j1+k1]) for i2 = 0 to 9 for j2 = 0 to 9 for k2 = 0 to 9 n2+=g(A[10*i2+j2+k2]) barrier synchronization Buffer update (DMA) Proc 1 Proc 2 barrier synchronization Buffer Update using Larger Buffer: for i1 = 0 to 9 for j1 = 0 to 9 for k1 = 0 to 9 n1+=f(A[10*i1+j1+k1]) for i2 = 0 to 9 for j2 = 0 to 9 for k2 = 0 to 9 n2+=g(A[10*i2+j2+k2]) Proc 1 Proc 2 barrier synchronization Wait for DMA Start DMA

24 Multiprocessor Reuse Analysis: Example
Original multiprocessor program for i1 = 0 to 9 for j1 = 0 to 4 for k1 = 0 to 8 n1+=f(A[10*i1+2*j1+k1]) for i2 = 0 to 9 for j2 = 0 to 4 for k2 = 1 to 9 n2+=g(A[10*i2+2*j2+k2]) Proc 1 Proc 2 sync 1 sync 3 sync 2 i1: 17 i2: 17 Proc 1 k: 2 Proc 2 Main memory i: 18 j: 10 j1: 9 j2: 9 MP data reuse graph Proc 1 k: 2 Proc 2 i: 18 Main memory j: 10 Sync. 3 Reuse trees (buffer hierarchies) Proc 1 j1: 9 Proc 2 j2: 9 i1: 17 i2: 17 Main memory No sync. Proc 1 j1: 9 Proc 2 j2: 9 i: 18 Main memory Sync. 1 Proc 1 j: 10 Proc 2 i: 18 Main memory Sync. 2


Download ppt "Spring 2008 CSE 591 Compilers for Embedded Systems"

Similar presentations


Ads by Google