Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.

Similar presentations


Presentation on theme: "Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University."— Presentation transcript:

1 Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University

2 Lecture 5: Scratch Pad Memories Motivation

3 Processor-Memory Performance Gap “Moore’s Law” µProc 55%/year (2X/1.5yr) DRAM 7%/year (2X/10yrs) □ Huge Processor-Memory Performance Gap □Cold start can take billions of cycles

4 More serious dimensions of the memory problem Energy Access times □Applications are getting larger and larger … Sub-banking

5 Memory Performance Impact on Performance □Suppose a processor executes at □ideal CPI = 1.1 □50% arith/logic, 30% ld/st, 20% control and that 10% of data memory operations miss with a 50 cycle miss penalty □CPI = ideal CPI + average stalls per instruction = 1.1(cycle) + ( 0.30 (datamemops/instr) x 0.10 (miss/datamemop) x 50 (cycle/miss) ) = 1.1 cycle + 1.5 cycle = 2.6 so 58% of the time the processor is stalled waiting for memory! □A 1% instruction miss rate would add an additional 0.5 to the CPI!

6 The Memory Hierarchy Goal: Create an illusion □Fact: Large memories are slow and fast memories are small □How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)? □With hierarchy □With parallelism

7 Second Level Cache (SRAM) A Typical Memory Hierarchy Control Datapath Secondary Memory (Disk) On-Chip Components RegFile Main Memory (DRAM) Data Cache Instr Cache ITLB DTLB eDRAM Speed (%cycles): ½’s 1’s 10’s 100’s 1,000’s Size (bytes): 100’s K’s 10K’s M’s G’s to T’s Cost: highest lowest  By taking advantage of the principle of locality l Can present the user with as much memory as is available in the cheapest technology l at the speed offered by the fastest technology

8 Memory system frequently consumes >50 % of the energy used for processing Multi-processor with cacheUni-processor without caches [M. Verma, P. Marwedel: Advanced Memory Optimization Techniques for Low-Power Embedded Processors, Springer, May 2007] [Segars 01 according to Vahid@ISSS01] Osman S. Unsal, Israel Koren, C. Mani Krishna, Csaba Andras Moritz, U. of Massachusetts, Amherst, 2001

9 Cache □Decoder logic

10 Energy Efficiency Technology [H. de Man, Keynote, DATE‘02; T. Claasen, ISSCC99] Operations/Watt [GOPS/W] Processors Reconfigurable Computing ASIC 1 0.1 0.01 0.13µ Necessary to optimize; otherwise the price for flexibility cannot be paid! Ambient Intelligence 0.07µ DSP-ASIPs µPs 10 0.25µ0.5µ1.0µ poor design techniques

11 Timing Predictability G.721: using unified Cache@ARM7TDMI Worst case execution time (WCET) larger than without cache

12 Objectives for Memory System Design □(Average) Performance □Throughput □Latency □Energy consumption □Predictability, good worst case execution time bound (WCET) □Size □Cost □….

13 Scratch pad memories (SPM): Fast, energy-efficient, timing-predictable □Address space ARM7TDMI cores, well-known for low power consumption scratch pad memory 0 FFF.. Example Small; no tag memory SPMs are small, physically separate memories mapped into the address space; Selection is by an appropriate address decoder (simple!) CPU CPU Regi sters SPM L1 Ca che L2 Cac he RAM

14 Comparison of currents E.g.: ATMEL board with ARM7TDMI and ext. SRAM

15 Scratchpad vs. main memory Example: Atmel ARM-Evaluation board > 86% savings energy reduction: 1/ 7.06 100% predictable energy reduction: 1/ 7.06 100% predictable

16 Why not just use a cache ? Energy consumption in tags, comparators and muxes is significant. [R. Banakar, S. Steinke, B.-S. Lee, 2001]

17 Influence of the associativity

18 Systems with SPM □Most of the ARM architectures have an on-chip SPM termed as Tightly-coupled memory (TCM) □GPUs such as Nvidia’s 8800 have a 16KB SPM □Its typical for a DSP to have scratch pad RAM □Embedded processors like Motorola Mcore, TI TMS370C □Commercial network processors – Intel IXP □And many more …

19 □Same motivation □Large memory latency □Huge overhead for automatically managed caches □Local SPE processors fetch instructions and data from local storage LS (256 kB). □LS not designed as a cache. Separate DMA transfers required to fill and spill. Main Memory And for the Cell processor

20 Advantages of Scratch Pads □Area advantage - For the same area, we can fit more memory of SPM than in cache (around 34%) □SPM consists of just a memory array & address decoding circuitry □Less energy consumption per access □Absence of tag memory and comparators □Performance comparable with cache □Predictable WCET – required for RTES

21 Challenges in using SPMs □In SPMs, application developer, or compiler has explicitly move data between memories □Data mapping is transparent in cache based architectures □Binary compatible? □Do advantages translate to a different machine?

22 Data Allocation on SPM □Techniques focus on mapping □Global data □Stack data □Heap data □Broadly, we can classify as □Static – Mapping of data decided at compile time and remains constant throughout the execution □Compile-time Dynamic – Mapping of data decided at compile time and data in SPM changes throughout execution □Goals are □To minimize off-chip memory access □To reduce energy consumption □To achieve better performance

23 Global Data □Panda et al., “Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications” □Map all scalars to SPM □Very small in size □Estimate conflicts in array □IAC(u): Interference Access Count: No. of accesses to other arrays during lifetime of u □VAC(u): Variable Access Count: Number of accesses to elements of u □IF(u) = ILT(u)*VAC(u) □Loop Conflict Graph □Nodes are arrays □Edge weight of (u -> v) is the number of accesses to u and v in the loop □More conflict  SPM □Either whole array goes to SPM or not

24 ILP Formulation □For Functions □For Basic Blocks □For global variables □ILP Variables

25 ILP Formulation □Energy Savings □Size Constraint □Need not jump to and back from memory for consecutive BBs


Download ppt "Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University."

Similar presentations


Ads by Google