Outline Introduction Different Scratch Pad Memories Cache and Scratch Pad for embedded applications.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Register Allocation Consists of two parts: Goal : minimize spills
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Static Bus Schedule aware Scratchpad Allocation in Multiprocessors Sudipta Chattopadhyay Abhik Roychoudhury National University of Singapore.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Parallell Processing Systems1 Chapter 4 Vector Processors.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Modified from notes by Saeid Nooshabadi COMP3221: Microprocessors and Embedded Systems Lecture 25: Cache - I Lecturer:
H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.
Memory Management.
Memory Management 2010.
Memory Organization.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Improving Code Generation Honors Compilers April 16 th 2002.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Chapter 9 Classification And Forwarding. Outline.
Memory Allocation via Graph Coloring using Scratchpad Memory
Memory Systems Architecture and Hierarchical Memory Systems
Chapter 3 Memory Management: Virtual Memory
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
Microprocessor-based systems Curse 7 Memory hierarchies.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Lecture 19: Virtual Memory
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.
CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.
Basic Memory Management 1. Readings r Silbershatz et al: chapters
© 2004 Wayne Wolf Memory system optimizations Strictly software:  Effectively using the cache and partitioned memory. Hardware + software:  Scratch-pad.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Embedded Systems Seminar Heterogeneous Memory Management for Embedded Systems By O.Avissar, R.Barua and D.Stewart. Presented by Kumar Karthik.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,
CMSC 611: Advanced Computer Architecture
Nios II Processor: Memory Organization and Access
Cache and Scratch Pad Memory (SPM)
Memory Management.
Chapter 2 Memory and process management
Memory COMPUTER ARCHITECTURE
Visit for more Learning Resources
Modeling of Digital Systems
Cache Memory Presentation I
Improving cache performance of MPEG video codec
Ann Gordon-Ross and Frank Vahid*
Adaptive Code Unloading for Resource-Constrained JVMs
Overheads for Computers as Components 2nd ed.
Spring 2008 CSE 591 Compilers for Embedded Systems
Page Main Memory.
Presentation transcript:

Outline Introduction Different Scratch Pad Memories Cache and Scratch Pad for embedded applications

Memories in Embedded Systems Each memory has its own advantages For better performance memory accesses have to be fast CPU Internal ROM Internal SRAM External DRAM

Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications

What is Scratchpad memory ? Fast on-chip SRAM Abbreviated as SPM 2 types of SPM :-  Static  SPM locations don’t change at runtime  Dynamic  SPM locations change at runtime

Objective Find a technique for efficiently exploiting on- chip SPM by partitioning the application’s scalar and array variables into off-chip DRAM and on-chip SPM. Minimize the total execution time of the application.

SPM and Cache Similarities  Connected to the same address and data buses.  Access latency of 1 processor cycle. Difference  SPM guarantees single cycle access time while an access to cache is subject to a miss.

Block Diagram of Embedded Processor Application

Division of Data Address Space between SRAM and DRAM

Example: Histogram Evaluation Code Builds a histogram of 256 brightness levels for the pixels of an N* N image – char Brightnesslevel [512] [512]; int Hist [256]; /* Elements initialized to 0 */ … for(i = 0;i < N;i+ +) for (j = 0;j < N;j + +) /* For each pixel (i, j) in image */ level = BrightnessLevel [i] [j]; Hist [level] = Hist [level] + 1;

Problem Description If the code is executed on a processor configured with a data cache of size 1Kb – performance will be degraded by conflict misses in the cache between elements of the 2 arrays Hist and BrightnessLevel. Solution:- Selectively map to SPM those variables that cause maximum number of conflicts in the data cache.

Partitioning Strategy Features affecting partitioning  Scalar variables and constants  Size of arrays  Life-times of array variables  Access frequency of array variables  Conflicts in loops Partitioning Algorithm

Features affecting partitioning Scalar variables and constants  All scalar variables and scalar constants are mapped onto SPM. Size of Arrays  Arrays that are larger than SRAM are mapped onto off-chip memory.

Features affecting partitioning Lifetime of an Array Variable  Definition :- period between its definition and its last use.  Variables with disjoint lifetimes can be stored in the same processor register.  Arrays with different lifetimes can share the same memory space.

Features affecting partitioning Intersecting Life Times  ILT(u)  Definition :- Number of array variables having a non-null intersection of lifetimes with u.  Indicates the number of other arrays it could possibly interact with, in cache.  So map arrays with highest ILT values into SPM, thereby eliminating a large number of potential conflicts.

Features affecting partitioning Access frequency of Array Variables Variable Access Count  VAC(u)  Definition :- Number of accesses to elements of u during its lifetime.  Interference Access Count  IAC(u)  Definition :- Number of accesses to other arrays during the lifetime of u.  Interference Factor  IF(u) = VAC(u)*IAC(u)

Features affecting partitioning b c a 3N Conflicts in Loops for i = 0 to N-1 access a [i] access b [i] access c [2 i] access c [2 i + 1] end for Loop Conflict Graph  LCG edge weight e(u, v) = ∑ p i=1 k i k i ->total no. of accesses to u and v in loop i Total no. of accesses to a and c combined : (1+2)*N = 3N =>e(a,c) = 3N ; e(b,c) = 3N ; e(a,b) = 0

Features affecting partitioning Loop Conflict Factor   Definition :- sum of incident edge weights to node u.  LCF(u) = ∑ v є LCG - {u} e(u,v)  Higher the LCF, more conflicts are likely for an array, more desirable to map the array to the SPM.

Partitioning Strategy Features affecting partitioning  Scalar variables and constants  Size of arrays  Life-times of array variables  Access frequency of array variables  Conflicts in loops Partitioning Algorithm

Algorithm for determining the mapping decision of each(scalar and array) program variable to SPM or DRAM/cache. First assigns scalar constants and variables to SPM. Arrays that are larger than SPM are mapped onto DRAM.

Partitioning Algorithm For remaining (n) arrays, generates lifetime intervals and computes LCF and IF values. Sorts the 2n interval points thus generated and traverses them in increasing order. For each array u encountered, if there is sufficient SRAM space for u and all arrays with lifetimes intersecting the lifetime interval of u, with more critical LCF and IF nos., then maps u to SPM else to DRAM/cache.

Performance Details for Beamformer Example

Typical Applications Dequant  de-quantization routine in MPEG decoder application IDCT  Inverse Discrete Cosine Transform SOR  Successive Over Relaxation Algorithm MatrixMult  Matrix multiplication FFT  Fast Fourier Transform DHRC  Differential Heat Release Computation Algorithm

Performance Comparison of Configurations A, B, C and D

Conclusion Average improvement of 31.4% over A (only SRAM) Average improvement of 30.0% over B (only cache) Average improvement of 33.1% over C (random partitioning)

Compiler Decided Dynamic Memory allocation for Scratch Pad Based Embedded Systems.

Cache is one of the option for Onchip Memory CPU Internal ROM External DRAM Cache

Why All Embedded Systems Don't Have Cache Memory The reasons could be Increased On Chip Area Increased Energy Increased Cost Hit Latency and Undeterministic Cache Access

A method for allocating program data to non-cached SRAM Dynamic i.e. allocation changes at runtime Compiler-decided transfers Zero overhead per-memory-instruction unlike software or hardware caching Has no software Caching tags Requires no run time checks High Predictable memory access times

Static Approach int a[100]; int b[100]; … while(i<100) …..a…… while(i<100) ……b…... Allocator External DRAM Internal SRAM Int b[100]

Static Approach int a[100]; int b[100]; … while(i<100) …..a…… while(i<100) ……b…... Allocator External DRAM Internal SRAM Int a[100] Int b[100]

Dynamic Approach int a[100]; int b[100]; … while(i<100) …..a…… while(i<100) ……b…... Allocator External DRAM Internal SRAM Int a[100] Int b[100]

Dynamic Approach int a[100]; int b[100]; while(i<100) ……a…... while(i<100) ……b…… Allocator External DRAM Internal SRAM int b[100] int a[100] It is similar to caching, but under compiler control

Compiler-Decided Dynamic Approach int a[100]; int b[100]; … // a is in SRAM while(i<100) ……a……. // Copy a out to DRAM // Copy b in to SRAM while(i<100) ……..b…..… Decide on dynamic behavior statically Need to minimize costs for greater benefit Accounts for changing program Requirements at run time Compiler manages and decides the transfers between sram and dram Transfer cost

Approach The method is to Use profiling to estimate reuse Copy variables in to SRAM when reused Cost model ensures that benefit exceeds cost Transfers data between the On chip and Off chip memory under compiler supervision Compiler-known data allocation at each point in the code

Advantages Benefits with no software translation overhead Predictable SRAM accesses ensuring better real- time guarantees than Hardware or Software caching No more data transfers than caching

Overview of Strategy Divide the complete program into different regions For (Starting Point of each Region) < Remove Some Variables from Sram Copy Some Variables into Sram from Dram >

Some Imp Questions What are regions ? What to bring in to SRAM ? What to evict from SRAM ? The Problem has an exponential number of Solutions (NP Complete)

Regions It is the code between successive program points Coincide with changes in program behavior New regions start at: Start of each procedure Before start of each loop Before conditional statements containing loops, procedures

What to Bring in to SRAM ? Bring in variables that are re-used in region, provided cost of transfer is recovered. These transfers will reduce the memory access time Cost model accounts for: Profile estimated re-use Benefit from reuse Detailed Cost of transfer Bring in cost Eviction cost

What to Remove from SRAM? in the future. Need concept of time order of different code regions The data variables that are furthest in the future This time can be obtained by assigning timestamps for each of the nodes

The Data-Program Relationship Graph The DPGR is a new data structure that helps in identification of regions and marking of time stamps It is essentially a program’s call graph appended with additional nodes for Loop nodes Variable nodes

Data-Program Relationship Graph ab Proc_B Proc_A main Defines regions Defines Regions Depth first search order reveals execution time. order “Allocation-change points” at region changes Proc_C lo op

Time Stamps A method associates a time stamp with every program point The time stamp forms a total order among themselves The program points are reached during the runtime in time stamp order.

Optimizations The is no need to write back unmodified or dead SRAM variables into DRAM Optimize data transfer code using DMA when it is available Data transfer code can be placed in special memory block copy procedures

Multiple Allocations due to Multiple Paths Contents of SRAM could be different on different incoming paths to a node in DPRG Problem can happen in Loops Conditional execution Multiple calls to same procedure

Conditional join nodes Favor the most frequent path Consensus allocation is chosen assuming the incoming allocation from the most probable predecessor Join Node

Procedure join nodes Few program points have multiple timestamps The nodes with multiple timestamps are called join nodes as they join multiple paths from main() A strategy is used that adopts different allocation strategies for different paths but with same code

Offsets in SRAM SRAM can get fragmented when variables are swapped out Intelligent offset mechanism required In this method Place memory variables with similar lifetimes together  larger fragments when evicted together

Experimental Setup Architecture: Motorola MCORE Memory architecture : 2 levels of memory SRAM size: Estimated as 25% of the total data requirement DRAM latency 10 cycles Compiler : Gcc

Results

Conclusion The designer has to choose the right mix of Scratch pad and Cache for performance advantages.

References Sumesh U,Rajeev B. Compiler Decided Dynamic Memory Allocation for Scratch Pad Based Embedded Systems. Alexandru N,Preeti P, N Dutt. Efficient Use of Scratch Pads in Embedded Applications Josh Pfrimmer, Kin F. Li, and Daler Rakhmatov Balancing Scratch Pad and Cache in Embedded Systems for Power and Speed Performance

Questions

Thank you