Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
CML CML CS 230: Computer Organization and Assembly Language Aviral Shrivastava Department of Computer Science and Engineering School of Computing and Informatics.
CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.
Now, Review of Memory Hierarchy
331 Week13.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Technische Universität Dortmund Automatic mapping to tightly coupled memories and cache locking Peter Marwedel 1,2, Heiko Falk 1, Robert Pyka 1, Lars Wehmeyer.
Memory Allocation via Graph Coloring using Scratchpad Memory
Outline Introduction Different Scratch Pad Memories Cache and Scratch Pad for embedded applications.
- 1 - Embedded systems: processing Embedded System Hardware Embedded system hardware is frequently used in a loop („hardware in a loop“): actuators.
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
CPE232 Memory Hierarchy1 CPE 232 Computer Organization Spring 2006 Memory Hierarchy Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
Microprocessor-based systems Curse 7 Memory hierarchies.
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
A Dynamic Code Mapping Technique for Scratchpad Memories in Embedded Systems Amit Pabalkar Compiler and Micro-architecture Lab School of Computing and.
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.
LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.
CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.
1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Improving Memory Access 2/3 The Cache and Virtual Memory
Sp09 CMPEN 411 L21 S.1 CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 21: Shifters, Decoders, Muxes [Adapted from Rabaey’s Digital Integrated Circuits,
CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.
CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (
CSE431 L18 Memory Hierarchy.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 18: Memory Hierarchy Review Mary Jane Irwin (
CPEG3231 Integration of cache and MIPS Pipeline  Data-path control unit design  Pipeline stalls on cache misses.
Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.
1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
CMSC 611: Advanced Computer Architecture
CS 704 Advanced Computer Architecture
COSC3330 Computer Architecture
Cache and Scratch Pad Memory (SPM)
Computer Organization
Yu-Lun Kuo Computer Sciences and Information Engineering
High Performance Computing (HIPC)
The Goal: illusion of large, fast, cheap memory
Morgan Kaufmann Publishers Memory & Cache
Ke Bai and Aviral Shrivastava Presented by Bryce Holton
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CPE 631 Lecture 05: Cache Design
Dynamic Code Mapping Techniques for Limited Local Memory Systems
Morgan Kaufmann Publishers Memory Hierarchy: Introduction
Spring 2008 CSE 591 Compilers for Embedded Systems
Cache - Optimization.
Cache Memory Rabi Mahapatra
Presentation transcript:

Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University

Lecture 5: Scratch Pad Memories Motivation

Processor-Memory Performance Gap “Moore’s Law” µProc 55%/year (2X/1.5yr) DRAM 7%/year (2X/10yrs) □ Huge Processor-Memory Performance Gap □Cold start can take billions of cycles

More serious dimensions of the memory problem Energy Access times □Applications are getting larger and larger … Sub-banking

Memory Performance Impact on Performance □Suppose a processor executes at □ideal CPI = 1.1 □50% arith/logic, 30% ld/st, 20% control and that 10% of data memory operations miss with a 50 cycle miss penalty □CPI = ideal CPI + average stalls per instruction = 1.1(cycle) + ( 0.30 (datamemops/instr) x 0.10 (miss/datamemop) x 50 (cycle/miss) ) = 1.1 cycle cycle = 2.6 so 58% of the time the processor is stalled waiting for memory! □A 1% instruction miss rate would add an additional 0.5 to the CPI!

The Memory Hierarchy Goal: Create an illusion □Fact: Large memories are slow and fast memories are small □How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)? □With hierarchy □With parallelism

Second Level Cache (SRAM) A Typical Memory Hierarchy Control Datapath Secondary Memory (Disk) On-Chip Components RegFile Main Memory (DRAM) Data Cache Instr Cache ITLB DTLB eDRAM Speed (%cycles): ½’s 1’s 10’s 100’s 1,000’s Size (bytes): 100’s K’s 10K’s M’s G’s to T’s Cost: highest lowest  By taking advantage of the principle of locality l Can present the user with as much memory as is available in the cheapest technology l at the speed offered by the fastest technology

Memory system frequently consumes >50 % of the energy used for processing Multi-processor with cacheUni-processor without caches [M. Verma, P. Marwedel: Advanced Memory Optimization Techniques for Low-Power Embedded Processors, Springer, May 2007] [Segars 01 according to Osman S. Unsal, Israel Koren, C. Mani Krishna, Csaba Andras Moritz, U. of Massachusetts, Amherst, 2001

Cache □Decoder logic

Energy Efficiency Technology [H. de Man, Keynote, DATE‘02; T. Claasen, ISSCC99] Operations/Watt [GOPS/W] Processors Reconfigurable Computing ASIC µ Necessary to optimize; otherwise the price for flexibility cannot be paid! Ambient Intelligence 0.07µ DSP-ASIPs µPs µ0.5µ1.0µ poor design techniques

Timing Predictability G.721: using unified Worst case execution time (WCET) larger than without cache

Objectives for Memory System Design □(Average) Performance □Throughput □Latency □Energy consumption □Predictability, good worst case execution time bound (WCET) □Size □Cost □….

Scratch pad memories (SPM): Fast, energy-efficient, timing-predictable □Address space ARM7TDMI cores, well-known for low power consumption scratch pad memory 0 FFF.. Example Small; no tag memory SPMs are small, physically separate memories mapped into the address space; Selection is by an appropriate address decoder (simple!) CPU CPU Regi sters SPM L1 Ca che L2 Cac he RAM

Comparison of currents E.g.: ATMEL board with ARM7TDMI and ext. SRAM

Scratchpad vs. main memory Example: Atmel ARM-Evaluation board > 86% savings energy reduction: 1/ % predictable energy reduction: 1/ % predictable

Why not just use a cache ? Energy consumption in tags, comparators and muxes is significant. [R. Banakar, S. Steinke, B.-S. Lee, 2001]

Influence of the associativity

Systems with SPM □Most of the ARM architectures have an on-chip SPM termed as Tightly-coupled memory (TCM) □GPUs such as Nvidia’s 8800 have a 16KB SPM □Its typical for a DSP to have scratch pad RAM □Embedded processors like Motorola Mcore, TI TMS370C □Commercial network processors – Intel IXP □And many more …

□Same motivation □Large memory latency □Huge overhead for automatically managed caches □Local SPE processors fetch instructions and data from local storage LS (256 kB). □LS not designed as a cache. Separate DMA transfers required to fill and spill. Main Memory And for the Cell processor

Advantages of Scratch Pads □Area advantage - For the same area, we can fit more memory of SPM than in cache (around 34%) □SPM consists of just a memory array & address decoding circuitry □Less energy consumption per access □Absence of tag memory and comparators □Performance comparable with cache □Predictable WCET – required for RTES

Challenges in using SPMs □In SPMs, application developer, or compiler has explicitly move data between memories □Data mapping is transparent in cache based architectures □Binary compatible? □Do advantages translate to a different machine?

Data Allocation on SPM □Techniques focus on mapping □Global data □Stack data □Heap data □Broadly, we can classify as □Static – Mapping of data decided at compile time and remains constant throughout the execution □Compile-time Dynamic – Mapping of data decided at compile time and data in SPM changes throughout execution □Goals are □To minimize off-chip memory access □To reduce energy consumption □To achieve better performance

Global Data □Panda et al., “Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications” □Map all scalars to SPM □Very small in size □Estimate conflicts in array □IAC(u): Interference Access Count: No. of accesses to other arrays during lifetime of u □VAC(u): Variable Access Count: Number of accesses to elements of u □IF(u) = ILT(u)*VAC(u) □Loop Conflict Graph □Nodes are arrays □Edge weight of (u -> v) is the number of accesses to u and v in the loop □More conflict  SPM □Either whole array goes to SPM or not

ILP Formulation □For Functions □For Basic Blocks □For global variables □ILP Variables

ILP Formulation □Energy Savings □Size Constraint □Need not jump to and back from memory for consecutive BBs