A Dynamic Code Mapping Technique for Scratchpad Memories in Embedded Systems Amit Pabalkar Compiler and Micro-architecture Lab School of Computing and.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

CML CML Managing Stack Data on Limited Local Memory Multi-core Processors Saleel Kudchadker Compiler Micro-architecture Lab School of Computing, Informatics.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Lecture 17: Virtual Memory, Large Caches

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Technische Universität Dortmund Automatic mapping to tightly coupled memories and cache locking Peter Marwedel 1,2, Heiko Falk 1, Robert Pyka 1, Lars Wehmeyer.

Memory Allocation via Graph Coloring using Scratchpad Memory

Outline Introduction Different Scratch Pad Memories Cache and Scratch Pad for embedded applications.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.

Microprocessor-based systems Curse 7 Memory hierarchies.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

A Time Predictable Instruction Cache for a Java Processor Martin Schoeberl.

CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.

Storage Allocation for Embedded Processors By Jan Sjodin & Carl von Platen Present by Xie Lei ( PLS Lab)

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and Jihong Kim Computer Architecture Lab School of Computer Science.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.

Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

Hello world !!! ASCII representation of hello.c.

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.

PipeliningPipelining Computer Architecture (Fall 2006)

1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,

CS 704 Advanced Computer Architecture

Cache and Scratch Pad Memory (SPM)

Reducing Code Management Overhead in Software-Managed Multicores

High Performance Computing (HIPC)

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Ke Bai and Aviral Shrivastava Presented by Bryce Holton

Splitting Functions in Code Management on Scratchpad Memories

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Dynamic Code Mapping Techniques for Limited Local Memory Systems

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Spring 2008 CSE 591 Compilers for Embedded Systems

rePLay: A Hardware Framework for Dynamic Optimization

Presentation transcript:

A Dynamic Code Mapping Technique for Scratchpad Memories in Embedded Systems Amit Pabalkar Compiler and Micro-architecture Lab School of Computing and Informatics Arizona State University 1 Master’s Thesis Defense October 2008

Agenda Motivation SPM Advantage SPM Challenges Previous Approach Code Mapping Technique Results Continuing Effort 2

Motivation - The Power Trend 3 Cache consumes around 44% of total processor power Cache architecture cannot scale on a many- core processor due to cache coherency attributed performance degradation. Within same process technology, a new processor design with 1.5x to 1.7x performance consumes 2x to 3x the die area [1] and 2x to 2.5x the power[2] For a particular process technology with fixed transistor budget, the performance/power and performance/unit area scales with the number of cores. Go to ReferencesReferences

Scratchpad Memory(SPM) High speed SRAM internal memory for CPU SPM falls at the same level as the L1 Caches in memory hierarchy Directly mapped to processor’s address space. Used for temporary storage of data, code in progress for single cycle access by CPU 4

The SPM Advantage 40% less energy as compared to cache ▫ Absence of tag arrays, comparators and muxes 34 % less area as compared to cache of same size ▫ Simple hardware design (only a memory array & address decoding circuitry) Faster access to SPM than physically indexed and tagged cache 5 Data Array Tag Array Tag Comparators, Muxes Address Decoder CacheSPM Address Decoder

Challenges in using SPMs Application has to explicitly manage SPM contents ▫ Code/Data mapping is transparent in cache based architectures Mapping Challenges ▫ Partitioning available SPM resource among different data ▫ Identifying data which will benefit from placement in SPM ▫ Minimize data movement between SPM and external memory ▫ Optimal data allocation is an NP-complete problem Binary Compatibility ▫ Application compiled for specific SPM size Sharing SPM in a multi-tasking environment 6 Need completely automated solutions (read compiler solutions)

Using SPM Original Code SPM Aware Code 7 int global; FUNC2() { int a, b; global = a + b; } FUNC1(){ FUNC2(); } int global; FUNC2() { int a,b; DSPM.fetch.dma(global) global = a + b; DSPM.writeback.dma(global) } FUNC1(){ ISPM.overlay(FUNC2) FUNC2(); }

Previous Work Static Techniques [3,4]. Contents of SPM do not change during program execution – less scope for energy reduction. Profiling is widely used but has some drawbacks [3, 4, 5, 6, 7,8] ▫ Profile may depend heavily depend on input data set ▫ Profiling an application as a pre-processing step may be infeasible for many large applications ▫ It can be time consuming, complicated task ILP solutions do not scale well with problem size [3, 5, 6, 8] Some techniques demand architectural changes in the system [6,10] 8 Go to ReferencesReferences

Code Allocation on SPM What to map? ▫ Segregation of code into cache and SPM ▫ Eliminates code whose penalty is greater than profit  No benefits in architecture with DMA engine ▫ Not an option in many architecture e.g. CELL Where to map? ▫ Address on the SPM where a function will be mapped and fetched from at runtime. ▫ To efficiently use the SPM, it is divided into bins/regions and functions are mapped to regions  What are the sizes of the SPM regions?  What is the mapping of functions to regions? ▫ The two problems if solved independently leads to sub-optimal results 9 Our approach is a pure software dynamic technique based on static analysis addressing the ‘where to map’ issue. It simultaneously solves the region size and function-to-region mapping sub-problems

Problem Formulation Input ▫ Set V = {v 1, v 2 … v f } – of functions ▫ Set S = {s 1, s 2 … s f } – of function sizes ▫ E spm/access and E cache/access ▫ E mbst energy per burst for the main memory ▫ E ovm energy consumed by overlay manager instruction Output ▫ Set {S 1, S 2, … S r } representing sizes of regions R = {R 1, R 2, … R r } such that ∑ S r ≤ SPM-SIZE ▫ Function to Region mapping, X[f,r] = 1, if function f is mapped to region r, such that ∑ S f x X[f,r] ≤ S r Objective Function ▫ Minimize Energy Consumption  E vi hit = nhit vi x (E ovm + E spm/access x s i )  E vi miss = nmiss vi x (E ovm + E spm/access x s i + E mbst x (s i + s j ) / N mbst  E total = ∑ (E vi hit + E vi miss ) ▫ Maximize Runtime Performance 10

Overview 11 Static Analysis Function Region Mapping Cycle Accurate Simulation GCCFG Weight Assignment SDRM Heuristic/ILP Interference Graph Instrumented Binary Link Phase Application Energy Statistics Compiler Framework Performance Statistics

Limitations of Call Graph Limitations ▫ No information on relative ordering among nodes (call sequence) ▫ No information on execution count of functions 12 F2 F5 F3 F6 F4 F1 mai n MAIN ( ) F2 ( ) F1( ) for for F6 ( ) F2 ( ) F3 ( ) end for while END MAIN F4 ( ) end while F5 (condition) end for if (condition) F5( ) condition = … END F2 F5() end if END F5 Call Graph

Global Call Control Flow Graph 13 MAIN ( ) F2 ( ) F1( ) for for F6 ( ) F2 ( ) F3 ( ) end for while END MAIN F4 ( ) end while F5 (condition) end for if (condition) if() condition = … F5( ) else else F5(condition) F1() end if end if END F5 END F2 L1 L2 F2F5 F3 L3 F6 F F1 main I1 F1 I2 10 T F F Advantages ▫ Strict ordering among the nodes. Left child is called before the right child ▫ Control information included (L-nodes and I-nodes) ▫ Node weights indicate execution count of functions ▫ Recursive functions identified Loop Factor 10 Recursion Factor 2

14 Create Interference Graph. Node of I-Graph are functions or F-nodes from GCCFG There is an edge between two F-nodes nodes if they interfere with each other. The edges are classified as Caller-Callee-no-loop, Caller-Callee-in-loop, Callee-Callee-no-loop, Callee-Callee-in-loop Assign weights to edges of I-Graph Caller-Callee-no-loop: cost[i,j] = (s i + s j ) x w j Caller-Callee-in-loop: cost[i,j] = (s i + s j ) x w j Callee-Callee-no-loop: cost[i,j] = (s i + s j ) x w k, where w k = MIN (w i, w j ) Callee-Callee-in-loop: cost[i,j] = (s i + s j ) x w k, where w k = MIN (w i, w j ) main F1 F2F5 F6F3 F4 L3 F1 F2 F4 F5 F6F3 120 Caller-Callee-no-loop Caller-Callee-in-loop Callee-Callee-in-loop routinesSize F22 F33 F41 F64 F12 F54 Interference Graph

SDRM Heuristic Suppose SPM size is 7KB Interference Graph F6 routinesSize F22 F33 F41 F64 RegionRoutineSizeCost R1F220 R2F410 R3F6,F34700 Total7700Total R2 Total F2 F4 F F6,F3 F3 F6 F3 Interference Graph F6 F2 F3 F F4,F3 F6 F4,F33 F64 9 R R1 R2 R3

Flow Recap 16 Static Analysis Function Region Mapping Cycle Accurate Simulation GCCFG Weight Assignment SDRM Heuristic/ILP Interference Graph Instrumented Binary Link Phase Application Energy Statistics Compiler Framework Performance Statistics

17 Overlay Manager F1(){ ISPM.overlay(F3) F3(); } F3() { ISPM.overlay(F2) F2() … ISPM.return } main….F1F3F2 IDRegionVMALMA F100x300000xA x xA01300 F2 1F4 0 10x30200F3 0xA xA x30200 Size 0x100 0x200 0x1000 0x300 0xA016002F50x312000x500 Overlay Table Region Table RegionID 0F1 2F5 1F3 F2F1

Performance Degradation Scratchpad Overlay Manager is mapped to cache Branch Target Table has to be cleared between function overlays to same region Transfer of code from main memory to SPM is on demand 18 FUNC1( ) { computation … ISPM.overlay(FUNC2) FUNC2(); } FUNC1( ) { ISPM.overlay(FUNC2) computation … FUNC2(); }

SDRM-prefetch 19 MAIN ( ) F2 ( ) F1( ) for forcomputation F2 ( ) F6 ( ) end for computation END MAIN F3 ( ) F5 (condition) while if (condition) F4 ( ) end while F5() end for end if computation END F5 F5( ) END F2 main F1 F2 L1 F3 L2 L3 F4 F6 F5 Q = 10 C = C3 C1 C2 Modified Cost Function cost p [vi, vj ] = (si + sj) x min(wi,wj) x latency cycles/byte - (Ci + Cj) cost[vi,vj] = cost e [vi, vj ] x cost p [vi, vj ] RegionID 0F1 2F3 1F4,F5 F2F2,F1 Region 0F1 2F3,F6 1F4 F2F2,F1 3F63F5 SDRMSDRM-prefetch

Energy Model 20 E TOTAL = E SPM + E I-CACHE + E TOTAL-MEM E SPM = N SPM x E SPM-ACCESS E I-CACHE = E IC-READ-ACCESS x { N IC-HITS + N IC-MISSES } + E IC-WRITE-ACCESS x 8 x N IC-MISSES E TOTAL-MEM = E CACHE-MEM + E DMA E CACHE-MEM = E MBST x N IC-MISSES E DMA = N DMA-BLOCK x E MBST x 4

Performance Model 21 chunks = block-size + (bus width - 1) / bus width (64 bits) mem lat[0] = 18 [first chunk] mem lat[1] = 2 [inter chunk] total-lat = mem lat[0] + mem lat[1] x (chunks - 1) latency cycles/byte = total-lat / block-size

Average Energy Reduction of 25.9% for SDRM Results 22

Cache Only vs Split Arch. X bytes Instruction Cache x/2 bytes Instruction cache x/2 bytes Instruction SPM On chip X bytes Instruction Cache Data Cache Data Cache ARCHITECTURE 1 ARCHITECTURE 2 23 Avg. 35% energy reduction across all benchmarks Avg. 2.08% performance degradation

24 Average Performance Improvement 6% Average Energy Reduction 32% (3% less)

Conclusion By splitting an Instruction Cache into an equal sized SPM and I-Cache, a pure software technique like SDRM will always result in energy savings. Tradeoff between energy savings and performance improvement. SPM are the way to go for many-core architectures.

Continuing Effort Improve static analysis Investigate effect of outlining on the mapping function Explore techniques to use and share SPM in a multi-core and multi-tasking environment 26

References 27 1.New Microarchitecture Challenges for the Coming Generations of CMOS Process Technologies. Micro32. 2.GROCHOWSKI, E., RONEN, R., SHEN, J., WANG, H Best of Both Latency and Throughput IEEE International Conference on Computer Design (ICCD ‘04), S. Steinke et al. : Assigning program and data objects to scratchpad memory for energy reduction. 4.F. Angiolini et al: A post-compiler approach to scratchpad mapping code. 5.B Egger, S.L. Min et al. : A dynamic code placement technique for scratchpad memory using postpass optimization 6.B Egger et al : Scratchpad memory management for portable systems with a memory management unit 7.M. Verma et al. : Dynamic overlay of scratchpad memory for energy minimization 8.M. Verma and P. Marwedel : Overlay techniques for scratchpad memories in low power embedded processors* 9.S. Steinke et al. : Reducing energy consumption by dynamic copying of instructions onto onchip memory 10.A. Udayakumaran and R. Barua: Dynamic Allocation for Scratch-Pad Memory using Compile-time Decisions

Research Papers SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories ▫International Conference on High Performance Computing 2008 – First Author A Software Solution for Dynamic Stack Management on Scratchpad Memory ▫Asia and South Pacific Design Automation Conference 2009 – Co-author A Dynamic Code Mapping Technique for Scratchpad Memories in Embedded Systems ▫Submitted to IEEE Trans. On Computer Aided Design of Integrated Circuits and Systems 28

Thank you! 29