Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.

Slides:



Advertisements
Similar presentations
1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering.
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.
Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Instruction-based System-level Power Evaluation of System-on-a-chip Peripheral Cores Tony Givargis, Frank Vahid* Dept. of Computer Science & Engineering.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Walid Najjar* *University of California, Riverside **The.
Parameterized Systems-on-a-Chip Frank Vahid Tony Givargis, Roman Lysecky, Leslie Tauro, Susan Cotterell Department of Computer Science and Engineering.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
Frank Vahid, UC Riverside 1 New Opportunities with Platform Based Design Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science.
Tony GivargisUniversity of California, Riverside & NEC USA1 Fast Cache and Bus Power Estimation for Parameterized System-on-a-Chip Design Tony D. Givargis.
Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
Parameterized Embedded Systems Platforms Frank Vahid Students: Tony Givargis, Roman Lysecky, Susan Cotterell Dept. of Computer Science and Engineering.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Improving Program Efficiency by Packing Instructions Into Registers
Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.
Ann Gordon-Ross and Frank Vahid*
A Self-Tuning Configurable Cache
Automatic Tuning of Two-Level Caches to Embedded Applications
Stream-based Memory Specialization for General Purpose Processors
Presentation transcript:

Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the U.S. National Science Foundation and a U.S. Department of Education GAANN Fellowship

2 Introduction Opportunity to tune the microprocessor architecture to the program Traditional Core Based microprocessor architecture

3 Introduction I-cache –Size –Associativity –Replacement policy I$ JPEG –Compression Buses –Width –Bus invert/gray code JPEG

4 Introduction Memory access can consume 50% of an embedded microprocessor’s system power –Caches tend to be power hungry M*CORE: unified cache consumes half of total power (Lee/Moyer/Arends 99) ARM920T: caches consume half of total power (Segars 01)

5 Introduction Advantageous to focus on the instruction fetching subsystem Processor USB I$ D$ Bridge JPEGCCDPP4 Mem

6 Introduction Techniques to reduce instruction fetch power –Program Compression Compress only a subset of frequently used instructions (Benini 1999) Compress procedures in a small cache (Kirvoski 1997) Lookup table based (Lekatsas 2000) –Bus Encoding Increment (Benini 1997) Bus-invert (Stan 1995) Binary/gray code (Mehta 1996)

7 Introduction Techniques to reduce instruction fetch power (cont.) –Efficient Cache Design Small buffers: victim, non-temporal, speculative, and penalty to reduce miss rate (Bahar 1998) Memory array partitioning and variation in cache sizes (Ko 1995) –Tiny Caches Filter cache (Kin/Gupta/Magione-Smith 1997) Dynamically loaded tagless loop cache (Lee/Moyer/Arends 1999) Preloaded tagless loop cache (Gordon-Ross/Cotterell/Vahid 2002)

8 Cache Architectures – Filter Cache Small L0 direct mapped cache Utilizes standard tag comparison and miss logic Has low dynamic power –Short internal bitlines –Close to the microprocessor Performance penalty of 21% due to high miss rate (Kin 1997) Processor Filter cache (L0) L1 memory

9 Cache Architectures – Dynamically Loaded Loop Cache Small tagless loop cache Alternative location to fetch instructions Dynamically fills the loop cache –Triggered by any short backwards branch (sbb) instruction Flexible variation –Allows loops larger than the loop cache to be partially stored... add r1,2... sbb -5 Processor Dynamic loop cache L1 memory Mux Iteration 3 : fetch from loop cache Dynamic loop cache Iteration 1 : detect sbb instruction L1 memory Iteration 2 : fill loop cache Dynamic loop cache L1 memory

10 Cache Architectures – Dynamically Loaded Loop Cache (cont.) Limitations –Does not support loops with control of flow changes (cofs) –cofs terminate loop cache filling and fetching –cofs include commonly found if-then-else statements... add r1,2 bne r1, r2, 3... sbb -5 Processor Dynamic loop cache L1 memory Mux Iteration 1 : detect sbb instruction L1 memory Iteration 3 : fill loop cache, terminate at cof Dynamic loop cache L1 memory Iteration 2 : fill loop cache, terminate at cof Dynamic loop cache L1 memory

11 Processor Preloaded loop cache L1 memory Mux Cache Architectures – Preloaded Loop Cache Small tagless loop cache Alternative location to fetch instructions Loop cache filled at compile time and remains fixed –Supports loops with cof Fetch triggered by any short backwards branch Start address variation –Fetch begins on first loop iteration... add r1,2 bne r1, r2, 3... sbb -5 Iteration 1 : detect sbb instruction L1 memory Iteration 2 : check to see if loop preloaded, if so fetch from cache Preloaded loop cache L1 memory

12 Traditional Design Traditional Pre-fabricated IC –Typically optimized for best average case –Intended to run well across a variety of programs –Benchmark suite is used to determine which configuration Processor L1 memory Mux ?

13 Core Based Design –Know application –Opportunity to tune the architecture Is it worth tuning the architecture to the application or is the average case good enough? microprocessor architecture

14 Evaluation Framework – Candidate Cache Configurations TypeSizeNumber of loops/ line size Configuration Original dynamically loaded loop cache entriesn/a1-8 Flexible dynamically loaded loop cache entriesn/a9-16 Preloaded loop cache (sa) entries2 - 3loop address registers Preloaded loop cache (sbb) entries2 - 6 loop address registers 33-72

15 Evaluation Framework – Motorola's Powerstone Benchmarks Benchmark# Instr Executed Description adpcm 63891Voice Encoding bcnt 1938Bit Manipulation binary 816Binary Insertion blit 22845Graphics Application brev 2377Bit Reversal compress Data Compression Program crc 37650Cyclic Redundancy Check des Data Encryption Standard Benchmark# Instr Executed Description engine Engine Controller fir 16211FIR Filtering g3fax Group Three Fax Decode insert 1942Insertion Sort jpeg JPEG Compression summin Handwriting Recognition ucbqsort U.C.B Quick Sort v Modem Encoding/Decoding

16 Tool Chain - Simulation LOOAN lcsim lc power calc loop stats packed loops & explr script loop cache stats loop cache power program instr trace many configs. tech info

17 Results - Averages Configuration 11 (flexible/32entry/dynamically loaded loop cache) –On average does well – 25% Instruction Fetch Energy Savings Loop cache selection on a per application basis –Saves additional 70% Instruction Fetch Energy Savings

18 Tool Chain - Simulation LOOAN lc power calc loop stats packed loops & explr script loop cache stats loop cache power many configs. tech info program instr trace lcsim program instr trace... lcsim

19 Tool Chain - Estimation loop and function call statistics... estimator  l i f = s*b;  l i f = s*b;  l i f = s*b; func calls LOOAN lc power calc loop stats packed loops loop cache stats loop cache power program instr trace fast. tech info estimator What kind of statistics? How can we use this information to model the various loop caches?

20 LOOAN How big are the loops? Loop hierarchy, function calls Once the loop is called, how many times does it iterate? How many times is the loop called?

21 if( loop size = 2) fills = # times loop called * loop size Estimation – Original Dynamically Loaded Loop Cache How many times do we fill the loop cache? mov r5,r4... add r1,2 sub r1, r2, 3... sbb -5 mov r5, r4... add r1,2 sub r1, r2, 3 bne r1, r2, 3... sbb -5 if( loop size = 2) if( cof != sbb) fills = # loop called * (iter per exec–1) * offset to 1 st cof else fills = # loop called * loop size iter 1: detect sbb iter 2: fill x x x x iter 1: detect sbb iter 2: fill, abort at cof iter 3: fill, abort at cof

22 Estimation - Original Dynamically Loaded Loop Cache How many times do we fetch from the loop cache? if( loop size = 3) fetch = # times loop called * (loop iter – 2) * loop size if( loop size = 3) if( cof == sbb) fetch = # times loop called * (loop iter – 2) * loop size mov r5, r4... add r1,2 sub r1, r2, 3 bne r1, r2, 3... sbb -5 mov r5,r4... add r1,2 sub r1, r2, 3... sbb -5 iter 1: detect sbb iter 2: fill iter 3: fetch from loop cache x x x x iter 1: detect sbb iter 2: fill, abort at cof iter 3: fill, abort at cof

23 Estimation Loop Cache Equations –Each loop cache type is characterized by approximately 5 unique equations –20 different equations in all

24 Estimation Results - Accuracy Ranges from 0-16% difference Average 2% difference

25 Estimation Results - Fidelity Does the estimation method preserve the fidelity? –summin shows the worst case – 10% –On average <1% difference in savings between loop cache chosen via simulation vs. loop cache chosen via estimation

26 Time Comparison Simulation Tool ChainEstimation Tool Chain BenchmarkNum Instr Exec. LOOANScript Gen lcsimlc power calc total sim time (sec.) LOOANEst.lc power calc total est time (sec.) speedup adpcm compress engine g3fax jpeg summin v more benchmarks in paper... AVERAGE : 67 Required for both methods simulation was bottleneck Biggest example only 30 minutes – small program Started looking at MediaBench – simulation takes hours

27 Conclusion and Future Work Important to tune the architecture to the program Simulation methods are slow –Presented a equation based methodology which is faster than the simulation based methodology previously used –Accuracy/fidelity preserved Future Work –Expand types of tiny caches –Look at more benchmarks MediaBench - several hours (up to 48 hours) for our simulations –Expand hierarchy search

28 Thank you for your attention. Questions?