Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
Números.
Advertisements

AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Adders Used to perform addition, subtraction, multiplication, and division (sometimes) Half-adder adds rightmost (least significant) bit Full-adder.
Fill in missing numbers or operations
EuroCondens SGB E.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Sequential Logic Design
Copyright © 2013 Elsevier Inc. All rights reserved.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Addition and Subtraction Equations
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Learning to show the remainder
The 5S numbers game..
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
NC STATE UNIVERSITY Transparent Control Independence (TCI) Ahmed S. Al-Zawawi Vimal K. Reddy Eric Rotenberg Haitham H. Akkary* *Dept. of Electrical & Computer.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
SE-292 High Performance Computing
The basics for simulations
CS 105 Tour of the Black Holes of Computing
Processor Data Path and Control Diana Palsetia UPenn
Announcements Homework 6 is due on Thursday (Oct 18)
Computer Science Education
Figure 12–1 Basic computer block diagram.
1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
OS-aware Tuning Improving Instruction Cache Energy Efficiency on System Workloads Authors : Tao Li, John, L.K. Published in : Performance, Computing, and.
Briana B. Morrison Adapted from William Collins
Chapter 3 Logic Gates.
Figure 3–1 Standard logic symbols for the inverter (ANSI/IEEE Std
Chapter 10: Virtual Memory
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
Progressive Aerobic Cardiovascular Endurance Run
© 2012 National Heart Foundation of Australia. Slide 2.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Before Between After.
25 seconds left…...
Subtraction: Adding UP
Figure 10–1 A 64-cell memory array organized in three different ways.
SE-292 High Performance Computing
Static Equilibrium; Elasticity and Fracture
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
PSSA Preparation.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
A Data Warehouse Mining Tool Stephen Turner Chris Frala
Chapter 3 โพรเซสเซอร์และการทำงาน The Processing Unit
Chapter 3 General-Purpose Processors: Software
THUMB Instructions: Branching and Data Processing
T-SPaCS – A Two-Level Single-Pass Cache Simulation Methodology + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Wei Zang.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.
Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power Frank Vahid* and Ann Gordon-Ross Dept. of Computer Science and Engineering University.
Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.
Ann Gordon-Ross and Frank Vahid*
A Self-Tuning Configurable Cache
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the U.S. National Science Foundation and a U.S. Department of Education GAANN Fellowship

2 Introduction Opportunity to tune the microprocessor architecture to the program Traditional Core Based microprocessor architecture

3 Introduction I-cache –Size –Associativity –Replacement policy I$ JPEG –Compression Buses –Width –Bus invert/gray code JPEG

4 Introduction Memory access can consume 50% of an embedded microprocessors system power –Caches tend to be power hungry M*CORE: unified cache consumes half of total power (Lee/Moyer/Arends 99) ARM920T: caches consume half of total power (Segars 01)

5 Introduction Advantageous to focus on the instruction fetching subsystem Processor USB I$ D$ Bridge JPEGCCDPP4 Mem

6 Introduction Techniques to reduce instruction fetch power –Program compression Compress only a subset of frequently used instructions (Benini 1999) Compress procedures in a small cache (Kirvoski 1997) Lookup table based (Lekatsas 2000) –Bus encoding Increment (Benini 1997) Bus-invert (Stan 1995) Binary/gray code (Mehta 1996)

7 Introduction Techniques to reduce instruction fetch power (cont.) –Efficient cache design Small buffers: victim, non-temporal, speculative, and penalty to reduce miss rate (Bahar 1998) Memory array partitioning and variation in cache sizes (Ko 1995) –Tiny caches Filter cache (Kin/Gupta/Magione-Smith 1997) Dynamically loaded tagless loop cache (Lee/Moyer/Arends 1999) Preloaded tagless loop cache (Gordon-Ross/Cotterell/Vahid 2002)

8 Cache Architectures – Filter Cache Small L0 direct mapped cache Utilizes standard tag comparison and miss logic Has low dynamic power –Short internal bitlines –Close to the microprocessor Performance penalty of 21% due to high miss rate (Kin 1997) Processor Filter cache (L0) L1 memory

9 Cache Architectures – Dynamically Loaded Loop Cache Small tagless loop cache Alternative location to fetch instructions Dynamically fills the loop cache –Triggered by short backwards branch (sbb) instruction Flexible variation –Allows loops larger than the loop cache to be partially stored... add r1,2... sbb -5 Processor Dynamic loop cache L1 memory Mux Iteration 3 : fetch from loop cache Dynamic loop cache Iteration 1 : detect sbb instruction L1 memory Iteration 2 : fill loop cache Dynamic loop cache L1 memory

10 Cache Architectures – Dynamically Loaded Loop Cache (cont.) Limitations –Does not support loops with control of flow changes (cofs) –cofs terminate loop cache filling and fetching –cofs include commonly found if-then-else statements... add r1,2 bne r1, r2, 3... sbb -5 Processor Dynamic loop cache L1 memory Mux Iteration 1 : detect sbb instruction L1 memory Iteration 3 : fill loop cache, terminate at cof Dynamic loop cache L1 memory Iteration 2 : fill loop cache, terminate at cof Dynamic loop cache L1 memory

11 Processor Preloaded loop cache L1 memory Mux Cache Architectures – Preloaded Loop Cache Small tagless loop cache Alternative location to fetch instructions Loop cache filled at compile time and remains fixed –Supports loops with cof Fetch triggered by short backwards branch Start address variation –Fetch begins on first loop iteration... add r1,2 bne r1, r2, 3... sbb -5 Iteration 1 : detect sbb instruction L1 memory Iteration 2 : check to see if loop preloaded, if so fetch from cache Preloaded loop cache L1 memory

12 Traditional Design Traditional Pre-fabricated IC –Typically optimized for best average case –Intended to run well across a variety of programs –Benchmark suite is used to determine which configuration On average, what is the best tiny cache configuration? Processor L1 memory Mux ?

13 Evaluation Framework – Candidate Cache Configurations TypeSizeNumber of loops/ line size Configuration Original dynamically loaded loop cache entriesn/a1-8 Flexible dynamically loaded loop cache entriesn/a9-16 Preloaded loop cache (sa) entries2 - 3loop address registers Preloaded loop cache (sbb) entries2 - 6 loop address registers Filter cache bytesline size of 8 to 64 bytes

14 Evaluation Framework – Motorola's Powerstone Benchmarks BenchmarkLines of C# Instructions ExecutedDescription adpcm Voice Encoding bcnt901938Bit Manipulation binary67816Binary Insertion blit Graphics Application compress Data Compression Program crc Cyclic Redundancy Check des Data Encryption Standard engine Engine Controller fir FIR Filtering g3fax Group Three Fax Decode jpeg JPEG Compression summin Handwriting Recognition ucbqsort U.C.B Quick Sort v Modem Encoding/Decoding

15 Simplified Tool Chain Loop selector (preloaded) lcsim lc power calculator Loop cache stats Loop cache power Program instruction trace Technology info

16 Best on Average originalflexiblepreloaded (sa)preloaded (sbb)filter Configuration 30 –Preloaded Loop cache (sa), 512 entries, 3 loop address registers –73% Instruction fetch energy savings 30 Configuration 105 –Filter cache, 1024 entries, line size 32 bytes –73% Instruction fetch energy savings 105

17 Core Based Design –Know application –Opportunity to tune the architecture Is it worth tuning the architecture to the application or is the average case good enough? microprocessor architecture

18 Best on Average Both configurations perform well for some benchmarks such as engine and summin However, both configurations perform below average for binary, v42, and others

19 Results - binary Config 30 yields 61% savings Config 105 yields 65% savings Config 31 (preloaded/1024entry/2LAR) yields 79% savings originalflexiblepreloaded (sa)preloaded (sbb)filter

20 Results – v42 Config 30 yields 58% savings Config 105 yields 23% savings Config 67 (preloaded/512entry/6LAR) yields 68% originalflexiblepreloaded (sa)preloaded (sbb)filter

21 Results - averages Average case Best case : 84% Config 30 : 73% Config 105: 73% Improvement : 11% adpcm Best case : 68% (preloaded) Config 105: 25% Improvement : 43% v42 Best case : 68% (preloaded) Config 105: 23% Improvement : 45% blit Best case : 96% (flexible) Config 30: 87% Improvement : 9% jpeg Best case : 92% (filter) Config 30: 69% Improvement : 23%

22 Conclusion and Future Work Shown benefits of tuning the tiny cache to a particular program –On average yields an additional 11% –Up to an additional 40% for some programs Environment automated but requires several hours to find best configuration –Current methodology is too slow –Faster method based on equations described in upcoming ICCAD 2002