Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.

Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the U.S. National Science Foundation and a U.S. Department of Education GAANN Fellowship

2 Introduction Opportunity to tune the microprocessor architecture to the program Traditional Core Based microprocessor architecture

3 Introduction I-cache –Size –Associativity –Replacement policy I$ JPEG –Compression Buses –Width –Bus invert/gray code JPEG

4 Introduction Memory access can consume 50% of an embedded microprocessor’s system power –Caches tend to be power hungry M*CORE: unified cache consumes half of total power (Lee/Moyer/Arends 99) ARM920T: caches consume half of total power (Segars 01)

5 Introduction Advantageous to focus on the instruction fetching subsystem Processor USB I$ D$ Bridge JPEGCCDPP4 Mem

6 Introduction Techniques to reduce instruction fetch power –Program Compression Compress only a subset of frequently used instructions (Benini 1999) Compress procedures in a small cache (Kirvoski 1997) Lookup table based (Lekatsas 2000) –Bus Encoding Increment (Benini 1997) Bus-invert (Stan 1995) Binary/gray code (Mehta 1996)

7 Introduction Techniques to reduce instruction fetch power (cont.) –Efficient Cache Design Small buffers: victim, non-temporal, speculative, and penalty to reduce miss rate (Bahar 1998) Memory array partitioning and variation in cache sizes (Ko 1995) –Tiny Caches Filter cache (Kin/Gupta/Magione-Smith 1997) Dynamically loaded tagless loop cache (Lee/Moyer/Arends 1999) Preloaded tagless loop cache (Gordon-Ross/Cotterell/Vahid 2002)

8 Cache Architectures – Filter Cache Small L0 direct mapped cache Utilizes standard tag comparison and miss logic Has low dynamic power –Short internal bitlines –Close to the microprocessor Performance penalty of 21% due to high miss rate (Kin 1997) Processor Filter cache (L0) L1 memory

9 Cache Architectures – Dynamically Loaded Loop Cache Small tagless loop cache Alternative location to fetch instructions Dynamically fills the loop cache –Triggered by any short backwards branch (sbb) instruction Flexible variation –Allows loops larger than the loop cache to be partially stored... add r1,2... sbb -5 Processor Dynamic loop cache L1 memory Mux Iteration 3 : fetch from loop cache Dynamic loop cache Iteration 1 : detect sbb instruction L1 memory Iteration 2 : fill loop cache Dynamic loop cache L1 memory

10 Cache Architectures – Dynamically Loaded Loop Cache (cont.) Limitations –Does not support loops with control of flow changes (cofs) –cofs terminate loop cache filling and fetching –cofs include commonly found if-then-else statements... add r1,2 bne r1, r2, 3... sbb -5 Processor Dynamic loop cache L1 memory Mux Iteration 1 : detect sbb instruction L1 memory Iteration 3 : fill loop cache, terminate at cof Dynamic loop cache L1 memory Iteration 2 : fill loop cache, terminate at cof Dynamic loop cache L1 memory

11 Processor Preloaded loop cache L1 memory Mux Cache Architectures – Preloaded Loop Cache Small tagless loop cache Alternative location to fetch instructions Loop cache filled at compile time and remains fixed –Supports loops with cof Fetch triggered by any short backwards branch Start address variation –Fetch begins on first loop iteration... add r1,2 bne r1, r2, 3... sbb -5 Iteration 1 : detect sbb instruction L1 memory Iteration 2 : check to see if loop preloaded, if so fetch from cache Preloaded loop cache L1 memory

12 Traditional Design Traditional Pre-fabricated IC –Typically optimized for best average case –Intended to run well across a variety of programs –Benchmark suite is used to determine which configuration Processor L1 memory Mux ?

13 Core Based Design –Know application –Opportunity to tune the architecture Is it worth tuning the architecture to the application or is the average case good enough? microprocessor architecture

14 Evaluation Framework – Candidate Cache Configurations TypeSizeNumber of loops/ line size Configuration Original dynamically loaded loop cache 8-1024 entriesn/a1-8 Flexible dynamically loaded loop cache 8-1024 entriesn/a9-16 Preloaded loop cache (sa) 8-1024 entries2 - 3loop address registers 17-32 Preloaded loop cache (sbb) 8-1024 entries2 - 6 loop address registers 33-72

15 Evaluation Framework – Motorola's Powerstone Benchmarks Benchmark# Instr Executed Description adpcm 63891Voice Encoding bcnt 1938Bit Manipulation binary 816Binary Insertion blit 22845Graphics Application brev 2377Bit Reversal compress 138573Data Compression Program crc 37650Cyclic Redundancy Check des 122214Data Encryption Standard Benchmark# Instr Executed Description engine 410607Engine Controller fir 16211FIR Filtering g3fax 1128023Group Three Fax Decode insert 1942Insertion Sort jpeg 4594721JPEG Compression summin 1909787Handwriting Recognition ucbqsort 219978U.C.B Quick Sort v42 2442551Modem Encoding/Decoding

16 Tool Chain - Simulation LOOAN lcsim lc power calc loop stats packed loops & explr script loop cache stats loop cache power program instr trace many configs. tech info

17 Results - Averages Configuration 11 (flexible/32entry/dynamically loaded loop cache) –On average does well – 25% Instruction Fetch Energy Savings Loop cache selection on a per application basis –Saves additional 70% Instruction Fetch Energy Savings

18 Tool Chain - Simulation LOOAN lc power calc loop stats packed loops & explr script loop cache stats loop cache power many configs. tech info program instr trace lcsim program instr trace... lcsim

19 Tool Chain - Estimation loop and function call statistics... estimator  l i f = s*b;  l i f = s*b;  l i f = s*b; func calls LOOAN lc power calc loop stats packed loops loop cache stats loop cache power program instr trace fast. tech info estimator What kind of statistics? How can we use this information to model the various loop caches?

20 LOOAN How big are the loops? Loop hierarchy, function calls Once the loop is called, how many times does it iterate? How many times is the loop called?

21 if( loop size = 2) fills = # times loop called * loop size Estimation – Original Dynamically Loaded Loop Cache How many times do we fill the loop cache? mov r5,r4... add r1,2 sub r1, r2, 3... sbb -5 mov r5, r4... add r1,2 sub r1, r2, 3 bne r1, r2, 3... sbb -5 if( loop size = 2) if( cof != sbb) fills = # loop called * (iter per exec–1) * offset to 1 st cof else fills = # loop called * loop size iter 1: detect sbb iter 2: fill x x x x iter 1: detect sbb iter 2: fill, abort at cof iter 3: fill, abort at cof

22 Estimation - Original Dynamically Loaded Loop Cache How many times do we fetch from the loop cache? if( loop size = 3) fetch = # times loop called * (loop iter – 2) * loop size if( loop size = 3) if( cof == sbb) fetch = # times loop called * (loop iter – 2) * loop size mov r5, r4... add r1,2 sub r1, r2, 3 bne r1, r2, 3... sbb -5 mov r5,r4... add r1,2 sub r1, r2, 3... sbb -5 iter 1: detect sbb iter 2: fill iter 3: fetch from loop cache x x x x iter 1: detect sbb iter 2: fill, abort at cof iter 3: fill, abort at cof

23 Estimation Loop Cache Equations –Each loop cache type is characterized by approximately 5 unique equations –20 different equations in all

24 Estimation Results - Accuracy Ranges from 0-16% difference Average 2% difference

25 Estimation Results - Fidelity Does the estimation method preserve the fidelity? –summin shows the worst case – 10% –On average <1% difference in savings between loop cache chosen via simulation vs. loop cache chosen via estimation

26 Time Comparison Simulation Tool ChainEstimation Tool Chain BenchmarkNum Instr Exec. LOOANScript Gen lcsimlc power calc total sim time (sec.) LOOANEst.lc power calc total est time (sec.) speedup adpcm638910.310.0132.150.0132.480.310.160.010.4868 compress1385730.850.0182.500.0183.370.850.140.011.0083 engine4106072.120.02214.990.01217.142.120.080.012.2198 g3fax11280233.540.02385.440.01389.013.540.090.013.64107 jpeg459472117.570.011837.280.011854.8717.570.120.0117.7105 summin190978711.420.01903.730.01915.178.250.090.018.35110 v42244255112.070.011252.480.011264.5712.270.120.0112.40102 more benchmarks in paper... AVERAGE : 67 Required for both methods simulation was bottleneck Biggest example only 30 minutes – small program Started looking at MediaBench – simulation takes hours

27 Conclusion and Future Work Important to tune the architecture to the program Simulation methods are slow –Presented a equation based methodology which is faster than the simulation based methodology previously used –Accuracy/fidelity preserved Future Work –Expand types of tiny caches –Look at more benchmarks MediaBench - several hours (up to 48 hours) for our simulations –Expand hierarchy search

28 Thank you for your attention. Questions?

Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.

Similar presentations

Presentation on theme: "Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University.

Similar presentations

Presentation on theme: "Synthesis of Customized Loop Caches for Core-Based Embedded Systems Susan Cotterell and Frank Vahid* Department of Computer Science and Engineering University."— Presentation transcript:

Similar presentations

About project

Feedback