Presentation is loading. Please wait.

Presentation is loading. Please wait.

Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science.

Similar presentations


Presentation on theme: "Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science."— Presentation transcript:

1 Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http://www.cs.ucr.edu/~vahid

2 Frank Vahid, UC Riverside 2 Trend Towards Pre-Fabricated Platforms: ASSPs ASSP: application specific standard product Domain-specific pre- fabricated IC e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC Unique IC design Ignores quantity of same IC ASIC design starts decreasing Due to strong benefits of using pre-fabricated devices Source: Gartner/Dataquest September’01

3 Frank Vahid, UC Riverside 3 Will High End ICs Still be Made? YES The point is that mainstream designers likely won’t be making them Very high volume or very high cost products Platforms are one such product – high volume Need to be highly configurable to adapt to different applications and constraints Becoming out of reach of mainstream designers

4 Frank Vahid, UC Riverside 4 UCR Focus Configurable Cache Hardware/Software Partitioning

5 Frank Vahid, UC Riverside 5 UCR Focus Configurable Cache Hardware/Software Partitioning

6 Frank Vahid, UC Riverside 6 Configurable Cache: Why uP L1 cache DSP JPEG dcd Periph- erals FPGA Pre-fabricated Platform (A pre-designed system-level architecture) IC ARM920T: Caches consume half of total power (Segars 01) M*CORE: Unified cache consumes half of total power (Lee/Moyer/Arends 99) L1 cache

7 Frank Vahid, UC Riverside 7 Best Cache for Embedded Systems? Not clear Huge variety among popular embedded processors What’s the best… Associativity, Line size, Total size?

8 Frank Vahid, UC Riverside 8 Cache Associativity Direct mapped cache Certain bits “index” into cache Remaining “tag” bits compared 00 0 000 11 0 000 A B C D 01 0 000 10 0 000 Conflict 0000 D Tag 11 Direct mapped cache (1-way set associative) Index Set associative cache Multiple “ways” Fewer index bits, more tag bits, simultaneous comparisons More expensive, but better hit rate D110C100 2-way set associative cache 000

9 Frank Vahid, UC Riverside 9 Cache Associativity Reduces miss rate – thus improving performance Impact on power and energy? (Energy = Power * Time)

10 Frank Vahid, UC Riverside 10 Associativity is Costly Associativity improves hit rate, but at the cost of more power per access Are the power savings from reduced misses outweighed by the increased power per hit? Energy access breakdown for 8 Kbyte, 4-way set associative cache (considering dynamic power only) Energy per access for 8 Kbyte cache

11 Frank Vahid, UC Riverside 11 Associativity and Energy Best performing cache is not always lowest energy Significantly poorer energy

12 Frank Vahid, UC Riverside 12 Associativity Dilemma Direct mapped cache Good hit rate on most examples Low power per access But poor hit rate on some examples High power due to many misses Four-way set-associative cache Good hit rate on nearly all examples But high power per access Overkill for most examples, thus wasting energy Dilemma: Design for the average or worst case?

13 Frank Vahid, UC Riverside 13 Associativity Dilemma Obviously not a clear choice Previous work Albonesi – proposed configurable cache having way shutdown ability to save dynamic power Motorola M*CORE also D 0 110 11 0 000 0000

14 Frank Vahid, UC Riverside 14 Our Solution: Way Concatenatable Cache Can be configured as 4, 2, or 1 way Ways can be concatenated D x 11xC10x 11 0 000 This bit selects the way 0000

15 Frank Vahid, UC Riverside 15 Configurable Cache Design: Way Concatenation (4, 2 or 1 way) index c1c1 c3c3 c0c0 c2c2 a 11 a 12 reg 1 reg 0 sense amps column mux tag part tag address mux driver c1c1 line offset data output critical path c0c0 c2c2 c0c0 c1c1 6x64 c3c3 c2c2 c3c3 a 31 tag address a 13 a 12 a 11 a 10 index a 5 a 4 line offset a 0 Configuration circuit data array bitline Small area and performance overhead

16 Frank Vahid, UC Riverside 16 Way Concatenate Experiments Experiment Motorola PowerStone benchmark g3fax Considering dynamic power only L1 access energy, CPU stall energy, memory access energy Way concatenate outperforms 4 way and direct map. Just as good as way shutdown

17 Frank Vahid, UC Riverside 17 Way Concatenate Experiments Considered 23 programs (Powerstone, MediaBench, and Spec2000) Dynamic power only (L1 access energy, CPU stall energy, memory access energy) Way concatenate Better than way shutdown (due to less performance penalty) Saves over conventional 4-way Also avoids big penalties of 1-way on some programs 100% = 4-way conventional cache

18 Frank Vahid, UC Riverside 18 Way Concatenate Experiments Best configuration varies Need to tune configuration to a given program

19 Frank Vahid, UC Riverside 19 Normalized Execution Times Way shutdown suffers performance penalty As does direct mapped Way concatenate has almost no performance penalty Though 3% longer critical path than conventional 4-way

20 Frank Vahid, UC Riverside 20 Way Shutdown for Static Power Savings Albonesi and Motorola used logic to gate clock Reduced dynamic power, but not static (leakage) Way concatenate clearly superior for reducing dynamic pwr Shutting down ways still useful to save static power But we’ll use another method (Agarwal DRG-cache) Gnd Vdd bitline Gated-Vdd Control SRAM cell

21 Frank Vahid, UC Riverside 21 Way Concatenate Plus Way Shutdown We set static power = 30% of dynamic power Way shutdown now preferred in many examples But way concatenate still very helpful

22 Frank Vahid, UC Riverside 22 Configurable Line Size Too Best line size also differs per example Our cache can be configured for line of 16, 32 or 64 bytes 64 is usually best; but 16 is much better in a couple cases 100% = 4-way conventional cachecsb: concatenate plus shutdown cache

23 Frank Vahid, UC Riverside 23 Configurable Cache A configurable cache with way concatenation, way shutdown, and variable line size, can save a lot of energy Well-suited for configurable devices like Triscend’s

24 Frank Vahid, UC Riverside 24 UCR Focus Configurable Cache Hardware/Software Partitioning

25 Frank Vahid, UC Riverside 25 Using On-Chip FPGA to Reduce Sw Energy Hennessey/Patterson: “The best way to save power is to have less hardware” (pg 392) Actually, best way is to have less ACTIVE hw Paradoxically, MORE hw can actually REDUCE power, as long as overall activity is reduced How?

26 Frank Vahid, UC Riverside 26 Using On-Chip FPGA to Reduce Sw Energy uP L1 cache DSP JPEG dcd Periph- erals FPGA Pre-fabricated Platform Move critical sw loops to FPGA Loop executes in 1/10 th the time Use this time to power down the system longer during task period Alternatively, slow down the microprocessor using voltage scaling IC FPGA uP idleuP active idleuP FPGA Task period

27 Frank Vahid, UC Riverside 27 The 90-10 rule (or 80-20 rule) Most software time is spent in a few small loops e.g., MediaBench and NetBench benchmarks Known as the 90-10 rule 10% of the code accounts for 90% of the execution time Move those loops to FPGA

28 Frank Vahid, UC Riverside 28 Hardware/Software Partitioning Results Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg) Simulation based

29 Frank Vahid, UC Riverside 29 Analysis of Ideal Speedup Each loop is 10x faster in hw (average based on observations) Notice the leveling off after the first couple loops (due to 90-10 rule) Thus, most speedup comes from the first few loops Good for us -- Moderate amount of FPGA gives most of the speedup How much FPGA?

30 Frank Vahid, UC Riverside 30 Speedup Gained with Relatively Few Gates Manually created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates Surprisingly few gates Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002 Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded Systems, 2002 (to appear).

31 Frank Vahid, UC Riverside 31 Impact of Microprocessor/FPGA Clock Ratio Previous data assumed equal clock freq. A faster microprocessor has significant impact Analyzed 1:1, 2:1, 3:1, 4:1, 5:1 ratios Planning additional such analyses Memory bandwidth Power ratios More

32 Frank Vahid, UC Riverside 32 Software Improvements using On-Chip Configurable Logic – Verified through Physical Measurement Performed physical measurements on Triscend A7 and E5 devices Similar results (even a bit better) A7 IC Triscend A7 development board

33 Frank Vahid, UC Riverside 33 Other Research Directions: Tiny Caches Impact of tiny caches on instruction fetch power Filter caches, dynamic loop cache, preloaded loop cache Gordon-Ross, Cotterell, Vahid, Comp. Arch. Letters 2002 Gordon-Ross, Vahid, ICCD 2002. Cotterell, Vahid, ISSS 2002 and ICCAD 2002 Gordon-Ross, Cotterell, Vahid, IEEE TECS, 2002 Processor Loop cache L1 cache or I-mem Mux

34 Frank Vahid, UC Riverside 34 Other Research Directions: Platform-Based CAD Use physical platform to aid search of configuration space Configure cache, hw/sw partition Configure, execute, and measure Goal: Define best cooperation between desktop CAD and platform NSF grant 2002-2005 (with N. Dutt at UC Irvine)

35 Frank Vahid, UC Riverside 35 Other Research Directions: Dynamic Hw/Sw Partitioning My favorite Add component on-chip: Detects most frequent sw loops Decompiles a loop Performs compiler optimizations Synthesizes to a netlist Places and routes the netlist onto FPGA Updates sw to call FPGA Self-improving IC Can be invisible to designer Appears as efficient processor Can also dynamically tune the cache configuration Config. Logic Mem Process or DMA D$ I$ Profiler Proc. Mem

36 Frank Vahid, UC Riverside 36 Current Researchers Working in Embedded Systems at UCR Prof. Frank Vahid 5 Ph.D. students, 2 M.S. Prof. Walid Najjar 3 Ph.D. students, 1 M.S., working on hw/sw partitioning, and on compiling C to FPGAs Prof. Tom Payne 1 Ph.D. student, working on compiling C to FPGAs Prof. Jun Yang (new hire) Working on low power architectures (frequent value detection) Prof. Harry Hsieh 2 Ph.D. students, working on formal verification of system models Prof. Sheldon Tan (new hire) 1 Ph.D, working on physical design, and analog synthesis

37 Frank Vahid, UC Riverside 37 Conclusions Highly configurable platforms have a bright future Cost equations just don’t justify ASIC production as much as before Triscend parts are well situated; close collaboration desired Configurable cache improves memory energy Tuning to a particular program is CRUCIAL to low energy Way concatenation is effective at reducing dynamic power Way shutdown saves static power Variable line size reduces traffic All must be tuned to a particular program Configurable logic improves software energy Without requiring excessive amounts of hardware Many exciting avenues to investigate!


Download ppt "Frank Vahid, UC Riverside 1 Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science."

Similar presentations


Ads by Google