Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE 260C – VLSI Advanced Topics Term paper presentation May 27, 2014 Keyuan Huang Ngoc Luong Low Power Processor Architectures and Software Optimization.

Similar presentations


Presentation on theme: "ECE 260C – VLSI Advanced Topics Term paper presentation May 27, 2014 Keyuan Huang Ngoc Luong Low Power Processor Architectures and Software Optimization."— Presentation transcript:

1 ECE 260C – VLSI Advanced Topics Term paper presentation May 27, 2014 Keyuan Huang Ngoc Luong Low Power Processor Architectures and Software Optimization Techniques

2 Motivation  ~10 billion mobile devices in 2018  Moore’s law is slowing down  Power dissipation per gate remains unchanged  How to reduce power?  Circuit level optimizations (DVFS, power gating, clock gating)  Microarchitecture optimization techniques  Compiler optimization techniques Global Mobile Devices and Connections Growth Trend: More innovations on architectural and software techniques to optimize power consumption

3 Low Power Architectures Overview  Asynchronous Processors  Eliminate clock and use handshake protocol  Save clock power but higher area  Ex: SNAP, ARM996HS, SUN Sproull.  Application Specific Instruction Set Processors  Applications: cryptography, signal processing, vector processing, physical simulation, computer graphic  Combine basic instructions with custom instruction based on application  Ex: Tensilica’s Extensa, Altera’s NIOS, Xilinx Microblaze, Sony’s Cell, IRAM, Intel’s EXOCHI  Reconfigurable Instruction Set Processors  Combine fixed core with reconfigurable logic (FPGA)  Low NRE cost vs ASIP  Ex: Chimaera, GARP, PRISC, Warp, Tensilica’s Stenos, OptimoDE, PICO  No Instruction Set Computer  Build custom datapath based on application code  Compiler has low-level control of hardware resource  Ex: WISHBONE system. Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009).

4  Combine GP processor with ASIP to focus on reducing energy and energy delay for a range of applications  Broader range of applications compared to accelerator  Reconfigurable via patching algorithm  Automatically synthesizable by toolchain from C source code  Energy consumption is reduced up to 16x for functions and 2.1x for whole application Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.

5 C-core organization  Data path (FU, mux, register)  Control unit (state machine)  Cache interface (ld, st)  Scan chain (CPU interface) Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.

6 C-core execution  Compiler insert stubs into code compatible with c-core  Choose between c-core and CPU and use c-core if available  If no c-core available, use GP processor, else use c-core to execute  C-core raises exception when finish executing and return the value to CPU Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.

7 Patching support  Basic block mapping  Control flow mapping  Register mapping  Patch generation Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.

8 Patching Example  Configurable constants  Generalized single-cycle datapath operators  Control flow changes Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.

9 Results  18 fully placed-and routed c-cores vs MIPS  3.3x – 16x energy efficiency improvement  Reduce system energy consumption by upto 47%  Reduce energy-delay by up to 55% at the full application level  Even higher energy saving without patching support Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.

10  Memory system uses power (1/10 to ¼) in portable computers  System bus switching activity controlled by software  ALU and FPU data paths needs good scheduling to avoid pipeline stalls  Control logic and clock reduce by using shortest possible program to do the computation Software Optimization Technique K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics

11 General categories of software optimization  Minimizing memory accesses  Minimize accesses needed by algorithm  Minimize total memory size needed by algorithm  Use multiple-word parallel loads, not single word loads  Optimal selection and sequencing of machine instruction  Instruction packing  Minimizing circuit state effect  Operand swapping K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics

12 Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive ー Conservative Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive ー Conservative Global program knowledge Proactive optimizations Efficient execution Basic Idea: Compiler Managed, Hardware Assisted Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran,Michael Chu,Scott Mahlke HardwareSoftware Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)

13 tagsetoffset =? tagdatalrutagdatalru tagdatalru tagdatalru 4:1 mux Replace Lookup  Activate all ways on every access Replacement  Choose among all the ways Traditional Cache Architecture – Fixed replacement policy – Set index  no program locality – Set-associativity has high overhead – Activate multiple data/tag-array per access – Fixed replacement policy – Set index  no program locality – Set-associativity has high overhead – Activate multiple data/tag-array per access Disadvantages

14 PartitionedCacheArchitecture Partitioned Cache Architecture tagsetoffset =? tagdatalrutagdatalru tagdatalru tagdatalru Ld/St Reg [Addr] [k-bitvector] [R/U] 4:1 mux Replace Lookup  Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions Replacement  Restricted to partitions specified in bit-vector P0P3P2P1 + Improve performance by controlling replacement + Reduce cache access power by restricting number of accesses + Improve performance by controlling replacement + Reduce cache access power by restricting number of accesses Advantages

15 for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k] } ld1/st1 ld2/st2 ld3 ld4 ld5 ld6 Partitioned Caches: Example (a) Annotated code segment (c) Trace consisting of array references, cache blocks, and load/stores from the example (b) Fused load/store instructions tagdatatagdatatagdata ld1, st1, ld2, st2 ld5, ld6ld3, ld4 ld1 [100], R ld5 [010], R ld3 [001], R y w1/w2 x part-0 part-1 part-3 (d) Actual cache partition assignment for each instuction

16 Compiler Controlled Data Partitioning  Goal: Place loads/stores into cache partitions  Analyze application’s memory characteristics  Cache requirements  Number of partitions per ld/st  Predict conflicts  Place loads/stores to different partitions  Satisfies its caching needs  Avoid conflicts, overlap if possible

17 Cache Analysis: Estimating Number of Partitions X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y j-loop k-loop M M MM B1 B2 B1 M has reuse distance = 1 Minimal partitions to avoid conflict/capacity misses Probabilistic hit-rate estimate Use the reuse distance to compute number of partitions

18 Cache Analysis: Estimating Number Of Partitions 1 1 1 1 8 16 24 32 1234 D = 0.76.98 1 1 8 16 24 32 1234 D = 2.87 1 1 1 8 16 24 32 1234 D = 1  Avoid conflict/capacity misses for an instruction  Estimates hit-rate based on Reuse-distance (D), total number of cache blocks (B), associativity (A)  Compute energy matrices in reality  Pick most energy efficient configuration per instruction (Brehob et. al., ’99)

19 Cache Analysis: Computing Interferences  Avoid conflicts among temporally co-located references  Model conflicts using interference graph M4 D = 1 M3 D = 1 M2 D = 1 M1 D = 1 X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y M4 M2 M1 M1 M4 M3 M1 M1

20 Partition Assignment  Placement phase can overlap references  Compute combined working-set  Use graph-theoretic notion of a clique  For each clique, new D  Σ D of each node  Combined D for all overlaps  Max (All cliques) M4 D = 1 M3 D = 1 M2 D = 1 M1 D = 1 Clique 2 Clique 1 Clique 1 : M1, M2, M4  New reuse distance (D) = 3 Clique 2 : M1, M3, M4  New reuse distance (D) = 3 Combined reuse distance  Max(3, 3) = 3 tagdatatagdatatagdata ld1, st1, ld2, st2ld5, ld6ld3, ld4 part-0 part-2part-1 ld1 [100], R ld5 [010], R ld3 [001], R yw1/w2x Actual cache partition assignment for each instruction

21 Experimental Setup  Trimaran compiler and simulator infrastructure  ARM9 processor model  Cache configurations:  1-Kb to 32-Kb  32-byte block size  2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache  Mediabench suite  CACTI for cache energy modeling

22 Reduction in Tag & Data-Array Checks 0 1 2 3 4 5 6 7 8 1-K2-K4-K8-K16-K32-KAverage Average way accesses 8-part4-part2-part Cache size Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007) 25%,30%,36 % access reduction on a 2-, 4-, 8-partition cache

23 Improvement in Fetch Energy 0 10 20 30 40 50 60 rawcaudio rawdaudio g721encodeg721decode mpeg2decmpeg2enc pegwitencpegwitdec pgpencodepgpdecode gsmencodegsmdecode epic unepic cjpeg djpeg Average Percentage energy improvement 2-part vs 2-way4-part vs 4-way8-part vs 8-way 16-Kb cache Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007) 8%,16%,25 % energy reduction on a 2-, 4-, 8-partition cache

24 Summary  Maintain the advantages of a hardware-cache  Expose placement and lookup decisions to the compiler  Avoid conflicts, eliminate redundancies  Achieve a higher performance and a lower power consumption

25 Future Works  Hybrid scratch-pad and caches  Develop advance toolchain for newer technology node such as 28nm  Incorporate the ability of partitioning data cache into the compiler of the toolchain for the ASIP

26 1.Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009). 2.Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010. 3.Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007) 4.K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics Reference


Download ppt "ECE 260C – VLSI Advanced Topics Term paper presentation May 27, 2014 Keyuan Huang Ngoc Luong Low Power Processor Architectures and Software Optimization."

Similar presentations


Ads by Google