Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Similar presentations


Presentation on theme: "A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,"— Presentation transcript:

1 A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu, ***Allen C.-H. Wu, *TingTing Hwang *National Tsing Hua University, Taiwan **Yuan Ze University, Taiwan ***Jiangnan University, China Thank you Mr. CHAIR. The title of my presentation is “a novel cache-utilization based dynamic voltage frequency scaling (DVFS) mechanism for reliability enhancements”.

2 Yen-Hao Chen / National Tsing Hua University
Outline Background & motivation Dynamic voltage frequency scaling (DVFS) 7T/14T SRAM cache [1] Cache-utilization based voltage frequency scaling mechanism Online control scheme Tag copy operation Experimental results Conclusions This is the outline. First, we discuss the background and the motivation. And then, we introduce the cache utilization-based voltage frequency scaling mechanism. Finally, the experimental results and conclusions. Let’s start with the background and the motivation. 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

3 Dynamic Voltage Frequency Scaling
Many modern processor designs have supported DVFS to reduce the power consumptions 𝑃𝑜𝑤𝑒𝑟∝𝑉𝑜𝑙𝑡𝑎𝑔 𝑒 2 ×𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 SRAM cache may become very unreliable under the near-threshold supply voltage Most DVFS frameworks suffer from the limitation of the lowest supply voltage constraint due to the reliability of the SRAM cache 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

4 7T/14T Reconfigurable SRAM [1]
Tradeoff between capacity and reliability under the near-threshold supply voltage Use two additional transistors and a /CTRL Connect two SRAM cells into a single robust one Bitline /Bitline Wordline 0 /CTRL A 7T/14T reconfigurable SRAM cache has been proposed. It tradeoffs the capacity for reliability enhancements under the ultra-low supply voltage condition. It connects two conventional 6T SRAM cell together into a single robust SRAM cell by two additional transistors and a control line. Wordline 1 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

5 7T/14T Reconfigurable SRAM [1] (cont.)
Operate in three modes Regular cache Normal mode Line-merged cache High-speed mode Dependable low-power mode Bitline /Bitline Wordline 0 /CTRL Wordline 1 Regular cache Line-merged cache Mode Normal High-speed Dependable low-power Cache capacity Regular (4-way) Half (2-way) Access latency Regular (4 cycles) Faster (3 cycles) Reliable voltage Regular (0.75v) Better (0.55v) By controlling the control line and wordlines, 7T/14T SRAM cache can operate in three operation modes. First, the normal mode or we call it regular cache. It turns off the two additional transistors, and two conventional 6T SRAM cells can operate independently. And, line-merged cache, 7T/14T SRAM connects two SRAM cells together and halves its capacity to obtain faster speed or better reliability, namely high-speed mode or dependable low-power mode. 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

6 Yen-Hao Chen / National Tsing Hua University
Motivation But reducing cache capacity may… Increase cache miss rate Result in performance and energy penalties It motivates us to… Investigate the inter-relationship between… The computational patterns in execution The cache utilization Devise an effective online control scheme Switch operation modes of 7T/14T SRAM cache Enhance reliability Minimize performance and energy penalties So, we can reduce the cache capacity to obtain better SRAM reliability. But reducing cache capacity may increase cache miss rate and result in performance energy penalties. By the above observations, it motivates us to investigate the inter-relationship between the computational patterns in execution and the cache utilization. Also, we want to devise an effective online control scheme that can switch the 7T/14T SRAM cache operation modes for reliability enhancements with minimized performance and energy penalties. 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

7 Yen-Hao Chen / National Tsing Hua University
Outline Background & motivation Dynamic voltage frequency scaling (DVFS) 7T/14T SRAM cache [1] Cache-utilization based voltage frequency scaling mechanism Online control scheme Tag copy operation Experimental results Conclusions Now let’s discuss the cache utilization-based voltage frequency scaling mechanism. 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

8 Yen-Hao Chen / National Tsing Hua University
Online Control Scheme The next operation mode is decided by the computational pattern of the last time period Time period Patterns Patterns Patterns Dependable low-power Normal time Executing Executing Executing Executing Control Scheme Computational pattern Operation mode High-speed Normal 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

9 The Proposed Control Scheme
Two metrics are used to decide the operation mode Miss per kilo instructions (MPKI) Overhead cycles True False MPKI > Threshold Overhead cycles < 0 Three operation modes of 7T/14T SRAM cache Dependable low-power mode High-speed mode Normal mode Control Scheme 7T/14T SRAM cache has three operation modes, which are normal mode, high-speed mode and dependable low-power mode. We use two metrics to decide the operation mode, which are the miss per kilo instructions (MPKI) and the overhead cycles. Let’s start with the first metric, MPKI. 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

10 Yen-Hao Chen / National Tsing Hua University
MPKI Metric High MPKI applications can be switched to the dependable low-power mode for energy saving Not sensitive to the CPU core frequency Low cache utilization, i.e. low data locality Dependable low-power mode High-speed mode Normal mode True False MPKI > Threshold Overhead cycles < 0 Performance penalty High MPKI applications are not sensitive to the CPU core frequency and usually shows low cache utilization, that is, not much data locality. By our experiment, low MPKI applications show severe performance penalty in dependable low-power mode. On the other hand, the affection on the high MPKI applications is relatively small in dependable low-power mode. So, by the observations, we conclude that MPKI is a good metric to decide whether to enter the dependable low-power mode. Next, we use ‘overhead cycles’ to decide whether to enter the high-speed mode or normal mode. 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

11 Overhead Cycles Metric
Cycle differences of normal and high-speed modes 𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 𝑐𝑦𝑐𝑙𝑒𝑠=𝐶𝑦𝑐𝑙 𝑒 𝑐𝑜𝑠𝑡 − 𝐶𝑦𝑐𝑙𝑒 𝑏𝑒𝑛𝑒𝑓𝑖𝑡 𝐶𝑦𝑐𝑙 𝑒 𝑐𝑜𝑠𝑡 Additional cycles caused by capacity loss 𝐶𝑦𝑐𝑙 𝑒 𝑐𝑜𝑠𝑡 =𝛥𝑚𝑖𝑠𝑠_𝑐𝑜𝑢𝑛𝑡×𝑚𝑖𝑠𝑠_𝑝𝑒𝑛𝑎𝑙𝑡𝑦 𝛥𝑚𝑖𝑠𝑠_𝑐𝑜𝑢𝑛𝑡 : Additional cache misses by capacity loss 𝐶𝑦𝑐𝑙𝑒 𝑏𝑒𝑛𝑒𝑓𝑖𝑡 Reduced cycles in high-speed mode 𝐶𝑦𝑐𝑙𝑒 𝑏𝑒𝑛𝑒𝑓𝑖𝑡 = 𝑖𝑛𝑠𝑡_𝑐𝑜𝑢𝑛𝑡 𝑙𝑜𝑎𝑑/𝑠𝑡𝑜𝑟𝑒 ×𝛥𝑐𝑦𝑐𝑙𝑒 𝛥𝑐𝑦𝑐𝑙𝑒: Reduced cache latency in high-speed mode Dependable low-power mode High-speed mode Normal mode True False MPKI > Threshold Overhead cycles < 0 The overhead cycles are defined as the cycle differences between normal mode and high-speed mode, which is calculated as the cycle cost subtract cycle benefit. And, cycle cost is the additional cycles that caused by capacity lost, calculated by the delta miss count times the miss penalty. And, the cycle benefit is the reduced cycles gained from high-speed mode, calculated by the load/store instruction count times the delta cycles of the high-speed mode. This is our overall control scheme. Those metrics are easy to obtain in the modern processors except for the delta miss count. For the next pages, we are going to discuss how to obtain the delta miss count by examples. 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

12 Yen-Hao Chen / National Tsing Hua University
𝜟𝒎𝒊𝒔𝒔_𝒄𝒐𝒖𝒏𝒕 Hit in regular cache but miss in line-merged cache Capacity loss The next operation mode Time period Mode time Executing Line-merged cache - ) 𝑀𝑖𝑠𝑠_𝑐𝑜𝑢𝑛𝑡 ℎ𝑎𝑙𝑓_𝑠𝑖𝑧𝑒𝑑 𝛥𝑀𝑖𝑠𝑠_𝑐𝑜𝑢𝑛𝑡 Control Scheme Regular cache 𝑀𝑖𝑠𝑠_𝑐𝑜𝑢𝑛𝑡 𝑓𝑢𝑙𝑙_𝑠𝑖𝑧𝑒𝑑 Truly, we have to obtain the 𝜟𝑴𝒊𝒔𝒔_𝒄𝒐𝒖𝒏𝒕 regardless of the current operation mode 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

13 𝜟𝒎𝒊𝒔𝒔_𝒄𝒐𝒖𝒏𝒕 in Regular Cache
Record hit count in LRU, lru blocks When hit in the LRU, lru blocks (cache hit) Increase the miss counter When miss Do not increase the miss counter hit 𝜟𝒎𝒊𝒔𝒔_𝒄𝒐𝒖𝒏𝒕 = 0 = 1 A B C D Obtaining the delta miss count in regular cache is easy. We can just count the hits on the lru blocks. For example, when a hit in the LRU block, we increase the miss counter and update the lru sequence. By doing so, we can obtain the delta miss count in regular cache. LRU sequence MRU mru mru lru LRU lru MRU LRU 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

14 𝜟𝒎𝒊𝒔𝒔_𝒄𝒐𝒖𝒏𝒕 in Line-Merged Cache
Data array is reduced to half Only way 0 and way 1 have data in data array Full-sized tag array can still be used To maintain full-sized tag information Line-merged cache Regular cache Blocks with data Way 0 Way 1 Way 2 Way 3 Tag array A B C D MRU mru lru LRU A A B B C D Data array When in line-merged cache, the capacity of the data array reduces to half. In this case, only way 0 and way 1 have data. But, we can still use the full sized tag array to maintain the full-sized cache replacement policy information and obtain the delta miss count. 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

15 𝜟𝒎𝒊𝒔𝒔_𝒄𝒐𝒖𝒏𝒕 in Line-Merged Cache
Simulate full-sized cache replacement policy When hit in the LRU, lru blocks (cache miss) Increase the miss counter Evict the cache block Copy the evicted tag to appropriate location Tags with data in data array hit miss 𝜟𝒎𝒊𝒔𝒔_𝒄𝒐𝒖𝒏𝒕 = 1 = 0 A E D B C D For examples, when a hit in the LRU block, it actually is a cache miss, because we don’t have the data in the data array. Besides of increasing the miss counter, we have to copy the evicted tag information into the appropriate location to maintain the replacement policy information of the full-sized cache. And then, update the LRU sequence. Similarly, when a miss, we have to copy the evicted tag information to the appropriate location to maintain the replacement policy information of the full-sized cache. And then, update the LRU sequence. With the replacement policy information of the full-sized cache, we can obtain the delta miss count. LRU sequence MRU mru lru LRU mru MRU LRU lru Evict block B, fetch block D Evict block A, fetch block E 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

16 Support for Tag Copy Operation
Tags with data array hit Memory address A A B B D C C D D Copy tag B from way 1 to way 3 Index Offset Tag Decoder Tag array Way 0 Write/Sense Tag array Way 1 Tag array Way 2 Tag array Way 3 hit? To performance the tag copy operation, we modify the cache architecture. This is a conventional cache tag array. When comes a memory request, the index will be sent to the decoders and the tag will be compared with the tags in the tag array to determine whether it is a hit or not. And, update the tag into the tag array if necessary. Now, if we want to copy tag B from way 1 to way 3, a tag buffer is required to temporally store the tag B, and then, written to the way 3. And finally, we can update the way 1 by the address of original memory request. This is the overall schematic. Tag buffer 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

17 Overhead Analysis of Tag Copy Operation
Small hardware area overhead A tag buffer Two multiplexers No timing delay overhead Not on the critical path One tag access cycle delay Paralleled with data fetching from next level memory The hardware overhead of our modification is relatively small, only a tag buffer and two multiplexers. And, there is no timing overhead, because it only takes one tag access cycle delay and can be performed in parallel with the data fetching from the next level memory hierarchy. 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

18 Yen-Hao Chen / National Tsing Hua University
Outline Background & motivation Dynamic voltage frequency scaling (DVFS) 7T/14T SRAM cache [1] Cache-utilization based voltage frequency scaling mechanism Online control scheme Tag copy operation Experimental results Conclusions Next, the experimental results. 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

19 Yen-Hao Chen / National Tsing Hua University
Configurations Reference No DVFS with conventional 6T SRAM cache Online DVFS (0.75V) [16] A DVFS mechanism using safe supply voltage Online DVFS (0.55V) The online DVFS using near-threshold voltage CUB VFS (0.55V) The proposed cache-utilization based voltage-frequency scaling mechanism with the 7T/14T SRAM cache using near-threshold voltage We performed four configurations, including the reference configuration using the conventional 6T SRAM cache, conventional online DVFS using safe supply voltage and conventional online DVFS using the near-threshold voltage, and finally our proposed cache utilization-based voltage-frequency scaling mechanism with 7T/14T SRAM cache using the near-threshold voltage. 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

20 The Reliability Comparisons
CUB VFS can guarantee reliable operations When using the near-threshold supply voltage Un-reliable This is the reliability comparisons. The conventional online DVFS can guarantee reliable operation under the safe supply voltage, but when using the near-threshold voltage, it becomes unreliable. On the other hand, our proposed mechanism can guarantee a reliable operation under the near-threshold voltage. Reliable [21] 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

21 The Performance and Energy Overheads
CUB VFS incurs no performance/energy overheads Even slightly better 2.6% 5.1% 4.5% Next we want to understand the performance and energy overheads of our proposed mechanism. By the results, it shows that our proposed mechanism incurs no performance or energy overhead, or even slightly better. Our proposed mechanism outperforms the reference configuration by 2.6% and outperforms the online DVFS configuration by 5.1%. Also, our proposed mechanism achieve 4.5% more energy saving compared to the online DVFS configuration. 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

22 Yen-Hao Chen / National Tsing Hua University
Outline Background & motivation Dynamic voltage frequency scaling (DVFS) 7T/14T SRAM cache [1] Cache-utilization based voltage frequency scaling mechanism Online control scheme Tag copy operation Experimental results Conclusions Finally the conclusions. 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

23 Yen-Hao Chen / National Tsing Hua University
Conclusions A cache utilization-based dynamic voltage frequency scaling mechanism for reliability enhancements is proposed Adopt reconfigurable 7T/14T SRAM cache Greatly reduce the bit-error probability Effectively control operations in a reliable state Pay no performance or energy penalties A cache-utilization based dynamic voltage frequency scaling mechanism for reliability enhancements is proposed. It adopts reconfigurable 7T/14T SRAM cache and can greatly reduce the bit error probability. And, it effectively operates in a reliable condition without paying any performance or energy penalties. 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

24 Thank you for listening
This is my presentation. Thank you for listening. 26-Jun-19 Yen-Hao Chen / National Tsing Hua University

25 Yen-Hao Chen / National Tsing Hua University
Backup Slides Yen-Hao Chen / National Tsing Hua University 26-Jun-19

26 Examples of Modified Placement Policy
26-Jun-19 Yen-Hao Chen / National Tsing Hua University

27 Experimental Environment
GEMS simulator Quad-core system, Linux kernel 20 randomly mixed workloads from SPEC CPU 2006 Normal mode Dependable low-power mode High-speed mode Supply voltage 0.8V 0.55V CPU frequency 800 Mhz 400 Mhz L1 cache size 32KB 16KB L1 cache assoc. 8 way 4 way L1 cache latency 4 cycles 3 cycles L2 cache 256KB, 8 way, 20 cycles DRAM cycles 300 150 26-Jun-19 Yen-Hao Chen / National Tsing Hua University


Download ppt "A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,"

Similar presentations


Ads by Google