Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June 2008 2015/10/28.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June 2008 2015/10/28

Today, embedded processors are expected to be able to run complex, algorithm-heavy applications that were originally designed and coded for general-purpose processors. As a result, traditional methods for addressing performance and determinism become inadequate. This paper explores a new data cache design for use in modern high-performance embedded processors that will dynamically improve execution time, power efficiency, and determinism within the system. The simulation results show significant improvement in cache miss ratios and reduction in power consumption of approximately 30% and 15 %, respectively. Abstract - 2 -

Primary (L1) caches in embedded processors are direct- mapped for power efficiency  However, direct-mapped caches are predisposed to thrashing Hence, require a cache design that will  Improve performance, power efficiency, and determinism  Minimize the area cost What’s the Problem - 3 -

Related Works - 4 - Cache optimization techniques for embedded processors Reduce cache conflict and cache pollution Filter caches [6] Dynamically detect the thrashing behavior and expand the select sets for data cache This Paper: Expandable cache lookup only when necessary Increase power efficiency Retain data evicted from $ in a small associative victim cache [2] Provide extended associativity Dual data cache scheme that can distinguish spatial, temporal, singe- use memory reference [3] Improve cache utilization Application- specific cache partitioning [4] Pseudo-associative caches: place blocks in a second associated line [5] Shut down cache ways that adapts to application [7]

Illustrate why we need to expand the select sets dynamically  Insufficiency of the victim cache Example thrashing code  B and E map to Set-S, C and F map to Set-Q, A and D map to Set-R Motivative Example - 5 - Set-RSet-SSet-Q Successive cache thrashing

… Cache trace of the example thrashing code  B and E map to Set-S, C and F map to Set-Q, A and D map to Set-R Motivative Example- Cont. - 6 - B[i] Set-S C[i] Set-Q A[i] Set-R Set-* 2 entry victim cache Main cache E[i] F[i] D[i] C[i] B[i] A[i] E[i] B[i] C[i] A[i] F[i] D[i] Uncorrelated evicted data polluting the victim cache

The Dynamically Expandable L1 Cache Architecture - 7 - 1 st. Circular recently- evicted-set list 2 nd. Expandable cache lookup

A small circular list  Keep track of the index of the most recently evicted sets Goal: detect a probable thrashing set Operation  Look up the circular list only when a cache miss 。 If the missed set is present in the list  Enable the expand bit for that set The access and update of the circular list  Only occur during a cache miss 。 Timing is not affected (1) Circular Recently-Evicted-Set List - 8 - Conclude the current set is in a thrashing state and should dynamically be expanded

Goal: allow a set to re-lookup into a predefined secondary set (virtually double associativity of a given set) Operation  The secondary set is determined by a fixed mapping function 。 Flip the most significant bit of the set index  Besides expand bit, toggle bit for each cache set 。 Lookup initially on primary set or secondary set  Enable: when a cache hit occurs on the secondary set  Disable: when a cache hit occurs on the primary set (2) Expandable Cache Lookup - 9 - 1 st lookup: cache miss and expand bit= 1 2 nd lookup in the predefined secondary set on next cycle - found: cache hit with one cycle penalty - not found: full cache miss 1 st lookup: cache miss and expand bit= 1 2 nd lookup in the predefined secondary set on next cycle - found: cache hit with one cycle penalty - not found: full cache miss “00” “01” “10” “11” Index Probable thrashing set is detected by first mechanism

Cache trace of the proposed cache architecture A Demonstrative Example - 10 - B[i] Set-S C[i] Set-Q A[i] Set-R Circular list Main cache E[i] F[i] D[i] Set-S Set-S’ Set-Q’ Set-R’ … … Set-Q Set-R B[i] C[i] A[i] Expand 1 1 1 Update

Use SimpleScalar toolset [8] for performance evaluation  Two baseline configuration 。 256-set, direct-mapped L1 data cache with a 32-byte line size 。 256-set, 4-way set-associative L1 data cache with a 32-byte line size Use CACTI[10] to evaluate the power efficiency  Assume L1/L2 power ratio of 20 。 Cost 20 times of power to access data in L2 than it does in L1 Benchmarks  7 representative programs of the SPEC CPU2000 suite [9] Experimental Setup - 11 -

Criteria: miss rate reduction  The miss rate improvement of the proposed implementation 。 The arithmetic mean is 30.75% Performance Improvement- Direct-Mapped Cache - 12 - 8-entry victim $ Improvement over baseline 5-entry recently- evicted-set list

Criteria: miss rate reduction  The miss rate improvement of the proposed implementation 。 The arithmetic mean is 26.74% Performance Improvement- 4-Way Set-Associative Cache - 13 - 64-entry victim $ Improvement over baseline 8-entry recently- evicted-set list Significant miss rate reduction for both direct-mapped and set-associative caches

The power reduction of the proposed implementation  The average is 15.73% Power Improvement- Direct-Mapped Cache - 14 - Consistently provide power reduction

However, the power reduction across the benchmarks  The average was still an improvement of 4.19% Power Improvement- 4-Way Set-Associative Cache - 15 - Exception: Higher power costs

This paper proposed a dynamically expandable data cache architecture  Compose of two main mechanisms 。 Circular recently-evicted-set list  Detect a probable thrashing set 。 Expandable cache lookup  Virtually increase the associativity of a given set Experimental results show that the proposed technique  Significant reduction in cache misses and power consumption 。 For both direct-mapped and set-associative caches Conclusions - 16 -

The related works are not strongly connected The results for power usage improvement are too coarse  Don’t show the extra power consumption of the support circuit The results for different length of the circular list are not shown Comment for This Paper - 17 -

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June 2008 2015/10/28.

Similar presentations

Presentation on theme: "Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June 2008 2015/10/28."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June 2008 2015/10/28.

Similar presentations

Presentation on theme: "Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June 2008 2015/10/28."— Presentation transcript:

Similar presentations

About project

Feedback