Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 12 Caches Optimization Technique in Embedded System (ARM)

Similar presentations

Presentation on theme: "Chapter 12 Caches Optimization Technique in Embedded System (ARM)"— Presentation transcript:

1 Chapter 12 Caches Optimization Technique in Embedded System (ARM)

2 Introduction Cache Write Buffer
is a small, fast array of memory placed between the processor core and main memory that store portions of recently referenced main memory. The word cache is a French word meaning “a concealed place for storage”. Write Buffer Often used with a cache is a write buffer a very small first-in-first-out (FIFO) memeory placed between the processor core and main memory. The purpose of a write buffer is to free the processor core and cache memory from the slow write time associated with writing to main memory.

3 12.1 The Memory Hierarchy and Cache Memory

4 Memory Hierarchy

5 Cache & Write Buffer

6 Overview Cache, Write-Buffer Cache Clean & Flush(or Invalidate)
N-Way set associate ARM940T’s cache Test a cache Cache Clean & Flush(or Invalidate) Cache Lockdown

7 12.1.1 Caches and Memory Management Units
Virtual vs. Physical cache Virtual cahce: ARM7 ~ ARM10, StrongARM, XScale Physical cache: ARM11 Locality of reference The cache make use of this local reference in both time and space. If the reference is in time, it is called Temporal locality. If it is by address proximity, then it is called Spatial locality.

8 Logical and Physical Caches

9 12.2 Cache Architecture

10 Two Bus Architecture ARM uses two bus architecture in its cached cores, the Von Neumann and the Harvard. In processor core using Von Neumann architecture, there is a single cache used for both instruction and data. This type is known as a unified cache. The Harvard architecture has separate instruction and data buses to improve overall system performance. This type of cache is known as a split cache.

11 12.2.1 Basic Architecture of a Cache Memory
Three main parts in a cache: Directory store, Status information, and Data section

12 Cache Organization Directory Data section Status Bits
The cache must know where the information stored in a cache line originates from in main memory. The directory entry is known as a cache-tag. Data section stores the data read from main memory. “The size of a cache “ is defined as the actual code and data the cache can store from main memory. Status Bits The two common bits are: Valid, Dirty

13 12.2.2 Basic Operations of a Cache Controller
The Cache Controller is hardware that copies code and data from main memory to cache memory automatically. The cache controller intercepts read and write memory requests before passing them on to the memory controller. It processes a request by dividing the address of the request into three fields: tag, set index, data index.

14 Access Cache First, use “set index” to locate the cache line within cache memory. Check the “valid” bit of the line; Compare the cache-tag to “tag” field. It’s a cache “hit” if both the status check and comparison succeed; or “miss” if either fails. On a cache miss, the controller copies an entire cache line from main memory and provide the requested code or data to the processor. On a cache hit the controller supplies the code and data directly from cache to the processor (selected by “data index”).

15 12.2.3 Direct Mapped Cache In a direct-mapped cache Data streaming
each addressed location in main memory maps to a single location in cache memory. Since main memory is much larger than the cache memory, there are many addresses in main memory that map to the same single location in cache memory. Data streaming During a cache line fill, the cache controller may forward the loading data to the core at the same time it is copying it to the cache. Thrashing Direct-mapped caches are subject to high levels of thrashing – a software battle for the same location in cache memory. The result of thrashing is the repeated loading and eviction of a cache line.

16 Set Associativity Some caches include an additional feature to reduce the frequency of thrashing This structural design feature is a change that divides the cache memory into smaller equal units, called ways. The cache lines with the same set index are said to be in the same set. The set of cache lines pointed to by the set index are set associative. The cache-tag field is larger, and set index is smaller.

17 Increasing Set Associativity
As the associativity of a cache controller goes up, the probability of thrashing goes down. The ideal goal would be to maximize the set associativity of a cache by designing it so any main memory location maps to any cache line, which is known as a fully associative cache. However, as the associativity increases, so does the complexity of the hardware that support it. One method used by hardware designers to increase the set associativity of a cache includes a content addressing memory (CAM).

18 CAM A CAM works in the opposite way a RAM works
Where a RAM produces data when given an address value, a CAM produces an address if a given data value exist in the memory. Using a CAM allow many more cache-tags to be compared simultaneously. A CAM uses a set of comparators to compare the input tag address with a cache-tag stored in each valid cache line.

19 CAM in ARM920T/940T Using a CAM to locate cache-tags is the design choice ARM made in their ARM920T and ARM940T processor cores. The caches in the ARM920T and ARM940T are 64-way set associative. The tag portion of a requested address is used as an input to the four CAMs that simultaneously compare the input tag with all cache-tags stored in the 64-way.

20 12.2.5 Write Buffers Write buffer
A small, fast FIFO holding data that processor would normally writes to main memory. It reduce the processor time taken to write small blocks of sequential data to main memory. The efficiency of the write buffer depends on the ratio of main memory writes to the number of instructions executed. Over the given time interval, if the number of writes to main memory is low or sufficiently spaced between other processing instructions, the write buffer will rarely fill. The write buffer also improves cache performance during cache line evictions. When data is in write-buffer, it’s not allowed to be read. This is one of reasons that FIFO depth is usually quiet small. Write Merging (or coalescing) (ARM10) If new value and old fit the same address If new data and old data fit into same memory block

21 12.3 Cache Policy

22 Three Cache Policies There are three policies that determine the operation of a cache Write Policy, Replacement Policy, and Allocation Policy

23 12.3.1 Writeback vs. Writethrough
Cache controller writes to both cache and main memory when there is a cache hot on write, ensuring that the cache and main memory stay coherent at all times, but slower than writeback. Writeback Cache controller writes to valid cache data memory and not to main memory. Consequently, valid cache lines and main memory may contain different data. The line data will be written back to main memory when evicted. Must use one or more of the dirty bits.

24 12.3.2 Cache Line Replacement policy
The strategy implemented in a cache controller to select the next victim. ARM cached core support two replacement policies round-robin: sequential increment. pseudorandom Use a non-sequential incrementing victim counter. ( When counter reaches a maximum value, it is reset to a defined base value). Most ARM cores support both policies The round-robin replacement policy has great predictability, which is desirable in an embedded system. However, a round-robin replacement policy is subject to large changes in performance given small changes in memory access. To show this change in performance, see example 12.1.

25 case: ARM940T’s Cache L1 Cache 4KB total size 64 ways
bit L1 Cache 4KB total size 64 ways 16 Bytes per line TAG idx BS MVA = (PID + VA) Way: 00 01 02 ……… 63 ……… 00 16 ……… 01 32 ……… 10 48 ……… 11 cmp & sel (64:1) 16:1 SEL

26 Example 12.1

27 Example 12.1 int readSet( unsigned int times, unsigned int numset) {
int setcount, value; // registers ? volatile int *newstart; volatile int *start = (int *)0x20000; // why it’s 0x20000 ? __asm timesloop: MOV newstart, start MOV setcount, numset // test: numset = 64 or 65 setloop: LDR value,[newstart,#0]; ADD newstart,newstart,#0x40; //0x40: keep idx unchange SUBS setcount, setcount, #1; BNE setloop; SUBS times, times, #1; BNE timesloop; } return value; numset=64/times Way: 00 Way: 00 01 02 ……… 63 0/1 0/2 1/1 1/2 2/1 2/2 ……… 63/1 63/2 ……… ……… ……… numset=65/times in RR Way: 00 Way: 00 01 02 ……… 63 0/1 64/1 63/2 1/1 0/2 64/2 2/1 1/2 0/3 ……… 63/1 62/2 61/3 ……… ……… ………

28 Test Result Given (in ADS1.2) So, results Analysis
times = 0x10000(2^16), 50MHz, 100ns(no-seq), 50ns(seq) So, results Round Robin test size = 64 Round Robin enabled = seconds Random enabled = seconds Round Robin test size = 65 Round Robin enabled = seconds Random enabled = seconds Analysis numset=64: Ta = 2^16*Texe + 64*Tm = 510,000,000ns numset=65: Tb = 2^16*(Texe + 65*Tm) = 2,560,000,000ns (2^16*65-64)*Tm = 2050,000,000 = Tb – Ta so, Tm = 481ns per line, Tm/4 word per access = 120ns This is an extreme example, but it does shows a difference between using a round-robin policy and a random replacement policy.

29 12.4 CP15 & Cache Cache is controlled via CP15’s registers
Primary CP15 registers c7, c9: Control the setup and operation of cache Secondary CP15 registers CP15:c7 registers Write-Only, clean (dirty written back) and flush (just invalidate it). CP15:c9 registers define the victim pointer base address.

30 Coprocessor Instructions
Instruction Format MRC|MCR cp, opcode1, Rd, Cn, Cm, opcode2 opcode: executed by CP Cn: major register Cm: minor resgister Example MRC p15, 0, r1, c1, c0, 0

31 12.5 CP15:c7 – Clean/Flush Cache
CP15:c7 is a Write-only register Sometime used for Prefetch buffers and BTB; Instruction format MCR p15, 0, <Rd>, c7, <CRm>, <op2> CRm 5=Flush I-Cache, 6=Flush D-Cache, 7=Flush Both 10=Clean D-Cache, 11=Clean Unified 14=Clean&Invalidate D-Cache, 15=C/I Unified op2 (index method) 0 = SBZ (whole) 1 = MVA 2 = Set/Index 3 = Test-Clean (only for ARM926/1026EJ-S)

32 Cache Operations Clean/Write-back/Copy-back Flush/Invalidate Prefetch
applied to “Write-Back D-Cache”; Flush/Invalidate just invalidate line(s), not include cleaning; Prefetch load memory line(s) into cache; Drain Write Buffer Stop ARM from further executing until write buffer empty; Wait for Interrupt Put ARM into Lower Power State until an interruption occurs; Prefetch buffer IMPLEMENTATION DEFINED; Branch Target Cache Data Value in <Rd>

33 Self-Modifying Code (SMC)
The cache may also need cleaning or flushing before the execution of self-modifying code in a split cache. The need to clean or flush arises from two possible conditions First, the self-modifying code may be held in the D-cache and therefore be unavailable to load from main memory as an instruction. Second, existing instructions in the I-cache may mask new instructions that have already been written to main memory. So, after self-modifying code being written D-cache should be “clean” to be present in main memory I-cache should be “flush” or invalidated to prevent hitting it in cache.

34 12.5.1 Flushing ARM Cached Cores
For example, flushing D-Cache MCR p15, 0, Rd, c7, c6, 0 Note Rd should be zero.

35 12.5.2 Cleaning ARM Cached Cores
To clean a cache is to issue commands that force the cache controller to write all dirty D-cache lines out to main memory.

36 Cleaning the D-Cache There are three methods used to clean the D-Cache Clean a certain line via c7f {Way & Set}; TEST-CLEAN Instruction Clean a dedicate block/line via MVA

37 Notes: Cache Parameters
CSIZE: size of cache = 2^CSIZE CLINE: size of a line = 2^CLINE NWAY: Number of way Command fields (c7f, c9f) I7SET: ‘SET’ offset in CP15:c7 I7WAY: ‘WAY’ offset in CP15:c7 I9WAY: ‘WAY’ offset in CP15:c9 And two others can be calculated out SWAY: Bytes per way NSET: lines per way

38 12.5.4 Index line via {way, set}

39 List of c7 format (c7f) in {Way, Set}

40 c7f RN 0 ; cp15:c7 register format
MACRO CACHECLEANBYWAY $op MOV c7f, # ; create c7 format 5 IF "$op" = "Dclean" MCR p15, 0, c7f, c7, c10, 2 ; clean D-cline ENDIF IF "$op" = "Dcleanflush" MCR p15, 0, c7f, c7, c14, 2 ; cleanflush D-cline ADD c7f, c7f, #1<<I7SET ; +1 set index TST c7f, #1<<(NSET+I7SET) ; test index overflow BEQ %BT5 BIC c7f, c7f, #1<<(NSET+I7SET) ; clear index overflow ADDS c7f, c7f, #1<<I7WAY ; +1 victim pointer BCC %BT ; test way overflow MEND cleanDCache CACHECLEANBYWAY Dclean MOV pc, lr cleanFlushDCache CACHECLEANBYWAY Dcleanflush cleanFlushCache MCR p15,0,r0,c7,c5,0 ; flush I-cache

41 12.5.5 Cleaning the D-Cache using Test-Clean command
Test-Clean search the first “dirty” cache line, and cleans it by transferring its contents to main memory. /* ARM926EJ-S, ARM1026EJ-S */ cleanDCache MCR p15, 0, r15, c7, c10, 3 BNE cleanDCache /* test Z flag */ MOV pc, lr Note: It’s R15(pc) to be written to MCR

42 12.5.6 Cleaning the D-Cache in Intel XScale and StrongARM
The Intel XScale and StrongARM processors use a third method to clean their D-Cache. Using a command to allocate a line in the D-cache without doing a line fill: sets the valid bit and fill the directory entry with cache-tag provide in the <Rd> register. No data is transferred from main memory. Thus, the data is not initialized until it is written to by the processor. .

43 12.5.7 Invalid & Clean line via MVA
SBZ(4:0) Flush-Clean (addr, size) addr RN 0 BIC addr, addr, #(1<<CLINE) – 1 MOV nl, size, lsr #CLINE ; CLINE=line_size 10 MCR p15, 0, addr, c7, c5, 1 ; clean ADD addr, addr, #1<<CLINE SUBS nl, nl, #1 BNE %BT10

44 Commands to Flush & Clean a single line via MVA or PA

45 c7f: clean & flush a line

46 12.6 Cache Lockdown

47 Cache Lockdown Cache lockdown is a feature that enables a program to load time-critical code and data into cache, and mark it as exempt from eviction. Lock unit: WAY (rather than LINE or others) Code that candidate for locking are IVT, ISR, Critical Algorithm Data to be locked Global variables frequently used Q: How-to ?

48 Methods of Cache Lockdown

49 12.6.1 Procedures for Lockdown
Example int globalData[16]; unsigned int *vectortable = (unsigned int *)0x0; int vectorCodeSize = 212; // IVT + IRQ handler state = disable_interrupt(); enableCache(); flushCache(); wayIndex = lockDcache(globalData, sizeof(globalData)); wayIndex = lockIcache(vectortable, vectorCodeSize); enable_interrupt(state);

50 12.6.2 Lockdown base By “victim counter”
The victim counter is reset to “victim reset value” when it increments beyond the number of ways in the core. In RR method, the entry (or way) to be evicted is pointed by the “victim counter”. Either Code or Data Cache can be controlled by the victim pointer, and prefetched into the cache.

51 Lock-down Commands Commands that lock data in cache by referencing its way.

52 How to lock down Principles for cache lock down
1. Ensure that no processor exceptions can occur during the execution of this procedure (by disabling interrupts). If for some reason this is not possible, all code and data used by any exception handlers that can get called must be treated as code and data used by this procedure for the purpose of steps 2 and 3. 2. If an instruction cache or a unified cache is being locked down, ensure that all the code executed by this procedure is in an uncachable area of memory. 3. If a data cache or a unified cache is being locked down, ensure that all data used by the following code is in an uncachable area of memory, apart from the data which is to be locked down. 4. Ensure that the data/instruction that are to be locked down are in a cachable area of memory. 5. Ensure that the data/instruction that are to be locked down are not already in the cache, using cache clean and/or invalidate instructions as appropriate. Ref: ARM Manual DDI0100E.pdf pp B5-20

53 How to Lockdown (1) First, ensuring the code to be locked is not already in the cache. e.g., Invalidate D or I Cache MCR p15, 0, Rd, c7, c5,0 ;Invalidate ICache MCR p15, 0, Rd, c7, c5,1 ;Invalidate ICache via MVA MCR p15, 0, Rd, c7, c6,0 ;Invalidate DCache MCR p15, 0, Rd, c7, c6,1 ;Invalidate DCache via MVA or Clean them (for writeback DCache) MCR p15, 0, Rd, c7, c10,1 ;Clean DCache via MVA MCR p15, 0, Rd, c7, c10,2 ;Clean DCache via Index

54 How to lockdown (2) Then, set victim counter
Writing CP15:c9 to force the victim pointer to a specific line. Using either C9F_a (without L-bit) or C9F_b (with L-bit). At last, load them into cache by a software routine. NOTE: It’s important that the code and data to be locked in cache does not exist elsewhere in cache. And, enable MMU to ensure that any TLB misses while loading instructions or data cause a page table walk. For Data, use LDR; For Instruction, use instruction prefetching (MCR C7:c13)

55 12.7 Cache & Software Performance
Here are s few simple rules to help write code that take advantage of cache architecture. Use Cache & Write buffer, improve the AMAT. Be careful, NOT to cache or write buffer the memory-mapping I/O; Spatial locality: Organize the common data in-order and in one cache line size; Make the code as small as possible; Search a “linked list” will degrade the performance of cache. Cache is not the only who effect the performance. (see chapter 5 and 6).

56 Summary of cache Cache is the layer between core and MM.
Write-Buffer is a FIFO between core and MM. Rule of Locality Line: operation unit Virtual vs. Physical Cache Cache’s Clean, Flush, Lockdown

Download ppt "Chapter 12 Caches Optimization Technique in Embedded System (ARM)"

Similar presentations

Ads by Google