Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented.

Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented by: Lena Salman

Introduction Pointer-based data structures are usually randomly allocated in memory Will usually not achieve good locality Higher miss rates Software approach against Hardware approach

Software approach Two techniques:  Cache concious allocation - By far the most efficient  Software prefetch – Better suited for automization, better for implementation in compilers. Combination of both cache-concious allocation and software prefetch, does not add significantly to performance

Hardware approach Calculating and prefetching pointers Calculating pointer dependencies Effects of effectively predicting what to evict from the cache General HW prefetch – More likely to pollute the cache Problem! All the hardware strategies take advantage of the increased locality of cache – consciously allocated data.

Prefetching and cache – concious allocation Should complement each other ’ s weakness – Reduce the prefetch overhead of fetching blocks with partially unwanted data. Prefetching should reduce the cache misses and miss latencies between the nodes

Cache-conscious allocation Excellent improvement in execution time performance Can be adapted to specific need by choosing the cache-conscious block-size (cc – block size) Attempts to co-allocate data in the same cache-line. Nodes are referenced after each other on the same cache line.

Allocation to improve locality

Cache – conscious allocation Attempts to allocate the data in the same cache- line Better locality can be achieved Improved cache performance by a reduction of misses

ccmalloc() Does the cache-concious allocation of memory. Takes extra argument – pointer to data structure that is likely to be referenced. #ifdef CCMALOC child=ccmaloc(sizeof(struct node), parent) #else child= malloc(sizeof(struct node)); #endif

ccmalloc() Takes pointer to data that is likely to be referenced close ( in time) to the newly allocated stucture Invokes calls to the standard malloc(): When allocating new cc-block When data is larger than cc-block Otherwise: allocate in the empty slot of cc-block

cc-blocks = Cache – conscious blocks Demands cache lines large enough to contain more than one pointer structure The bigger the blocks – the lower the miss- rate if allocation is smart. Can be set dynamically in software, independently of the HW cache line size. In our study cc-block size – 256B hardware cache line size – 16B – 256B

Prefetch Prefetching will reduce the cost of cache – miss Can be controlled by software and/or hardware Software results in extra instructions Hardware leads to complexity in hardware

Software controlled prefetch Implemented by including prefetch instruction in the instruction set Should be inserted well ahead of reference, according to prefetch algorithm In this study: we will use greedy algorithm, by Mowry et al.

Software prefetch – Greedy algorithm When a node is referenced, it prefetches all children at that node. Without extra calculation, can only be done to children, not to grandchildren Easier to control and optimize The risk of polluting the cache decreases (since prefetch only needed lines)

Software greedy prefetch

Hardware Controlled Prefetch Depending on the algorithm used, prefetching can occur when a miss is caused Or when a hint is given by the programmer through an instruction, Or can always occur on certain types of data

Hardware prefetch Techniques used: Prefetch on miss Tagged Prefetch Attempt to utilize spatial locality Do NOT analyze data access patterns

Prefetch-on-Miss Prefetches the next sequantial line i+1, when detecting miss on line i. Line i : Miss! Line i-1 Line i+1 : will be prefetched

Tagged Prefetch Each prefetched line is tagged with a tag When a prefetched line - i is referenced, the line i+1 is prefetched. (no miss has occurred) Efficient when memory is fairly sequential, and has been shown efficient

Pre-fetch on miss – for ccmalloc() HW prefetch can be combined with ccmalloc(), by introducing a hint with address to the beginning of such a block.

Prefetch-one-cc on miss Prefetch the next line after detecting a cache – miss on a cache-conciously allocated block. Miss!!

Prefetch-all-cc on miss Decides dynamically how many lines to prefetch. Depends on where on cc-block the missing cache line is located. Prefetches all the cache lines on the cc- block, from the address causing miss Miss!!

Experimental Framework MIPS-like, out-of-order processor simulator. Memory latency equal to 50 ns random access time. Benchmarks: health – simulates columbian health care system mst – creates graph and calculates minimum span tree perimeter – calculates the perimeter of image treeadd – calculates recursive sum of values

More about benchmarks: health – elements are moved between lists during execution, and there is more calculation between data. mst – originally used a locality optimization procedure, which made ccmalloc() non noticeable. perimeter – data allocated in an order similar to access order, resulting locality optimization. treeadd – has calculation between nodes in a balanced binary tree.

Results: Execution time

Stalls: Memory stall – an instruction waits a cycle, due to the oldest instruction waiting to be retired – load / store instr. FU stall – the oldest instr. Is not load / store instr. Fetch stall – there is no instruction waiting to be retired. Prefetch is likely to affect when memory stalls are dominant!!

Graphs:

Cache performance - SW Miss rates are improved by most strategies Increased spatial locality with ccmalloc() reduces cache misses (less pollution) Software shows some decrease of misses, but prefetches a lot unused data Combination of software techniques achieves the lowest rates

Cache performance – cache lines The larger cache lines the more effective is using ccmalloc() HW prefetch alone, however, tends to pollute the cache, with unwanted data SW prefetch alone, tends to bring data already existing in the cache

Cache performance: SW prefetch achieves higher precision HW prefetch alone, are no good. HW prefetch is more sensitive to cache line size than the SW prefetch

Cache performance – SW pref. with ccmalloc() Results in increased amount of used cache lines, among the prefetched lines This is caused by increased spatial locality However! Also results trying prefetching lines already in the cache.

Cache performance – HW prefetch with ccmalloc() HW are greater improvement with cache- conscious allocation, then on their own, Prefetch-on-miss and tagged-prefetch both show the same results Still : large amount of unused prefetched lines Unused lines decrease with larger cache lines, due to spatial locality, and lack of need to prefetch

Conclusions:  The best way still remains cache conscious allocation – ccmalloc()  Efficient to overcome the drawbacks of large cache line  Creates locality necessary for prefetch  The larger the cache line – less prominent the prefetch strategy

Conclusions 2:  Cache-conscious allocation with HW prefetch, is not prominent, and it seems that ccmalloc() alone is enough  However, ccmalloc() can be used to overcome the negative effect of next- line prefetch  HW prefetch is better then SW prefetch

Conclusions 3:  When a compiler can use profiling info. and optimize memory allocation in cache-conscious manner – it ’ s preferable!  However, when profiling is too expensive – will likely to benefit from general prefetch support.

The endddd !!! You can tell me, I can take it.. What ’ s up doc???

לנה סלמן 28.06.2004

Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented.

Similar presentations

Presentation on theme: "Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented.

Similar presentations

Presentation on theme: "Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented."— Presentation transcript:

Similar presentations

About project

Feedback