Software Methods to Increase Data Cache Performance Presented by Philip Marshall.

Slides:



Advertisements
Similar presentations
1 Optimizing compilers Managing Cache Bercovici Sivan.
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Organization and Architecture
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.
Computer Organization and Architecture The CPU Structure.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
CH12 CPU Structure and Function
Chapter 1 Algorithm Analysis
Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.
Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.
Linked Lists. Array List Issues Painful insert/remove at start/middle.
Data Structures and Algorithms Searching Algorithms M. B. Fayek CUFE 2006.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
Data Structure and Algorithms
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
LINKED LISTS.
Optimization. How to Optimize Code Conventional Wisdom: 1.Don't do it 2.(For experts only) Don't do it yet.
CSE 351 Section 9 3/1/12.
Associativity in Caches Lecture 25
William Stallings Computer Organization and Architecture 8th Edition
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture: Cache Hierarchies
Morgan Kaufmann Publishers The Processor
Lecture 14: Reducing Cache Misses
Adapted from slides by Sally McKee Cornell University
Advanced Computer Architecture
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 2- Query Processing (continued)
Siddhartha Chatterjee
CS 3410, Spring 2014 Computer Science Cornell University
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Cache - Optimization.
Cache Performance Improvements
Principle of Locality: Memory Hierarchies
Optimization.
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Outline Introduction Example: Multiple Vector Additions Example: Linked List Example: Binary Tree Conclusion

Introduction Cache hit time is critical to system performance  Often determines a processor’s clock period  Cache controllers must be as simple as possible The miss rate of a cache can be decreased if we know something about the access patterns If we use software to use better access patterns or hint at how the cache can best be used, we can improve performance

Introduction Various methods can be used:  Loop Fusion – combine multiple loops that access the same elements  Array Merge – combine multiple arrays to increase spatial locality  Cache Prefetch – ask for values to be loaded into cache in advance  Cache Bypass – prevent certain accesses from allocating in the cache

Vector Addition – Base Code #define SIZE_N 1024 int a[SIZE_N], b[SIZE_N], c[SIZE_N]; int s1[SIZE_N], s2[SIZE_N]; for (int i = 0; i < SIZE_N; i++) s1[i] = a[i] + b[i]; for (int i = 0; i < SIZE_N; i++) s2[i] = a[i] + c[i];

Vector Addition – Base Code Assume a perfect instruction cache Ignore conflict data misses Assume a cache line size of 4 words Assume write miss penalties can be hidden First loop:  a, b: 256 misses each (every 4 th access) Second loop:  a, c: 256 misses each unless cache is large enough to hold entire a and b arrays 1024 total misses

Vector Addition – Loop Fusion #define SIZE_N 1024 int a[SIZE_N], b[SIZE_N], c[SIZE_N], s1[SIZE_N], s2[SIZE_N]; for (int i = 0; i < SIZE_N; i++) { s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i]; }

Vector Addition – Loop Fusion a, b, c: 256 misses each 768 total misses Are there always loops that can be combined?

Vector Addition – Array Merge #define SIZE_N 1024 struct vectors_type { int a; int b; int c; } int s1[SIZE_N], s2[SIZE_N]; vectors_type vectors[SIZE_N]; for (int i = 0; i < SIZE_N; i++) { s1[i] = vectors[i].a + vectors[i].b; s2[i] = vectors[i].a + vectors[i].c; }

Vector Addition – Array Merge 3072 accesses, every 4 th one misses 768 misses May not be a viable optimization method in all cases  If we have a large set of vectors and want to be able to add any two  Dynamic memory allocation  What if we only want to traverse one vector?

Vector Addition – Prefetch Speculatively load data into cache before we need it Useful if we know which data we need far enough in advance Assume prefetch is useful if we know the address 10 iterations in advance Assume prefetch past end of array is non- faulting

Vector Addition – Prefetch #define SIZE_N 1024 int a[SIZE_N], b[SIZE_N], c[SIZE_N], s1[SIZE_N], s2[SIZE_N] for (int i = 0; i < SIZE_N; i++) { s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i]; prefetch(a[i+10]); prefetch(b[i+10]); prefetch(c[i+10]); }

Vector Addition – Prefetch Only 30 misses 3072 prefetch instructions issued Does the cost outweigh the benefit?  768 – 30 = 738 fewer misses  Miss cost only needs to be 4.2 cycles for prefetch be worthwhile  Multiple issue processors can help hide the cost of issuing prefetches Improves performance even if we’re only adding 2 vectors

Vector Addition – Prefetch Do we want a special load instruction that prefetches several blocks ahead?  Reduces instruction count  Works in the case of sequential access, but what if we want to prefetch from non- contiguous locations?

Vector Addition – Cache Bypass Assume a 2-set fully associative cache with 4 word line size for (int i = 0; i < SIZE_N; i++) { s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i]; } Assume write non-allocate Very worst case: cache always misses (4096 misses) If we use LRU and write our assembly so that a is always in cache: 2048 misses for b[i] and c[i] misses for a[i] = 2304 If we use non-caching reads for c[i]: 1024 misses  a[i] and b[i] 256 misses each: 1536 total

Linked List Suppose we are sequentially traversing a linked list We can prefetch the next several items Calculating addresses repeatedly could be expensive (requires multiple memory accesses) Use 2 pointers: one for prefetch

Linked List – Base Code struct linked{ int data; *linked next; } *linked start; *linked temp = start; int a[SIZE], index=0; while (temp->next) { a[index++] = temp->data; temp = temp->next; }

Linked List – Prefetch struct linked{ int data; *linked next; } *linked start; *linked temp = start, temp2 = start; int a[SIZE], index=0; for (int i = 0; i < 10; i++) temp2 = temp2->next; while (temp->next){ a[index++] = temp->data; temp = temp->next; if (temp2->next){ temp2 = temp2->next; prefetch(temp2->next); }

Linked List – Prefetch Instead of every element potentially missing the cache, only the first 10 do If prefetch takes longer to complete, more cache space is necessary

Binary Tree Suppose we are traversing a binary tree where we can’t easily predict which branch we’ll access next. Is prefetch useful? We can speculatively prefetch all values  How far down tree?  Cache Pollution May be valuable to speculatively fetch next two possible elements if we can do useful work until the prefetch completes (ie, if it takes enough cycles to determine which branch to take)

Binary Tree struct node{ int data; *node left, right; } *node top; *node temp = top; int search_value, found=0; do{ if (temp->left) prefetch(temp->left); if (temp->right) prefetch(temp->right); temp = next_node(temp, search_value, &found); }until (found);

Conclusion Some methods improve contrived cases, but are they always useful?  Loop fusion  Array merge Prefetch works well for predictable access patterns  Dynamic memory and pointers?  Is prefetch worthwhile for large block size and random access of small elements?

Conclusion Cache miss time measured in clock cycles is increasing  Requires prefetch farther ahead – larger caches Software methods are static  Low cost of implementation  Potentially pipeline independent