Presentation is loading. Please wait.

Presentation is loading. Please wait.

Memory Optimizations Research at UNT Krishna Kavi Professor Director of NSF Industry/University Cooperative Center for Net-Centric Software and Systems.

Similar presentations

Presentation on theme: "Memory Optimizations Research at UNT Krishna Kavi Professor Director of NSF Industry/University Cooperative Center for Net-Centric Software and Systems."— Presentation transcript:

1 Memory Optimizations Research at UNT Krishna Kavi Professor Director of NSF Industry/University Cooperative Center for Net-Centric Software and Systems (Net-Centric IUCRC) Computer Science and Engineering The University of North Texas Denton, Texas 76203, USA

2 Memory Optimizations at UNT2 Motivation  Memory subsystem plays a key role in achieving performance on multi-core processors  Memory subsystem contributes to significant portions of energy consumed  Pin limitations limit bandwidth to off-chip memories  Shared caches may have non-uniform access behaviors  Shared caches may encounter inter-core conflicts and coherency misses  Different data types exhibit different locality and reuse behaviors  Different applications need different memory optimizations

3 Memory Optimizations at UNT3 Our Research Focus Cache Memory optimizations software and hardware solutions primarily at L-1 some ideas at L-2 Memory Management Intelligent allocation and user defined layouts Hardware supported allocation and garbage collection

4 Memory Optimizations at UNT4 Non-Uniformity of Cache Accesses Non-Uniform access to cache sets Some sets are accessed 100,000 time more often than other sets Cause more misses while some sets are not used Non-Uniform Cache Accesses For Parser

5 Memory Optimizations at UNT5 Non-Uniformity of Cache Accesses But, not all applications exhibit “bad” access behavior Non-Uniform Cache Accesses for Selected Benchmarks Need different solutions for different applications

6 Memory Optimizations at UNT6 Improving Uniformity of Cache Accesses Possible solutions Using Fully associative caches with perfect replacement policies Selecting optimal addressing schemes Dynamically re-mapping addresses to new cache lines Partitioning caches into smaller portions Each partition used by a different data object Using Multiple address decoders Static or dynamic data mapping and relocation

7 Memory Optimizations at UNT7 Associative Caches Improve Uniformity Direct Mapped Cache16-Way Associative Cache

8 Memory Optimizations at UNT8 Data Memory Characteristics Different Object Types exhibit different access behaviors -Arrays exhibit spatial localities -Linked lists and pointer data types are difficult to pre-fetch -Static and scalars may exhibit temporal localities Custom memory allocators and custom run-time support can be used to improve locality of dynamically allocated objects -Pool Allocators (U of Illinois) -Regular Expressions to improve on Pool Allocators (Korea) -Profiling and reallocating objects (UNT) -Hardware support for intelligent memory management (UNT and Iowa State)

9 Memory Optimizations at UNT9 ABC’s of Cache Memories Multiple levels of memory – memory hierarchy CPU and Registers L1- Instr Cache L1- Data Cache L2 Cache (combined Data and Instr) DRAM (Main memory) DISK

10 Memory Optimizations at UNT10 ABC’s of Cache Memories Consider a direct mapped Cache An address can only be in a fixed cache line as specified by the 6-bit line number of the address

11 Memory Optimizations at UNT11 ABC’s of Cache Memories Consider a 2-way set associative cache An address is located in a fixed set of the cache. But the address can occupy either of the 2 lines of a set. We extend this idea to 4-way, 8- way,.. fully associative caches

12 Memory Optimizations at UNT12 ABC’s of Cache Memories Consider a fully associative cache An address is located in any line Or, there is only one set in the cache. Very expensive since we need to compare the address tag with each line tag. Also need a good replacement strategy. Can lead to more uniform of access to cache lines TagByte offset

13 Memory Optimizations at UNT13 Programmable Associativity Can we provide higher associativity only when we need it? Consider a simple idea Heavily accessed cache lines will be provided with alternate locations as indicated by “partner index”

14 Memory Optimizations at UNT14 Programmable Associativity Pier’s adaptive cache uses two tables Set-reference History Table (SHT) – tracks heavily used cache lines Out-of-position directory (OUT) – tracks alternate locations [Pier 98] J. Peir, Y. Lee, and W. Hsu, “Capturing Dynamic Memory Reference Behavior with Adaptive Cache Topology.” In Proc. of the 8th Int. Conf. on Architectural Support for Programming Language and Operating Systems, 1998, pp. 240–250 [Zhang 06] C. Zhang. Balanced cache: Reducing conflict misses of direct-mapped caches. ISCA, pages 155–166, June 2006 Zhang’s programmable associativity (B-Cache) Cache index is divided in to Programmable and Non-programmable indexes The NPI facilitates for varying associativities

15 Memory Optimizations at UNT15 Programmable Associativity

16 Memory Optimizations at UNT16 Programmable Associativity

17 Memory Optimizations at UNT17 Programmable Associativity

18 Memory Optimizations at UNT18 Multiple Decoders Tag Set Index Byte offsetTag Set Index Byte offset TagSet IndexByte offsetSet Index TagData Different decoders may use different associativities

19 Memory Optimizations at UNT19 Multiple Decoders But how to select index bits?

20 Memory Optimizations at UNT20 Index Selection Techniques Different approaches have been studied Givargis quality bits X-Or some tag bits with index bits Add a multiple of tag to index Use prime modulo [Givargis 03] T. Givargis, “Improved Indexing for Cache Miss Reduction in Embedded Systems,” In Proc. of Design Automation Conference, [Kharbutli 04] M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee, “Using PrimeNumbers for Cache Indexing to Eliminate Conflict Misses,” Proc.Int’l Symp. High Performance Computer Architecture, 2004

21 Memory Optimizations at UNT21 Index Selection Techniques

22 Memory Optimizations at UNT22 Multiple Decoders Odd multiplier method Different multipliers for each thread

23 Memory Optimizations at UNT23 Multiple Decoders Here we split cache into segments, one per thread But, we used Adaptive cache techniques to “donate” underutilized sets to other threads

24 Memory Optimizations at UNT24 Other Cache Memory Research at UNT Use of a single data cache can lead to unnecessary cache misses Arrays exhibit higher spatial localities while scalar may exhibit higher temporal localities May benefit from different cache organizations (associativity, block size) If using separate instruction and data caches, why not different data caches -- either statically or dynamically partitioned And if separate array and scalar caches are included, how to further improve their performance Optimize the sizes of array and scalar caches for each application

25 Memory Optimizations at UNT25 Reconfigurable Caches CPU Array Cache Scalar Cache MAINMEMORYMAINMEMORY Secondary Cache

26 Memory Optimizations at UNT26 Percentage reduction of power, area and cycles for data cache Conventional cache configuration: 8k, Direct mapped data cache, 32k 4-way Unified level 2 cache Scalar cache configuration: Size variable, Direct mapped with 2 lined Victim cache Array cache configuration: Size variable, Direct mapped

27 Memory Optimizations at UNT27 Summarizing For instruction cache 85% (average 62%) reduction in cache size 72% (average 37%) reduction in cache access time 75% (average 47%) reduction in energy consumption For data cache 78% (average 49%) reduction in cache size 36% (average 21%) reduction in cache access time 67% (average 52%) reduction in energy consumption when compared with an 8KB L-1 instruction cache and an 8KB L-1 unified data cache with a 32KB level-2 cache

28 Memory Optimizations at UNT28 Generalization Why not extend Array/Scalar split caches to more than 2 partitions? Each partition customized to a specific object type Partitioning can be achieved using multiple decoders with a single cache resource (virtual partitioning) Reconfigurable partitions is possible with programmable decoders Each decoder accesses a portion of the cache either physically restrict to a segment of cache or virtually limit the number of lines accessed by a decoder Scratchpad Memories can be viewed as cache partitions Dedicate a segment of cache for scratchpad

29 Memory Optimizations at UNT29 Scratch Pad Memories They are viewed as compiler controlled memories as fast as L-1 caches, but not managed as caches Compiler decides which data will reside in scratch pad memory A new paper from Maryland proposes a way of compiling programs for unknown sized Scratch pad memories Only Stack data (static and global variables) are placed in SPM Compiler views Stack as two stacks Potential SPM data stack DRAM data stack

30 Memory Optimizations at UNT30 Current and Future Research Extensive study of using Multiple Decoders Separate decoders for different data structures partitioning of L-1 caches Separate decoders for different threads and cores at L-2 or Last Level Caches minimize conflicts minimize coherency related misses minimize loss due to non-uniform memory access delays Investigate additional indexing or programmable associativity ideas Cooperative L-2 caches using adaptive caches

31 Memory Optimizations at UNT31 Program Analysis Tool We need tools to profile and analyze Data layout at various levels of memory hierarchy Data access patterns Existing tools (Valgrind, Pin) do not provide fine grained information We want to relate each memory access back to a source level constructs Source variable name, function/thread that caused the access

32 Memory Optimizations at UNT32 Gleipnir Our tool is built on top of Valgrind Can be used with any architecture that is supported by Valgrind x-86, PPC, MIPS and ARM

33 Memory Optimizations at UNT33 Gleipnir

34 Memory Optimizations at UNT34 Gleipnir

35 Memory Optimizations at UNT35 Gleipnir

36 Memory Optimizations at UNT36 Gleipnir How can we use Gleipnir. Explore different data layouts and their impact on cache accesses

37 Memory Optimizations at UNT37 Gleipnir Standard layout

38 Memory Optimizations at UNT38 Gleipnir Tiled matrices

39 Memory Optimizations at UNT39 Gleipnir Matrices A and C combined

40 Memory Optimizations at UNT40 Further Research Restructuring memory allocation – currently in progress -Analyze cache set conflicts and relate them to data objects -Modify data placement of these objects -Reorder variables, include dummy variables, … Restructure Code to improve data access patters (SLO tool) -Loop Fusion – combine loops that use the same data -Loop tiling – split loops into smaller loops to limit data accessed -Similar techniques to assure “common” data resides in L-2 (shared caches) -Similar techniques such that data is transferred to GPUs infrequently

41 Loop Tiling Idea Too much data accessed in the loop Memory Optimizations at UNT41 Code Refactoring double sum(…) { … for(int i=0; i

42 Memory Optimizations at UNT42 Code Refactoring Loop Fusion Idea double inproduct(…) { … for(int i=0; i

43 Memory Optimizations at UNT43 SLO Tool double inproduct(…) { … for(int i=0; i

44 Memory Optimizations at UNT44 Extensions Planned Key Factors Influencing Code and Data Refactoring Reuse Distance – reducing distance improves data utilization Can be used with CPU-GPU configurations Fuse loops so that all computations using the “same” data are grouped Conflict sets and conflict distances The set of variables that fall to the same cache line (or group of lines) Conflict between pairs of conflicting variables Increase conflict distance

45 Memory Optimizations at UNT45 Further Research We are currently investigating several of these ideas Using architectural simulators like SimICS explore multiple decoders with multiple threads, cores or for different data types Further extend Gleipnir and explore using Gleipnir with compilers and Gleipnir with other tools like SLO, evaluate the effectiveness of custom allocators Some hardware implementations of memory management using FPGAs And we welcome collaborations

46 Memory Optimizations at UNT46 The End Questions? More information and papers at

47 Memory Optimizations at UNT47 Custom Memory Allocators Consider a typical pointer chasing programs node { int key; … data; /* complex data part node *next; } We will explore two possibilities pool allocation split structures

48 Memory Optimizations at UNT48 Custom Memory Allocators Pool Allocator (Illinois) Data type B Data type A Data type B Data type A Data type B Heap

49 Memory Optimizations at UNT49 Custom Memory Allocators Further Optimization Consider a typical pointer chasing programs node { int key; … data; /* complex data part node *next; } The data part is accessed only if key matches while (..) { if (b->key == k) return h->data; h= h->next; } Consider a different definition of the data node { int key; node *next; data_node * data+ptr; } Key; *next; *datat_ptr Key; *next; *datat_ptr Key; *next; *datat_ptr Key; *next; *datat_ptr Key; *next; *datat_ptr Key; *next; *datat_ptr Data_node

50 Memory Optimizations at UNT50 Custom Memory Allocators Profiling (UNT) Using data profiling, “flatten” dynamic data into consecutive blocks Make linked lists look like arrays!

51 Memory Optimizations at UNT51 Cache Based Side-Channel Attacks Encryption algorithms use keys (or blocks of the key) as index into tables containing constants used in the algorithm Using which table entries caused cache misses can find the address of the table entry and then find the value of the key that was used Z. Wang and R. Lee. “New cache designs for thwarting software cache based side channel attacks”, ISCA 2007, pp Two solutions: 1. Lock cache lines (cannot be displaced) when using encryption 2. Use a random replacement policy in selecting which line of a set is replaced

52 Memory Optimizations at UNT52 Offloading Memory Management Functions 1.Dynamic memory management is the management of main memory for use by programs during runtime 2.Dynamic memory management account for significant amount of execution time –42& for 197.parser (from SPEC 2000 benchmarks) 3.If CPU is performing memory management, CPU cache will perform poorly due to switching between user functions and memory management functions 4.If we have a separate hardware and separate cache for memory management, CPU cache performance can be improved dramatically

53 Memory Optimizations at UNT53 Offloading Memory Management Functions BIU CPU Data Cache De-All Completion Allocation Ready System Bus Instruction Cache Interface Memory Processor MP Inst. Cache MP Data Cache Second Level Cache

54 Memory Optimizations at UNT54 Improved Performance Object Oriented and Linked Data Structured Applications Exhibit Poor Locality Cache pollution caused by Memory Management functions Memory management functions do not use user data caches On average, about 40% of cache misses eliminated Memory manager does not need large data caches

55 Memory Optimizations at UNT55 Improved Execution Performance Name of Benchmark % of cycles spent on malloc Numbers of instructions in conventional Architecture Numbers of instruction in Separated Hardware Implementation % Performance increase due to Separate Hardware Implementation % Performance increase due to fastest separate Hardware Implementation 255.vortex gzip ,540,6604,539, parser espresso Cfrac bisort

56 Memory Optimizations at UNT56 Other Uses of Hardware Memory Manager Dynamic relocation of objects to improve localities Hardware Manager can track object usage and relocate them without CPU’s knowledge New and innovative Allocation/Garbage collection methods Estranged Buddy Allocator Contaminated Garbage Collector Predictive allocation to achieve “one-cycle” allocation Allocator bookkeeping data kept separate from objects

Download ppt "Memory Optimizations Research at UNT Krishna Kavi Professor Director of NSF Industry/University Cooperative Center for Net-Centric Software and Systems."

Similar presentations

Ads by Google