Presentation is loading. Please wait.

Presentation is loading. Please wait.

Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.

Similar presentations


Presentation on theme: "Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State."— Presentation transcript:

1 Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State University

2 Summary Proposal A compiler-runtime cooperative data layout optimization that improves row-buffer locality in irregular programs Proposal A compiler-runtime cooperative data layout optimization that improves row-buffer locality in irregular programs ~17% improvement in overall application performance Problem Most data locality optimizations target exclusively cache locality. “Row Buffer Locality” is also important. The problem is especially challenging in the case of irregular programs (sparse data) Problem Most data locality optimizations target exclusively cache locality. “Row Buffer Locality” is also important. The problem is especially challenging in the case of irregular programs (sparse data) 2

3 Outline Background Motivation Conservative Layout Fine-grain Layout Related Work Evaluation Conclusion 3

4 DRAM Organization DIMM DRAM chip Processor MC Rank Channel Bank Row Buffer Row-buffer Locality 4

5 Irregular Programs Real X(num_nodes), Y(num_edges); Integer IA(num_edges, 2); for (t = 1, t < T, t++) { /* If it is time to update the interaction list */ for (i = 0, i < num_edges; i++) { X(IA(i, 1)) = X(IA(i, 1)) + Y(i); X(IA(i, 2)) = X(IA(i, 2)) - Y(i); } 5

6 Inspector/Executor model /* Executor */ Real X(num_nodes), Y(num_edges); Real X’(num_nodes), Y’(num_edges); Integer IA(num_edges, 2); for (t = 1, t < T, t++) { X’, Y’ = Trans(X, Y); for (i = 0, i < num_edges; i++) { X’(IA(i, 1)) = X’(IA(i, 1)) + Y’(i); X’(IA(i, 2)) = X’(IA(i, 2)) - Y’(i); } /* Inspector */ Trans(X, Y): for (i = 0, i < num_edges; i++) { /* data reordering algorithms */ } return (X’, Y’) Used for identifying parallelism or improving cache locality 6

7 Outline Background Motivation Conservative Layout Fine-grain Layout Related Work Evaluation Conclusion 7

8 Row-buffer Locality Prior works that target irregular applications exclusively focus on improving cache locality – No efforts to improve row-buffer locality Typical latencies (based on AMD architecture) – Last Level Cache (LLC) hit = 28 cycles – Row-buffer hit = 90 cycles – Row-buffer miss = 350 cycles Application performance is dictated not only by the cache hitrate, but also by the row-buffer hitrate. 8

9 Example Layout (b) eliminates the row-buffer miss caused by accessing ‘y’. Assuming this move will not cause any additional cache misses Layout (c) eliminates the row-buffer misses caused by accessing ‘v’ even at the cost of an additional cache miss 1 2 3 9

10 Outline Background Motivation Conservative Layout Fine-grain Layout Related Work Evaluation Conclusion 10

11 Notations Seq: the sequence of data elements obtained by traversing the index array α x : the access to a particular data element x in Seq time(α x ): the “logical time stamp” of x in Seq β x : the memory block where data element x resides α x, : the “most recent access” to β x before α x Caches(β x ): the set of cache blocks to which β x can be mapped in a k-way set-associative cache 11

12 Definition Block Distance: Given Caches(β x ) = Caches(β Y ), the block distance between α x and α y, denoted as Δ(α y, α x ) is the number of “distinct" memory blocks that are mapped to Caches(β x ) and accessed during the time period between time(α x ) and time(α y ) 12

13 Lemma 13

14 Conservative Layout Objective: – Increase row-buffer hitrate – Without affecting the cache performance Algorithm 1.Identifying the locality sets 2.Constructing the interference graph 3.Assigning rows in memory 14

15 1. Identifying the Locality Sets 15

16 2. Constructing the Interference Graph Each node represents a locality set If α x and α y are the two accesses that incur successive cache misses in Seq, and x and y are located in different rows, then an edge is added between the locality sets of x and y – Weight on this edge represents the total number of such α x and α y pairs 16

17 3. Assigning Rows in Memory Sort the edges in the interference graph in decreasing order Assign same row to the locality sets connected by the edge with the largest weight 17

18 Outline Background Motivation Conservative Layout Fine-grain Layout Related Work Evaluation Conclusion 18

19 Fine-grain Layout 19

20 Algorithm 1.Constructing the Interference Graph 2.Constructing the Locality Graphs 3.Finding Partitions 4.Assigning Rows in Memory 20

21 1. Constructing the Interference Graph Each node in the interference graph represents a data element If α x and α y are two accesses that incur successive cache misses in Seq, and x and y are located in different rows, then we set up an edge between x and y – Weight on the edge represents the number of such α x and α y pairs 21

22 2. Constructing the Locality Graphs 22

23 3. Finding Partitions 23

24 4. Assigning Rows in Memory Each partition is assigned to a memory block in a row 24

25 Example 25

26 Related Work Inspector/Executor model – Typically used for parallelism (Lawrence Rauchwerger [1]) and cache locality (Chen Ding [2]) – We use it to improve row-buffer locality and our approach is complementary to them Row buffer locality – Compiler approach: Mary W. Hall [3] – Hardware approach: Al Davis [4] – Our work specifically targets irregular applications 26

27 Outline Background Motivation Conservative Layout Fine-grain Layout Related Work Evaluation Conclusion 27

28 Evaluation CPU12 cores; 2.6 GHz; 4 memory controllers Caches64KB per core L1 (3 cycles); 512KB per core L2 (12 cycles); 12MB per socket shared L3 (28 cycles) MemoryDDR3-1866; 8 banks per channel; 8KB row-buffers NameInput SizeL3 Miss rate RB Miss rate PSST427.6 MB18.1 %29.6 % PaSTiX511.6 MB24.3 %41.7 % SSIF129.3 MB13.7 %24.4 % PPS738.2 MB21.4 %33.1 % REACT1.2 GB28.6 %46.9 % Benchmarks Platform (modeled in GEM5) 28

29 Simulation Results 6 % 15 % 27 % 12 % 17 % 29

30 30

31 Conclusion Exploiting row-buffer locality is critical for application performance We proposed two compiler-directed data layout organizations with the goal of improving row-buffer locality in irregular applications – Without affecting cache performance – Trading cache performance for row-buffer locality 31

32 Thank You Questions? 32

33 References 1.“Improving Cache Performance in Dynamic Applications through Data and Computation Reorganization at Run Time”, ICPP 2012 2.“Sensitivity Analysis for Automatic Parallelization on Multi-Cores”, ICS 2007 3.“A compiler algorithm for exploiting page-mode memory access in embedded dram devices“, MSP ’02 4.“Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement”, ASPLOS 2010 33

34 BACKUP SLIDES 34

35 Results with AMD based system 35

36 Memory Scheduling 36


Download ppt "Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State."

Similar presentations


Ads by Google