The Hardness of Cache Conscious Data Placement Erez Petrank, Technion Dror Rawitz, Caesarea Rothschild Institute Appeared in 29 th ACM Conference on Principles.

Slides:

Advertisements

Similar presentations

Yannis Smaragdakis / 11-Jun-14 General Adaptive Replacement Policies Yannis Smaragdakis Georgia Tech.

Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

Heuristics for the Hidden Clique Problem Robert Krauthgamer (IBM Almaden) Joint work with Uri Feige (Weizmann)

Department of Computer Science & Engineering

1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 

Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.

Combinatorial Algorithms

CS774. Markov Random Field : Theory and Application Lecture 17 Kyomin Jung KAIST Nov

Fast FAST By Noga Alon, Daniel Lokshtanov And Saket Saurabh Presentation by Gil Einziger.

Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.

Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.

CPSC 411, Fall 2008: Set 12 1 CPSC 411 Design and Analysis of Algorithms Set 12: Undecidability Prof. Jennifer Welch Fall 2008.

Approximation Algorithms: Combinatorial Approaches Lecture 13: March 2.

Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.

1 Optimization problems such as MAXSAT, MIN NODE COVER, MAX INDEPENDENT SET, MAX CLIQUE, MIN SET COVER, TSP, KNAPSACK, BINPACKING do not have a polynomial.

1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 21 Instructor: Paul Beame.

The Theory of NP-Completeness

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Register Allocation (via graph coloring)

Analysis of Algorithms CS 477/677

Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.

2-Layer Crossing Minimisation Johan van Rooij. Overview Problem definitions NP-Hardness proof Heuristics & Performance Practical Computation One layer:

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

1 Combinatorial Dominance Analysis Keywords: Combinatorial Optimization (CO) Approximation Algorithms (AA) Approximation Ratio (a.r) Combinatorial Dominance.

1 Liveness analysis and Register Allocation Cheng-Chia Chen.

Randomness in Computation and Communication Part 1: Randomized algorithms Lap Chi Lau CSE CUHK.

NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.

(work appeared in SODA 10’) Yuk Hei Chan (Tom)

1 Slides by Asaf Shapira & Michael Lewin & Boaz Klartag & Oded Schwartz. Adapted from things beyond us.

Paging for Multi-Core Shared Caches Alejandro López-Ortiz, Alejandro Salinger ITCS, January 8 th, 2012.

Fixed Parameter Complexity Algorithms and Networks.

1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.

1 The TSP : NP-Completeness Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell.

Kernel Bounds for Structural Parameterizations of Pathwidth Bart M. P. Jansen Joint work with Hans L. Bodlaender & Stefan Kratsch July 6th 2012, SWAT 2012,

Edge Covering problems with budget constrains By R. Gandhi and G. Kortsarz Presented by: Alantha Newman.

APPROXIMATION ALGORITHMS VERTEX COVER – MAX CUT PROBLEMS

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Advanced Algorithm Design and Analysis (Lecture 13) SW5 fall 2004 Simonas Šaltenis E1-215b

CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 4.

Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.

Private Approximation of Search Problems Amos Beimel Paz Carmi Kobbi Nissim Enav Weinreb (Technion)

An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.

Although this may seem a paradox, all exact science is dominated by the idea of approximation. Bertrand Russell Approximation Algorithm.

Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.

1 Design and Analysis of Algorithms Yoram Moses Lecture 11 June 3, 2010

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

The Dominating Set and its Parametric Dual  the Dominated Set  Lan Lin prepared for theory group meeting on June 11, 2003.

Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.

Speeding Up Enumeration Algorithms with Amortized Analysis Takeaki Uno (National Institute of Informatics, JAPAN)

CS 3343: Analysis of Algorithms Lecture 25: P and NP Some slides courtesy of Carola Wenk.

Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.

Chapter 11 Introduction to Computational Complexity Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.

Chapter 15 P, NP, and Cook’s Theorem. 2 Computability Theory n Establishes whether decision problems are (only) theoretically decidable, i.e., decides.

Introduction to NP-Completeness Tahir Azim. The Downside of Computers Many problems can be solved in linear time or polynomial time But there are also.

Conditional Lower Bounds for Dynamic Programming Problems Karl Bringmann Max Planck Institute for Informatics Saarbrücken, Germany.

Lecture. Today Problem set 9 out (due next Thursday) Topics: –Complexity Theory –Optimization versus Decision Problems –P and NP –Efficient Verification.

Approximation Algorithms by bounding the OPT Instructor Neelima Gupta

Instructor: Shengyu Zhang 1. Optimization Very often we need to solve an optimization problem.  Maximize the utility/payoff/gain/…  Minimize the cost/penalty/loss/…

The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.

A Graph Theoretic Approach to Cache-Conscious Placement of Data for Direct Mapped Caches Mirza Beg and Peter van Beek University of Waterloo June

The NP class. NP-completeness

Optimization problems such as

Cache Memory Presentation I

CS4234 Optimiz(s)ation Algorithms

NP-Completeness Yin Tat Lee

Computability and Complexity

Approximation Algorithms for TSP

NP-Completeness Yin Tat Lee

Major Design Strategies

Presentation transcript:

The Hardness of Cache Conscious Data Placement Erez Petrank, Technion Dror Rawitz, Caesarea Rothschild Institute Appeared in 29 th ACM Conference on Principles of Programming Languages Portland, Oregon, January 16, 2002

2 Agenda Background & motivation The problem of cache conscious data / code placement is: extremely difficult in various models Positive matching results (weak…). Some proof techniques and details Conclusion

3 Computers Today Memory speed falls behind processor speed, and gap still increasing. Solution: use a fast cache between memory and CPU. Implication: program cache behavior has a significant impact on program efficiency. Cache Mem CPU

4 Cache Structure Large memory divided into blocks. Small cache - k blocks. Mapping of memory blocks to cache blocks. (e.g., modulus function) Cache hit - accessed block is in the cache. Cache miss - required block must be read to cache from memory. Direct mapping

5 What can we do to improve program cache behavior? Arrange code / data to minimize cache misses Write cache-conscious programs In this work we concentrate on the first.

6 How do we place data (or code) optimally? Step 1: Discover future accesses to data. Step 2: Find placement of data that minimizes the number of cache misses. Step 3: Rearranged the data in memory. Step 4: Run program. Some “minor” problems: In Step 1: We cannot tell the future In Step 2: We don’t know how to do that

7 Step 1: Discover future accesses to data Static analysis. Profiling. Runtime monitoring. This work: Even if future accesses are known exactly, Step 2 (placing data optimally) is extremely difficult.

8 The Problem Input: a set of objects O={o 1,…,o m }, and a sequence of accesses  =(  1,…,  n ). E.g.  = (o 1,o 3,o 7,o 1,o 2,o 1,o 3,o 4,o 1 ). Solution: a placement, f:O  N. Measure: number of misses. We want: placement of o 1,…,o m in memory that obtains minimum number of cache misses (over all possible placements).

9 Our Results Can we (efficiently) find an optimal placement? No! Unless, P=NP.

10 Our Results Can we (efficiently) find an “almost” optimal placement? Almost = # misses  twice the optimum No! Unless, P=NP. Can we (eff.) find “fairly” optimal placement? Fairly = # misses  100 times the optimum No! Unless, P=NP.

11 Our Results Can we (eff.) find a “reasonable” placement? reasonable = # misses  log(n) the optimum No! Unless, P=NP. Can we (eff.) find an “acceptable” placement? Acceptable = # misses  n 0.99 times the optimum No! Unless, P=NP.

12 The Main Theorem Let ε be any real number, 0< ε <1. If there is a polynomial time algorithm that finds a placement which is within a factor of n (1-  ) from the optimum, then P=NP. (Theorem holds for caches with > 2 blocks)

13 Extend to t-way Associative Caches t-way Associative Caches: t·k blocks in cache, k sets, t blocks in a set. memory block mapped to a set. Inside a set: a replacement protocol. Theorem 2: same hardness holds for t-way associative cache systems

14 Result is “robust” Holds for a variety of models. E.g., Mapping of memory block to cache is not by modulus, Replacement policy is not standard, Object sizes are fixed, (or they are not), Objects must be aligned to cache blocks, (or not), Etc…

15 More Difficulties: Pairwise Information A practical problem: sequence  of accesses is long. Processing it is costly. Solution in previous work: keep relations between pairs of objects. E.g., for each pair how beneficial is putting them in the same memory block. E.g., for each pair how many misses would be caused by mapping the two to the same cache block

16 Pairwise Information is Lossy Conclusion: Even when given unrestricted time, finding an optimal pairwise placement is a bad idea (worst case). Theorem 3: There exists a sequence  such that #misses(f)  (k-3)  #misses(f*) f - optimal pairwise placemet f* - optimal placement

17 Pairwise Information: Hardness Result Theorem 4: Let ε be any real number, 0<ε<1. If there is a polynomial time algorithm that finds a placement which is within a factor of n (1- ε) from the optimum with respect to pairwise information, then P=NP. Proof is similar to the direct mapping case.

18 A Simple Observation Input: Objects O={o 1,…,o m }, and access sequence  =(  1,…,  n ). Any placement yields at most n cache misses. Any placement yields at least 1 cache miss. Therefore, any placement is within a factor of n from the optimum. (Recall: a solution within n (1-ε) is not possible.)

19 What about positive results? In light of the lower bound not much can be done in general. Yet… Theorem 5: There exists a polynomial time approximation algorithm that outputs a placement (always) within a factor of from the optimal placement for any c. Compare: impossible: n (1-  ), possible: n / c logn

20 Other Problems Our Problem: Inapproximable n / c logn -approximation algorithm Famous problems with similar results: Minimum graph coloring Maximum clique

21 Implications: We cannot hope to find an algorithm that will always give a good placement. We must use heuristics. We cannot estimate the potential benefit of rearranging data in memory to the cache behavior. We can only check what a proposed heuristic does for common benchmarks.

22 Some Proof Ideas (simplest – direct mapping) Proof: We show that if the above algorithm exists, then we can decide for any given graph G, if G is k-colorable. Theorem 1: Let  be any real number, 0<  <1. If there is a polynomial time algorithm that finds a placement which is within a factor of n (1-  ) from the optimum, then P=NP.

23 The k-colorability Problem Problem: Given G=(V,E), is G k-colorable? Known to be NP-complete for k>2.

24 Translating a Graph G into a Cache Question GraphCache Question ColorCache line Vertex v i Object o v Edge e=(v i,v j ) Subsequence  e =(o i,o j ) M (i.e., M times (o i,o j ).) Coloring  Placement

25 Translating a Graph G into a Cache Question A vertex v i is represented by an object o i : O G = { o i : v i  V } Let =O(1/  ). Each edge (v i,v j ) is represented by |E| repetitions of the two objects o i,o j :

26 Examples G 1 (not 3-colorable) G 2 (3-colorable)

27 Properties of the Translation Length of  : n = O(|E| +1 ) Case I: G is k-colorable. Then, Opt(O G,  G ) = O(|E|) = O(n 1/( +1) ) = O(n  /2 ) Case II: G is not k-colorable. Then, Opt(O G,  G ) = Ω(|E| ) = Ω(n /( +1) ) = Ω(n 1-  /2 ) Thus, an algorithm that provides a placement within n (1-  ) from the optimum can distinguish between the two cases!

28 Replacement Protocol Relevant in t-way caches Which object is removed when set is full? We need a replacement policy or protocol. Examples: LRU FIFO …

29 Replacement Protocol Solution: Only reasonable caches are considered. Problem: replacement protocol may behave badly. Recently used objects are being removed (e.g., most recently used). Any placement is good. Problem: How do we define reasonable?

30 Sensible Caches  =(o 1,…,o q ), s.t.  i  j, o i  o j No more than t objects are mapped to the same cache set  causes at most q+C misses, when accessed after any  ’. O(1) misses after first access to . E.g., LRU is 0-sensible.

31 Pseudo-LRU Used in Intel® Pentium® 4-way associative Each set: Two pairs LRU block flag (for each pair) LRU pair flag Replacement Protocol: LRU block in LRU pair is removed Pseudo-LRU is 0-sensible.

32 t-way associative caches: Some Proof Ideas Proof: If the above algorithm exists, then we can decide for any given graph G, if G is k-colorable. Theorem 2: Let  be any real number, 0<  <1. If there is a polynomial time algorithm that finds a placement which is within a factor of n (1-  ) from the optimum, then P=NP. Main idea: a sensible replacement protocol is “forced” to behave as a direct mapping cache. We do this by using dummy objects

33 How do we construct the approximation algorithm? If there are  c logn objects Opt  c logn. Any placement is -approximate. Otherwise, There are < c logn objects in . In this case, we can find an optimal placement by examining possible placements.

34 Unrestricted Alignments and non Uniform Size Objects Two simplifying assumptions: Uniform size objects Aligned objects Can we get the same results without them? Lower bound? Upper bound?

35 Problem with Unrestricted Alignments Cache with 3 blocks. 2 objects o 1, o 2 each of size 1.5 blocks. The sequence: (o 1,o 2 ) M Restricted alignment: O(M) misses. Unrestricted alignment: 3 misses. (If they are placed consecutively in memory.)

36 Unrestricted Alignments (uniform instances) Aligned version of a placement: fg Direct mapping: #misses(g)  #misses(f) Lower bounds Associative caches: #misses(g) = O(#misses(f)) + O(|E|) when (O,  ) = H(G) H

37 Conclusion Computing the best placement of data in memory (w.r.t. reducing cache misses) is extremely difficult. We cannot even get close (if P  NP). There exists a matching (weak) positive result. Implications: using heuristics cannot be avoided. We cannot hope to evaluate potential benefit.

38 An Open Question: Can we classify programs for which the problem becomes simpler?

39 The end