Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xiaodong Zhang The Ohio State University

Similar presentations


Presentation on theme: "Xiaodong Zhang The Ohio State University"— Presentation transcript:

1 Xiaodong Zhang The Ohio State University
Advancement of Buffer Management Research and Development in Computer and Data Systems Xiaodong Zhang The Ohio State University

2 Numbers Everyone Should Know (Jeff Dean, Google)
L1 cache reference: ns Branch mis-predict: ns L2 cache reference: ns Mutex lock/unlock: ns Main memory reference ns Compress 1K Bytes with Zippy ns Send 2K Bytes over 1 GBPS network ns Read 1 MB sequentially from memory ns Round trip within data center ns Disk seek ns Read 1MB sequentially from disk ns Send one packet from CA to Europe ns

3 Replacement Algorithms in Data Storage Management
A replacement algorithm decides Which data entry to be evicted when the data storage is full. Objective: keep to-be-reused data, replace ones not to-be-reused Making a critical decision: a miss means an increasingly long delay Widely used in all memory-capable digital systems Small buffers: cell phone, Web browsers, boxes … Large buffers: virtual memory, I/O buffer, databases … A simple concept, but hard to optimize More than 40 years tireless algorithmic and system efforts LRU-like algorithms/implementations have serious limitations.

4 Least Recent Used (LRU) Replacement
LRU is most commonly used replacement for data management. Blocks are ordered by an LRU order (from bottom to top) Blocks enter from the top (MRU), and leave from bottom.(LRU) Move block 2 to the top of stack The stack is long, the bottom is the only exit. LRU stack 5 Recency = 2 Recency – the distance from a block to the top of the LRU stack Upon a hit – move block to the top 3 2 Recency of Block 2 is its distance to the top of stack Upon a Hit to block 2 9 1

5 Least Recent Used (LRU) Replacement
LRU is most commonly used replacement for data management. Blocks are ordered by an LRU order (from bottom to top) Blocks enter from the top, and leave from bottom. Load block 6 from disk Disk The stack is long, the bottom is the only exit. 6 6 Upon a Miss to block 6 LRU stack 3 2 5 9 Recency – the distance from a block to the top of the LRU stack Upon a hit – move block to the top Upon a miss – evict block at the bottom Put block 6 on the stack top Replacement – the block 1 at the stack bottom is evicted 1

6 LRU is a Classical Problem in Theory and Systems
First LRU paper L. Belady, IBM System Journal, 1966 Analysis of LRU algorithms Aho, Denning & Ulman, JACM, 1971 Rivest, CACM, 1976 Sleator & Tarjan, CACM, 1985 Knuth, J. Algorithm, 1985 Karp, et. al, J. Algorithms, 1991 Many papers in systems and databases ASPLOS, ISCA, SIGMETRICS, SIGMOD, VLDB, USENIX…

7 The Problem of LRU: Inability to Deal with Certain Access Patterns
7 The Problem of LRU: Inability to Deal with Certain Access Patterns File Scanning One-time accessed data evict to-be-reused data (cache pollution) A common data access pattern (50% data in NCAR accessed once) LRU stack holds them until they reach to the bottom. Loop-like accesses A loop size k+1 will miss k times for a LRU stack of k Access with different frequencies (mixed workloads) Frequently accessed data can be replaced by infrequent ones

8 Why Flawed LRU is so Powerful in Practice
What is the major flaw? The assumption of “recently used will be reused” is not always right This prediction is based on a simple metrics of “recency” Some are cached too long, some are evicted too early. Why it is so widely used? Works well for data accesses following LRU assumption A simple data structure to implement

9 Challenges of Addressing the LRU Problem
Two types of Efforts have been made Detect specific access patterns: handle it case by case Learn insights into accesses with complex algorithms Most published papers could not be turned into reality Two Critical Goals Fundamentally address the LRU problem Retain LRU merits: low overhead and its assumption The goals are achieved by a set of three papers The LIRS algorithm (SIGMETRICS’02) Clock-pro: a system implementation (USENIX’05) BP-Wrapper: lock-contention free assurance (ICDE’09)

10 Outline The LIRS Algorithm Clock-pro BP-Wrapper
How the LRU problem is fundamentally addressed How a data structure with low complexity is built Clock-pro Turn LIRS algorithm into system reality BP-Wrapper free lock contention so that LIRS and others can be implemented without approximation What would we do for multicore processors? Research impact in daily computing operations

11 Recency vs Reuse Distance
Recency – the distance between last reference to the current time Reuse Distance (Inter reference recency) – the distance between two consecutive reference to the block (deeper and more useful information) . . . 3 4 4 5 3 4 3 2 5 9 8 Recency = 1 Recency = 2 Recency is the distance from a block to the top of the LRU stack Let’s see how to quantify locality using LRU stack. LRU is the most widely used replacement algorithm. LRU stack is a data structure for its implementation, in which… LRU uses recency to quantify locality. Recency of a block is a period from the current time to its last reference. Let’s observe the recency of block 3. And the recency of block 3 is increased from 1 to 2 then to 0. The recency keeps changing from time to time. So a randomly selected recency value is hard to quantify the locality of a block. For example, if a block is regularly accessed every 4 other distinct blocks, we know its locality strength is stable. However, its recency constantly changes from 0 to 4. If we use recency to quantify locality, we will be confused and wonder which of the 5 recency values describes the locality strength? We argue that not all the recency values, but the recency at which the block is accessed should be used to quantify locality. We define the recency as IRR. When you look at the access stream, it can also be define as … LRU stack 1

12 Recency vs Reuse Distance
Recency – the distance between last reference to the current time Reuse Distance (Inter reference recency) – the distance between two consecutive reference to the block (deeper and more useful info) IRR = 2 . . . 3 3 4 5 3 4 5 Recency = 2 Recency = 0 5 3 LRU Stack for HIRs 3 Inter-Reference Recency (IRR) The number of other unique blocks accessed between two consecutive references to the block. 2 9 LRU stack 8 IRR is the recency of the block being accessed last time – need an extra stack to help, increasing complexity.

13 Diverse Locality Patterns on an Access Map
one-time accesses loops Logical Block Number Let’s have a look at an reference stream to see its locality. This is a access map for a workload trace called multi2, which was Collected on a machine running multiple programs together Its X axis shows the virtual time, each time unit is for a block reference. Y axis is for logical block number. For each accessed block there is a distinct block number. Each point represents a reference to a block in the reference stream. These references exhibit various access patterns: Some blocks are very intensively repeatedly accessed; some are regularly re-accessed, something like looping; Some are just accessed once. Virtual Time (Reference Stream) strong locality

14 What Blocks does LRU Cache (measured by IRR)?
Locality Strength MULTI2 LRU holds frequently accessed blocks with “absolutely” strong locality. holds one-time accessed blocks (0 locality) Likely to replace other relatively strong locality blocks IRR (Re-use Distance in Blocks) Here is the IRR map of the access stream in previously shown workload. X axis is aslo …, y axis is … or re-use distance in the access stream. Each point represents the IRR of a reference. When we use IRR to quantify locality, the lower a point is, the stronger its locality is. Using IRR map, the performance of LRU can be easily interpreted. Say the cache size is 1000 blocks, if and only if the references below this line are hits in LRU. For a workload with weak locality, which has small number of references in the low area, LRU will have a low hit ratio. In the extreme case, where all the references are above the line, the LRU hit ratio would be ZERO. Cache Size Virtual Time (Reference Stream)

15 LIRS: Only Cache Blocks with Low Reuse Distances
MULTI2 Locality Strength Holds strong locality blocks (ranked by reuse distance) IRR (Re-use Distance in Blocks) Because LRU doesn’t quantify locality strength correctly, it could cache many actually weak locality blocks and make cache under-utilized. We have decide to use IRR to quantify locality strength. Then the next challenge is that: can we dynamically maintain a set of blocks with strong locality and and the size of the set equals to the cache size, so that we can cache these blocks? Suppose we have a curve to cover the blocks with the strongest locality. The number of the blocks involving in the references below the curve equals to the cache size. And these blocks are cached. The curve can adapt to the current access pattern. When the locality becomes weak, more points are in the upper area, the curve automatically climbs up. When the locality becomes strong, the curve slips down. So at any time the 1000 strongest locality blocks are selected for caching and make the cache fully utilized. Because all reference covered by the LRU line are covered by the new curve, so theoretically, the replacement algorithm caching the blocks covered the curve will have a higher hit Ratio than LRU, and could be much higher with weak locality references. The LIRS algorithm we designed is the one to realize the idea. Cache Size Virtual Time (Reference Stream)

16 Basic Ideas of LIRS (SIGMETRICS’02)
LIRS: Low Inter-Reference recency Set Low IRR blocks are kept in buffer cache High IRR blocks are candidates for replacements Two stacks are maintained A large LRU stack contains low IRR resident blocks A small LRU stack contains high IRR blocks The large stack also records resident/nonresident high IRR blocks IRRs are measured by the two stacks After a hit to a resident high IRR block in small stack, the block becomes low IRR block and goes to the large stack if it can also be found in the large stack => low IRR , otherwise, top it locally The low IRR block in the bottom stack will become a high IRR block and go to the small stack when the large stack is full.

17 Low Complexity of LIRS Both recencies and IRRs are recorded in each stack The block in the bottom of LIRS has the maximum recency A block is low IRR if it can be found in in both stacks No explicit comparisons and measurements are needed Complexity of LIRS = LRU = O(1) although Additional object movements between two stacks Pruning operations in stacks

18 Data Structure: Keep LIR Blocks in Cache
Low IRR (LIR) blocks and High IRR (HIR) blocks Block Sets Physical Cache LIR block set (size is Llirs ) Lhirs Llirs Cache size L = Llirs + Lhirs HIR block set

19 LIRS Operations 5 3 2 Cache size L = 5 1 6 Llir = 3 Lhir =2 9 4 8
Initialization: All the referenced blocks are given an LIR status until LIR block set is full. We place resident HIR blocks in a small LRU Stack. Upon accessing an LIR block (a hit) Upon accessing a resident HIR block (a hit) Upon accessing a non-resident HIR block (a miss) 5 3 2 1 6 9 4 8 LIRS stack LRU Stack for HIRs resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2

20 Access an LIR Block (a Hit)
. . . 5 9 7 5 3 8 4 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 5 3 2 1 6 9 First let’s the operation on accessing an LIR block. Block 4 is accessed, move it to the top of stack S. 4 5 8 3

21 Access an LIR Block (a Hit)
. . . 5 9 7 5 3 8 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 5 3 2 1 6 9 4 Then block 8 is accessed, move it to the top. However, HIR blocks are in the bottom. So do stack pruning to make sure a LIR block is in the bottom. 5 8 3

22 Access an LIR block (a Hit)
. . . 5 9 7 5 3 8 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 5 3 2 1 4 8 6 9 5 3

23 Access a Resident HIR Block (a Hit)
. . . 5 9 7 5 3 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 3 8 4 5 3 This is for accessing HIR resident blocks. Block 3 is accessed. Move it to the top, it also becomes LIR block because it was in stack S, and its new IRR is less than Rmax. Accordingly LIR block 1 is demoted as HIR block and enters stack Q. Do stack pruning. 2 5 1 3

24 Access a Resident HIR Block (a Hit)
. . . 5 9 7 5 3 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 5 4 8 3 2 5 1

25 . . . 5 9 7 5 3 Access a Resident HIR Block (a Hit) Cache size L = 5
resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 4 8 3 2 5 5 1 1

26 . . . 5 9 7 5 Access a Resident HIR Block (a Hit)
resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 3 Then 5 is accessed, it remains as HIR block because it was not in stack S. 8 1 5 4 5

27 Access a Non-Resident HIR block (a Miss)
. . . 5 9 7 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 7 5 3 This is for accessing non-resident HIR blocks. Block 7 is accessed. It is a miss. We need a free buffer. Bottom block of stack Q is replaced. Move it to the top of stack S as HIR block as well as Q. 8 7 5 4 1

28 Access a Non-Resident HIR block (a Miss)
. . . 5 9 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 9 7 5 5 3 Block 9 is accessed. This time bottom block of stack Q, 5 is replaced and leaves stack Q. But it remains in LIRS stack as non-resident HIR block. 8 9 7 4 5 5

29 Access a Non-Resident HIR block (a Miss)
. . . 5 resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 5 9 7 7 5 3 Then block 5 is accessed, it becomes LIR block because it was in stack S. We can see that the overhead of the operations on the stacks in the LIRS algorithm is almost as low as that of LRU. 8 9 4 4 7 7

30 Access a Non-Resident HIR block (a Miss)
. . . resident in cache LIR block HIR block Cache size L = 5 Llir = 3 Lhir =2 5 9 7 3 Then block 5 is accessed, it becomes LIR block because it was in stack S. We can see that the overhead of the operations on the stacks in the LIRS algorithm is almost as low as that of LRU. 8 4 9

31 A Simplified Finite Automata for LRU
Upon a block access Hit (place it on the top) Miss (fetch data and place on the top) Operations on LRU stack block evicts (remove the block in the bottom)

32 A Simplified Finite Automata for LIRS
Upon a block access Block evicts Operations on HIR stack Demotion to HIR stack Miss (with a record) Hit Operations on LIR stack Hit/promotion to LIR stack Operations on HIR stack Miss/promotion to LIR stack Pruning Hit Miss (no record) Operations on HIR stack Add a record on resident HIR Miss Block evicts

33 Operations on HIR stack Operations on HIR stack
A Simplified Finite Automata for LIRS Upon a block access Operations on HIR stack Demotion to HIR stack Miss (with a record) Hit Operations on LIR stack Hit/promotion to LIR stack Operations on HIR stack Miss/promotion to LIR stack Pruning Hit Miss (no record) Operations on HIR stack Add a record on resident HIR Miss Block evicts

34 How LIRS addresses the LRU problem
File scanning: one-time access blocks will be replaced timely; (due to their high IRRs) Loop-like accesses: a section of loop data will be protected in low IRR stack; (misses only happen in the high IRR stack) Accesses with distinct frequencies: Frequently accessed blocks in short reuse distance will NOT be replaced. (dynamic status changes)

35 Performance Evaluation
Trace-driven simulation on different patterns shows LIRS outperforms existing replacement algorithms in almost all the cases. The performance of LIRS is not sensitive to its only parameter Lhirs. Performance is not affected even when LIRS stack size is bounded. The time/space overhead is as low as LRU. LRU is a special case of LIRS (without recording resident and non-resident HIR blocks, in the large stack).

36 Looping Pattern: postgres (Time-space map)

37 Looping Pattern: postgres (IRR Map)
LIRS IRR (Re-use Distance in Blocks) LRU This is its IRR map. The size of LRU stack and LIRS stack for 500 block cache are also shown. While LRU only covers the references below its stack size line, LIRS adaptively change its stack size to cover current locality-strong blocks. When the locality becomes weak, i.e more references are in the upper IRR area, Its size goes up, otherwise, it goes down. This adaptation attempts to make the 500 blocks with the currently strongest locality be identified and cached. Virtual Time (Reference Stream)

38 Looping Pattern: postgres (Hit Rates)

39 Two Technical Issues to Turn it into Reality
High overhead in implementations For each data access, a set of operations defined in replacement algorithms (e.g. LRU or LIRS) are performed This is not affordable to any systems, e.g. OS, buffer caches … An approximation with reduced operations is required in practice High lock contention cost For concurrent accesses, the stack(s) need to be locked for each operation Lock contention limits the scalability of the system Clock-pro and BP-Wrapper addressed these two issues

40 Only Approximations can be Implemented in OS
The dynamic changes in LRU and LIRS cause some computing overhead, thus OS kernels cannot directly adopt them. An approximation reduce overhead at the cost of lower accuracy. The clock algorithm for LRU approximation was first implemented in the Multics system in 1968 at MIT by Corbato (1990 Turing Award Laureate) Objective: LIRS approximation for OS kernels.

41 Basic Operations of CLOCK Replacement No algorithm operations
All the resident pages are placed around a circular list, like a clock; Each page is associated with a reference bit, indicating if the page has been accessed. 1 1 On a block HIT 1 1 CLOCK hand Upon a HIT: Set reference bit to 1 No algorithm operations 1 Then LRU is approximated by the CLOCK replacement in VM. In CLOCK, All the resident pages are placed around the a circular list, like a clock. There is a clock hand turning in the clockwise direction to search victim pages. Each page is associated with a reference bit, indicating if the page has been accessed. On a hit access, the bit is set automatically by hardware, there are no algorithm operations. 1 1 1 1

42 On a sequence of two MISSes
Basic CLOCK Replacement Starts from the currently pointed page, and evicts the page if it is”`0”; Move the clock hand until reach a “0” page; Give “1” page a second chance, and reset its “1” to “0” Insert a new block here Reset reference bit to 0 1 Upon the second MISS: Evict the block w/ reference bit 0 1 On a sequence of two MISSes 1 Upon a MISS: Evict the block w/ reference bit 0 1 CLOCK hand 1 On a miss access, CLOCK turns the clock hand, evict the first “0” page; CLOCK inserts the missed page at the head, initialize its reference bit to 0. If the ref-bit of page pointed by the hand is “1”, CLOCK gives it a second chance without replacing it, And reset its “1” to “0” and continue to look for a page with its bit of “0” CLOCK simulates LRU replacement very well, and its hit ratios are very close to LRU. 1 1 1 1

43 Unbalanced R&D on LRU versus CLOCK
LRU related work CLOCK related work FBR (1990, SIGMETRICS) LRU-2 (1993, SIGMOD) 2Q (1994, VLDB) SEQ (1997, SIGMETRICS) LRFU (1999, OSDI) EELRU (1999, SIGMETRICS) MQ (2001, USENIX) LIRS (2002, SIGMETRICS) ARC (2003, FAST, IBM patent) GCLOCK (1978, ACM TDBS) 1968, Corbato 2003, Corbato CAR (2004, FAST, IBM patent) CLOCK-Pro (2005, USENIX) Now let’s see the prior research work on LRU and CLOCK. Because the importance of replacement algorithms and the well-known LRU performance problems, there are a large number of new algorithms proposed for better performance, but almost all of them target at LRU. See the long list over the decades. However, the list for CLOCK is much shorter. See, from 1968 to 1998, it is a really a long time for a person! Only GCLOCK and several CLOCK variants are proposed during the long time. The very stringent low cost requirement poses a big challenge on inventing new VM replacement. Recently there is a VM replacement algorithm called CAR proposed by the researchers at IBM. And we proposed the CLOCK-Pr n pljn.o.

44 Basic Ideas of CLOCK-Pro
It is an approximation of LIRS based on the CLOCK infrastructure. Pages categorized into two groups: cold pages and hot pages based on their reuse distances (or IRR). There are three hands: Hand-hot for hot pages, Hand-cold for cold pages, and Hand-test for running a reuse distance test for a block; The allocation of memory pages between hot pages (Mhot) and cold pages (Mcold ) are adaptively adjusted. (M = Mhot + Mcold) All hot pages are resident (=Lir blocks), some cold pages are also resident (= Hir Blocks); keep track of recently replaced pages (=non-resident Hir blocks) Now let’s see CLOCK-Pro, which uses the same principle as that of LIRS

45 Cold resident 1 Hot CLOCK-Pro (USENIX’05) 1 1 24 Cold non-resident Hand-hot: find a hot page to be demoted into a cold page. 2 23 3 22 1 4 21 5 hand-hot 20 6 1 19 hand-cold Hand-cold is used to find a page for replacement. 7 hand-test 18 All hands move in the clockwise direction. 8 17 16 I would like to use the figure to illustrate CLOCK-Pro. There are several kinds of pages on the clock. …. I associate a test period to a cold page to test its reuse distance. Hand-test: (1) to determine if a cold page is promoted to be hot; (2) remove non-resident cold pages out of the clock. 9 10 15 1 Two reasons for a resident cold page: A fresh replacement: a first access. It is demoted from a hot page. 11 14 12 13

46

47

48 Concurrency Management in Buffer Management
48 Pages Page Accesses Concurrent accesses to buffer caches A critical section is needed Buffer cache (pool) keeps hot pages Maximizing hit ratio is the key Lock (Latch) Replacement Management inside Lock Maximizing hit ratio Buffer Pool (in DRAM) Hit ratio is largely determined by effectiveness of replacement algorithm It determines which pages to be kept and which to be evicted LRU-k, 2Q, LIRS, ARC, … Lock (latch) is required to serialize the update after each page request Hard Disk

49 Accurate Algorithms and Their Approximations
49 Approximations CLOCK (LRU), CLOCK-Pro (LIRS), CAR (ARC) LRU, LIRS, ARC, …. 1 1 1 CLOCK hand 1 1 1 1 clock sets bit to 1 without lock for a page hit. Lock synchronization is only used only for misses. Clock approximation reduces lock contention at the price of reducing hit ratios.

50 History of Buffer Pool's Caching Management in PostgreSQL
: LRU (suffer lock contention moderately due to low concurrency) : LRU-k (hit ratio outperforms LRU, but lock contention became more serious) 2004: ARC/CAR are implemented, but quickly removed due to an IBM patent protection. 2005: 2Q was implemented (hit ratios were further improved, but lock contention was high) 2006 to now: CLOCK (approximation of LRU, lock contention is reduced, but hit ratio is the lowest compared with all the previous ones)

51 Trade-offs between Hit Ratios and Low Lock Contention
LRU-k, 2Q, LIRS, ARC, SEQ, …. …… for high hit ratio CLOCK, CLOCK-Pro, and CAR for high scalability ? Our Goal: to have both! Lock Synchronization modify data structures Update page metadata Low Lock Synchronization Clock-based approximations lower hit ratios (compared to original ones). The transformation can be difficult and demand great efforts; Some algorithms do not have clock-based approximations. 51 51

52 Reducing Lock Contention by Batching Requests
Replacement Algorithm (modify data structures, etc. ) Buffer Pool Replacement Algorithm (modify data structures, etc. ) Buffer Pool One batch queue per thread Page hit Commit assess history for a set of replacement operations Fulfill page request Fetch the page directly. 52 52

53 Reducing Lock Holding Time by Prefetching
Thread 2 Data Cache Miss Stall Thread 1 Time Thread 2 Pre-read data that will be accessed in the critical section Read the data that would be accessed in the critical section by the replacement algorithm immediately before a lock is requested. Thread 1 Time 53 53

54 Lock Contention Reduction by BP-Wrapper (ICDE’09)
Lock contention: a lock cannot be obtained without blocking; Number of lock acquisitions (contention) per million page accesses. Reduced by over 7000 times! 54 54

55 Impact of LIRS in Academic Community
LIRS is a benchmark to compare replacement algorithms Reuse distance is first used in replacement algorithm design A paper in SIGMETRICS’05 confirmed that LIRS outperforms all the other replacement. LIRS has become a topic to teach in both graduate and undergraduate classes of OS, performance evaluation, and databases at many US universities. The LIRS paper (SIGMETRICS’02) is highly and continuously cited. Linux Memory Management group has established an Internet Forum on Advanced Replacement for LIRS

56 LIRS has been adopted in MySQL
MySQL is the most widely used relational database 11 million installations in the world The busiest Internet services use MySQL to maintain their databases for high volume Web sites: google, YouTube, wikipedia, facebook, Taobao… LIRS is managing the buffer pool of MySQL The adoption is the most recent version (5.1), November 2008.

57

58

59 Infinispan (a Java-based Open Software)
The data grid forms a huge in-memory cache being managed using LIRS BP-Wrapper is used to ensure lock-free

60 Concurrentlinkedhashmap as a Software Cache
Linked list structure (a Java class) Elements are Linked and managed using LIRS replacement policy. BP-Wrapper ensures lock contention-free

61 LIRS in Management of Big Data
LIRS has been adopted in GridGain Software A Java based open source middle ware for real time big data processing and analytics (www.gridgain.com) LIRS makes replacement decisions for the in-memory data grid Over 500 products and organizations using GridGain software daily: Sony, Cisco, Canon, JobsonJonson, Deutsche Bank, …. LIRS has been adopted in SYSTAP’s storage management Big data scale-out storage systems (www.bigdata.com)

62 LIRS in Functional Programming Language: Clojure
Clojure is a dynamic programming language that targets Java Virtual Machine (http://clojure.org) A dialect of Lisp, functional programming, and designed for concurrency Have be used by many organizations LIRS is a member of the clojure library: LIRSCache

63 LIRS Principle in Hardware Caches
A cache replacement hardware implementation based on Re-Reference Interval prediction (RRIP) Presented in ISCA’10 by Intel Two bits are added to each cache line to measure reuse-distance in a static and dynamic way Performance gains are up to 4-10% Hardware cost may not be affordable in practice.

64 Impact of Clock-Pro in OS and Other Systems
Clock-pro has been adopted in FreeBSD/NetBSD (open source Unix) Two patches in Linux kernel for users Clock-pro patches in by Rik van Riel PeterZClockPro2 in by Peter Zijlstra Clock-pro is patched in Aparche Derby (a relational DB) Clock-pro is patched in OpenLDAP (directory accesses)

65 Impact of Multicore Procesors in Computer Systems
8MB L3 Cache 8GB Memory L1 L2 Disk Dell Precision1500 in 2009 with similar price 256MB Memory Dell Precision GX620 Purchased in 2004 L1 2MB L2 Disk 8KB L1D+12KBL1I+512KBL2

66 Performance Issues w/ the Multicore Architecture
Slow data accesses to memory and disks continue to be major bottlenecks. Almost all the CPUs in Top-500 Supercomputers are multicores. 8MB L3 Cache 8GB Memory L1 L2 Disk Dell Precision 1500 Purchased in 2009 with similar price Cache Contention and Pollution: Conflict cache misses among multi-threads can significantly degrade performance. Memory Bus Congestion: Bandwidth is limited to as the number of cores increases 8KB L1D+12KBL1I+512KBL2 “Disk Wall”: Data-intensive applications also demand high throughput from disks.

67 Throughput = Concurrency/Latency
Multi-core Cannot Deliver Expected Performance as It Scales Performance Reality Ideal Throughput = Concurrency/Latency Exploiting parallelism Exploiting locality Sandia National Laboratories “The Troubles with Multicores”, David Patterson, IEEE Spectrum, July, 2010 “Finding the Door in the Memory Wall”, Erik Hagersten, HPCwire, Mar, 2009 “Multicore Is Bad News For Supercomputers”, Samuel K. Moore, IEEE Spectrum, Nov, 2008

68 Challenges of Managing LLC in Multi-cores
Recent theoretical results about LLC in multicores Single core: optimal offline LRU algorithm exists Online LRU is k-competitive (k is the cache size) Multicore: finding an offline optimal LRU is NP-complete Cache partitioning for threads: an optimal solution in theory System Challenges in practice LLC lacks necessary hardware mechanism to control inter-thread cache contention LLC share the same design with single-core caches System software has limited information and methods to effectively control cache contention

69 OS Cache Partitioning in Multi-cores (HPCA’08)
Physically indexed caches are divided into multiple regions (colors). All cache lines in a physical page are cached in one of those regions (colors). Physically indexed cache Virtual address virtual page number page offset OS control Address translation … … Physical address physical page number Page offset Let me introduce page coloring technique first. 1. A virtual address can be divide into two parts: Virtual page number and page offset (In our experimental system, the page size is 4KB, so length of page offset bits are 12) 2. The virtual address is converted to physical address by address translation controlled by OS. Virtual page number is mapped to a physical page number and page offset remain same. 3. The physical address is used in a physically indexed cache. The cache address is divided into tag, set index and block offset. 4. The common bits between physical page number and set index are referred as page color bits. 5. The physically indexed cache is then divided into multiple regions. Each OS page will cached in one of those regions. Indexed by the page color. In a summary, 1) 2)… OS can control the page color of a virtual page through address mapping (by selecting a physical page with a specific value in its page color bits). = Cache address Cache tag Set index Block offset page color bits 69

70 Shared LLC can be partitioned into multiple regions
70 Shared LLC can be partitioned into multiple regions Physical pages are grouped to different bins based on their page colors OS address mapping Physically indexed cache 1 2 3 4 …… i Shared cache is partitioned between two processes through OS address mapping. i+1 i+2 Process 1 … … 1. This slide shows an example how we enhance page coloring to partition shared cache between two processes. 2. Physical pages are grouped into page bins according to their page color. The pages in a same bin will be cached in same cache region or cache color in a physically indexed cache. 3. Suppose there are two processes running on a dual-core processor. Each processor has a set of pages need to be mapped to physical pages. 4. If we limit that pages from one process can only be mapped to a subset of page bins. Then pages from a process will only use part of the cache. In a summary, … We call this restriction co-partitioning. ... 1 2 3 4 Main memory space needs to be partitioned too (co-partitioning). …… i i+1 i+2 …… Process 2

71 Implementations in Linux and its Impact
71 Implementations in Linux and its Impact Static partitioning Predetermines the amount of cache blocks allocated to each running process at the beginning of its execution Dynamic cache partitioning Adjusts cache allocations among processes dynamically Dynamically changes processes’ cache usage through OS page address re-mapping (page re-coloring) Current Status of the system facility Open source in Linux kernels adopted as a software solution in Intel SSG in May 2010 used in applications of Intel platforms, e.g. automation As I discussed earlier, We support both static and dynamic cache partitioning. The static cache partitioning mechanism predetermines the amount of cache blocks allocated to each program at the beginning of its execution. We implement the static cache portioning by enhancing page coloring. The goal of static cache partitioning is to divide shared cache to multiple regions and partition cache regions among processes statically through OS page address mapping The dynamic cache partitioning adjusts cache quota among process dynamically. The dynamic cache partitioning is supported by page re-coloring. The goal of dynamic cache partitioning is to coordinate dynamic behaviors of the programs by changing processes’ cache usage dynamically through OS page address re-mapping.

72 Final Remarks: Why LIRS-related Efforts Make the Difference?
Caching the most deserved data blocks Using reuse-distance as the ruler, approaching to the optimal 2Q, LRU-k, ARC, and others can still cache non-deserved blocks LIRS with its two-stack yields constant operations: O(1) Consistent to LRU, but recording much more useful information Clock-pro turns LIRS into reality in production systems None of other algorithms except ARC have approximation versions BP-Wrapper ensures lock contention free in DBMS OS partitioning executes LIRS principle in LLC in multicores Protect strong locality data, and control weak locality data

73 Acknowledgement to Co-authors and Sponsors
73 Song Jiang Ph.D.’04 at William and Mary, faculty at Wayne State Feng Chen Ph.D.’10 , Intel Labs (Oregon) Xiaoning Ding Ph.D. ‘10 , Intel Labs (Pittsburgh) Qingda Lu Ph.D.’09 , Intel (Oregon) Jiang Lin Ph.D’08 at Iowa State, AMD Zhao Zhang Ph.D.’02 at William and Mary, faculty at Iowa State P. Sadayappan, Ohio State Continuous support from the National Science Foundation

74 CSE 788: Winter Quarter 2011 74 Principle of Locality in Design and Implementation of Computer and Distributed Systems Exploiting locality at different levels of computer systems Challenges of algorithms design and implantations Readings of both classical and new papers A proposals- and projects-based class Many high quality research started from this class, and published in FAST, HPCA, Micro, PODC, PACT, SIGMETRICS, USENIX, and VLDB You are welcome to take the class next quarter

75 : zhang@cse.ohio-state.edu
Xiaodong Zhang : Thank You !


Download ppt "Xiaodong Zhang The Ohio State University"

Similar presentations


Ads by Google