ULC: An Unified Placement and Replacement Protocol in Multi-level Storage Systems Song Jiang and Xiaodong Zhang College of William and Mary.

Slides:

Advertisements

Similar presentations

1 Evaluation of applications over an intranet. 2 FDDI 100 Mbps Lan 5 R1 R3 R2 R4 Lan 1 10 Mbps Eth Lan 2 10 Mbps Eth Lan 3 10 Mbps Eth Lan 4 16 Mbps TR.

Advertisements

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Enabling Inter-domain DTN Communications by Networked Static Gateways Ting He*, Nikoletta Sofra, Kang-Won Lee*, and Kin K Leung * IBM Imperial College.

Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.

Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu

SE-292 High Performance Computing

CS 105 Tour of the Black Holes of Computing

Storing Data: Disk Organization and I/O

1 Storing Data: Disks and Files Chapter 7. 2 Disks and Files v DBMS stores information on (hard) disks. v This has major implications for DBMS design!

I/O Management and Disk Scheduling

HyLog: A High Performance Approach to Managing Disk Layout Wenguang Wang Yanping Zhao Rick Bunt Department of Computer Science University of Saskatchewan.

OS-aware Tuning Improving Instruction Cache Energy Efficiency on System Workloads Authors : Tao Li, John, L.K. Published in : Performance, Computing, and.

Chapter 4 Memory Management Basic memory management Swapping

Song Jiang1 and Xiaodong Zhang1,2 1College of William and Mary

Cache and Virtual Memory Replacement Algorithms

Project : Phase 1 Grading Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values.

The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms (ACM SIGMETRIC 05 ) ACM International Conference on Measurement & Modeling.

A Survey of Web Cache Replacement Strategies Stefan Podlipnig, Laszlo Boszormenyl University Klagenfurt ACM Computing Surveys, December 2003 Presenter:

A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.

Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.

Bypass and Insertion Algorithms for Exclusive Last-level Caches

ITEC 352 Lecture 25 Memory(2). Review RAM –Why it isnt on the CPU –What it is made of –Building blocks to black boxes –How it is accessed –Problems with.

1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.

1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.

Introduction to Indexes Rui Zhang The University of Melbourne Aug 2006.

Reference stream A memory reference stream is an n-tuple of addresses which corresponds to n ordered memory accesses. – A program which accesses memory.

Improving OLTP scalability using speculative lock inheritance Ryan Johnson, Ippokratis Pandis, Anastasia Ailamaki.

SE-292 High Performance Computing

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

T-SPaCS – A Two-Level Single-Pass Cache Simulation Methodology + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Wei Zang.

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

ARC: A SELF-TUNING, LOW OVERHEAD REPLACEMENT CACHE

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Outperforming LRU with an Adaptive Replacement Cache Algorithm Nimrod megiddo Dharmendra S. Modha IBM Almaden Research Center.

1 Storage-Aware Caching: Revisiting Caching for Heterogeneous Systems Brian Forney Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Wisconsin Network Disks University.

Presented By- Sayandeep Mitra TH SEMESTER Sensor Networks(CS 704D) Assignment.

Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:

Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.

Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,

ECE7995 Caching and Prefetching Techniques in Computer Systems Lecture 8: Buffer Cache in Main Memory (IV)

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Web Caching Schemes For The Internet – cont. By Jia Wang.

By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.

Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.

An Effective Disk Caching Algorithm in Data Grid Why Disk Caching in Data Grids?  It takes a long latency (up to several minutes) to load data from a.

Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.

Lecture 40: Review Session #2 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

C-Hint: An Effective and Reliable Cache Management for RDMA- Accelerated Key-Value Stores Yandong Wang, Xiaoqiao Meng, Li Zhang, Jian Tan Presented by:

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

MiddleMan: A Video Caching Proxy Server NOSSDAV 2000 Brian Smith Department of Computer Science Cornell University Ithaca, NY Soam Acharya Inktomi Corporation.

On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.

Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

LIRS: Low Inter-reference Recency Set Replacement for VM and Buffer Caches Xiaodong Zhang College of William and Mary.

Web Server Load Balancing/Scheduling

Memory COMPUTER ARCHITECTURE

Cache Memory Presentation I

Memory Management for Scalable Web Data Servers

Lecture 23: Cache, Memory, Virtual Memory

Lecture 22: Cache Hierarchies, Memory

CARP: Compression-Aware Replacement Policies

Qingbo Zhu, Asim Shankar and Yuanyuan Zhou

Lecture 22: Cache Hierarchies, Memory

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

ULC: An Unified Placement and Replacement Protocol in Multi-level Storage Systems Song Jiang and Xiaodong Zhang College of William and Mary

Multi-Level Buffer Caching in Distributed Systems client network Front-tier serverend-tier server disk array

Challenges to Improve Hierarchy Performance LRU L1 L2 L3 L4 LRU (1) Can the hit rate of hierarchical caches achieve the hit rate of a single first level cache with its size equal to the aggregate size of the hierarchy? (2) Can we make caches close to clients contribute more to the hit rate? L1 80% 40% 50%10% 50%

Reason I: Weakened Locality at Low Level Caches Low level caches hold the misses from their upper level buffer caches, and the hits have high latency. The requests with strong locality have been filtered by the high level buffer caches close to clients.

An Existing Solution: Re-designing Low Level Cache Replacement Algorithms To overcome weak locality, MQ is a frequency-based replacement; Once a block is accessed, it is promoted to a higher queue. Periodically, blocks in each queue are checked and low frequency blocks are demoted to lower queues. Multi-Queue Replacement (MQ) [USENIX01]... Q0Q0 Q1Q1 QnQn Q out

Drawbacks of MQ Replacement Inheriting the weakness of frequency-based algorithm – not responsive to access pattern changes Containing workload sensitive parameters; Cannot fully exploit the locality knowledge inherent in applications (accurate information is in high level caches) Motivation: Locality analysis is conducted at clients, where original requests are generated.

Both caches Server cache client Cache Snapshots at every 1000 references Reason II: Undiscerning Redundancy among Levels of Buffer Caches

Another Existing Solution: Extending Existing Replacement into an Unified Replacement For example: Unified LRU (uniLRU) [USENIX02] L1 LRU stack L2 LRU stack Demotion 10 Client Server 10 6

Drawbacks of Unified LRU ……. L1 LRU stack L2 LRU stack All the hits go to this L2 position 1)High level caches are not well utilized 2)Large demotion overhead;

Our Approach: Unified Level-aware Caching (ULC) Blocks with weak locality are placed in the low level buffer caches (1) Locality is analyzed at client. (2) The analysis results are used to direct the placement of blocks in the hierarchy. Minimizing redundancy among levels of the buffer caches by unified replacement based on client information. Locality strength Cache levels

Quantifying Locality Strength LAD-R = max (LAD, R) Next Access Position Last Access Position Current Position LAD LAD-R R Unified LRU Stack Locality strength is characterized by Next Access Distance (NAD); NAD is unknown currently; NAD is quantitatively predicted by Last Access Distance (LAD) and Recency (R). NAD LAD LAD-R R Next Access Position Last Access Position Current Position NAD Advantages of LAD-R over R 1.not change until the next reference of the block 2.Accurate quantification

Multi-Level Buffer Caching Protocol ---- Unified and Level-Aware Caching (ULC) ULC running on the first level client dynamically ranks the accessed blocks according to their LAD-R values. Based on the ranking results, blocks are cached (placed) at levels L1, L2, …, accordingly. Low level caches take actions such as caching or replacing according to the instructions from clients.

LAD-R Based Block Caching Exactly arranging block layout as LAD-R ranking is expensive (at least O(logn)) Efficient two-phase LAD-R Based caching (O(1)): 1)LAD determines block placement at the time of retrieval (R = 0); 2)R is used for block replacement after a block is cached.

LAD-R Based Placement and Replacement L1 LRU stack L2 LRU stack The LRU position at which a block is accessed determines its placement The current LRU position determines its replacement 2 7

ULC Data Structure uniLRU Stack Y1 Y2 Y3 Yardstick Recency Status R1 R2 R3 L L2 L3 Level Status Recency Status is determined by recency Level Status is determined by LAD The placement of a block is determined by its level status The yardstick block is the one for replacement at the corresponding level

Two Operations in the ULC Protocol Two request messages from the client to the low level caches: 1.Retrieve (b, i, j) ( i j ): retrieve block b from level Li, and cache it at level Lj when it passes level Lj on its route to level L1. 2.Demote (b, i, j) (i < j): demote block b from level Li into level Lj.

uniLRU Stack Y1 Y2 Y3 L L2L3 Level Status Retrieve (11, From, To) R2 Block 11 L3L2 R2, accessed Retrieve (11, 3, 2 )

uniLRU Stack Y1 Y2 Y3 L L2L3 Level Status 11 Retrieve (11, 3, 2) Demote (6, 2, 3) 6 6

ULC with Multiple Clients L1 Block Lout block Y1 Yardstick 9 5 Y2 3 6 Y1 Yardstick 9 Client 1 Client 2 Global_LRU at Server L2 Block L1L2 L1L2 Y2

Performance Evaluation: Workload Traces RANDOM: spatially uniform distribution of references; (synthetic traces) ZIPF: highly skewed reference distribution; (synthetic traces) HTTPD collected on a 7-node parallel web-server. (HP) DEV1 collected in an office environment for over 15 consecutive days. (HP) TPCC1 the I/O trace of the TPC-C database benchmark. (IBM DB2)

Performance on a 3-level Structure Block size: 8KB Block transfer time between the client and the server buffer caches: 1ms Block transfer time between the server buffer cache and its RAM cache on disk: 0.2ms Block transfer time between the disk RAM cache and the disk: 10ms Cache size: 100MB each

1)Compared with indLRU, ULC significantly increases hit ratios; 2)Compared with uniLRU, ULC providse better hit distribution;

1)indLRU has high miss penalty; 2)uniLRU has high demotion cost;

Performance on a Multi-client Structure httpd collected on a 7 -node parallel web-server. openmail : collected on 6 HP 9000 K580 servers running HP OpenMail application. db2 collected on an 8 -node IBM SP2 system running an IBM DB2 database. Block size: 8KB Block transfer time between the clients and the server: 1ms Block transfer time between the server buffer cache and the disk: 10ms Cache size: 100MB each (except for workload tpcc1, which is 50MB)

The effect of cache pollution in MQ

Large demotion cost in uniLRU

Summary We propose an effective way to quantify locality in multi-level caches; We design an efficient block placement / replacement protocol (ULC); ULC makes the layout of cached blocks in the hierarchy matches their locality; Experiments show that ULC significantly outperform exiting schemes.