Prefetching with Adaptive Cache Culling for Striped Disk Arrays Sung Hoon Baek and Kyu Ho Park Korea Advanced.

Slides:

Advertisements

Similar presentations

The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms (ACM SIGMETRIC 05 ) ACM International Conference on Measurement & Modeling.

Advertisements

Background Virtual memory – separation of user logical memory from physical memory. Only part of the program needs to be in memory for execution. Logical.

Hadi Goudarzi and Massoud Pedram

ARC: A SELF-TUNING, LOW OVERHEAD REPLACEMENT CACHE

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Outperforming LRU with an Adaptive Replacement Cache Algorithm Nimrod megiddo Dharmendra S. Modha IBM Almaden Research Center.

Virtual Memory Introduction to Operating Systems: Module 9.

Exploiting Gray-Box Knowledge of Buffer Cache Management Nathan C. Burnett, John Bent, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of.

Cache Memory By JIA HUANG. "Computer Science has only three ideas: cache, hash, trash.“ - Greg Ganger, CMU.

1 Storage (cont’d) Disk scheduling Reducing seek time (cont’d) Reducing rotational latency RAIDs.

1 Database Buffer Management Yanlei Diao UMass Amherst Feb 20, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Accurate and Efficient Replaying of File System Traces Nikolai Joukov, TimothyWong, and Erez Zadok Stony Brook University (FAST 2005) USENIX Conference.

Improving Proxy Cache Performance: Analysis of Three Replacement Policies John Dilley and Martin Arlitt IEEE internet computing volume3 Nov-Dec 1999 Chun-Fu.

CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.

RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.

Report ： Zhen Ming Wu 2008 IEEE 9th Grid Computing Conference.

Continuous resource monitoring for self-predicting DBMS Dushyanth Narayanan 1 Eno Thereska 2 Anastassia Ailamaki 2 1 Microsoft Research-Cambridge, 2 Carnegie.

Dynamic and Decentralized Approaches for Optimal Allocation of Multiple Resources in Virtualized Data Centers Wei Chen, Samuel Hargrove, Heh Miao, Liang.

Parity Logging O vercoming the Small Write Problem in Redundant Disk Arrays Daniel Stodolsky Garth Gibson Mark Holland.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.

Profile Driven Component Placement for Cluster-based Online Services Christopher Stewart (University of Rochester) Kai Shen (University of Rochester) Sandhya.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.

RAID SECTION (2.3.5) ASHLEY BAILEY SEYEDFARAZ YASROBI GOKUL SHANKAR.

INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu.

1 A New Approach to File System Cache Writeback of Application Data Sorin Faibish – EMC Distinguished Engineer P. Bixby, J. Forecast, P. Armangau and S.

1/14/2005Yan Huang - CSCI5330 Database Implementation – Storage and File Structure Storage and File Structure.

Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.

CS Operating System & Database Performance Tuning Xiaofang Zhou School of Computing, NUS Office: S URL:

Physical Storage Organization. Advanced DatabasesPhysical Storage Organization2 Outline Where and How data are stored? –physical level –logical level.

Exploiting Gray-Box Knowledge of Buffer Cache Management Nathan C. Burnett, John Bent, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of.

Micro-sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems Jeongseob Ahn, Chang Hyun Park, and Jaehyuk.

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 30 – Media Server (Part 5) Klara Nahrstedt Spring 2009.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

The Evicted-Address Filter

Program-Context Specific Buffer Caching Feng Zhou Rob von Behren Eric Brewer System Lunch, Berkeley CS, 11/1/04.

Energy Efficient Prefetching and Caching Athanasios E. Papathanasiou and Michael L. Scott. University of Rochester Proceedings of 2004 USENIX Annual Technical.

Storage and File structure COP 4720 Lecture 20 Lecture Notes.

Transforming Policies into Mechanisms with Infokernel Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Nathan C. Burnett, Timothy E. Denehy, Thomas J.

Sunpyo Hong, Hyesoon Kim

Informed Prefetching and Caching R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Daniel Stodolsky, Jim Zelenka.

Jonathan Walpole Computer Science Portland State University

Module 11: File Structure

LRFU (Least Recently/Frequently Used) Block Replacement Policy

5.2 Eleven Advanced Optimizations of Cache Performance

Informed Prefetching and Caching

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Adaptive Cache Replacement Policy

HashKV: Enabling Efficient Updates in KV Storage via Hashing

Chapter 9: Virtual-Memory Management

Predictive Performance

Reference-Driven Performance Anomaly Identification

External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

Overview Continuation from Monday (File system implementation)

CARP: Compression-Aware Replacement Policies

Adaptive Single-Chip Multiprocessing

Dept. of Computer Science, Univ. of Rochester

Getting to the root of concurrent binary search tree performance

Fast Accesses to Big Data in Memory and Storage Systems

Dong Hyun Kang, Changwoo Min, Young Ik Eom

Presentation transcript:

Prefetching with Adaptive Cache Culling for Striped Disk Arrays Sung Hoon Baek and Kyu Ho Park Korea Advanced Institude of Science and Technology (KAIST) School of Electrical Engineering and Computer Science USENIX Annual Technical Conference

Our Work 2 Introduction

Disk Prefetching Schemes  Accurate Prediction 1. Offline Prefetching 2. History-Based Prefetching 3. Application-Hint-Based Prefetching  Sequential Prediction 4. Sequential Prefetching  Most widely used, never beneficial to non-sequential accesses  Our Scheme:  Goal  Beneficial to non-seq. read as well as seq. reads  Very practical as much as Sequential Prefetching  Approach  Low prefetch cost while sacrificing prediction accuracy  Consider both prefetch buffer management and prefetching  For Striped Disk Arrays: RAID-0, RAID-5, RAID-6, SSD, etc 3 High overhead Impractical

Prior Work  Buffer Management for Prefetched Data  Related Work: TIP [1,2]  Deterministic cost estimation makes errors  Scan overhead: search the least-valuable block: O(N)  Adaptive Strip Prefetching: the proposed scheme  Practical Scheme.  Low overhead: O(1)  Inspired by ARC and SARC, which are for cached data  More analytical method, for prefetched data  Specialized for RAID 4 [1] R.H. Patterson and et al “Informed Prefetching and Caching,” ACM OSP, (Dec 1995) [2] A. TOMKINS, et al “Informed multiprocess prefetching and caching. ACM Int’l Conf. on MMCS (June 1997) Prior Work

Prior Works vs. Our Work 5 Buffer Management for Prefetched Data (TIP) [1] Adaptive Cache Management (ARC [2], SARC) (1) A New Prefetching non-seq. read, seq. read, very practical, for RAID (1) A New Prefetching non-seq. read, seq. read, very practical, for RAID + + (2) Prefetch Buffer Management Similar Method Similar Goal (3) An Online Cost Estimator [1] R.H. Patterson and et al “Informed Prefetching and Caching,” ACM OSP, (Dec 1995) [2] Megiddo and Modha, “ARC: A self-tuning, low overhead replacement cache”, USENIX FAST, 2003 Prior Works Our Work Prefetched Data Mgmt., More Analytical Method Prefetched Data Mgmt., More Analytical Method Cached Data Mgmt. O(N) O(1) Tightly Integrated Resolve bad cache utilization

RAID Layout 6 Problem: Independency

Adaptive Strip Prefetching (ASP)  Strip Prefetching  Read all blocks of a strip  Segment prefetching  Segment = Strip  Bad cache utilization, unused data pollutes the cache  Adaptive Cache Culling  Buffer Management for Prefetched data  Differential Feedback  Online Prefetch Cost Estimation 7 My Work: Adaptive Strip Prefetching

Strip Prefetching 8 Non-sequential reads may be beneficial to SP nor not. However, Most non-sequential reads in real workloads also exhibit spatial locality unlike random reads on a huge workspace. So in many cases, SP provides performance gain. For random reads on a huge workspace, SP is deactivated by the online disk simulator.

Best Segment Size for a Segment Prefetching? 200 threads performs random read in a fixed read size Three UltraSCSI disks (15krpm) One strip Two strip Three strip One strip, twice strip Half strip, twice strip One strip Two strips Three strips twice strip size 128 KiB /128 KiB 256 KiB /128 KiB 384 KiB /128KiB 256 KiB /256 KiB 128 KiB /256 KiB Request size / Strip size: Bandwidth / prefetch size One strip Two strip half strip

Adaptive Strip Prefetching (ASP)  Strip Prefetching  Bad cache utilization, useless data pollutes the cache  Adaptive Cache Culling (prefetch buffer management)  Mitigate the disadvantage of strip prefetching  Buffer Management for Prefetched Data  Cull uselessly prefetched data  Maximize total hit rate = prefetch hit rate + cache hit rate  In a given cache management  A differential feedback (an automatic manner)  Prefetch hit: A request on a prefetched block  Cache hit: A request on a cached block  Online Prefetching Cost Estimation 10 My Work : Adaptive Strip Prefetching

Downstream Block States in Adaptive Strip Prefetching 11 Upstream

Basic Operations of ASP (1/2) 12 Adding a new strip cache to the upstream Culling Upstream N U : # of strip caches, variable Downstream Get free block caches Empty block Prefetched block Cached block Strip cache

Basic Operations of ASP (2/2) 13 Cache hit Cache miss : strip prefetching Cache hit Cache miss Upstream N U : max. # of strip caches, adaptively controlled variable

Cache Replacement Policy 14 MRU LRU Eviction (no ASP) pointing Cache Replacement Policy Culling (ASP) Prefetch Buffer Management A Global LRU list Global Bottom Hit

N U vs. hit rate 15 position Prefetch hit: hit on prefetched block ΔP: partial prefetch hit rate (hit rate on prefetched block ) ΔC: partial cache hit rate (hit rate on cached block ) position Additional cached blocks Hit rate for each position Additional cache hit rate Reduced prefetch hit rate N U = 9 N U = 7

Total Hit Rate vs. N U (1/2) 16  Find the optimal N U that maximizes the total hit rate Feedback Control: N U ←N U +s× slope

Total Hit Rate vs. N U (2/2) 17  Monotonically Increasing Function  Slope ≥ 0  N U ←min(N U +C× slope, N U max )  Force N U to be the maximum value  Monotonically Decreasing Function  Slope ≤ 0  N U ←max(N U +C× slope, N U min }  Force N U to ZERO

Derivative vs. Marginal Utility 18 Marginal Utility (inspired by SARC) Derivative Additional allocation Original upstream bottom

Differential Feedback 19 UpstreamDownstream culling Upstream Bottom (U b )Global Bottom (G b )  ΔP: # of prefetching hits in U b during a time interval  ΔC: # of cache hits in G b during a time interval Proportional control Further work: PID (proportional-integral-derivative) control

Differential Feedback Diagram 20 Cache with Strip Prefetching + NUNU ZOH α -ΔCΔC ΔPΔP + + delay S+ + workload It maximizes the total hit rate in a given buffer management, and resolves the disadvantage of strip prefetching.

Initial Condition 21 Upstream Upstream Bottom & Global Bottom Overlapped two bottoms N a ← cache size / strip size Init: N U ← N a No feedback until N U + N D <= N a Force to perform Strip Prefetching until N U + N D <= N a No Downstream

Ghosts 22 Upstream Downstream Ghosts eviction Past cached block, which was the cached block before it become a ghost Upstream Downstream culling Culling: do not evict either past cached blocks or cached blocks Cache miss

Which become a ghost strip? 23  Our goal: easy implementation  RAID drivers manages destage caches in terms of the stripe.  A stripe cache includes its strip caches  Example 1. Stripe2 has live strip caches for strip2A and strip2B 2. strip2A is evicted then it becomes a ghost 3. strip2B is evicted then they are completely removed

Online Cost Estimation (1/2) 24  The differential feedback resolves the disadvantage of strip prefetching  But it is not beneficial to random reads  Random reads cause rare prefetch hits and cache hits.  The Online Cost Estimation  Investigates which choice is better between Strip Prefetching and no prefetching  Activate/deactivate Strip Prefetching

Online Cost Estimation (2/2) 25  Low Overhead  O(1) Complexity

Evaluation  Implemented a RAID-5 driver in Linux  Five SCSI320 disks (15krpm, 73GB)  Dual Xeon 3.0GHz, 1GB of memory  Combinations  ASP+MSP  ASP+SEQP  MSP+SEQP  ASP+MSP+SEQP  SEQP: Sequential Prefetching of Linux  SEQPX : SEQP with X KiB of prefetching size  SP: Strip Prefetching  ASP: Adaptive Strip Prefetching  Measurement: six repetition, low deviation 26

PCMark  Over-provisioned memory  General Application Usage  Word, Winzip, PowerCrypt, Antivirus, Winamp, WMP, Internet, etc 2.2 times

Dbench  Dbench: Realistic workload like a file server times 30 % 2.2 times

Tiobench: Decision Correctness 29 Random Reads: Extremely low cache/prefetching hit rate Feedback does not work The online cost estimator makes the decision No prefetching

Maximum Latency & CPU Load  Tiobench (random read) 30 Maximum latency CPU load / Throughput

IOZone : Independency  IOZone Benchmark  Concurrent sequential reads 31 Independency loss Including SEQP The best Parallelism loss Independency loss Parallelism loss Including MSP

IOZone: Stride/Reverse Read  Stride Read 32  Reverse Read 40 times ASP includedSequential Prefetching

TPC Benchmark TM H  TPC-H: business-oriented database server benchmark  DBMS: MySQL  Stride reads and non-sequential reads 33 27% 134% 721% 24% 41% 199% 52% 27% 37% 20% 73% 141% The gain of ASP+MSP over SEQP128

Real Scenarios  cscope: C source file indexing of the kernel source  cscope1: exclude object files  cscope2: include object files 34  glimpse: text file indexing (/usr/share/doc) for cross reference  link: linking kernel object codes 116% 10% 107% 44%

Linux Booting 35 30%

Summary  Non-sequential reads as well as sequential reads  Database Queries, Building Search Indices  Link, Booting, File server  General application usage  Prefetch Buffer Management (Differential Feedback)  Resolves the bad cache utilization of strip prefetching  Online Disk Cost Simulation  Resolve the bad prefetch cost of strip prefetching  Practical, Low overhead, Great performance gain for practical RAID systems 36

Q&A 37

38 Step response NUNU Time Desired N U Real N U by the feedback control Realistic N U Initial N U

39 Backup Slides

Massive Stripe Prefetching 40 Prior Work: for parallelism  Adaptive Strip Prefetching (ASP)  Good for large numbers of concurrent IOs  Bad Parallelism for small numbers of concurrent IOs  Massive Stripe Prefetching (MSP)  Our Prior Work  Resolve Parallelism Loss  Activated for a small number of concurrent sequential reads  Prefetching multiple stripes  Perfect parallelism of disks

The Prefetching Size and Time of MSP 41 The amount of sequential accesses in a file Prefetch size SEQP MSP + SEQP MSP Stripe size MSP is aligned in stripe Proposed scheme: for parallelism

The Coefficient α 42 The amount of memory in the increased region in U = the amount of memory in the reduced region in D

Further Work  Optimal S ? or Dynamically controlling S  Optimal Size of Upstream Bottom |U b | ?   Ideal Derivative, Great Errors  Impractical 43