Counter Stacks: Storage Workload Analysis via Streaming Algorithms Nick Harvey University of British Columbia and Coho Data Joint work with Zachary Drudi,

Slides:

Advertisements

Similar presentations

ARC: A SELF-TUNING, LOW OVERHEAD REPLACEMENT CACHE

Advertisements

1 Cache and Caching David Sands CS 147 Spring 08 Dr. Sin-Min Lee.

Ymir Vigfusson, Emory University Hjortur Bjornsson University of Iceland Ymir Vigfusson Emory University / Reykjavik University Trausti Saemundsson Reykjavik.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

External Sorting R & G Chapter 13 One of the advantages of being

Chapter 3.2 : Virtual Memory

1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.

ECE7995 Caching and Prefetching Techniques in Computer Systems Lecture 8: Buffer Cache in Main Memory (IV)

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

Virtual Memory BY JEMINI ISLAM. What is Virtual Memory Virtual memory is a memory management system that gives a computer the appearance of having more.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates

Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), and Hakim Weatherspoon (Cornell) Gecko: Contention-Oblivious.

Data Structures Hashing Uri Zwick January 2014.

Memory Management Last Update: July 31, 2014 Memory Management1.

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

CH05 Internal Memory Computer Memory System Overview Semiconductor Main Memory Cache Memory Pentium II and PowerPC Cache Organizations Advanced DRAM Organization.

Slide 12-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter Virtual Memory.

1 File Systems Chapter Files 6.2 Directories 6.3 File system implementation 6.4 Example file systems.

CMPE 421 Parallel Computer Architecture

Oracle Index study for Event TAG DB M. Boschini S. Della Torre

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.

Multiple Aggregations Over Data Streams Rui ZhangNational Univ. of Singapore Nick KoudasUniv. of Toronto Beng Chin OoiNational Univ. of Singapore Divesh.

1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.

Approximating Hit Rate Curves using Streaming Algorithms Nick Harvey Joint work with Zachary Drudi, Stephen Ingram, Jake Wires, Andy Warfield TexPoint.

IT253: Computer Organization

David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.

The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.

1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.

ReiserFS Hans Reiser

An Effective Disk Caching Algorithm in Data Grid Why Disk Caching in Data Grids?  It takes a long latency (up to several minutes) to load data from a.

PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.

CSE 241 Computer Engineering (1) هندسة الحاسبات (1) Lecture #3 Ch. 6 Memory System Design Dr. Tamer Samy Gaafar Dept. of Computer & Systems Engineering.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

Memory Management & Virtual Memory © Dr. Aiman Hanna Department of Computer Science Concordia University Montreal, Canada.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.

Hashing TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA Course: Data Structures Lecturer: Haim Kaplan and Uri Zwick.

Introduction to Database Systems1 External Sorting Query Processing: Topic 0.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.

Jiahao Chen, Yuhui Deng, Zhan Huang 1 ICA3PP2015: The 15th International Conference on Algorithms and Architectures for Parallel Processing. zhangjiajie,

Cache Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module.

1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.

Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.

Virtual memory.

Memory Management 5/11/2018 9:49 PM

Storage Access Paging Buffer Replacement Page Replacement

Multilevel Memories (Improving performance using alittle “cash”)

Database Management Systems (CS 564)

Streaming & sampling.

Plethora: Infrastructure and System Design

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Advanced Topics in Data Management

Module IV Memory Organization.

Greedy Algorithms / Caching Problem Yin Tat Lee

Contents Memory types & memory hierarchy Virtual memory (VM)

Lecture 9: Caching and Demand-Paged Virtual Memory

Chapter 9: Virtual Memory CSS503 Systems Programming

Virtual Memory CSE451 Andrew Whitaker.

Presentation transcript:

Counter Stacks: Storage Workload Analysis via Streaming Algorithms Nick Harvey University of British Columbia and Coho Data Joint work with Zachary Drudi, Stephen Ingram, Jake Wires, Andy Warfield

Caching What data to keep in fast memory? Fast, Low-Capacity Memory Slow, High-Capacity Memory

Caching Historically Registers RAM Disk Belady, 1966: FIFO, RAND, MIN Denning, 1968: LRU

Caching Modern Registers, L1, L2, L3 RAM Disk SSD Cloud Storage Proxy CDN Associative map LRU etc. LRU Consistent Hashing... from 1968 CPUs are >1000x faster Disk latency is <10x better Cache misses are increasingly costly

Challenge: Provisioning How much cache should you buy to support your workload?

Challenge: Virtualization Modern servers are heavily virtualized How should we allocate the physical cache among virtual servers to improve overall performance? What is “marginal benefit” to giving server more cache?

Understanding workloads better can help – Administrators make provisioning decisions – Software make allocation decisions Storing a trace is costly: GBs per day Analyzing and distilling traces is a challenge Understanding Workloads

Hit Rate Curve Hit rate MSR Cambridge “TS” Trace, LRU Policy Fix a particular workload and caching policy If cache size is x, what would hit rate be? HRCs are useful for choosing an appropriate cache size Cache Size (GB) “Elbow” “Knee” “Working Set” Not much marginal benefit of a bigger cache

Hit Rate Curve Real-world HRCs need not be concave or smooth “Marginal benefit” is meaningless “Working set” is a fallacy Cache Size (GB) Hit rate MSR Cambridge “Web” Trace, LRU Policy “Elbow”? “Knee”? “Working Set”?

LRU Caching Policy: An LRU cache of size x always contains the x most recently requested distinct symbols. A B C A D A B … If cache size >3 then B will still be in the cache during the second request for B. – Second request for B is a hit for cache size x if x>3. Inclusive: Larger caches always include contents of smaller caches. 3 distinct symbols “Reuse Distance”

Mattson’s Algorithm Maintain LRU cache of size n; simulate cache of all sizes x · n. Keep list of all blocks, sorted by most recent request time. Reuse distance of a request is its position in that list. If distance is d, this request is a hit for all cache sizes ¸ d. Hit rate curve is CDF of reuse distances. A B C A D A B … List: AB AC B AA C BD A C BA D C BB A D C Requests:

Faster Mattson [Bennett-Kruskal 1975, Olken 1981, Almasi et al. 2001,...] Maintain table mapping block to time of last request # of blocks whose last request time is ¸ t = # of distinct blocks seen since time t Can compute this in O(log n) time with a balanced tree Can compute HRC in O(m log n) time A B C A D A B … Block A B C D Space is  (n) n = # blocks m = length of trace

Is linear space OK? A modern disk is 8TB, divided in 4kB blocks ) n = 2B The problem is worse in multi-disk arrays ) n = 15B If the algorithm for improving memory usage consumes 15GB of RAM, that’s counterproductive! 60TB JBOD

We ran an optimized C implementation of Mattson on the MSR-Cambridge traces of 13 live servers over 1 week Trace file is 20GB in size, 2.3B requests, 750M blocks (3TB) Processing time: 1 hour RAM usage: 92GB Lesson: Cannot afford linear space to process storage workloads Question:Can we estimate HRCs in sublinear space? Is linear space OK?

Quadratic Space A B C A D A B Requests: Set of all subsequent items: A BB CCC AAA DDDDD AA BBBBB Items seen since first request Items seen since second request Reuse distance is size of oldest set that grows. Hit rate curve is CDF of reuse distances. Reuse Distance = 2 Reuse Distance = 3 Reuse Distance = 1

Quadratic Space A B C A D A B Requests: For t=1,…,m Receive request b t; Find minimum j such that b t is not in j th set Let v j be cardinality of j th set Record a hit at reuse distance v j Insert b t into all previous sets Set of all subsequent items: A BB CCC AAA DDDDD AA v j = 3 j=3

More Abstract Version For t=1,…,m Let v j be cardinality of j th set Receive request b t Let ± j be change in j th set’s cardinality when adding b t For j=2,…,t Record ( ± j - ± j-1 ) hits at reuse distance v j A B C A D A B Requests: Set of all subsequent items: A BB CCC AAA DDDDD AA ±j:±j: ± j - ± j-1 : v j = 3 How should we represent these sets?Hash table? ; Insert b t into all previous sets

Insert Delete Member? Cardinality? Space (in bits) Random Set Data Structures Hash TableBloom FilterF 0 Estimator Yes  (n log n) Yes No Yes* No  (n) Yes No Yes* O(log n) Operations Aka “HyperLogLog” “Probabilistic Counter” “Distinct Element Estimator” * allowing some error

Subquadratic Space A B C A D A B Requests: Set of all subsequent items: Items seen since first request Items seen since second request Reuse distance is size of oldest set that grows (cardinality query) Hit rate curve is CDF of reuse distances. F0 Estimator Insert … For t=1,…,m Let v j be value of j th F 0 -estimator Receive request b t Let ± j be change in j th F 0 -estimator when adding b t For j=2,…,t Record ( ± j - ± j-1 ) hits at reuse distance v j

Towards Sublinear Space A B C A Requests: Set of all subsequent items: Note that an earlier F 0 -estimator is a superset of later one Can this be leveraged to achieve sublinear space? F0 Estimator … ¶¶¶

F 0 Estimation [Flajolet-Martin ‘83, Alon-Mattias-Szegedy ‘99, …, Kane-Nelson-Woodruff ‘10] Operations: Insert(x) Cardinality(), with (1+ ² ) multiplicative error Space: log(n)/ ² 2 bits £ ( ² -2 +log n) is optimal log n rows ² -2 columns

F 0 Estimation A B C A D A B … Hash function h (uniform) Hash function g (geometric) Operations: Insert(x), Cardinality() ² -2 columns 1 1 log n rows

F 0 Estimation 11 1 A B C A D A B … Hash function h (uniform) Hash function g (geometric) Operations: Insert(x), Cardinality() ² -2 columns log n rows

F 0 Estimation A B C A D A B … Hash function h (uniform) Hash function g (geometric) Operations: Insert(x), Cardinality() ² -2 columns log n rows

F 0 Estimation A B C A D A B … Hash function h (uniform) Hash function g (geometric) Operations: Insert(x), Cardinality() ² -2 columns log n rows

F 0 Estimation A B C A D A B … Hash function h (uniform) Hash function g (geometric) Operations: Insert(x), Cardinality() ² -2 columns log n rows

F 0 Estimation Suppose we insert n distinct elements # of 1 s in a column is max of ¼ n ² 2 geometric RVs, so ¼ log(n ² 2 ) Averaging over all columns gives a concentrated estimate for log(n ² 2 ) Exponentiating and scaling gives concentrated estimate for n Operations: Insert(x), Cardinality() ² -2 columns log n rows

F 0 Estimation for a chain word ² -2 columns Operations: Insert(x) Cardinality(t), estimate # distinct elements since t th insert Space: log(n)/ ² 2 words log n rows

F 0 Estimation for a chain 1 1 A B C A D A B … Hash function h (uniform) Hash function g (geometric) ² -2 columns Operations: Insert(x), Cardinality(t) Space: log(n)/ ² 2 words log n rows

21 1 A B C A D A B … Hash function h (uniform) Hash function g (geometric) ² -2 columns F 0 Estimation for a chain Operations: Insert(x), Cardinality(t) log n rows

213 1 A B C A D A B … Hash function h (uniform) Hash function g (geometric) ² -2 columns F 0 Estimation for a chain Operations: Insert(x), Cardinality(t) log n rows

243 4 A B C A D A B … Hash function h (uniform) Hash function g (geometric) ² -2 columns F 0 Estimation for a chain Operations: Insert(x), Cardinality(t) log n rows

A B C A D A B … Hash function h (uniform) Hash function g (geometric) ² -2 columns F 0 Estimation for a chain Operations: Insert(x), Cardinality(t) log n rows

² -2 columns F 0 Estimation for a chain The {0,1}-matrix consisting of all entries ¸ t is the same as the matrix for an F 0 estimator that started at time t. So, for any t, we can estimate # distinct elements since time t. Operations: Insert(x), Cardinality(t) log n rows

Theorem: Let n=B ¢ W. Let C : [n] ! [0,1] be true HRC. Let Ĉ : [n] ! [0,1] be estimated HRC. Using O(B 2 ¢ log(n) ¢ log(m)/ ² 2 ) words of space, can get C((j-1) ¢ W) - ² · Ĉ (j ¢ W) · C(j ¢ W)+ ² 8 j =1,…,B Vertical error Horizontal error n = # distinct blocks m = # requests B = # “bins” W = width of each “bin” 1 0 Hit Rate 0 n W B bins C Ĉ C(j ¢ W) Ĉ (j ¢ W) C((j-1) ¢ W) C((j-1) ¢ W)- ²

Experiments: MSR-Cambridge traces of 13 live servers over 1 week Trace file is 20GB in size, 2.3B requests, 750M blocks Optimized C implementation of Mattson’s algorithm – Processing time: ~1 hour – RAM usage: ~92GB Java implementation of our algorithm – Processing time: 17 minutes (2M requests per second) – RAM usage: 80MB (mostly the garbage collector)

Experiments: MSR-Cambridge traces of 13 live servers over 1 week Trace file has m=2.3B requests, n=750M blocks heuristic counter stacks

Experiments: MSR-Cambridge traces of 13 live servers over 1 week Trace file has m=585M requests, n=62M blocks heuristic counter stacks

Experiments: MSR-Cambridge traces of 13 live servers over 1 week Trace file has m=75M requests, n=20M blocks heuristic counter stacks

Conclusions Workload analysis by measuring uniqueness over time. Notion of “working set” can be replaced by “hit rate curve”. Can estimate HRCs in sublinear space, quickly and accurately. On some real-world data sets, its accuracy is noticeably better than heuristics that have been proposed in the literature.

Open Questions Does algorithm use optimal amount of space? Can it be improved to O(B ¢ log(n) ¢ log(m)/ ² 2 ) words of space? We did not discuss runtime. Can we get runtime independent of B and ² ? We are taking difference of F 0 -estimators by subtraction. This seems crude. Is there a better approach? Streaming has been used in networks, databases, etc. To date, not used much in storage. Potentially more uses here.