Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

1 Storage-Aware Caching: Revisiting Caching for Heterogeneous Systems Brian Forney Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Wisconsin Network Disks University.
High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.
Web Server Benchmarking Using the Internet Protocol Traffic and Network Emulator Carey Williamson, Rob Simmonds, Martin Arlitt et al. University of Calgary.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Scalable Content-aware Request Distribution in Cluster-based Network Servers Jianbin Wei 10/4/2001.
1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.
Internet Traffic Patterns Learning outcomes –Be aware of how information is transmitted on the Internet –Understand the concept of Internet traffic –Identify.
ISCSI Performance in Integrated LAN/SAN Environment Li Yin U.C. Berkeley.
Introduction to Systems Architecture Kieran Mathieson.
Hands-On Microsoft Windows Server 2003 Networking Chapter 6 Domain Name System.
Figure 1.1 Interaction between applications and the operating system.
1 PERFORMANCE EVALUATION H Often in Computer Science you need to: – demonstrate that a new concept, technique, or algorithm is feasible –demonstrate that.
ISCSI Performance Experiments Li Yin EECS Department U.C.Berkeley.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
5/8/2006 Nicole SAN Protocols 1 Storage Networking Protocols Nicole Opferman CS 526.
Check Disk. Disk Defragmenter Using Disk Defragmenter Effectively Run Disk Defragmenter when the computer will receive the least usage. Educate users.
Session 3 Windows Platform Dina Alkhoudari. Learning Objectives Understanding Server Storage Technologies Direct Attached Storage DAS Network-Attached.
Case Study - GFS.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
1 Input/Output. 2 Principles of I/O Hardware Some typical device, network, and data base rates.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
1 The Google File System Reporter: You-Wei Zhang.
Toolbox for Dimensioning Windows Storage Systems Jalil Boukhobza, Claude Timsit 12/09/2006 Versailles Saint Quentin University.
Disk Access. DISK STRUCTURE Sector: Smallest unit of data transfer from/to disk; 512B 2/4/8 adjacent sectors transferred together: Blocks Read/write heads.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.
© 2006 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Identifying Application Impacts on Network Design Designing and Supporting Computer.
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Unit OS6: Device Management 6.1. Principles of I/O.
CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.
Redundancy. 2. Redundancy 2 the need for redundancy EPICS is a great software, but lacks redundancy support which is essential for some highly critical.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Introduction to HP Availability Manager.
C++ Programming Language Lecture 1 Introduction By Ghada Al-Mashaqbeh The Hashemite University Computer Engineering Department.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
EIDE Design Considerations 1 EIDE Design Considerations Brian Wright Portland General Electric.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Distributed File Systems Overview  A file system is an abstract data type – an abstraction of a storage device.  A distributed file system is available.
An I/O Simulator for Windows Systems Jalil Boukhobza, Claude Timsit 27/10/2004 Versailles Saint Quentin University laboratory.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Oracle 10g Database Administrator: Implementation and Administration Chapter 2 Tools and Architecture.
A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.
Background: Operating Systems Brad Karp UCL Computer Science CS GZ03 / M th November, 2008.
1 Week #10Business Continuity Backing Up Data Configuring Shadow Copies Providing Server and Service Availability.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Exploiting Gray-Box Knowledge of Buffer Cache Management Nathan C. Burnett, John Bent, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of.
Deconstructing Storage Arrays Timothy E. Denehy, John Bent, Florentina I. Popovici, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin,
Bluetooth on CE. Mid - Presentation Roman Zoltsman & Oren Haggai Group /2001 Instructor: Nir Borenshtein HSDSL Lab. Technion.
Storage Research Meets The Grid Remzi Arpaci-Dusseau.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Switch Features Most enterprise-capable switches have a number of features that make the switch attractive for large organizations. The following is a.
A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)
Measuring the Capacity of a Web Server USENIX Sympo. on Internet Tech. and Sys. ‘ Koo-Min Ahn.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Emir Halepovic, Jeffrey Pang, Oliver Spatscheck AT&T Labs - Research
Development of a QoE Model Himadeepa Karlapudi 03/07/03.
Parallel IO for Cluster Computing Tran, Van Hoai.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Embedded System Lab. 최 진 화최 진 화 Kilmo Choi 최길모 A Study of Linux File System Evolution L. Lu, A. C. Arpaci-Dusseau, R. H. ArpaciDusseau,
Kernel Modules – Introduction CSC/ECE 573, Sections 001 Fall, 2012.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Network - definition A network is defined as a collection of computers and peripheral devices (such as printers) connected together. A local area network.
Video Security Design Workshop:
Chapter 1: Introduction
Storage Virtualization
An Introduction to Computer Networking
Presentation transcript:

Deconstructing Commodity Storage Clusters Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Univ. of Wisconsin - Madison Jiri Schindler Corporation

2 Storage system  Storage system –Important components of large-scale systems –Multi-billion dollar industry  Often comprised of high-end storage servers –A big box with lots of disks inside  The simple question –How does storage server work? –Simple but hard – closed storage subsystem design

3 Why need to know?  Better modeling –How system behaves under different workload –Example in storage industry: capacity model for capacity planning –Model is limited if the information is limited  Product validation –Validate what product specs say –Performance numbers cannot confirm  Critical evaluation of design and implementation choices –Control what is occurring inside

4 Traditionally black box  Highly customized and proprietary hardware and OS –Hitachi Lightning, NetApp Filers, EMC Symmetrix –EMC Symmetrix: disk/cache manager, proprietary OS  Internal information is hidden behind standard interfaces Client Storage System ? Acks

5 Modern graybox storage system  Cluster of commodity PCs running commodity OS –Google FS cluster, HP FAB, EMC Centera  Advantages of commodity storage clusters –Direct internal observation – visible probe points –Leverage existing standardized tools Client Storage System Update DB

6 Intra-box Techniques  Two “Intra-box” techniques –Observation –System perturbation  Two components of analysis –Deduce structure of main communication protocol Object Read and Write protocol –Internal policy decisions Caching, prefetching, write buffering, load balancing, etc.

7 Goal and EMC Author  Objectives –Feasibility of deconstructing commodity storage clusters, no source code –Results achieved without EMC assistance  EMC Author –Evaluate correctness of our findings –Give insights behind their design decisions

8 Outline  Introduction  EMC Centera Overview –Intra-box tools  Deducing Protocol –Observation and Delay Pertubation  Inferring Policies –System Perturbation  Conclusion

9 WAN LAN Centera Topology Client AN 1 AN 2 SN 1 Access Nodes SN 2 SN 3 SN 4 SN 5 SN 6 Storage Nodes

10 Storage Node Commodity OS Reiserfs IDE driver TCP/UDP Linux Access Node Linux Client Client SDK TCP Centera Software Reiserfs IDE driver WAN LAN TCP/UDP

11 Probe Points – Observation  Internal probe points –Trace traffic using standardized tools tcpdump: trace network traffic Pseudo Device Driver: trace disk traffic Storage Node Reiserfs IDE drives TCP/UDP Centera Software Access Node Centera SW. TCP/UDP Client Client SDK TCP tcpdump Pseudo Dev. Driver

12 Probe Points – Perturbation  Perturbing system at probe points –Modified NistNet: delay particular messages –Pseudo Dev. Driver: delay disk I/O traffic –Additional Load CPU Load: High priority while loop Disk Load: File copy Storage Node Reiserfs IDE drives TCP/UDP Centera Software Access Node Centera SW TCP/UDP Client TCP tcpdump Pseudo Dev. Mod. NistNet Add CPU Load: while(1) {..} Add Disk Load: cp fX fY Client SDK User-level Process + Delay

13 Outline  Introduction  EMC Centera Overview  Deducing Protocol –Observation and Delay Perturbation  Inferring Policies –System Perturbation  Conclusion

14 Understanding the protocol  Understanding Read/Write protocol –Read and Write implementations in big distributed storage systems are not simple –Deconstruct the protocol structure Which pieces are involved? Where data is sent to? Data reliably stored, mirrored, striped?

15 Observing Write Protocol  Deconstruct protocol using passive observation –Run a series of write workload –Observe network and disk traffic –Correlation tools: convert traces into protocol structure Client EMC Centera Access Nodes Storage Nodes write( )

16 Observation Results  Object Write Protocol findings –Phase 1: Write request establishment –Phase 2: Data transfer –Phase 3: Disk write, notify other SNs, commit –Phase 4: Series of acknowledgement  Determine general properties –Primary SN handles generation of 2 nd copy –Two new TCP connections / object write time R R R Write Req. Request Ack. Client Access Node Primary SN Secondary SN Data Transfer Write-Commit Transfer Ack. Write Complete Write-Commit Software ACKs Software ACKs TCP Setup Write Req Request Ack. SNv SNw SNx SNy TCP Setup

17 Resolving Dependencies  Cannot conclude dependencies from observation only –B after A != B depends on A Must delay A, and see if B is delayed AN Primary SN Secondary SN Secondary Commit (sc) Primary commit (pc) Conclude causality by delaying: -disk write traffic and -secondary commit From observation only: Primary commit depends on secondary commit and sync. disk write

18 Delaying a Particular Message  Need to delay a particular message –Leverage packet sizes –Modify NistNet Delay specific message, not link –Ex: delay sc (90 bytes) 299 bytes Client Access Node Primary SN Secondary SN sc prim. commit Primary SN CentraStar Linux TCP/UDP if size= 90 Mod. NistNet incoming packet no yes delay queue 90 bytes

19 Delaying secondary-commit AN Primary SN Secondary SN Secondary commit Primary commit  Resolving first dependency –Delay secondary commit  primary commit also gets delayed  Primary commit depends on the receipt of secondary commit + delay

20 Delaying disk I/O traffic  Delay disk writes at primary storage node Primary SN Secondary-commit Primary-commit + Delay Disk Write Primary SN CentraStar ReiserFS if WRITE Pseudo-Dev disk req no yes delay queue IDE Driver From observation and delay: Primary commit depends on secondary commit message and sync. disk write

21 Ability to analyze internal designs  Intra-box techniques: Observation and perturbation by delay –Able to deduce Object Write protocol –Give ability to analyze internal design decisions  Serial vs. Parallel –Primary SN handles the generation of 2 nd copy (Serial) vs. AN handles both 1 st and 2 nd (Parallel) –EMC Centera: write throughput is more important –Decrease load on access nodes – increase write throughput  New TCP connections (internally) / object write –vs. using persistent connection to remove TCP setup cost –Prefer simplicity – no need to manage persistent conn. for all requests Client AN SN1 SN2 Client AN SN1 SN

22 Outline  Introduction  EMC Centera Overview  Deducing Protocol  Inferring Policies –Various system perturbation  Conclusion

23 Inferring internal policies  Write policies –Level of replication, Load balancing, Caching/buffering  Read policies –Caching, Prefetching, Load balancing  Try to infer –Is particular policy implemented? –At which level it is being implemented? Ex: Read Caching at Client, Access Node, Storage Node?

24 System Pertubation  Perturb the system –Delay and extra load  4 common load- balancing factors: –CPU load High priority while loop –Disk load Background file copy –Active TCP connection –Network delay Access Node Client write() SN 1 SN 2 ?? ? SN 3 SN … ? CPU Active TCP + net delay

25 Write Load Balancing  What factors determined which storage nodes are selected?  Experiment : –Observe which primary storage nodes selected –Without load: writes are balanced –With load: writes skew toward unloaded nodes ? AN sn#1 Unloaded sn#2 Unloaded ? sn#2 Loaded

26 Write Load Balancing Results Normal No Perturb Additional CPU Load Disk LoadNetwork Load Incoming Net. Delay sn#1 sn#2 sn#1 +CPU sn#1 +Disk sn#1 +TCP sn#1 +Delay

27 Summary of findings Write Policies ReplicationTwo copies in two nodes attached to different power (reliability) Load balancingCPU usage (locally observable status) Network status is not incorporated Write bufferingStorage nodes write synchronously Read Policies CachingStorage node only (commodity filesystem) Access node and client does not cache. PrefetchingStorage node only (commodity filesystem) Access node and client does not prefetch Load BalancingNot implemented in earlier version Still reads from busy nodes EMC Centera: Simplicity and Reliability

28 Conclusion  Intra-box: –Observe and perturb –Deconstruct protocol and infer policies –No access to source code  Power of probe points –More observation places –Ability to control the system  Systems built with more externally visible probe points –Systems more readily understood, analyzed, and debugged –Higher-performing, more robust and reliable computer systems

29 Questions?