Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

Slides:



Advertisements
Similar presentations
Parallel File System Simulator In order to test the Parallel File System (PFS) scheduling algorithms in a light-weighted approach, we have developed the.
Advertisements

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.
The Zebra Striped Network File System Presentation by Joseph Thompson.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Ceph: A Scalable, High-Performance Distributed File System
Ceph: A Scalable, High-Performance Distributed File System Priya Bhat, Yonggang Liu, Jing Qin.
Scalable Content-aware Request Distribution in Cluster-based Network Servers Jianbin Wei 10/4/2001.
International Conference on Supercomputing June 12, 2009
An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
Locality-Aware Request Distribution in Cluster-based Network Servers 1. Introduction and Motivation --- Why have this idea? 2. Strategies --- How to implement?
1 PERFORMANCE EVALUATION H Often in Computer Science you need to: – demonstrate that a new concept, technique, or algorithm is feasible –demonstrate that.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
PMIT-6102 Advanced Database Systems
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
1 The Google File System Reporter: You-Wei Zhang.
STRATEGIES INVOLVED IN REMOTE COMPUTATION
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.
TRACK-ALIGNED EXTENTS: MATCHING ACCESS PATTERNS TO DISK DRIVE CHARACTERISTICS J. Schindler J.-L.Griffin C. R. Lumb G. R. Ganger Carnegie Mellon University.
Emalayan Vairavanathan
The Center for Autonomic Computing is supported by the National Science Foundation under Grant No NSF CAC Seminannual Meeting, October 5 & 6,
1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Serverless Network File Systems Overview by Joseph Thompson.
Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
Data Sharing. Data Sharing in a Sysplex Connecting a large number of systems together brings with it special considerations, such as how the large number.
A Study of Caching in Parallel File Systems Dissertation Proposal Brad Settlemyer.
Ceph: A Scalable, High-Performance Distributed File System
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Fast File System 2/17/2006. Introduction Paper talked about changes to old BSD 4.2 File System (FS) Motivation - Applications require greater throughput.
Active Storage Processing in Parallel File Systems Jarek Nieplocha Evan Felix Juan Piernas-Canovas SDM CENTER.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
AFS/OSD Project R.Belloni, L.Giammarino, A.Maslennikov, G.Palumbo, H.Reuter, R.Toebbicke.
Parallel IO for Cluster Computing Tran, Van Hoai.
Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Distributed File Systems Sun Network File Systems Andrew Fıle System CODA File System Plan 9 xFS SFS Hadoop.
Model-driven Data Layout Selection for Improving Read Performance Jialin Liu 1, Bin Dong 2, Surendra Byna 2, Kesheng Wu 2, Yong Chen 1 Texas Tech University.
Distributed Network Traffic Feature Extraction for a Real-time IDS
Jiang Zhou, Wei Xie, Dong Dai, and Yong Chen
Distributed Shared Memory
Filesystems 2 Adapted from slides of Hank Levy
Hadoop Technopoints.
PVFS: A Parallel File System for Linux Clusters
THE GOOGLE FILE SYSTEM.
Database System Architectures
Performance-Robust Parallel I/O
Presentation transcript:

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009

Abstract This paper briefly introduced the parallel file system simulator named IMPIOUS (Imprecisely Modeling Parallel I/O is Ususlly Successful). They showed IMPIOUS is capable of simulating the performance effects of different workloads in PFS (They validated it in PanFS). They showed the simulation runtime is acceptable for large scale (e.g., thousands of clients, 512 OSD) simulations on a regular single core machine.

Introduce IMPIOUS IMPIOUS is a trace-driven parallel file system simulator. It uses an abstract simulation model of object-based parallel file systems. It only accounts for just enough detail to reproduce important effects observed in real systems. They hope it to be: A simple, scalable, and general simulation model. A framework for rapid hypothesis testing. A simulator available to the open source community.

Abstract model of PFS's Typical PFS's consist of the same subsystems: OSD – run their own local file system and export a flat name space of objects. Metadata Management – implements the file name space and all metadata operations. Clients – provide the file system interface to applications.

Abstract model of PFS's The key characteristics distinguishing particular PFS's are: Data placement strategy – maps the requests to OSDs using a particular striping strategy. Resource locking protocol – determines which disks or OSDs are locked for a particular client access. Redundancy strategy – determines the kind of redundancy and which subsystem is responsible for maintaining it. Client buffer cache – specifies management of communication between clients and OSDs.

The Major Components of IMPIOUS

The clients are driven by parallel I/O traces. The file system-specific plugins determine fundamental behaviors: The data placement strategy, replication strategy, locking discipline and client cache strategy. Clients communicate with OSDs to perform I/O. Each simulated OSD consists of an instance of NaiveFS – an idealized per-OSD filesystsem. Disk simulation – DiskSim / a simple disk model.

Current status Implemented the models for PVFS, PanFS and Ceph, in the process of implementing Lustre. The validation results are based on PanFS.

Validation 1 Write Alignment vs. Bandwidth Traces used: PLFS Maps by LANL, N-1 checkpoint pattern. For the real experiments: 512 client processes running on LANL's Roadrunner test cluster. 56 OSD's implemented PanFS storage system running on LANL's Roadrunner test cluster. Use PatternIO benchmark running on clients to collect the results.

Validation 1 Real Results Each single write size is multiple of page size(4K) Each single write size is smaller than 4K Each single write size is multiple of stripe size (64K). This achieves the highest bandwidth.

Validation 1 - compare with IMPIOUS

Overall, the relative results from IMPIOUS is useful for analysis. IMPIOUS models the trends precisely. For Disksim, the IMPIOUS results do not match the absolute values because the authors did not use the same performance disk model in Disksim. Simple disk model is developed because Disksim consumes too much of simulation's runtime. For simple disk model, the absolute throughputs are considerably lower. Though the results are not very precise, the authors are refining it.

Validation 2 - performance effects of N-1 checkpoint Due to locking present in an N-1 checkpoint pattern, for multiple clients writing to the same file, the performance degrees significantly. Real experiment: run on LANL's Roadrunner supercomputer, 512 OSDs running PanFS. Simulation: Simple disk model. 2GHz Opteron one core machine; ran for 12 min – 3 hr. In this workload, a number of clients generate an N- 1 strided workload in 47KB writes.

Validation 2 - performance effects of N-1 checkpoint The stair step behavior is because the Roadrunner cluster is partitioned into sub-clusters, each supporting 720 processors, and each with an equal fraction of the total I/O bandwidth. We can see, IMPIOUS is able to correctly predict the relative ordering of these two lines, and the linear growth of the N-1 throughput.

Conclusions Given two different set of experimental results from checkpointing workloads on PanFS, IMPIOUS is able to simulate the important performance effects of those workloads. The wall-clock runtime of IMPIOUS is also encouraging: for experimantal runs consuming 2 minutes across a 512-node storage cluster, the slowest runs took 3 hours on a single core.

Our PFS Simulator

Client trace Metadata Server Stripping Strategy Local FS Disk queue Local FS Disk queue Local FS Disk queue Client trace Client trace Simulated Network OMNeT++ Disksims Data Servers Scheduling Algorithm Socket Connections

Existing Problems Simulation time – Disksim and the communication between Disksim and OMNeT++ is the runtime bottleneck of our simulator. –Optimize synchronization mechanism. –Choose more efficient communication channel other than TCP. –Implement “simple disk model”. We do not have disk models for modern disks. –DIXtrac program is very limited in modeling modern disks.

Thank you!