Slide 1/23 F RESCO : An Open Failure Data Repository for Dependability Research and Practice Saurabh Bagchi, Carol Song (Purdue University) Ravi Iyer,

Slides:

Advertisements

Similar presentations

Starfish: A Self-tuning System for Big Data Analytics.

Advertisements

An Analysis of Node Sharing on HPC Clusters using XDMoD/TACC_Stats Joseph P White, Ph.D Scientific Programmer - Center for Computational Research University.

PRAKTICKÝ ÚVOD DO SUPERPOČÍTAČE ANSELM Infrastruktura, přístup a podpora uživatelů David Hrbáč

Slide 1 of 10 Job Event Basics A Job Event is the name for the collection of components that comprise a scheduled job. On the iSeries a the available Job.

MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.

S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.

Process Description and Control Module 1.0. Major Requirements of an Operating System Interleave the execution of several processes to maximize processor.

Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.

MCDST : Supporting Users and Troubleshooting a Microsoft Windows XP Operating System Chapter 10: Collect and Analyze Performance Data.

Module 18 Monitoring SQL Server 2008 R2. Module Overview Monitoring Activity Capturing and Managing Performance Data Analyzing Collected Performance Data.

SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

Process Description and Control Chapter 3. Major Requirements of an OS Interleave the execution of several processes to maximize processor utilization.

Module 10 Configuring and Managing Storage Technologies.

Introduction to HP LoadRunner Getting Familiar with LoadRunner >>>>>>>>>>>>>>>>>>>>>>

Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

MCTS Guide to Microsoft Windows Vista Chapter 11 Performance Tuning.

MCTS Guide to Microsoft Windows 7

How to Resolve Bottlenecks and Optimize your Virtual Environment Chris Chesley, Sr. Systems Engineer

1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.

Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]

Cluster Reliability Project ISIS Vanderbilt University.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.

RNA-Seq 2013, Boston MA, 6/20/2013 Optimizing the National Cyberinfrastructure for Lower Bioinformatic Costs: Making the Most of Resources for Publicly.

Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.

1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.

CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.

Learningcomputer.com SQL Server 2008 – Profiling and Monitoring Tools.

Looking Ahead: A New PSU Research Cloud Architecture Chuck Gilbert - Systems Architect and Systems Team Lead Research CI Coordinating Committee Meeting.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.

SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Process Description and Control Chapter 3. Source Modified slides from Missouri U. of Science and Tech.

Open XDMoD Overview Tom Furlani, Center for Computational Research

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-2.

1 Chapter Overview Monitoring Access to Shared Folders Creating and Sharing Local and Remote Folders Monitoring Network Users Using Offline Folders and.

Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability September 2010 Phil Andrews Patricia Kovatch Victor Hazlewood Troy Baer.

Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer

1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

HPHC - PERFORMANCE TESTING Dec 15, 2015 Natarajan Mahalingam.

1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.

Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)

Slide 1 User-Centric Workload Analytics: Towards Better Cluster Management Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal,

Slide 1 Cluster Workload Analytics Revisited Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal, Stephen Harrell (Purdue),

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.

Monitoring Windows Server 2012

HPC usage and software packages

OpenPBS – Distributed Workload Management System

A Study of Failures in Community Clusters: The Case of Conte

Software Architecture in Practice

MCTS Guide to Microsoft Windows 7

SQL Server Monitoring Overview

Architecture & System Overview

Upgrading to Microsoft SQL Server 2014

Process Description and Control

Introduction to Operating Systems

Presentation transcript:

Slide 1/23 F RESCO : An Open Failure Data Repository for Dependability Research and Practice Saurabh Bagchi, Carol Song (Purdue University) Ravi Iyer, Zbigniew Kalbarczyk (University of Illinois at Urbana-Champaign) Nathen DeBardeleben (Los Alamos) Presentation available at: engineering.purdue.edu/dcsl & Resource Usage 

Slide 2/23 Roadmap Motivation –Why do we need an open data repository? –What are our plans Context –Large computing cluster at Universities –Demography of the cluster users –Challenge in supporting user needs Insights from analysis of Purdue’s cluster –Cluster environment –The data set –Analysis and Results –Current status of the repository –The next steps

Slide 3/23 Motivation Dependability has become a necessary requisite property for computing systems that we rely on Dependable system design should be based on real failure modes of systems There does not exist any open failure data repository today for any recent computing infrastructure that is –large enough, –diverse enough, and –with enough information about the infrastructure and –the applications that run on them BUT

Slide 4/23 So what do we do about it? 1.Stop bemoaning the lack of publicly available dependability dataset and start building one 2.But, who will want to share such data publicly? 3.Start with local IT organizations: Purdue and UIUC 4.Get NSF backing – some of the IT clusters have been set up with NSF funding 5.Collect an initial dataset for a small time window of 1.Applications and libraries 2.Resource usage – node and job level 3.Health information – node and job level 6.Release the dataset after heuristic-based anonymization

Slide 5/23 National Science Foundation Context Planning grant from the National Science Foundation (NSF) in 06/14-06/15, $100K –Computational and Information Sciences Directorate (CISE) –Computing Research Infrastructure (CRI) Program –Deliverable: Data collection tools in place on Purdue’s IT infrastructure; Requirements gathering Regular grant from NSF: 3 years, started 07/15, $1.2M –Involves UIUC with their Blue Waters cluster –Delivered: First release of the dataset with DOI –Deliverable: Large set of diverse data, from the PI/co-PI’s institutions plus others Large set of users Large set of analytical tools made available by the community

Slide 6/23 University Compute Cluster Context Computing clusters at university is not uncommon Users have a varying level of expertise –Writing own job scripts –Using scripts like a black box Varying user needs –High computation power Analysis of large structures (Civil, Aerospace engineering) –High Lustre bandwidth for file operations Working with multiple large databases/files (Genomics) –High Network Bandwidth A parallel processing application

Slide 7/23 Goals of Data Analysis & Related Efforts Goals for data analysis of cluster resource usage and failures –Improve cluster availability –Minimize the maintenance cost –``Customer satisfaction” Several synergistic efforts –XDMoD (U of Buffalo): NSF-funded project to audit utilization of the XSEDE cyberinfrastructure by providing a wide range of metrics on XSEDE resources, including resource utilization, resource performance, and impact on scholarship –XALT (U of Chicago, UT Austin): NSF-funded project to collect and understand job-level information about the libraries and executables that jobs use –[Past project] Computer Failure Data Repository (CFDR) (CMU, U of Toronto)

Slide 8/23 Purdue Cluster Details and Initial Dataset

Slide 9/23 Cluster Details Purdue’s cluster is called Conte –580 homogeneous nodes –Each node contains two 8 core Intel Xeon E Sandy Bridge processors running at 2.6 GHz –Two Xeon Phi 5110P accelerator card, each with 60 cores –Memory: 64GB of DDR3, 1.6 GHz RAM 40Gbps FDR10 Infiniband interconnect along with IP Lustre file system, 2GB/s RHEL 6.6 PBS based job scheduling using Torque Conte is a ``Community” cluster

Slide 10/23 Cluster Details Scheduling in Conte: –Each job requests for certain time duration, number of nodes and in some cases, amount of memory needed –When job exceeds the specified time limit, it is killed –Jobs are also killed by out-of-memory (OOM) killer scripts, if it exhausts available physical memory and swap space Node sharing: –By default only a single job is scheduled on a an entire node giving dedicated access to all the resources –However, user can enable sharing by using a configuration in the job submission scripts

Slide 11/23 Conte’s Data Set Data set spans Oct ‘14 – Mar ‘15 (6 months) –~500k jobs (489, 971 jobs) –306 unique users Per job data –Accounting logs from PBS scheduler Job owner details Start/end time, resource requested/used –List of shared libraries used (using lsof) Node-wise performance data –Collected using Tacc stats Lustre, Infiniband, Virtual memory, Memory and more… Syslog messages

Slide 12/23 System Usage Repository Objectives –Enable systems research in dependability that relies on system usage and failure records from large-scale systems –Provide synchronized workload information, system usage information, some user information, hardware status information –Provide this repository for a diversity of workloads and diversity of computing systems –Provide common analytic tools to allow for dependability- related questions to be asked of the data URL:

Slide 13/23 Questions for the Repository What is the resource utilization (CPU, memory, network, storage) of jobs with a certain characteristic? –Characteristic could be the application domain, the libraries being used, parallel or serial, and if parallel, how many cores –We want to know this especially if resource utilization is anomalously high What is the profile of users submitting jobs to the cluster? –Are they asking for too much or too little resources in the submission script? –Are they making the appropriate use of the parallelism? –What kinds of problem tickets do they submit and how many rounds are needed to resolve these tickets? URL:

Slide 14/23 Current status of the repository Workload traces from Conte –Accounting information (Torque logs) –TACC stats performance data –User documentation Privacy –Anonymize machine specific information –Anonymize user/group identifiers Library list is not shared –For privacy reasons

Slide 15/23 Insights from Data Analysis Few libraries (not pre-installed) are highly popular 45% jobs use less than10% requested time 70% jobs use less than 50% requested memory 20% jobs showed memory thrashing Memory thrashing behavior of jobs the share the node and jobs that do not are exactly opposite Few jobs place very high demand for I/O and network resources

Slide 16/23 Hot Libraries Extract all dynamically linked libraries being used by the applications 1.Out of a total 3,629 unique libraries, some are used much more often 2.Each of the top 50 libraries is used by more than 80 % of the users Sorted histogram of top 500 libraries as used by the jobs

Slide 17/23 Resource Request Patterns 1.Almost 45% of jobs actually used less than 10% of requested time 2.But, scheduler during busy periods gives higher priority to shorter jobs Percentage of the user requested time actually used by the jobs

Slide 18/23 Shared v/s non-shared Jobs Jobs that share node have higher thrashing compared to jobs that do not share the node

Slide 19/23 Project Plans

Slide 20/23 Immediate next steps Continued data collection on Purdue clusters Analysis of Blue Waters logs –Analyze workloads on similar lines as Conte –Investigate the similarities and differences in the results –Identify the user behavior and workload characteristic for better cluster management Close the loop –Implement the remediation measures, e.g., move some jobs to a different cluster, increase the memory asked for –Check the effect of the remediation measures

Slide 21/23 A Wish List for Types of Data Accounting information –Job Owner ID, group ID –Start/End time –Resources requested/used Memory, CPUs, Walltime,etc Performance statistics –I/O usage (Lustre, Disk) –Network usage (IP, Infiniband) –Memory usage –Virtual memory statistics Library List per job –Shared objects used by the job

Slide 22/23 A Wish List for Types of Data User Tickets –Problem and resolution Failure resolution reports –Failure description –Root cause identification –Issue resolution

Slide 23/23 Conclusion It is important to analyze how resources are being utilized by users –Scheduler tuning, resource provisioning, and educating users It is important to look at workload information together with failure events –Workload affects the kinds of hardware-software failures that are triggered Open repository enables researchers for different kind of analyses to enhance system dependability Contributors Purdue: Suhas Javagal, Subrata Mitra, Chris Thompson, Stephen Harrell, Chuck Schwarz UIUC/NCSA: Saurabh Jha, Joseph Fullop, Jeremy Enos, Fei Deng, Jin Hao DOE: Todd Gamblin, Ignacio Laguna, Dong Ahn

Slide 24/23 Thank you