Idle virtual machine detection in FermiCloud Giovanni Franzini September 21, 2012 Scientific Computing Division Grid and Cloud Computing Department.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

Slide 2-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 2 Using the Operating System 2.
SLA-Oriented Resource Provisioning for Cloud Computing
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
Xrootd and clouds Doug Benjamin Duke University. Introduction Cloud computing is here to stay – likely more than just Hype (Gartner Research Hype Cycle.
Operating Systems Operating system is the “executive manager” of all hardware and software.
1 Characterization of Software Aging Effects in Elastic Storage Mechanisms for Private Clouds Rubens Matos, Jean Araujo, Vandi Alves and Paulo Maciel Presenter:
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Introduction CSCI 444/544 Operating Systems Fall 2008.
CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.
A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter : S.Y.Chen.
Phones OFF Please Operating System Introduction Parminder Singh Kang Home:
Understanding Operating Systems 1 Overview Introduction Operating System Components Machine Hardware Types of Operating Systems Brief History of Operating.
Cs238 Lecture 3 Operating System Structures Dr. Alan R. Davis.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
November 1, 2004Introduction to Computer Security ©2004 Matt Bishop Slide #29-1 Chapter 33: Virtual Machines Virtual Machine Structure Virtual Machine.
The Origin of the VM/370 Time-sharing system Presented by Niranjan Soundararajan.
Minerva Infrastructure Meeting – October 04, 2011.
Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &
A User Experience-based Cloud Service Redeployment Mechanism KANG Yu.
Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.
STRATEGIES INVOLVED IN REMOTE COMPUTATION
Objectives To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization.
How to Resolve Bottlenecks and Optimize your Virtual Environment Chris Chesley, Sr. Systems Engineer
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
OSG Public Storage and iRODS
CS 1308 Computer Literacy and the Internet. Introduction  Von Neumann computer  “Naked machine”  Hardware without any helpful user-oriented features.
November , 2009SERVICE COMPUTATION 2009 Analysis of Energy Efficiency in Clouds H. AbdelSalamK. Maly R. MukkamalaM. Zubair Department.
ICOM Noack Operating Systems - Administrivia Prontuario - Please time-share and ask questions Info is in my homepage amadeus/~noack/ Make bookmark.
OS provide a user-friendly environment and manage resources of the computer system. Operating systems manage: –Processes –Memory –Storage –I/O subsystem.
Computing and the Web Operating Systems. Overview n What is an Operating System n Booting the Computer n User Interfaces n Files and File Management n.
Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment.
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems with Multi-programming Chapter 4.
Operating System Principles And Multitasking
1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.
Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.
June 30 - July 2, 2009AIMS 2009 Towards Energy Efficient Change Management in A Cloud Computing Environment: A Pro-Active Approach H. AbdelSalamK. Maly.
Grid and Cloud Computing Alessandro Usai SWITCH Sergio Maffioletti Grid Computing Competence Centre - UZH/GC3
NTU Cloud 2010/05/30. System Diagram Architecture Gluster File System – Provide a distributed shared file system for migration NFS – A Prototype Image.
CSE 451: Operating Systems Winter 2015 Module 25 Virtual Machine Monitors Mark Zbikowski Allen Center 476 © 2013 Gribble, Lazowska,
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer
Auxiliary services Web page Secrets repository RSV Nagios Monitoring Ganglia NIS server Syslog Forward FermiCloud: A private cloud to support Fermilab.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Processes and threads.
2. OPERATING SYSTEM 2.1 Operating System Function
Chapter 1: A Tour of Computer Systems
Introduction to Operating System (OS)
Chapter 2: System Structures
Cloud Computing Architecture
Chapter 2: Operating-System Structures
Transparent Contribution of Memory
Introduction to Operating Systems
Wide Area Workload Management Work Package DATAGRID project
CSE 451: Operating Systems Autumn Module 24 Virtual Machine Monitors
OS Components and Structure
Chapter 2: Operating-System Structures
Transparent Contribution of Memory
Presentation transcript:

Idle virtual machine detection in FermiCloud Giovanni Franzini September 21, 2012 Scientific Computing Division Grid and Cloud Computing Department

FermiCloud & FermiGrid ● Cloud Computing is the use of computing resources (hardware and software) that are delivered as a service over a network (typically the Internet). ● A Cloud creates Virtual Machines (VM) upon request. A VM mimics a physical computer in all functionalities (e.g. it is accessible remotely using ssh connections, etc.). ● Here at Fermilab, the Grid and Cloud Computing department manages two distributed systems: – FermiCloud A private Cloud providing Infrastructure-as-a- Service for the Fermilab scientific stakeholders to manage dynamically allocated services, interactive and batch processing. – FermiGrid A distributed campus infrastructure that manages statically allocated compute and storage resources for batch processing.

Problem Statement ● FermiCloud and FermiGrid can share resources, so it is important to optimize their individual utilization to maximize available computing cycles. ● My work focussed on identifying and minimizing idle virtual machines on FermiCloud. An idle VM is an existing machine that it not currently providing computing services. ● If the resources occupied by idle Virtual Machines on FermiCloud were reclaimed, they could be used by FermiGrid to process scientific data. ● Problem: How can we identify idle VMs in FermiCloud?

Project overview ● Find a way to identify idle VMs – Investigate the reuse of existing workload management systems for compute intensive jobs, such as Condor. – Develop a software product to address the requirements of the problem. ● Develop a mechanism to – Inform the FermiCloud management system (OpenNebula) about the idleness of the machine. – Suspend idle VMs. – Replace the suspended VMs with FermiGrid “worker nodes” to execute jobs in queue.

Identify an Idle VM ● In order to identify an idle VM, some components and activities of the machine must be monitored periodically, especially: – CPU idleness – Keyboard / Mouse usage – Network – I/O – Virtual Memory ● In Unix, some useful information can be found in special files (like /proc/uptime ), or can be obtained using particular commands (such as vmstat ) offered by the shell. ● Starting from these data, we can create several indexes to detect the VM status.

Identify an Idle VM For monitoring the VMs, six indexes were defined: 1)CPU Idle Percentage Based on /proc/uptime file. 2)Keyboard / pseudo-terminal idle time (from Condor code) Based on utmp file. 3)Bytes tx and rx by the network interface 4)Iowait ticks Percentage of time the CPU is idle and there is at least one I/O in progress (local disk or remotely mounted disk, NFS). 5)Context switches 6)Memory Paged in/out

Computing the Indexes ● CPU idle example. ● At step n: Take times from /proc/uptime → total(n), idle(n) delta_idle(n) = idle(n) - idle(n-1) delta_total(n) = total(n) - total(n-1) idle_perc(n) = delta_idle(n) / delta_idle(n) avg_idle(n) = (1-α)*idle_perc(n) + α*avg_idle(n-1) ● avg_idle(n ) is an EMA (Exponential Moving Average). It keeps track of the old values of the index, giving more importance to new values (α < ½). ● The sample time for all the indexes is 60 seconds.

Index in Action: CPU Idle Prime Number Test (T = 5s)

Index in Action: Network Activity Ping Test (T = 10s)

Indexes & Idleness Rules ● Rules based on index values can define if a VM is currently idle. These rules are evaluated periodically (e.g. every hour). ● An example of rule could be: vm_status = ( CPU_idle > 0.98 && pty_idle > 2*60 ) ? IDLE : NOT_IDLE ● In order to define these rules, “idle” thresholds for the defined indexes must be found. ● I considered two ways to define these thresholds: – experimentally (24 hours VM usage tests); – using ANFIS (Adaptive Network Fuzzy Interference System).

24 Hours VM Test: CPU Idle * according to the tester report Not idle*

24 Hours VM Test: Keyboard / pty Idle Time Not idle* * according to the tester report

24 Hours VM Test iowait * according to the tester report Not idle*

24 Hours VM Test Memory Paging in/out * according to the tester report Not idle*

Idleness rule ● An average of the indexes values recorded while the VM was idle, was performed. This gave us an idea of typical values for the indexes, when the VM is idle. The “idle” thresholds were defined starting from these averages. ● These values show an high standard deviation. Different VMs so may have very different values for the same index. ● Idleness rule used for the final tests is shown in the next slide ( th_index_A is the idle threshold for index_A ). CPU idle Keyboard/ pty idle time Bytes tx per period Bytes rx per period Iowait Context switches per period Paging in per period Paging out per period Avg Std.Dev ~

Idleness rule SOMEONE_LOGGED = ( keyboard_pty != -1 ); KEYBOARD_USED = ( SOMEONE_LOGGED && keyboard_pty < th_keyboard_pty ); CPU_IDLE = ( cpu_idle > th_cpu_idle ); IOWAIT_IDLE = ( iowait < th_iowait ); PI_IDLE = ( paging_in < th_paging_in ); rule_A = ( SOMEONE_LOGGED && !KEYBOARD_USED && CPU_IDLE && IOWAIT_IDLE && PI_IDLE ); rule_B = ( !SOMEONE_LOGGED && CPU_IDLE && IOWAIT_IDLE && PI_IDLE ); vm_status = ( rule_A || rule_B ) ? IDLE : NOT_IDLE;

Final test results ● During the final test, the detector identified idle VMs with 90% accuracy. ● This is only a first encouraging result. Deeper analysis and improvements of the idle detector are needed. ● In particular: – The final test results come from the anaysis of a very little pool of VMs (30). The majority of them were always idle, only a few of them were actually used during the test. – Idle thresholds must be improved, processing data coming from VMs providing different services (servers, worker nodes, etc.). – New ways to use the recorded indexes values may be investigated (use of different averages, recording of the last 60 minute values and processing of these data, etc.).

Conclusions ● Goal: optimize the usage of computing resources at Fermilab by detecting idle Virtual Machines (VM). ● Idle FermiCloud resources can be reclaimed and used by FermiGrid to run data processing jobs. ● I have implemented six indexes to expose the activity of a VM. These indexes are sampled periodically. ● An “idleness” rule was defined, to properly identify an idle VM. ● Running 24 hour tests: friendly users mark down when their VM is used, while my software record index values and define the VM status. ● A first prototype of the detector was created, with an accuracy of 90%.