ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.

Slides:



Advertisements
Similar presentations
Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.
Advertisements

An Approach to Secure Cloud Computing Architectures By Y. Serge Joseph FAU security Group February 24th, 2011.
Implementing a menu based application in FutureGrid
CMS Experience Provisioning Cloud Resources with GlideinWMS Anthony Tiradani HTCondor Week May 2015.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
Eucalyptus on FutureGrid: A case for Eucalyptus 3 Sharif Islam, Javier Diaz, Geoffrey Fox Gregor von Laszewski Indiana University.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
SICSA student induction day, 2009Slide 1 Social Simulation Tutorial Session 6: Introduction to grids and cloud computing International Symposium on Grid.
Open Science Grid Software Stack, Virtual Data Toolkit and Interoperability Activities D. Olson, LBNL for the OSG International.
Nimbus & OpenNebula Young Suk Moon. Nimbus - Intro Open source toolkit Provides virtual workspace service (Infrastructure as a Service) A client uses.
glideinWMS: Quick Facts  glideinWMS is an open-source Fermilab Computing Sector product driven by CMS  Heavy reliance on HTCondor from UW Madison and.
OSG Public Storage and iRODS
BESIII distributed computing and VMDIRAC
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
Ashish Patro MinJae Hwang Thanumalayan S. Thawan Kooburat.
Presented by: Sanketh Beerabbi University of Central Florida COP Cloud Computing.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Grids, Clouds and the Community. Cloud Technology and the NGS Steve Thorn Edinburgh University Matteo Turilli, Oxford University Presented by David Fergusson.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
Magellan: Experiences from a Science Cloud Lavanya Ramakrishnan.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Grid Middleware Tutorial / Grid Technologies IntroSlide 1 /14 Grid Technologies Intro Ivan Degtyarenko ivan.degtyarenko dog csc dot fi CSC – The Finnish.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
Conference name Company name INFSOM-RI Speaker name The ETICS Job management architecture EGEE ‘08 Istanbul, September 25 th 2008 Valerio Venturi.
Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center February.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.
The Eucalyptus Open-source Cloud Computing System Daniel Nurmi Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia Youseff, Dmitrii.
VO Privilege Activity. The VO Privilege Project develops and implements fine-grained authorization to grid- enabled resources and services Started Spring.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Condor + Cloud Scheduler Ashok Agarwal, Patrick Armstrong, Andre Charbonneau, Ryan Enge, Kyle Fransham, Colin Leavett-Brown, Michael Paterson, Randall.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
OpenStack Chances and Practice at IHEP Haibo, Li Computing Center, the Institute of High Energy Physics, CAS, China 2012/10/15.
The GridPP DIRAC project DIRAC for non-LHC communities.
OSG Security: Updates on OSG CA & Federated Identities Mine Altunay, PhD OSG Security Team OSG AHM March 24, 2015.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
Trusted Virtual Machine Images the HEPiX Point of View Tony Cass October 21 st 2011.
Grid Deployment Technical Working Groups: Middleware selection AAA,security Resource scheduling Operations User Support GDB Grid Deployment Resource planning,
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI CloudBroker usage Zoltán Farkas MTA SZTAKI LPDS
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
DIRAC Distributed Computing Services A. Tsaregorodtsev, CPPM-IN2P3-CNRS FCPPL Meeting, 29 March 2013, Nanjing.
Running User Jobs In the Grid without End User Certificates - Assessing Traceability Anand Padmanabhan CyberGIS Center for Advanced Digital and Spatial.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Running LHC jobs using Kubernetes
Dag Toppe Larsen UiB/CERN CERN,
Dag Toppe Larsen UiB/CERN CERN,
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
ATLAS Cloud Operations
Workload Management System
Glidein Factory Operations
DIRAC services.
WLCG Collaboration Workshop;
Condor-G: An Update.
Presentation transcript:

ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud

glideinWMS (review of basic principles) Pilot-based WMS that creates an on demand dynamically- sized overlay condor batch system to address the complex needs of VOs in running application workflows Components  WMS Collector  Glidein Factory  User Pool Collector  User Scheduler  VO Frontend Factory knows about the sites and how to submit glideins to the sites VO frontend knows about the user job details WMS Collector acts as a dashboard for Factory - VO Frontend communication

glideinWMS Worldwide (Grid)

CMS glideinWMS Worldwide (Grid) ~60,000 concurrent jobs! CMS Frontend 1 CMS Frontend 2

Cloud Definition (for this presentation anyway) Cloud Definition: Cloud in this presentation refers to the EC2 Infrastructure as a Service (IaaS) computing model This does not limit the Cloud to Amazon’s service  Almost all alternative cloud service providers offer the EC2 Query API interface  Almost all alternative cloud software provides an EC2 Query API interface (e.g. Eucalyptus, OpenStack, ect.)

Why Use the Cloud? Cloud offers choice of OS and system libs  Don’t have to ask for specific versions to be installed on grid site worker nodes Can’t get enough time on the Grid (deadline looming)  Get guaranteed resources in the cloud  Amazon will be happy to take your money You can have privileged user access to the VM You can have custom software installed for your jobs You do not need to set up an infrastructure for your resources (worker nodes)

glideinWMS: Grid vs. Cloud VO Infrastructure Grid Site Worker Node Condor Scheduler Job glideinWMS Glidein Factory, WMS Pool glideinWMS Glidein Factory, WMS Pool VO Frontend glidein Condor Startd Condor Central Manager Condor Central Manager VO Infrastructure Cloud Site glideinWMS VM Condor Scheduler Job glideinWMS Glidein Factory, WMS Pool glideinWMS Glidein Factory, WMS Pool VO Frontend bootstrap Condor Startd Condor Central Manager Condor Central Manager

Cloud Support in glideinWMS glideinWMS development version 3.0 has been released with support for EC2 universe submissions  oc.prd/download.html oc.prd/download.html Requires a properly configured virtual machine image to be uploaded and registered with the target cloud infrastructure (more later if desired, time permitting) Cloud credentials now must be presented along with a valid grid/voms proxy A different economic model is in effect. For now, it is up to the VO administrators to monitor usage. The “Cloud pilot” is indistinguishable from a “Grid pilot” from the end user’s view point

glideinWMS/Condor Decisions Several decisions were made along the way that impacted the current release.  Condor changed from using Amazon’s SOAP interface to use the EC2 Query API  SSH Key management  User Data handling

SOAP API Condor originally used the SOAP API  Worked fine for EC2 based cloud  Eucalyptus… Not so much.  Eucalyptus only supported very specific WSDL versions per Eucalyptus version  WSDL version support not much documented (unless you count code as documentation)  Amazon_gahp had to be recompiled if new WSDL was needed or desired – too often  Required certificates for communication  This meant that the CA for each target cloud had to be available to Condor.  Not much of a problem for Amazon, but can be difficult for more obscure clouds  Also can be difficult depending on how Condor was installed and how CA’s are managed for the site

EC2 Query API Condor switched to EC2 Query API  Nimbus, OpenStack, and others were coming out with support for the EC2 Query API  Eucalyptus works with the EC2 Query API  Becoming the standard Cloud API for all vendors  Only have to recompile ec2_gahp when support for a new method is desired  Does not require certificates – uses an Access Key and a Secret Key  Has worked quite well for us

SSH Key Management The glideinWMS Factory serves multiple communities glideinWMS does not perform file management for job submission  e.g. no file spooling Condor creates an ssh key for each vm request submitted.  Past reasons for doing this include idempotency on Amazon EC2. (no longer necessary)  However, it is very convenient for multiple community support  Should a key be compromised, only one VM is compromised

User Data Handling Condor JDL snippet:  ec2_ami_id=$ENV(AMI_ID) ec2_instance_type=$ENV(INSTANCE_TYPE) ec2_access_key_id=$ENV(ACCESS_KEY_FILE) ec2_secret_access_key=$ENV(SECRET_KEY_FILE) ec2_keypair_file=$ENV(CREDENTIAL_DIR)/ssh_key_pair.$(Cluster).$(Process).pem ec2_user_data=$ENV(USER_DATA)#### -cluster $(Cluster) -subcluster $(Process) To avoid having to write out a JDL per submission, almost all dynamic data is passed as environment variables The “tricky” part was how to pass the Cluster and Process to the pilot. (Used to help debug problems)

Future Work Set new pilot submission limit:  limit by budget  Provide VO Frontend directive: “Kill all cloud pilots, we are going to go over budget soon” Improve accounting for cloud pilots Retrieve logs from “stuck” cloud pilots  Might include a new feature request to Condor Bullet-proof virtual machine management  Condor will start and stop the virtual machines reliably, but what happens if the pilot stops working? Release an RPM for configuring the Cloud image for glideinWMS

Questions Questions?

Acknowledgments glideinWMS is an open-source Fermilab project in collaboration with UC-San Diego, the Information Sciences Institute, UW-Madison and the Condor team, and many others. We welcome any future contributors: github.com/holzman/glideinWMS Fermilab is operated by Fermi Research Alliance, LLC under Contract No. DE-AC0s-07CH11359 with United States Department of Energy. The US National Science Foundation under Grants No. OCI (STCI) and PHY (CMS Maintenance & Operations) the US Department of Energy under Grant No. DE-FC02-06ER41436 subcontract No. 647F290 (OSG)

Extra Slide - Cloud Image Challenges Who “owns” image?  Who creates it?  Who patches it?  Who gets called when something goes wrong? Where does the image reside?  It must be pre-staged to the cloud infrastructure  Under which account is the image stored? How do you create the image?  What tools do you use to create and deploy the image?  How do you keep track of the image? (Image Catalog)

Extra Slide - Cloud Image Challenges (cont.) What “extra” resources does the cloud provide?  How much memory is allowed?  How much “instance storage” is given?  This is a real issue for CMS How do you debug problems with your image?  With out knowing an admin for the cloud, or having a working example image, it can be frustrating to build a working image