Elastic Computing Resource Management Based on HTCondor

Slides:



Advertisements
Similar presentations
SLA-Oriented Resource Provisioning for Cloud Computing
Advertisements

1 October 2013 APF Summary Oct 2013 John Hover John Hover.
INFSO-RI An On-Demand Dynamic Virtualization Manager Øyvind Valen-Sendstad CERN – IT/GD, ETICS Virtual Node bootstrapper.
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
Cloud Computing Why is it called the cloud?.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Building service testbeds on FIRE D5.2.5 Virtual Cluster on Federated Cloud Demonstration Kit August 2012 Version 1.0 Copyright © 2012 CESGA. All rights.
Department of Computer Science Engineering SRM University
Nimbus & OpenNebula Young Suk Moon. Nimbus - Intro Open source toolkit Provides virtual workspace service (Infrastructure as a Service) A client uses.
How to Resolve Bottlenecks and Optimize your Virtual Environment Chris Chesley, Sr. Systems Engineer
Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.
BESIII physical offline data analysis on virtualization platform Qiulan Huang Computing Center, IHEP,CAS CHEP 2015.
BESIII distributed computing and VMDIRAC
Grid Appliance – On the Design of Self-Organizing, Decentralized Grids David Wolinsky, Arjun Prakash, and Renato Figueiredo ACIS Lab at the University.
Presented by: Sanketh Beerabbi University of Central Florida COP Cloud Computing.
YAN, Tian On behalf of distributed computing group Institute of High Energy Physics (IHEP), CAS, China CHEP-2015, Apr th, OIST, Okinawa.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.
Power-Aware Scheduling of Virtual Machines in DVFS-enabled Clusters
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Scheduling in HPC Resource Management System: Queuing vs. Planning Matthias Hovestadt, Odej Kao, Alex Keller, and Achim Streit 2003 Job Scheduling Strategies.
Integrating Computing Resources on Multiple Grid-enabled Job Scheduling Systems Through a Grid RPC System Yoshihiro Nakajima, Mitsuhisa Sato, Yoshiaki.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
Scaling the CERN OpenStack cloud Stefano Zilli On behalf of CERN Cloud Infrastructure Team 2.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
OpenStack Chances and Practice at IHEP Haibo, Li Computing Center, the Institute of High Energy Physics, CAS, China 2012/10/15.
CSF. © Platform Computing Inc CSF – Community Scheduler Framework Not a Platform product Contributed enhancement to The Globus Toolkit Standards.
INFN OCCI implementation on Grid Infrastructure Michele Orrù INFN-CNAF OGF27, 13/10/ M.Orrù (INFN-CNAF) INFN OCCI implementation on Grid Infrastructure.
Virtual Cluster Computing in IHEPCloud Haibo Li, Yaodong Cheng, Jingyan Shi, Tao Cui Computer Center, IHEP HEPIX Spring 2016.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
StratusLab is co-funded by the European Community’s Seventh Framework Programme (Capacities) Grant Agreement INFSO-RI Demonstration StratusLab First.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
DIRAC Distributed Computing Services A. Tsaregorodtsev, CPPM-IN2P3-CNRS FCPPL Meeting, 29 March 2013, Nanjing.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
CernVM and Volunteer Computing Ivan D Reid Brunel University London Laurence Field CERN.
Towards Self Adaptable Security Monitoring in IaaS clouds Anna Giannakou Advisors: Christine Morin, Jean-Louis Pazat, Louis Rilling.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Distributed computing and Cloud Shandong University (JiNan) BESIII CGEM Cloud computing Summer School July 18~ July 23, 2016 Xiaomei Zhang 1.
New Paradigms: Clouds, Virtualization and Co.
Dynamic Extension of the INFN Tier-1 on external resources
The advances in IHEP Cloud facility
OpenPBS – Distributed Workload Management System
Introduction to Distributed Platforms
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Integration of Openstack Cloud Resources in BES III Computing Cluster
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
ATLAS Cloud Operations
Yaodong CHENG Computing Center, IHEP, CAS 2016 Fall HEPiX Workshop
GWE Core Grid Wizard Enterprise (
StratusLab Final Periodic Review
StratusLab Final Periodic Review
DIRAC services.
Ops Manager API, Puppet and OpenStack – Fully automated orchestration from scratch! MongoDB World 2016.
How to enable computing
“The HEP Cloud Facility: elastic computing for High Energy Physics” Outline Computing Facilities for HEP need to evolve to address the new needs of the.
SCD Cloud at STFC By Alexander Dibbo.
PES Lessons learned from large scale LSF scalability tests
FCT Follow-up Meeting 31 March, 2017 Fernando Meireles
Discussions on group meeting
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Management of Virtual Execution Environments 3 June 2008
Replication Middleware for Cloud Based Storage Service
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Module 01 ETICS Overview ETICS Online Tutorials
What’s New in Work Queue
Building an Elastic Batch System with Private and Public Clouds
Presentation transcript:

Elastic Computing Resource Management Based on HTCondor Qiulan Huang On behalf of IHEP Computer Center CHEP2016 San Francisco,2016-10-12

Elastic Computing Resource Management Based on HTCondor Outline Background VCondor: Elastic Computing Resource Management System Based on HTCondor Running Status and Performance Analysis in IHEPCloud Summary 2016/10/12 Elastic Computing Resource Management Based on HTCondor

Traditional Architecture of HEP Computing Cluster Cluster Resource Provided by Independent HEP Application Multiple Independent Application vs Multiple Independent Computing Queue 2016/10/12 Elastic Computing Resource Management Based on HTCondor

Existing Trouble in Traditional HEP Cluster Computing Computing queues don’t share resource Different queues have different peak times of resource usage Massive jobs with a little resource or vice versa The overall resource utilization is low Virtual Computing Cluster Elastic computing resource management Elastic resource pool, scale up and down dynamically according to HEP Application jobs to improve resource utilization 2016/10/12 Elastic Computing Resource Management Based on HTCondor

Openstack/OpenNebula Run jobs on Virtual Computing Cluster A virtual layer is constructed between the physical machines and the RMS (resource management system such as HTCondor or PBS) In the case of free resources in the pool, a job queue can use more than it owns WLCG Grid RMS VCondor/VPBS Virtual machines Openstack/OpenNebula Dedicated SGE working physical nodes VMM VMM VMM VMM Physical machines 2016/10/12 Elastic Computing Resource Management Based on HTCondor

Resource pool expansion Condor Job Submit Node Openstack OpenNebula SN Job Compute Job 7 WN Condor Worker Node Compute Job Transfer VM Virtual Machine VCondor Workflow 5 VCondor 8 VM Creation 6 4 VM SN 1 3 VMQuota Job LHAASO 9 2 SN Job Job Job Job Job Job 11 10 WN WN JUNO Central Manager Job Job CONDOR POOL OF JOBS CONDOR POOL OF NODES 2016/10/12 Elastic Computing Resource Management Based on HTCondor

Elastic Computing Resource Management Based on HTCondor Resource pool shrink Condor Job Submit Node Compute Job Openstack OpenNebula SN Job 7 WN Condor Worker Node Compute Job Transfer VM Virtual Machine VCondor Workflow 5 VCondor 8 VM Deletion 6 4 VM 3 VMQuota SN LHAASO 9 1 SN 2 Job Job Job Job Job 10 WN WN JUNO Central Manager 11 CONDOR POOL OF JOBS CONDOR POOL OF NODES 2016/10/12 Elastic Computing Resource Management Based on HTCondor

Elastic Computing Resource Management Based on HTCondor VCondor’s Components Job submission (1)vcondor: JobMonitor: query and record job information and queue length changes NodeManager: use openstack api to create and destroy virtual machines DAEMON: Main module, periodically executed (2)VMQuota: Computing resource share management system https://github.com/hep-gnu/VCondor.git 2016/10/12 Elastic Computing Resource Management Based on HTCondor

(1) JobMonitor: Analyze resource requirements IHEPCloud: Virtual computing cluster 28 computing nodes,672 CPU cores HTCondor version:8.2.5 Provide support for IHEP experiment such as LHAASO, JUNO, BES, CEPC accelerator design 2016/10/12 Elastic Computing Resource Management Based on HTCondor

(2)NodeManager: VM creation and deletion VM Class —Store vm properties, such as hostname, job queue name, idle or busy Icluster Abstract Class —Implementation of the interface support a variety of cloud computing platforms, like Openstack, OpenNebula, AWS EC2 Openstack Cluster Class —Communicate with the controller Nova through a REST client to create or destroy VMs on Openstack ResourcePool Class —Provide support for resource management across multiple cloud platforms Openstack VM Manager Create/Delete VM VM info <object> <object> REST Client Post,Put,Delete Get REST API Nova Keystone Neutron …… OpenStack VM VM VM VM 2016/10/12 Elastic Computing Resource Management Based on HTCondor

(3)VMQuota: Resource share management Set up virtual computing queues for each HEP application, like LHAASO and JUNO Manage the upper thresholds, lower thresholds, and resource reservation time for each queue Queue Lower thresholds Upper thresholds Available resource Resource reservation time(seconds) LHAASO 100 400 200 600 JUNO 300 vcondor [{“ResID”:”juno”}] VMQuota Sending Requests Socket Socket Sending Replies [{“ResID”:”juno”,”MIN”:100,”AVAILABLE”:50}] VMQuota 2016/10/12 Elastic Computing Resource Management Based on HTCondor

Dynamic Scheduling Effect Upper thresholds Jobs queuing, automatically adding virtual computing nodes Lower thresholds LHAASO Resource Pool: Automatically Scale up and down on demand 2016/10/12 Elastic Computing Resource Management Based on HTCondor

Elastic Computing Resource Management Based on HTCondor VCondor:How to use Download VCondor from https://github.com/hep-gnu/VCondor.git Make sure HTCondor and Openstack or OpenNebula are well configured Setup a VM Image with Condor installed Setup a VM Template with Image in the above The VCondor configuration file allows you to configure most of its functionality, open it up to get a usable installation Start VCondor and submit jobs, resource pool scale up and down dynamically 2016/10/12 Elastic Computing Resource Management Based on HTCondor

Elastic Computing Resource Management Based on HTCondor Summary VCondor enables elastic resource management Has begun trial operation in IHEPCloud Current experiment: LHAASO, JUNO Provide support for HEP application resource sharing plan Future Achieve preemptive scheduling based on HTCondor Improve scheduling algorithm further: 目前的调度算法是FCFS,比较简单 Welcome to Contact us ! E-mail: chyd@ihep.ac.cn lihaibo@ihep.ac.cn chengzj@ihep.ac.cn 2016/10/12 Elastic Computing Resource Management Based on HTCondor