Virtual Cluster Computing in IHEPCloud Haibo Li, Yaodong Cheng, Jingyan Shi, Tao Cui Computer Center, IHEP HEPIX Spring 2016.

Slides:



Advertisements
Similar presentations
Capacity Planning in a Virtual Environment
Advertisements

1 Applications Virtualization in VPC Nadya Williams UCSD.
SLA-Oriented Resource Provisioning for Cloud Computing
Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.
Xrootd and clouds Doug Benjamin Duke University. Introduction Cloud computing is here to stay – likely more than just Hype (Gartner Research Hype Cycle.
Pankaj Kumar Qinglan Zhang Sagar Davasam Sowjanya Puligadda Wei Liu
IHEP Site Status Jingyan Shi, Computing Center, IHEP 2015 Spring HEPiX Workshop.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
Computer Organization and Architecture
1 CompuP2P: An Architecture for Sharing of Computing Resources In Peer-to-Peer Networks With Selfish Nodes Rohit Gupta and Arun K. Somani
Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week –
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
How to Resolve Bottlenecks and Optimize your Virtual Environment Chris Chesley, Sr. Systems Engineer
Personal Cloud Controller (PCC) Yuan Luo 1, Shava Smallen 2, Beth Plale 1, Philip Papadopoulos 2 1 School of Informatics and Computing, Indiana University.
BESIII physical offline data analysis on virtualization platform Qiulan Huang Computing Center, IHEP,CAS CHEP 2015.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
BESIII distributed computing and VMDIRAC
Creating an EC2 Provisioning Module for VCL Cameron Mann & Everett Toews.
Cloud Computing Energy efficient cloud computing Keke Chen.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
YAN, Tian On behalf of distributed computing group Institute of High Energy Physics (IHEP), CAS, China CHEP-2015, Apr th, OIST, Okinawa.
Ian Alderman A Little History…
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
Windows Azure Conference 2014 Deploy your Java workloads on Windows Azure.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
HEPiX Karlsruhe May 9-13, 2005 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National.
BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.
BESIII Production with Distributed Computing Xiaomei Zhang, Tian Yan, Xianghu Zhao Institute of High Energy Physics, Chinese Academy of Sciences, Beijing.
CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
BES III Computing at The University of Minnesota Dr. Alexander Scott.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
IHEP(Beijing LCG2) Site Report Fazhi.Qi, Gang Chen Computing Center,IHEP.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Ian Gable HEPiX Spring 2009, Umeå 1 VM CPU Benchmarking the HEPiX Way Manfred Alef, Ian Gable FZK Karlsruhe University of Victoria May 28, 2009.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
Monitoring with InfluxDB & Grafana
OpenStack Chances and Practice at IHEP Haibo, Li Computing Center, the Institute of High Energy Physics, CAS, China 2012/10/15.
Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer
Building Virtual Scientific Computing Environment with Openstack Yaodong Cheng, CC-IHEP, CAS ISGC 2015.
Managing a growing campus pool Eric Sedore
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Review of the WLCG experiments compute plans
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
The advances in IHEP Cloud facility
AWS Integration in Distributed Computing
Elastic Computing Resource Management Based on HTCondor
Research on an universal Openstack upgrade solution
Integration of Openstack Cloud Resources in BES III Computing Cluster
ATLAS Cloud Operations
Yaodong CHENG Computing Center, IHEP, CAS 2016 Fall HEPiX Workshop
GWE Core Grid Wizard Enterprise (
Provisioning 160,000 cores with HEPCloud at SC17
Condor – A Hunter of Idle Workstation
Architecture & System Overview
PES Lessons learned from large scale LSF scalability tests
VMDIRAC status Vanessa HAMAR CC-IN2P3.
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Haiyan Meng and Douglas Thain
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Zhen Xiao, Qi Chen, and Haipeng Luo May 2013
EGI High-Throughput Compute
Presentation transcript:

Virtual Cluster Computing in IHEPCloud Haibo Li, Yaodong Cheng, Jingyan Shi, Tao Cui Computer Center, IHEP HEPIX Spring 2016

Contents Background Overview of IHEPCloud Virtual Cluster Computing Status Future Work 2

Background Low resources utilization rate – Utilization of computing resources is less than 60% on average Computing resources are non-shared – Every experiment such as BESIII,YBJ, has its own computing machine. Peak computing requirement The penalty of job running on virtual machine is optimistic. 3

Virtual machine performance test (1) BES simulation job – Same number of jobs running on physical vs VM, each VM runs one job. – The number VM on physical machine(24 cores):1,12,24 //This file is generated from " /bes3fs/offline/data/job/665/tmp109/mctest/test1/mctest1" by the program! RealizationSvc.InitEvtID=103265; #include "$OFFLINEEVENTLOOPMGRROOT/share/OfflineEventLoopMgr_Option.txt" #include "$KKMCROOT/share/jobOptions_KKMC.txt" KKMC.CMSEnergy= 3.686; KKMC.BeamEnergySpread= ; KKMC.NumberOfEventPrinted=1; KKMC.GeneratePsiPrime=true; #include "$BESEVTGENROOT/share/BesEvtGen.txt" BesRndmGenSvc.RndmSeed=724432; #include "$BESSIMROOT/share/G4Svc_BesSim.txt" #include "$CALIBSVCROOT/share/calibConfig_sim.txt" RealizationSvc.RunIdList={-8997}; #include "$ROOTIOROOT/share/jobOptions_Digi2Root.txt" RootCnvSvc.digiRootOutputFile="/scratchfs/bes/offline/shijytest/tmptestfile/mctest_10.rtraw"; MessageSvc.OutputLevel= 6; ApplicationMgr.EvtMax=4900; Experiment environment Virtual machine:1CPU cores , 2GB memory Physical machine:24CPU cores , 16GB memory Experiment Result: – 1 job :Running time penalty on VM is about 3% – 24 job: 2% 4 JoballtimeusertimeCPUslow 1-pm % 1-vm % 3.3% 12-pm % 12-vm % 2.7% 24-pm % 24-vm % 2.2%

Virtual machine performance test (2) BES reconstruction job – Same number of jobs running on physical vs VM, each VM runs one job. – The number VM on physical machine(24 cores):1,12,24 DatabaseSvc.ReuseConnection=false; MessageSvc.OutputLevel= 6; RawDataInputSvc.InputFiles={"/bes3fs/offline/data/job/700/test/virtual/rec-run_4.raw"}; EventPreSelect.WriteDst=true; EventPreSelect.WriteRec=false; EventPreSelect.SelectBhabha=false; EventPreSelect.SelectDimu=false; EventPreSelect.SelectHadron=false; EventPreSelect.SelectDiphoton=false; WriteDst.digiRootOutputFile="/besfs/offline/data/700-1/test/dst/virtual/rec-run_4.dst"; EventCnvSvc.digiRootOutputFile="/besfs/offline/data/700-1/test/tmp/virtual/rec-run_4.tmp"; ApplicationMgr.EvtMax= ; Experiment environment Virtual machine:1CPU cores , 2GB memory Physical machine:24CPU cores , 16GB memory Experiment Result: – 1 job :Running time penalty on VM is about 3% – 24 job: 16.3% 5 JoballtimeusertimeCPUslow 1-pm % 1-vm % 3.6% 12-pm % 12-vm % 4.2% 24-pm % 24-vm % 16.3% Network I/O consumption cause high IOWait

Overview of IHEPCloud 6

IHEPCloud Introduction Provide cloud services for IHEP users and computing demand Based on Openstack – Launched in May 2014 – Performed 4 rolling upgrades – Currently base on Kilo – Virtual Computing Cluster is an use scenario in IHEPCloud. 7

IHEPCloud Status 1 controller, 29 computing nodes ~ 720 CPU cores Job queues managed by HTCondor and Torque Support LHAASO,JUNO,CEPC currently 8 Control Node Computing Node VM

Virtual Cluster Computing Status 9

Features Easy to be used for different experiments Provide dynamic virtual resource on demand User transparent Two types of Cluster in IHEPCloud – Static virtual cluster – Dynamic virtual cluster 10

Architecture of Virtual Computing Cluster 11 … … Physical Machines BES JUNO LHAASO Virtual Machines BES Resource Scheduler Job Queues JUNO LHAASO Other Resource Management HTCondor Torque…

Key Technology Resource Pool Management Dynamic Scheduling 12

Resource pool management Different experiments have different resource queue Resource Quota management 13 Resource Name MinMaxAvailableReserve_time LHAASO s JUNO s

Dynamic scheduling Support different batch systems – Torque, HTcondor Dynamic VM supplies – Virtual machines are created and destoryed as application demand Fair-share algorithm – Guarantee resources are equally distributed among experiments 14

Openstack API list A list of openstack API written in python to control the VM 15 class CloudControl: def get_active_vm(self):#return active vm list def get_instance_from_ip(self,ip):#return instance id from ip address def create_vm(self,sname,imageid,flavorid):#create a new vm def delete_vm(self,serverid):#destory a vm def get_vmstat(self,serverid):#get a vm status

Dynamic Virtual computing Virtual PBS (VPBS) Virtual Condor (Vcondor) 16

VPBS Integrated with Torque Pull-push mode – When a job comes, Matcher will ask for the specific virtual machine. – When vm starts, it will request jobs. Jobagent in VM – A pilot running at virtual machine, to pull job and monitor vm status. 17

VPBS Structure IHEPCloud VM (jobagent) VM (jobagent) VM (jobagent) VM (jobagent) VM (jobagent) VM (jobagent) VM (jobagent) VM (jobagent) Openstack API Torque Matcher Vmcontrol Job Management 18 1.qsub job.sh -q 2. Request the job info 3. Ask for vm resource 4. Start VM with openstack API 5. Pull job

VCondor Use HTCondor as the new scheduler Push mode – Once vm starts, it will join in the server automatically How vm join in the right experiment pool? – Access group: which group user belongs to. – Resource group: which resource a group can use. – Mapping table 19 biby mall Lhaaso Lhaasorun LHAASO User Access group Resource group

Vcondor structure IHEPCloud VM (condor) VM (condor) VM (condor) VM (condor) VM (condor) VM (condor) VM (condor) VM (condor) Openstack API HTCondor Condor scheduler VM quota management 20 1.condor_submit job.sh 2. Jobs schedule to condor scheduler 3. Request for vm available number 4. get vm running status 6. Push jobs 5.Start VM with openstack API

Running Status Support three experiment – VPBS CEPC: ~1600 jobs,12300hours a week. – Vcondor LHAASO: ~1700 jobs,23500 hours a week. – Vcondor Juno: ~ jobs(mainly short job, average 152s), 1924hours a week. 21 Virtual computing cluster VPBS VCondor CEPC JUNO LHAASO BES

JUNO running performance: physical vs vm machine 22 Physical machine CPU use ratio: avarage 95.4%, highest 99.9% System consumes: avarage 2.1%,highest 16.7% virtual machine CPU use ratio: avarage 94.4%, highest 99.7% System consumes: avarage 1.7%,highest 5.9%

23 Vcondor jobs statistics (LHAASO - static cluster) Vcondor jobs statistics (JUNO: dynamic cluster) VPBS jobs statistics (CEPC)

Future work Extend Openstack cells and controllers – When vm instances reaches a certain range, the cloud management becomes complex. Accounting – Job accounting – Virtual resource accounting Job Scheduler Policy – Resource preemption, job deleting policy. 24

Thank you! 25