Making HTCondor Energy Efficient by identifying miscreant jobs Stephen McGough, Matthew Forshaw & Clive Gerrard Newcastle University Stuart Wheater Arjuna.

Slides:

Advertisements

Similar presentations

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.

Advertisements

Display Power Management Policies in Practice Stephen P. Tarzia Peter A. Dinda Robert P. Dick Gokhan Memik Presented by: Andrew Hahn.

Evaluating the Cost-Benefit of Using Cloud Computing to Extend the Capacity of Clusters Presenter: Xiaoyu Sun.

Energy Efficiency through Burstiness Athanasios E. Papathanasiou and Michael L. Scott University of Rochester, Computer Science Department Rochester, NY.

Bag-of-Tasks Scheduling under Budget Constraints Ana-Maria Oprescu, Thilo Kielman Presented by Bryan Rosander.

C LOUD C OM 2012 Self-Adaptive Management of The Sleep Depths of Idle Nodes in Large Scale Systems to Balance Between Energy Consumption and Response Times.

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)

Intelligent Power Management over Large Clusters Stephen McGough *, Clive Gerrard *, Paul Haldane *, Sindre Hamlander +, Paul Robinson +, Dave Sharples.

Charles Reiss *, Alexey Tumanov †, Gregory R. Ganger †, Randy H. Katz *, Michael A. Kozuch ‡ * UC Berkeley† CMU‡ Intel Labs.

Tools for Investigating Graphics System Performance

Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

Grid Computing Exposing the myths of desktop scavenging grids John Easton – IBM Grid computing

The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

Energy, Energy, Energy  Worldwide efforts to reduce energy consumption  People can conserve. Large percentage savings possible, but each individual has.

Life Cycle Assessment (LCA), Bottom-Up model (BU) and Computable General Equilibrium (CGE) model LCA: Focused on one product and traces the environmental.

Blake Colyer & Max Breidenstein.  College students are deprived of sleep and heavy caffeine users (coffee, tea, energy drinks) which affects alertness.

Simulating Condor Stephen McGough, Clive Gerrard & Jonathan Noble Newcastle University Paul Robinson, Stuart Wheater Arjuna Technologies Limited Condor.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

AN INTRODUCTION TO THE OPERATIONAL ANALYSIS OF QUEUING NETWORK MODELS Peter J. Denning, Jeffrey P. Buzen, The Operational Analysis of Queueing Network.

Computer Aided Sustainability and Green Computing BRYAN IDDINGS.

Enhanced power efficient sleep mode operation for IEEE e based WiMAX Shengqing Zhu, and Tianlei Wang IEEE Mobile WiMAX Symposium, 2007 IEEE Mobile.

Wei Gao1 and Qinghua Li2 1The University of Tennessee, Knoxville

November , 2009SERVICE COMPUTATION 2009 Analysis of Energy Efficiency in Clouds H. AbdelSalamK. Maly R. MukkamalaM. Zubair Department.

Cloud Computing Energy efficient cloud computing Keke Chen.

Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]

Bigben Pittsburgh Supercomputing Center J. Ray Scott

Power Save Mechanisms for Multi-Hop Wireless Networks Matthew J. Miller and Nitin H. Vaidya University of Illinois at Urbana-Champaign BROADNETS October.

Time Management & Goal Setting

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley

Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.

1 Reducing Queue Lock Pessimism in Multiprocessor Schedulability Analysis Yang Chang, Robert Davis and Andy Wellings Real-time Systems Research Group University.

1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.

Grid job submission using HTCondor Andrew Lahiff.

Hao Chen, Guoliang Yao, Hao Liu National ASIC System Engineering Research Center Southeast University WICOM 2008.

Condor Week 2005Optimizing Workflows on the Grid1 Optimizing workflow execution on the Grid Gaurang Mehta - Based on “Optimizing.

The Owner Share scheduler for a distributed system 2009 International Conference on Parallel Processing Workshops Reporter: 李長霖.

Business and the Environment Ewan Boyd Teesdale Environmental Consultants.

BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.

9 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From to

Cracking the DES Encryption

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.

June 30 - July 2, 2009AIMS 2009 Towards Energy Efficient Change Management in A Cloud Computing Environment: A Pro-Active Approach H. AbdelSalamK. Maly.

Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison

A Fully Automated Fault- tolerant System for Distributed Video Processing and Offsite Replication George Kola, Tevfik Kosar and Miron Livny University.

A+MAC: A Streamlined Variable Duty-Cycle MAC Protocol for Wireless Sensor Networks 1 Sang Hoon Lee, 2 Byung Joon Park and 1 Lynn Choi 1 School of Electrical.

Timeshared Parallel Machines Need resource management Need resource management Shrink and expand individual jobs to available sets of processors Shrink.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

IM Power Project Summer 2007 Raye Gomez April Wensel Heather Tomko Jen Mankoff (mentor) Anind Dey.

Site Administration: Using Pressure Sensor Akhil Pai Ramkumar Chandrasekaran.

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

STANDARD COSTING Setting Standard and Analyzing Variances Pertemuan 3, 4 dan 5 Matakuliah: > Tahun: >

Jacob R. Lorch Microsoft Research

QoS & Queuing Theory CS352.

Introduction to High Throughput Computing

Green cloud computing 2 Cs 595 Lecture 15.

Dynamic Processor Allocation for Adaptively Parallel Jobs

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Operating Systems CPU Scheduling.

Department of Computer Science University of California, Santa Barbara

Chapter 6: CPU Scheduling

How I learned to Stop Worrying and Love Preemption

CPU scheduling decisions may take place when a process:

Activity Five: Simple Machines and Energy

CSE 542: Operating Systems

Proofs of Space 徐昊 2017/5/31.

CPU Scheduling David Ferry CSCI 3500 – Operating Systems

Introduction to research computing using Condor

Presentation transcript:

Making HTCondor Energy Efficient by identifying miscreant jobs Stephen McGough, Matthew Forshaw & Clive Gerrard Newcastle University Stuart Wheater Arjuna Technologies Limited

MotivationPolicy and SimulationConclusion Task lifecycle time HTC user submits task resource selection Interactive user logs in Task eviction resource selection Computer reboot resource selection Interactive user logs in … resource selection Task eviction Task completion time … Good Bad ‘Miscreant’ – has been evicted but don’t know if it’s good or bad Motivation

Policy and SimulationConclusion Motivation We have run a high-throughput cluster for ~6 years – Allowing many researchers to perform more work quicker University has strong desire to reduce energy consumption and reduce CO 2 production – Currently powering down computer & buying low power PCs – “If a computer is not ‘working’ it should be powered down” Can we go further to reduce wasted energy? – Reduce time computers spend running work which does not complete – Prevent re-submission of ‘bad’ jobs – Reduce the number of resubmissions for ‘good’ jobs Aims – Investigate policy for reducing energy consumption – Determine the impact on high-throughput users Motivation

Policy and SimulationConclusion Can we fix the number of retries? ~57 years of computing time during 2010 ~39 years of wasted time – ~27 years for ‘bad’ tasks : average 45 retries : max 1946 retries – ~12 years for ‘good’ tasks : average 1.38 retries : max 360 retries 100% ‘good’ task completion -> 360 retries Still wastes ~13 years on ‘bad’ tasks – 95% ‘good’ task completion -> 3 retries : 9,808 good tasks killed (3.32%) – 99% ‘good’ task completion -> 6 retries : 2,022 good tasks killed (0.68%) Motivation

Policy and SimulationConclusion Can we make tasks short enough? Make tasks short enough to reduce miscreants Average idle interval – 371 minutes But to ensure availability of intervals – 95% : need to reduce time limit to 2 minutes – 99% : need to reduce time limit to 1 minute Impractical to make tasks this short Motivation

Policy and SimulationConclusion Cluster Simulation High Level Simulation of Condor – Trace logs from a twelve month period are used as input User Logins / Logouts (computer used) Condor Job Submission times (‘good’/’bad’ and duration) Active User / Condor Active User / Condor Idle Sleep Policy and Simulation

MotivationPolicy and SimulationConclusion n reallocation policies N1(n): Abandon task if deallocated n times. N2(n): Abandon task if deallocated n times ignoring interactive users. N3(n): Abandon task if deallocated n times ignoring planned machine reboots. C1: Tasks allocated to resources at random, favouring awake resources C2: Target less used computers (longer idle times) C3: Tasks are allocated to computers in clusters with least amount of time used by interactive users Policy and Simulation

MotivationPolicy and SimulationConclusion Accrued Time Policies Impose a limit on cumulative execution time for a task. A1(t): Abandon if accrued time > t and task deallocated. A2(t): As A1, discounting interactive users.. A3(t): As A1, discounting reboots. Policy and Simulation

MotivationPolicy and SimulationConclusion Individual Time Policy Impose a limit on individual execution time for a task. – Nightly reboots bound this to 24 hours. – What is the impact of lowering this? I1(t): Abandon if individual time > t. Policy and Simulation

MotivationPolicy and SimulationConclusion Dedicated Resources D1(m,d): Miscreant tasks are permitted to continue executing on a dedicated set of m resources (without interactive users or reboots), with a maximum duration d. Policy and Simulation

MotivationPolicy and SimulationConclusion Simple policies can be used to reduce the effect of miscreant tasks in a multi-use cycle stealing cluster. – N2 (total evictions ignoring users) Order of magnitude reduction in energy consumption – Reduce amount of effort wasted on tasks that will never complete Policies may be combined to achieve further improvements. – Adding in dedicated computers Conclusion

Questions? More info: McGough, A Stephen; Forshaw, Matthew; Gerrard, Clive; Wheater, Stuart; Reducing the Number of Miscreant Tasks Executions in a Multi-use Cluster, Cloud and Green Computing (CGC), 2012