Valencia Cluster status Valencia Cluster status —— Gang Qin Nov.25 2011.

Slides:

Advertisements

Similar presentations

Blackbird: Accelerated Course Archives Using Condor with Blackboard Sam Hoover, IT Systems Architect Matt Garrett, System Administrator.

Advertisements

Network Printing. Printer sharing Saves money by only needing one printer Increases efficiency of managing resources.

Cambodia-India Entrepreneurship Development Centre - : :.... :-:-

Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week –

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

XA R7.8 Link Manager Belinda Daub Sr. Technical Consultant 1.

1 Computer and Network Bottlenecks Author: Rodger Burgess 27th October 2008 © Copyright reserved.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

PROOF work progress. Progress on PROOF The TCondor class was rewritten. Tested on a condor pool with 44 nodes. Monitoring with Ganglia page. The tests.

Condor and DRBL Bruno Gonçalves & Stefan Boettcher Emory University.

PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,

Cosc 4750 Maintenance & Analysis. Maintenance Contracts Annual cost of 10%-12% of component’s list price. On-site maintenance –usually within hours.

XA R7.8 Link Manager How to Manage an R7.8 Environment Ruth Anne Pharr Sr. IT Consultant, CISTECH Inc.

VMware vSphere Configuration and Management v6

High Availability in DB2 Nishant Sinha

Page 1 Printing & Terminal Services Lecture 8 Hassan Shuja 11/16/2004.

PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,

Analysis Trains Costin Grigoras Jan Fiete Grosse-Oetringhaus ALICE Offline Week,

Page 1 Monitoring, Optimization, and Troubleshooting Lecture 10 Hassan Shuja 11/30/2004.

CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos

Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016.

BIG DATA/ Hadoop Interview Questions.

Creating Grid Resources for Undergraduate Coursework John N. Huffman Brown University Richard Repasky Indiana University Joseph Rinkovsky Indiana University.

Cluster Status & Plans Cluster Status & Plans —— Gang Qin Jan

Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.

Tuque Automated Software Distribution System By Erick Engelke.

Computing Infrastructure – Minos 2009/06 ● MySQL and CVS upgrades ● Hardware deployment ● Condor / Grid status ● /minos/data file server ● Parrot status.

Computing Infrastructure – Minos 2009/12 ● Downtime schedule – 3 rd Thur monthly ● Dcache upgrades ● Condor / Grid status ● Bluearc performance – cpn lock.

Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie

High Availability Clusters in Linux Sulamita Garcia EDS Unix Specialist

CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland.

Cluster Status & Plans —— Gang Qin

RHEV Platform at LHCb Red Hat at CERN 17-18/1/17

Chapter 1: Introduction

Rhapsody Design Manager 4.0.1

SQL Replication for RCSQL 4.5

Dag Toppe Larsen UiB/CERN CERN,

Dynamic Deployment of VO Specific Condor Scheduler using GT4

High Availability Linux (HA Linux)

N-Tier Architecture.

Dag Toppe Larsen UiB/CERN CERN,

Work report Xianghu Zhao Nov 11, 2014.

Chapter 1: Introduction

Chapter 1: Introduction

Chapter 1: Introduction

Conditions Data access using FroNTier Squid cache Server

TYPES OF SERVER. TYPES OF SERVER What is a server.

AliEn central services (structure and operation)

Chapter 1: Introduction

Analysis Operations Requirements

The Scheduling Strategy and Experience of IHEP HTCondor Cluster

Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.

Chapter 1: Introduction

Migration Strategies – Business Desktop Deployment (BDD) Overview

Adding Objects To Nagios 3.0

Diskless network security

Chapter 1: Introduction

Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.

Chapter 1: Introduction

Chapter 1: Introduction

CSE 451 Fall 2003 Section 11/20/2003.

First Level Incident Handling FAQ (For EAL)

Chapter 1: Introduction

5/7/2019 Map Reduce Map reduce.

Chapter 1: Introduction

Condor-G Making Condor Grid Enabled

Chapter 1: Introduction

Deploying Production GRID Servers & Services

Presentation transcript:

Valencia Cluster status Valencia Cluster status —— Gang Qin Nov

New Items condor & proof Monitoring Service Availability Monitoring(SAM). Every condor slave in the cluster will receive a test job every hour, results will be merged into web monitoring page, alarm mail will be sent out if any of them failed. Similar idea for proof No priority for SAM jobs; Add system load while the system load is already quite high NFS failing on some WNs Some jobs will fail directly Popular problem with NFS, usually fixed by crond. (2)

Items with improvement condor upgrade on valtical cluster condor x86_64 has been installed on all machines in valtical cluster, twiki updated as well, to run condor commands user doesn’t need to do any speical enviroment setting Configure files for condor master & slave are different, to be uniformed in the furture in scheduled maintenance. Optimization of crontab to restart the xrtood & proofd sevice Deployed to all machines in the valtical cluster,. High CPU Overload (>100) on Valtical00 (NFS server) Caused by xrootd, around 50% of the xrootd data are saved on this machine (12TB) Possible solution Data rebalance between data servers, which means adding more disk to other WNs, this needs to change the Chasis, Carlos has ordered one and it has come today. Further tests will be organized. Filesize regulation: currently the size of xrootd files in the cluster jumps from ~20M to ~1G, a general idea is that disk I/O will benefit from larger size file, tests to be done. Adding RAID controller at the begging? (not possible now) (3)

Load Balancing Balance data importing and proof jobs When importing data to the cluster with xrdcp, proof jobs will be very slow or sometimes crashed Coordinate the data importing & proof job running time? Data importing before 9:00 and after 20:00 ? Send mail to the mailing list when data importing starts and ends? Load balance between Condor & Proof in the cluster Force condor daemon on client unable to get started when non-condor cpu load > 0.3 (further tests needed) (4)

Pending Items Evaluate filesystem migration from XRootd to EOS To be done. Find cause of regular IOwait problems in NFS share Problem is not on NFS service, but still we can do some NFS optimization Nfsd number adjustment: 8 fine Linux kernal optimization: no big improvement observed with an instant check, longer-time tests to be done. Better use NFS? disk I/O situation will be even worse when when xrootd is accessing files on the NFS server. Separate WN, NFS & UI with limited machines? (5)

Finished old items Revive valtical15 as SLC5 workstation Done and now it’s providing NFS service to the whole cluster (/data2, /data3, /data4) (6)

Thank you