Valencia Cluster status Valencia Cluster status —— Gang Qin Nov.25 2011.

Slides:



Advertisements
Similar presentations
Blackbird: Accelerated Course Archives Using Condor with Blackboard Sam Hoover, IT Systems Architect Matt Garrett, System Administrator.
Advertisements

Network Printing. Printer sharing Saves money by only needing one printer Increases efficiency of managing resources.
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week –
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
XA R7.8 Link Manager Belinda Daub Sr. Technical Consultant 1.
1 Computer and Network Bottlenecks Author: Rodger Burgess 27th October 2008 © Copyright reserved.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
PROOF work progress. Progress on PROOF The TCondor class was rewritten. Tested on a condor pool with 44 nodes. Monitoring with Ganglia page. The tests.
Condor and DRBL Bruno Gonçalves & Stefan Boettcher Emory University.
PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
Cosc 4750 Maintenance & Analysis. Maintenance Contracts Annual cost of 10%-12% of component’s list price. On-site maintenance –usually within hours.
XA R7.8 Link Manager How to Manage an R7.8 Environment Ruth Anne Pharr Sr. IT Consultant, CISTECH Inc.
VMware vSphere Configuration and Management v6
High Availability in DB2 Nishant Sinha
Page 1 Printing & Terminal Services Lecture 8 Hassan Shuja 11/16/2004.
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
Analysis Trains Costin Grigoras Jan Fiete Grosse-Oetringhaus ALICE Offline Week,
Page 1 Monitoring, Optimization, and Troubleshooting Lecture 10 Hassan Shuja 11/30/2004.
CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos
Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016.
BIG DATA/ Hadoop Interview Questions.
Creating Grid Resources for Undergraduate Coursework John N. Huffman Brown University Richard Repasky Indiana University Joseph Rinkovsky Indiana University.
Cluster Status & Plans Cluster Status & Plans —— Gang Qin Jan
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Tuque Automated Software Distribution System By Erick Engelke.
Computing Infrastructure – Minos 2009/06 ● MySQL and CVS upgrades ● Hardware deployment ● Condor / Grid status ● /minos/data file server ● Parrot status.
Computing Infrastructure – Minos 2009/12 ● Downtime schedule – 3 rd Thur monthly ● Dcache upgrades ● Condor / Grid status ● Bluearc performance – cpn lock.
Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie
High Availability Clusters in Linux Sulamita Garcia EDS Unix Specialist
CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland.
Cluster Status & Plans —— Gang Qin
RHEV Platform at LHCb Red Hat at CERN 17-18/1/17
Chapter 1: Introduction
Rhapsody Design Manager 4.0.1
SQL Replication for RCSQL 4.5
Dag Toppe Larsen UiB/CERN CERN,
Dynamic Deployment of VO Specific Condor Scheduler using GT4
High Availability Linux (HA Linux)
N-Tier Architecture.
Dag Toppe Larsen UiB/CERN CERN,
Work report Xianghu Zhao Nov 11, 2014.
Chapter 1: Introduction
Chapter 1: Introduction
Chapter 1: Introduction
Conditions Data access using FroNTier Squid cache Server
TYPES OF SERVER. TYPES OF SERVER What is a server.
AliEn central services (structure and operation)
Chapter 1: Introduction
Analysis Operations Requirements
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.
Chapter 1: Introduction
Migration Strategies – Business Desktop Deployment (BDD) Overview
Adding Objects To Nagios 3.0
Diskless network security
Chapter 1: Introduction
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
Chapter 1: Introduction
Chapter 1: Introduction
CSE 451 Fall 2003 Section 11/20/2003.
First Level Incident Handling FAQ (For EAL)
Chapter 1: Introduction
5/7/2019 Map Reduce Map reduce.
Chapter 1: Introduction
Condor-G Making Condor Grid Enabled
Chapter 1: Introduction
Deploying Production GRID Servers & Services
Presentation transcript:

Valencia Cluster status Valencia Cluster status —— Gang Qin Nov

New Items condor & proof Monitoring Service Availability Monitoring(SAM). Every condor slave in the cluster will receive a test job every hour, results will be merged into web monitoring page, alarm mail will be sent out if any of them failed. Similar idea for proof No priority for SAM jobs; Add system load while the system load is already quite high NFS failing on some WNs Some jobs will fail directly Popular problem with NFS, usually fixed by crond. (2)

Items with improvement condor upgrade on valtical cluster condor x86_64 has been installed on all machines in valtical cluster, twiki updated as well, to run condor commands user doesn’t need to do any speical enviroment setting Configure files for condor master & slave are different, to be uniformed in the furture in scheduled maintenance. Optimization of crontab to restart the xrtood & proofd sevice Deployed to all machines in the valtical cluster,. High CPU Overload (>100) on Valtical00 (NFS server) Caused by xrootd, around 50% of the xrootd data are saved on this machine (12TB) Possible solution Data rebalance between data servers, which means adding more disk to other WNs, this needs to change the Chasis, Carlos has ordered one and it has come today. Further tests will be organized. Filesize regulation: currently the size of xrootd files in the cluster jumps from ~20M to ~1G, a general idea is that disk I/O will benefit from larger size file, tests to be done. Adding RAID controller at the begging? (not possible now) (3)

Load Balancing Balance data importing and proof jobs When importing data to the cluster with xrdcp, proof jobs will be very slow or sometimes crashed Coordinate the data importing & proof job running time? Data importing before 9:00 and after 20:00 ? Send mail to the mailing list when data importing starts and ends? Load balance between Condor & Proof in the cluster Force condor daemon on client unable to get started when non-condor cpu load > 0.3 (further tests needed) (4)

Pending Items Evaluate filesystem migration from XRootd to EOS To be done. Find cause of regular IOwait problems in NFS share Problem is not on NFS service, but still we can do some NFS optimization Nfsd number adjustment: 8 fine Linux kernal optimization: no big improvement observed with an instant check, longer-time tests to be done. Better use NFS? disk I/O situation will be even worse when when xrootd is accessing files on the NFS server. Separate WN, NFS & UI with limited machines? (5)

Finished old items Revive valtical15 as SLC5 workstation Done and now it’s providing NFS service to the whole cluster (/data2, /data3, /data4) (6)

Thank you