Florida Tech Grid Cluster P. Ford 2 * X. Fave 1 * M. Hohlmann 1 High Energy Physics Group 1 Department of Physics and Space Sciences 2 Department of Electrical.

Slides:



Advertisements
Similar presentations
HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Advertisements

Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
Introduction to CMS computing CMS for summer students 7/7/09 Oliver Gutsche, Fermilab.
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Site Report US CMS T2 Workshop Samir Cury on behalf of T2_BR_UERJ Team.
1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.
1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.
Open Science Grid Software Stack, Virtual Data Toolkit and Interoperability Activities D. Olson, LBNL for the OSG International.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Zhiling Chen (IPP-ETHZ) Doktorandenseminar June, 4 th, 2009.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Grid Canada CLS eScience Workshop 21 st November, 2005.
HEP Experiment Integration within GriPhyN/PPDG/iVDGL Rick Cavanaugh University of Florida DataTAG/WP4 Meeting 23 May, 2002.
José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.
OSG Services at Tier2 Centers Rob Gardner University of Chicago WLCG Tier2 Workshop CERN June 12-14, 2006.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
Alain Romeyer - 15/06/20041 CMS farm Mons Final goal : included in the GRID CMS framework To be involved in the CMS data processing scheme.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
F.Fanzago – INFN Padova ; S.Lacaprara – LNL; D.Spiga – Universita’ Perugia M.Corvo - CERN; N.DeFilippis - Universita' Bari; A.Fanfani – Universita’ Bologna;
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.
INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.
Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.
10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
São Paulo Regional Analysis Center SPRACE Status Report 22/Aug/2006 SPRACE Status Report 22/Aug/2006.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
UMD TIER-3 EXPERIENCES Malina Kirn October 23, 2008 UMD T3 experiences 1.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
1 Development of a High-Throughput Computing Cluster at Florida Tech P. FORD, R. PENA, J. HELSBY, R. HOCH, M. HOHLMANN Physics and Space Sciences Dept,
INFSO-RI Enabling Grids for E-sciencE OSG-LCG Interoperability Activity Author: Laurence Field (CERN)
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
July 29' 2010INDIA-CMS_meeting_BARC1 LHC Computing Grid Makrand Siddhabhatti DHEP, TIFR Mumbai.
Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center February.
4/25/2006Condor Week 1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
High Energy FermiLab Two physics detectors (5 stories tall each) to understand smallest scale of matter Each experiment has ~500 people doing.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
Florida Tier2 Site Report USCMS Tier2 Workshop Livingston, LA March 3, 2009 Presented by Yu Fu for the University of Florida Tier2 Team (Paul Avery, Bourilkov.
Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.
INFSO-RI Enabling Grids for E-sciencE CRAB: a tool for CMS distributed analysis in grid environment Federica Fanzago INFN PADOVA.
The 2001 Tier-1 prototype for LHCb-Italy Vincenzo Vagnoni Genève, November 2000.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
A. Mohapatra, T. Sarangi, HEPiX-Lincoln, NE1 University of Wisconsin-Madison CMS Tier-2 Site Report D. Bradley, S. Dasu, A. Mohapatra, T. Sarangi, C. Vuosalo.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Creating Grid Resources for Undergraduate Coursework John N. Huffman Brown University Richard Repasky Indiana University Joseph Rinkovsky Indiana University.
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
Claudio Grandi INFN Bologna Workshop congiunto CCR e INFNGrid 13 maggio 2009 Le strategie per l’analisi nell’esperimento CMS Claudio Grandi (INFN Bologna)
WLCG IPv6 deployment strategy
Status of BESIII Distributed Computing
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Composition and Operation of a Tier-3 cluster on the Open Science Grid
Basic Grid Projects – Condor (Part I)
Florida Tech Grid Cluster
The CMS Beijing Site: Status and Application
Presentation transcript:

Florida Tech Grid Cluster P. Ford 2 * X. Fave 1 * M. Hohlmann 1 High Energy Physics Group 1 Department of Physics and Space Sciences 2 Department of Electrical & Computer Engineering

History Original conception in 2004 with FIT ACITC grant Received over 30 more low-end systems from UF. Basic cluster software operational Purchased high-end servers and designed new cluster. Established Cluster on Open Science Grid Upgraded and added systems. Registered as CMS Tier 3 site.

Current Status OS: Rocks V (CentOS 5.0) Job Manager: Condor Grid Middleware: OSG 1.2, Berkeley Storage Manager (BeStMan) i7.p3, Physics Experiment Data Exports (PhEDEx) Contributed over 400,000 wall hours to CMS experiment. Over 1.3M wall hours total. Fully Compliant on OSG Resource Service Validation (RSV), and CMS Site Availability Monitoring (SAM) tests.

System Architecture nas-0-0 Compute Element (CE) Storage Element (SE) compute-2-Xcompute-1-X

Hardware CE/Frontend: 8 Intel Xeon E5410, 16GB RAM, RAID5 NAS0: 4 CPUs, 8GB RAM, 9.6TB RAID6 Array SE: 8 CPUs, 64GB RAM, 1TB RAID5 20 Compute Nodes: 8 CPUs & 16GB RAM each. 160 total batch slots. Gigabit networking, Cisco Express at core. 2x 208V 5kVA UPS for nodes, 1x 120V 3kVA UPS for critical systems.

Hardware Olin Physical Science High Bay

Rocks OS Huge software package for clusters (e.g. 411, dev tools, apache, autofs, ganglia) Allows customization through “Rolls” and appliances. Config stored in MySQL. Customizable appliances auto-install nodes and post-install scripts.

Storage Set up XFS on NAS partition - mounted on all machines. NAS stores all user and grid data, streams over NFS. Storage Element gateway for Grid storage on NAS array.

Condor Batch Job Manager Batch job system that enables distribution of workflow jobs to compute nodes. Distributed computing, NOT parallel. Users submit jobs to a queue and system finds places to process them. Great for Grid Computing, most-used in OSG/CMS. Supports “Universes” - Vanilla, Standard, Grid...

Personal Condor / Central Manager Master collector negotiator startdschedd Master: Manages all daemons Negotiator: “Matchmaker” between idle jobs and pool nodes. Collector: Directory service for all daemons. Daemons send ClassAd updates periodically. Startd: Runs on each “execute” node. Schedd: Runs on a “submit” host, creates a “shadow” process on the host. Allows manipulation of job queue.

Master Collector scheddnegotiator Workstation Master startd schedd Workstation Master startd schedd Central Manager Cluster Node Master startd Cluster Node Master startd Typical Condor setup

Condor Priority User priority managed by complex algorithm (half-life) with configurable parameters. System does not kick off running jobs. Resource claim is freed as soon as job is finished. Enforces fair use AND allows vanilla jobs to finish. Optimized for Grid Computing.

Grid Middleware Source: OSG Twiki documentation

OSG Middleware OSG middleware installed/updated by Virtual Data Toolkit (VDT). Site configuration was complex before 1.0 release. Simpler now. Provides Globus framework & security via Certificate Authority. Low maintenance: Resource Service Validation (RSV) provides snapshot of site. Grid User Management System (GUMS) handles mapping of grid certs to local users.

BeStMan Storage Berkeley Storage Manager: SE runs basic gateway configuration - short config but hard to get working. Not nearly as difficult as dCache - BeStMan is a good replacement for small to medium sites. Allows grid users to transfer data to-and-from designated storage via LFN e.g. srm://uscms1-se.fltech-grid3.fit.edu:8443/srm/v2/server?SFN=/bestman/BeStMan/cms...

WLCG Large Hadron Collider - expected 15PB/year. Compact Muon Solenoid detector will be a large part of this. World LHC Computing Grid (WLCG) handles the data, interfaces with sites in OSG, EGEE (european), etc. Tier 0 - CERN, Tier 1 - Fermilab, Closest Tier 2 - UFlorida. Tier 3 - US! Not officially part of CMS computing group (i.e. no funding), but very important for dataset storage and analysis.

T2/T3 sites in the US T3 T3 T3 T3 T3 T3 T3 T3 T3 T3 T2 T2 T2T2 T2 T2 T2 T3 T3

Cumulative Hours for FIT on OSG

Local Usage Trends Trends Over 400,000 cumulative hours for CMS Over 900,000 cumulative hours by local users Total of 1.3 million CPU hours utilized

Tier-3 Sites Not yet completely defined. Consensus: T3 sites give scientists a framework for collaboration (via transfer of datasets), also provide compute resources. Regular testing by RSV and Site Availability Monitoring (SAM) tests, and OSG site info publishing to CMS. FIT is one of the largest Tier 3 sites.

RSV & SAM Results

PhEDEx Physics Experiment Data Exports: Final milestone for our site. Physics datasets can be downloaded from other sites or exported to other sites. All relevant datasets catalogued on CMS Data Bookkeeping System (DBS) - keeps track of locations of datasets on the grid. Central web interface allows dataset copy/deletion requests.

Demo Sensitive=on&userMode=user&sortOrder=desc&sort Name=&grid=0&method=dbsapi&dbsInst=cms_dbs_p h_analysis_02&userInput=find+dataset+where+site+li ke+*FLTECH*+and+dataset.status+like+VALID* Sensitive=on&userMode=user&sortOrder=desc&sort Name=&grid=0&method=dbsapi&dbsInst=cms_dbs_p h_analysis_02&userInput=find+dataset+where+site+li ke+*FLTECH*+and+dataset.status+like+VALID*

CMS Remote Analysis Builder (CRAB) Universal method for experimental data processing Automates analysis workflow, i.e. status tracking, resubmissions Datasets can be exported to Data Discovery Page Locally used extensively in our muon tomography simulations.

Network Performance Changed to a default 64kB blocksize across NFS RAID Array change to fix write-caching Increased kernel memory allocation for TCP Improvements in both network and grid transfer rates DD copy tests across network Changes from 2.24 to2.26 GB/s in reading Changes from 7.56 to MB/s in Writing

Block size WRIT EMB/sREADGB/s 64k Block sizeWRITEMB/sREADGB/s 64k TCP:STCP:CUDP: jitterlost (Mbits/sec ) Iperf on the Frontend Before TCP: STCP: CUDP: jitterlost(Mbits/sec) Iperf on the Frontend After DD on the Frontend After DD on the Frontend Before