Florida Tier2 Site Report USCMS Tier2 Workshop Livingston, LA March 3, 2009 Presented by Yu Fu for the University of Florida Tier2 Team (Paul Avery, Bourilkov.

Slides:

Advertisements

Similar presentations

National Grid's Contribution to LHCb IFIN-HH Serban Constantinescu, Ciubancan Mihai, Teodor Ivanoaica.

Advertisements

CHEPREO Tier-3 Center Achievements. FIU Tier-3 Center Tier-3 Centers in the CMS computing model –Primarily employed in support of local CMS physics community.

DOSAR Workshop VI April 17, 2008 Louisiana Tech Site Report Michael Bryant Louisiana Tech University.

Southwest Tier 2 Center Status Report U.S. ATLAS Tier 2 Workshop - Harvard Mark Sosebee for the SWT2 Center August 17, 2006.

GSIAF "CAF" experience at GSI Kilian Schwarz. GSIAF Present status Present status installation and configuration installation and configuration usage.

Scale-out Central Store. Conventional Storage Verses Scale Out Clustered Storage Conventional Storage Scale Out Clustered Storage Faster……………………………………………….

Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.

Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.

SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently on- line data 100 TB work area (not controlled by SUMS) 2 PB near-line.

Site Report US CMS T2 Workshop Samir Cury on behalf of T2_BR_UERJ Team.

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

March 27, IndiaCMS Meeting, Delhi1 T2_IN_TIFR of all-of-us, for all-of-us, by some-of-us Tier-2 Status Report.

SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.

1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.

Computing/Tier 3 Status at Panjab S. Gautam, V. Bhatnagar India-CMS Meeting, Sept 27-28, 2007 Delhi University, Delhi Centre of Advanced Study in Physics,

Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.

A. Mohapatra, HEPiX 2013 Ann Arbor1 UW Madison CMS T2 site report D. Bradley, T. Sarangi, S. Dasu, A. Mohapatra HEP Computing Group Outline  Infrastructure.

ASGC 1 ASGC Site Status 3D CERN. ASGC 2 Outlines Current activity Hardware and software specifications Configuration issues and experience.

Florida Tier2 Site Report USCMS Tier2 Workshop Fermilab, Batavia, IL March 8, 2010 Presented by Yu Fu For the Florida CMS Tier2 Team: Paul Avery, Dimitri.

US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.

Site Lightning Report: MWT2 Mark Neubauer University of Illinois at Urbana-Champaign US ATLAS Facilities UC Santa Cruz Nov 14, 2012.

Planning and Designing Server Virtualisation.

UTA Site Report Jae Yu UTA Site Report 4 th DOSAR Workshop Iowa State University Apr. 5 – 6, 2007 Jae Yu Univ. of Texas, Arlington.

David Hutchcroft on behalf of John Bland Rob Fay Steve Jones And Mike Houlden [ret.] * /.\ /..‘\ /'.‘\ /.''.'\ /.'.'.\ /'.''.'.\ ^^^[_]^^^ * /.\ /..‘\

CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.

ATLAS Great Lakes Tier-2 Site Report Ben Meekhof USATLAS Tier2/Tier3 Workshop SLAC, Menlo Park, CA November 30 th, 2007.

INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.

Wide Area Network Access to CMS Data Using the Lustre Filesystem J. L. Rodriguez †, P. Avery*, T. Brody †, D. Bourilkov *, Y.Fu *, B. Kim *, C. Prescott.

03/03/09USCMS T2 Workshop1 Future of storage: Lustre Dimitri Bourilkov, Yu Fu, Bockjoo Kim, Craig Prescott, Jorge L. Rodiguez, Yujun Wu.

PDSF at NERSC Site Report HEPiX April 2010 Jay Srinivasan (w/contributions from I. Sakrejda, C. Whitney, and B. Draney) (Presented by Sandy.

JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.

São Paulo Regional Analysis Center SPRACE Status Report 22/Aug/2006 SPRACE Status Report 22/Aug/2006.

UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.

UMD TIER-3 EXPERIENCES Malina Kirn October 23, 2008 UMD T3 experiences 1.

Purdue Tier-2 Site Report US CMS Tier-2 Workshop 2010 Preston Smith Purdue University

ATLAS Great Lakes Tier-2 (AGL-Tier2) Shawn McKee (for the AGL Tier2) University of Michigan US ATLAS Tier-2 Meeting at Harvard Boston, MA, August 17 th,

HEP Computing Status Sheffield University Matt Robinson Paul Hodgson Andrew Beresford.

UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.

CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.

US-CMS T2 Centers US-CMS Tier 2 Report Patricia McBride Fermilab GDB Meeting August 31, 2007 Triumf - Vancouver.

US ATLAS Western Tier 2 Status Report Wei Yang Nov. 30, 2007 US ATLAS Tier 2 and Tier 3 workshop at SLAC.

Status of India CMS Grid Computing Facility (T2-IN-TIFR) Rajesh Babu Muda TIFR, Mumbai On behalf of IndiaCMS T2 Team July 28, 20111Status of India CMS.

Site Report: Prague Jiří Chudoba Institute of Physics, Prague WLCG GridKa+T2s Workshop.

ATLAS Midwest Tier2 University of Chicago Indiana University Rob Gardner Computation and Enrico Fermi Institutes University of Chicago WLCG Collaboration.

RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,

Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.

Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.

BaBar Cluster Had been unstable mainly because of failing disks Very few (

Ole’ Miss DOSAR Grid Michael D. Joy Institutional Analysis Center.

A. Mohapatra, T. Sarangi, HEPiX-Lincoln, NE1 University of Wisconsin-Madison CMS Tier-2 Site Report D. Bradley, S. Dasu, A. Mohapatra, T. Sarangi, C. Vuosalo.

FroNtier Stress Tests at Tier-0 Status report Luis Ramos LCG3D Workshop – September 13, 2006.

Southwest Tier 2 (UTA). Current Inventory Dedidcated Resources  UTA_SWT2 320 cores - 2GB/core Xeon EM64T (3.2GHz) Several Headnodes 20TB/16TB in IBRIX/DDN.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

Pathway to Petaflops A vendor contribution Philippe Trautmann Business Development Manager HPC & Grid Global Education, Government & Healthcare.

Data Analysis w ith PROOF, PQ2, Condor Data Analysis w ith PROOF, PQ2, Condor Neng Xu, Wen Guan, Sau Lan Wu University of Wisconsin-Madison 30-October-09.

Western Tier 2 Site at SLAC Wei Yang US ATLAS Tier 2 Workshop Harvard University August 17-18, 2006.

9/22/10 OSG Storage Forum 1 CMS Florida T2 Storage Status Bockjoo Kim for the CMS Florida T2.

IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.

RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.

Creating Grid Resources for Undergraduate Coursework John N. Huffman Brown University Richard Repasky Indiana University Joseph Rinkovsky Indiana University.

STORAGE EXPERIENCES AT MWT2 (US ATLAS MIDWEST TIER2 CENTER) Aaron van Meerten University of Chicago Sarah Williams Indiana University OSG Storage Forum,

Jefferson Lab Site Report Sandy Philpott HEPiX Fall 07 Genome Sequencing Center Washington University at St. Louis.

Experience of Lustre at QMUL

The Beijing Tier 2: status and plans

Xiaomei Zhang CMS IHEP Group Meeting December

6th DOSAR Workshop University Mississippi Apr. 17 – 18, 2008

Experience of Lustre at a Tier-2 site

5th DOSAR Workshop Louisiana Tech University Sept. 27 – 28, 2007

Статус ГРИД-кластера ИЯФ СО РАН.

Presentation transcript:

Florida Tier2 Site Report USCMS Tier2 Workshop Livingston, LA March 3, 2009 Presented by Yu Fu for the University of Florida Tier2 Team (Paul Avery, Bourilkov Dimitri, Yu Fu, Bockjoo Kim, Yujun Wu)

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 2 3/3/2009 Outline Site Status –Hardware –Software –Network FY2009 Plan Experience with Metrics Other Issues

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 3 3/3/2009 Hardware Status Computing Resources: –UFlorida-PG (merged from previous UFlorida-PG and UFlorida-IHEPA): 126 worknodes, 504 cores (slots) 84 * dual dual-core Opteron GHz + 42 * dual dual- core Opteron GHz, 6GB RAM, 2x250(500) GB HDD 794 kSI2k, RAM: 1.5 GB/slot Older than 3 years, out of warranty, considering the possibility to upgrade or replace sometime in future.

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 4 3/3/2009 Hardware Status –UFlorida-HPC: 530 worknodes, 2568 cores (slots) 112 * dual quad-core Xeon E GHz * dual dual-core Opteron GHz 5240 kSI2k, RAM: 4GB/slot, 2GB/slot, 1GB/slot Managed by UF HPC Center, Florida Tier2 invested partially in three phases. Tier2’s official share/quota is 900 slots (35% of total slots), and Tier2 can use more slots on opportunistic basis. The actual average Tier2 usage is ~50%. Tier2’s dedicated SpecInt: 1836 kSI2k

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 5 3/3/2009 HPC cluster usage of last month

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 6 3/3/2009 Hardware Status –Interactive analysis cluster for CMS 5 nodes + 1 NIS server + 1 NFS server 1 * dual quad-core Xeon E * dual dual-core Opteron GHz, 2GB RAM/core, 18 TB total disk. –Total Florida CMS Tier2 dedicated computing power (Grid only, not including the interactive analysis cluster): 2.63M SpecInt2000, 1404 batch slots (cores). –Have fulfilled the 2009 milestone of 1.5M SpecInt2000. –Still considering to get more computing power in FY09.

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 7 3/3/2009 Hardware Status Storage Resources: –Data RAID: gatoraid1, gatoraid2, storing CMS software, $DATA, $APP, etc. 3ware controller with SATA drives, mounted as NFS. Very reliable. –Resilient dCache: 2 x 250 (500) GB SATA drives on each worknode. Acceptable reliability, a few failures. –Non-resilient RAID dCache: FibreChannel RAID (pool03, pool04, pool05, pool06) + 3ware-based SATA RAID (pool01, pool02), with 10GbE or bonded 1GbE network. Very reliable. –HPC Lustre storage, accessible both directly and via dCache. –20 GridFTP doors: 20x1Gbps

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 8 3/3/2009 Hardware Status –Total dCache Storage: Resource Raw Usable Usable (after 1/2 factor for resilient) pool01 9.6TB 6.9TB 6.9TB pool02 9.6TB 6.9TB 6.9TB pool TB 13.9TB 13.9TB pool TB 13.9TB 13.9TB pool TB 47.2TB 47.2TB pool TB 55.2TB 55.2TB worknodes 93.0TB 71.1TB 35.6TB HPC Lustre 35TB 30TB 30TB (space actually used by T2) Total 309TB 245TB 210TB (Hard drives in the UFlorida-HPC worknodes are not counted since they are not deployed in dCache system.)

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 9 3/3/2009 Hardware Status –Total raw storage is 309TB, real usable Tier2 dCache space is 210TB, still some gap to meet the 400TB target of ’09. –Planning to deploy 200TB new RAID in 2009.

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 10 3/3/2009 Software Status Most systems running 64-bit SLC4, some have migrated to SLC5 OSG Condor (UFlorida-PG and UFlorida-IHEPA) and PBS (UFlorida- HPC) batching system. dCache Phedex Squid 3.0 GUMS …… All resources managed with a 64-bit customized ROCKS 4, all rpm’s and kernels are upgraded to current SLC4 versions. Preparing ROCKS 5.

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 11 3/3/2009 Network Status Cisco 6509 switch –All 9 slots populated –2 blades of 4 x 10 GigE ports each –6 blades of 48 x 1 GigE ports each 20 Gbps uplink to campus research network 20 Gbps to UF HPC 10 Gbps to UltraLight via FLR and NLR Florida Tier2’s own domain and DNS All nodes including worknodes are on public IP, directly connected to out world without NAT. UFlorida-HPC and HPC Lustre with InfiniBand.

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 12 3/3/2009 FY09 Hardware Deployment Plan Computing resources –Have already met the 1.5M SI2k milestone, still considering to add more computing power. –Investigating two options: Purchase a new cluster: power, cooling and network impact. Upgrade present UFlorida-PG cluster to dual quad- core: no additional power and cooling impact, may be more cost-effective but re-used old parts may be unreliable. –No official SpecInt2000 numbers available for new processors?

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 13 3/3/2009 FY09 Hardware Deployment Plan Storage resources –Will purchase 200TB new RAID. –To avoid network bottleneck, don’t want to put too much disk under a single I/O node. –Considering 4U 24-drive servers with internal hardware RAID controllers. –Waiting until 2TB drives become reasonably available at enterprise-level stability: 1TB drives require more servers and they will be made obsolete soon by the 2TB ones.

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 14 3/3/2009 Experience with Metrics SAM: excellent

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 15 3/3/2009 Experience with Metrics RVS: good

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 16 3/3/2009 Experience with Metrics JobRobot: OK –We had three (now two) different clusters. –Limited available slots on small clusters – merging may help. –Proxy expiration problems. –Ambiguous errors due to glite job submission.

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 17 3/3/2009 Experience with Metrics We have self monitor to monitor SAM, RVS, JobRobot metrics monitoring systems. Our monitor systems notify us the problems instantly by s – this always allows us to fix problems as quickly as possible. The alarm s also help us think what tools we need to develop and improve to better monitor, diagnose and fix the problems. The tools developed with alarm s have proved to be very useful.

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 18 3/3/2009

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 19 3/3/2009 Experience with Metrics We also have various other monitor systems: Ganglia, Nagios and systems to monitor all aspects of hardware, software as well as services like PhEDEx, dCache, Squid, status of dataset transfer, …… etc. Operations support helps only if a useful solution, suggestion or hint is provided. We often find we have to understand the details of what operations support is doing, this can take quite some time.

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 20 3/3/2009 Other Issues Merge UFlorida-PG and UFlorida-IHEPA. Working on a prototype Condor-PBS combined gatekeeper based on random selection. Overloaded gatekeepers – to upgrade the hardware. What is the best way to deal with out-of-warranty old machines? –Becoming unstable and unreliable. –Parts can be expensive and hard to find. –Significantly lower performance or capacity than new ones - less energy-efficient.

Yu Fu, University of Florida USCMS Tier2 Workshop, Livingston, LA 21 3/3/2009 Other Issues –Question: continue to maintain, de-commission, or upgrade? –An example: We considered to upgrade UFlorida-PG’s gatekeeper’s memory to 16GB, it turned out that 8*2GB DDR memory (registered ECC) would cost more than $1000. Finally we decided to upgrade the motherboard, processors and memory, the total cost is less than $1000, but we got: New motherboard with better chipset New more efficient heatpipe-type heatsinks Dual dual-core processors -> Dual quad-core faster processors 4GB DDR RAM -> 16GB DDR2 RAM (faster than 16GB DDR) –Bottom line: it can be more cost-effective to get a new machine or upgrade to a new machine than to fix an old one.