Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration.

Slides:



Advertisements
Similar presentations
Amber Boehnlein, FNAL D0 Computing Model and Plans Amber Boehnlein D0 Financial Committee November 18, 2002.
Advertisements

Proposal for dCache based Analysis Disk Pool for CDF presented by Doug Benjamin Duke University on behalf of the CDF Offline Group.
Title US-CMS User Facilities Vivian O’Dell US CMS Physics Meeting May 18, 2001.
F Fermilab Database Experience in Run II Fermilab Run II Database Requirements Online databases are maintained at each experiment and are critical for.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
The D0 Monte Carlo Challenge Gregory E. Graham University of Maryland (for the D0 Collaboration) February 8, 2000 CHEP 2000.
D0 Taking Stock1 By Anil Kumar CD/CSS/DSG July 10, 2006.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
GLAST LAT ProjectDOE/NASA Baseline-Preliminary Design Review, January 8, 2002 K.Young 1 LAT Data Processing Facility Automatically process Level 0 data.
The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.
LcgCAF:CDF submission portal to LCG Federica Fanzago for CDF-Italian Computing Group Gabriele Compostella, Francesco Delli Paoli, Donatella Lucchesi, Daniel.
Remote Production and Regional Analysis Centers Iain Bertram 24 May 2002 Draft 1 Lancaster University.
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
Preparation of KIPT (Kharkov) computing facilities for CMS data analysis L. Levchuk Kharkov Institute of Physics and Technology (KIPT), Kharkov, Ukraine.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
CDF data production models 1 Data production models for the CDF experiment S. Hou for the CDF data production team.
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
An Overview of PHENIX Computing Ju Hwan Kang (Yonsei Univ.) and Jysoo Lee (KISTI) International HEP DataGrid Workshop November 8 ~ 9, 2002 Kyungpook National.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.
A Design for KCAF for CDF Experiment Kihyeon Cho (CHEP, Kyungpook National University) and Jysoo Lee (KISTI, Supercomputing Center) The International Workshop.
Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility Newport News, Virginia USA
International Workshop on HEP Data Grid Nov 9, 2002, KNU Data Storage, Network, Handling, and Clustering in CDF Korea group Intae Yu*, Junghyun Kim, Ilsung.
CDF Offline Production Farms Stephen Wolbers for the CDF Production Farms Group May 30, 2001.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.
16 September GridPP 5 th Collaboration Meeting D0&CDF SAM and The Grid Act I: Grid, Sam and Run II Rick St. Denis – Glasgow University Act II: Sam4CDF.
JLAB Computing Facilities Development Ian Bird Jefferson Lab 2 November 2001.
EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,
Sep 02 IPP Canada Remote Computing Plans Pekka K. Sinervo Department of Physics University of Toronto 4 Sep IPP Overview 2 Local Computing 3 Network.
Outline: Tasks and Goals The analysis (physics) Resources Needed (Tier1) A. Sidoti INFN Pisa.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,
IDE disk servers at CERN Helge Meinhard / CERN-IT CERN OpenLab workshop 17 March 2003.
The KLOE computing environment Nuclear Science Symposium Portland, Oregon, USA 20 October 2003 M. Moulson – INFN/Frascati for the KLOE Collaboration.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
HIGUCHI Takeo Department of Physics, Faulty of Science, University of Tokyo Representing dBASF Development Team BELLE/CHEP20001 Distributed BELLE Analysis.
UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.
Outline: Status: Report after one month of Plans for the future (Preparing Summer -Fall 2003) (CNAF): Update A. Sidoti, INFN Pisa and.
D0 Taking Stock1 By Anil Kumar CD/CSS/DSG June 06, 2005.
CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Neutrino Program (MINOS, MINERvA, General) Margaret Votava April 21, 2009 Tactical plan.
DCAF(DeCentralized Analysis Farm) for CDF experiments HAN DaeHee*, KWON Kihwan, OH Youngdo, CHO Kihyeon, KONG Dae Jung, KIM Minsuk, KIM Jieun, MIAN shabeer,
International Workshop on HEP Data Grid Aug 23, 2003, KNU Status of Data Storage, Network, Clustering in SKKU CDF group Intae Yu*, Joong Seok Chae Department.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
Frank Wuerthwein, UCSD Update on D0 and CDF computing models and experience Frank Wuerthwein UCSD For CDF and DO collaborations October 2 nd, 2003 Many.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Adapting SAM for CDF Gabriele Garzoglio Fermilab/CD/CCF/MAP CHEP 2003.
MC Production in Canada Pierre Savard University of Toronto and TRIUMF IFC Meeting October 2003.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, R. Brock,T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina,
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
D0 File Replication PPDG SLAC File replication workshop 9/20/00 Vicky White.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Hans Wenzel CDF CAF meeting October 18 th -19 th CMS Computing at FNAL Hans Wenzel Fermilab  Introduction  CMS: What's on the floor, How we got.
1 5/4/05 Fermilab Mass Storage Enstore, dCache and SRM Michael Zalokar Fermilab.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.
Jianming Qian, UM/DØ Software & Computing Where we are now Where we want to go Overview Director’s Review, June 5, 2002.
1 By: Solomon Mikael (UMBC) Advisors: Elena Vataga (UNM) & Pavel Murat (FNAL) Development of Farm Monitoring & Remote Concatenation for CDFII Production.
CDF SAM Deployment Status Doug Benjamin Duke University (for the CDF Data Handling Group)
DØ Grid Computing Gavin Davies, Frédéric Villeneuve-Séguier Imperial College London On behalf of the DØ Collaboration and the SAMGrid team The 2007 Europhysics.
1 P. Murat, Mini-review of the CDF Computing Plan 2006, 2005/10/18 An Update to the CDF Offline Plan and FY2006 Budget ● Outline: – CDF computing model.
US CMS Testbed.
Lee Lueking D0RACE January 17, 2002
Presentation transcript:

Computing at CDF ➢ Introduction ➢ Computing requirements ➢ Central Analysis Farm ➢ Conclusions Frank Wurthwein MIT/FNAL-CD for the CDF Collaboration

CDF in a Nutshell Frank Wurthwein/MITLCCWS '02 ➢ CDF + D0 experiments analyze pp collisions from Tevatron at Fermilab ➢ Tevatron highest energy collider in world ( TeV) until LHC ➢ Run I ( ) huge success  200+ papers (t quark discovery,...) ➢ Run II (March 2001-) upgrades for luminosity (  10) + energy (~10%  )  expect integrated luminosity 20  (Run IIa) and 150  (Run IIb) of Run I Run II physics goals: ➢ Search for Higgs boson ➢ Top quark properties (m t,  tot,...) ➢ Electroweak (m W,  W, ZZ ,...) ➢ Search for new physics (e.g. SUSY) ➢ QCD at large Q 2 (jets,  s,...) ➢ CKM tests in b hadron decays

CDF RunII Collaboration LCCWS'02 Goal: Provide computing resources for 200+ collaborators simultaneously doing analysis per day! Frank Wurthwein/MIT

LCCWS'02 CDF Level 3 Trigger Production Farm Central Analysis Facility (CAF) User Desktops Robotic Tape Storage CDF DAQ/Analysis Flow Data Analysis 7MHz beam Xing 0.75 Million channels 300 Hz L1  L2  75 Hz 20 MB/s Recon Read/write Data MC

Reconstruction Farms Frank Wurthwein/MITLCCWS'02 Data reconstruction + validation, Monte Carlo generation 154 dual P3's (equivalent to Ghz machines) Job management: ➢ Batch system  FBSNG developed at FNAL ➢ Single executable, validated offline 150 Million events

Data Handling Frank Wurthwein/MITLCCWS'02 Data archived using STK 9940 drives and tape robot Enstore: Network-attached tape system developed at FNAL  provides interface layer for staging data from tape 5 TB/day 100 TB Today: 176TB on tape

Database Usage at CDF Frank Wurthwein/MITLCCWS'02 Oracle DB: Metadata + Calibrations DB Hardware: ➢ 2 Sun E4500 Duals ➢ 1 Linux Quad Presently evaluating: ➢ MySQL ➢ Replication to remote sites ➢ Oracle9 streams, failover, load balance

Data/Software Characteristics LCCWS'02 Data Characteristics: ➢ Root I/O sequential for raw data: ~250 kB/event ➢ Root I/O multi-branch for reco data: kB/event ➢ 'Standard' ntuple: 5-10 kB/event ➢ Typical RunIIa secondary dataset size: 10 7 events Analysis Software: ➢ Typical analysis jobs 5 Hz on 1 GHz P3  few MB/sec ➢ CPU rather than I/O bound (FastEthernet)

Computing Requirements LCCWS'02 Requirements set by goal: 200 simultaneous users to analyze secondary data set (10 7 evts) in a day Need ~700 TB of disk and ~5 THz of CPU by end of FY'05:  need lots of disk  need cheap disk  IDE Raid  need lots of CPU  commodity CPU  dual Intel/AMD

Past CAF Computing Model LCCWS'02 Very expensive to expand and maintain Bottom line: Not enough 'bang for the buck'

Design Challenges LCCWS'02 develop/debug remote desktop 3 code management & rootd Send binary & 'sandbox' for execution on CAF 3 kerberized gatekeeper no user accounts on cluster BUT user access to scratch space with quotas 3 creative use of kerberos

CAF Architecture LCCWS'02 Users are able to: ➢ submit jobs ➢ monitor job progress ➢ retrieve output from 'any desktop' in the world

CAF Milestones 11/01 2/25/02 3/6/02 4/25/02 5/30/02 ➢ Start of CAF design ➢ CAF prototype (protoCAF) assembled ➢ Fully-functional prototype system (>99% job success) ➢ ProtoCAF integrated into Stage1 system ➢ Production Stage1 CAF for collaboration Stage1 ProtoCAF Design  Production system in 6 months! LCCWS'02

CAF Stage 1 Hardware LCCWS'02 Worker Nodes File Servers Code Server Linux 8-ways (interactive)

Stage 1 Hardware: Workers LCCWS'02 Workers (132 CPUs, 1U+2U rackmount): 16 2U Dual Athelon 1.6GHz / 512MB RAM 50 1U/2U Dual P3 1.26GHz / 2GB RAM FE (11 MB/s) / 80GB job scratch each

Stage 1 Hardware: Servers LCCWS'02 Servers (35TB total, 16 4U rackmount): 2.2TB useable IDE RAID50 hot-swap Dual P3 1.4GHz / 2GB RAM SysKonnect 9843 Gigabt Ethernet card

File Server Performance Server/Client Performance: Up to 200MB/s local reads, 70 MB/s NFS Data Integrity tests: md5sum of local reads/writes under heavy load BER read/write =  /  Cooling tests: Temp profile of disks w/ IR gun after extended disk thrashing LCCWS' MB/s 60 MB/s 70 MB/s

Stage2 Hardware LCCWS'02 Worker nodes: 238 Dual Athlon MP2000+, 1U rackmount 1 THz of CPU power File servers: 76 systems, 4U rackmount, dual red. Power supply 14 WD180GB in 2 RAID5 on 3ware WD40GB in RAID1 on 3ware GigE Syskonnect 9843 Dual P3 1.4GHz 150 TB disk cache

Stage1 Data Access LCCWS'02 Static files on disk: NFS mounted to worker nodes remote file access via rootd Dynamic disk cache: dCache in front of Enstore robot

Problems & Issues LCCWS'02 Resource overloading: ➢ DB meltdown  dedicated replica, startup delays ➢ Rcp overload  replaced with fcp ➢ Rootd overload  replaced with NFS,dCache ➢ File server overload  scatter data randomly System issues: ➢ Memory problems  improved burn-in for next time ➢ Bit error during rcp  checksum after copy dCache filesystem issues  xfs & direct I/O

Lessons Learned LCCWS'02 ➢ Expertise in FNAL-CD is essential. ➢ Well organized code management is crucial. ➢ Independent commissioning of data handling and job processing  3 ways of getting data to application.

CAF: User Perspective Job Related: ➢ Submit jobs ➢ Check progress of job ➢ Kill a job Remote file system access: ➢ 'ls' in job's 'relative path' ➢ 'ls' in a CAF node's absolute path ➢ tail' of any file in job's 'relative path' LCCWS'02

CAF Software LCCWS'02

CAF User Interface ➢ Compile, build, debug analysis job on 'desktop' ➢ Fill in appropriate fields & submit job ➢ Retrieve output using kerberized FTP tools... or write output directly to 'desktop'! section integer range user exe+tcl directoryoutput destination LCCWS'02

Each user a different queue Process type for job length test:5 mins short:2 hrs medium:6 hrs long:2 days This example: 1 job  11 sections (+ 1 additional section automatic for job cleanup) Web Monitoring of User Queues LCCWS'02

Monitoring jobs in your queue LCCWS'02

Monitoring sections of your job LCCWS'02

CAF in active use by CDF collaboration ➢ 300 CAF Users (queues) to date ➢ Several dozen simultaneous users in a typical 24 hr period CAF Utilization

CAF System Monitoring LCCWS'02 Round-robin Database (RRD)

LCCWS'02 CPU Utilization CAF utilization steadily rising since opened to collaboration Provided 10-fold increase in analysis resources for last summer physics conferences Need for more CPU for winter Week 3 months Day

LCCWS'02 File Server Worker Node Aggregate I/O 4-8TB/day Aggregate I/O Data Processing Average I/O ~80%CPU util. 3 months 1 week

Work in Progress LCCWS'02 Stage2 upgrade: 1THz CPU & 150TB disk SAM  framework for global data handling/distribution ''DCAF''  remote ''replicas'' of CAF Central login FNAL

CAF Summary LCCWS'02 Distributed Desk-to-Farm Computing Model Production system under heavy use: ➢ Single farm at FNAL 4-8TB/day processed by user applications Average CPU utilization of 80% ➢ Many users all over the world 300 total users typical: 30 users per day share 130 CPU's Regularly several 1000 jobs queued ➢ Connected to tape via large cache ➢ Currently updating to 1THz & 150TB

CDF Summary LCCWS'02 Variety of computing systems deployed: ➢ Single app. Farms: Online & Offline ➢ Multiple app. Farm: user analysis farm ➢ Expecting 1.7Petabyte tape archive by FY05 ➢ Expecting 700TB disk cache by FY05 ➢ Expecting 5THz of CPU by FY05 ➢ Oracle DB cluster with loadavg & failover for metadata.