Stephen Gowdy FNAL 9th Feb 2015CMS Computing Model Simulation 1.

Slides:

Advertisements

Similar presentations

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Advertisements

ATLAS Tier-3 in Geneva Szymon Gadomski, Uni GE at CSCS, November 2009 S. Gadomski, ”ATLAS T3 in Geneva", CSCS meeting, Nov 091 the Geneva ATLAS Tier-3.

Higher Computing: Unit 1: Topic 3 – Computer Performance St Andrew’s High School, Computing Department Higher Computing Topic 3 Computer Performance.

The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.

MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

Outline Network related issues and thinking for FAX Cost among sites, who has problems Analytics of FAX meta data, what are the problems  The main object.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

Analysis of Simulation Results Andy Wang CIS Computer Systems Performance Analysis.

Edge Based Cloud Computing as a Feasible Network Paradigm(1/27) Edge-Based Cloud Computing as a Feasible Network Paradigm Joe Elizondo and Sam Palmer.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Zhiling Chen (IPP-ETHZ) Doktorandenseminar June, 4 th, 2009.

Grid Data Management A network of computers forming prototype grids currently operate across Britain and the rest of the world, working on the data challenges.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

1 Data Management D0 Monte Carlo needs The NIKHEF D0 farm The data we produce The SAM data base The network Conclusions Kors Bos, NIKHEF, Amsterdam Fermilab,

J OINT I NSTITUTE FOR N UCLEAR R ESEARCH OFF-LINE DATA PROCESSING GRID-SYSTEM MODELLING FOR NICA 1 Nechaevskiy A. Dubna, 2012.

Remote Production and Regional Analysis Centers Iain Bertram 24 May 2002 Draft 1 Lancaster University.

INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.

Fermilab User Facility US-CMS User Facility and Regional Center at Fermilab Matthias Kasemann FNAL.

CMS STEP09 C. Charlot / LLR LCG-DIR 19/06/2009. Réunion LCG-France, 19/06/2009 C.Charlot STEP09: scale tests STEP09 was: A series of tests, not an integrated.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Lecture Topics: 11/17 Page tables TLBs Virtual memory flat page tables

Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.

Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.

Comp 335 File Structures Hashing.

PHENIX and the data grid >400 collaborators Active on 3 continents + Brazil 100’s of TB of data per year Complex data with multiple disparate physics goals.

A.Golunov, “Remote operational center for CMS in JINR ”, XXIII International Symposium on Nuclear Electronics and Computing, BULGARIA, VARNA, September,

Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.

Grid Scheduler: Plan & Schedule Adam Arbree Jang Uk In.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

Reconstruction Configuration with Python Chris Jones University of Cambridge.

V.Ilyin, V.Gavrilov, O.Kodolova, V.Korenkov, E.Tikhonenko Meeting of Russia-CERN JWG on LHC computing CERN, March 14, 2007 RDMS CMS Computing.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

8-Dec-15T.Wildish / Princeton1 CMS analytics A proposal for a pilot project CMS Analytics.

PHENIX and the data grid >400 collaborators 3 continents + Israel +Brazil 100’s of TB of data per year Complex data with multiple disparate physics goals.

CMS Computing Model Simulation Stephen Gowdy/FNAL 30th April 2015CMS Computing Model Simulation1.

SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.

CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

I. Understanding Record Loading and EDIS II. Database Statistics & Top 10 Search III. Problem with merging records IV. Pseudo Tag (Special 035 Tag ) V.

US-CMS T2 Centers US-CMS Tier 2 Report Patricia McBride Fermilab GDB Meeting August 31, 2007 Triumf - Vancouver.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

A year & a summer of June – August

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Data Management Highlights in TSA3.3 Services for HEP Fernando Barreiro Megino,

GridKa Summer 2010 T. Kress, G.Quast, A. Scheurer Migration of data from old to new dCache instance finished on Nov. 23 rd almost 500'000 files (600.

INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.

Maria Girone, CERN CMS Experiment Status, Run II Plans, & Federated Requirements Maria Girone, CERN XrootD Workshop, January 27, 2015.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

GDB, 07/06/06 CMS Centre Roles à CMS data hierarchy: n RAW (1.5/2MB) -> RECO (0.2/0.4MB) -> AOD (50kB)-> TAG à Tier-0 role: n First-pass.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

Computer Model Simulation STEPHEN GOWDY FNAL 30th March 2015 Computing Model Simulation 1.

Analysis efficiency Andrei Gheata ALICE offline week 03 October 2012.

Maria Girone, CERN  CMS in a High-Latency Environment  CMSSW I/O Optimizations for High Latency  CPU efficiency in a real world environment  HLT 

1 June 11/Ian Fisk CMS Model and the Network Ian Fisk.

Upcoming Service Now Changes Heads up regarding upcoming changes to the helpdesk interface. 3/14/12 Margaret Votava /CD/SCP/REX1.

ALICE Physics Data Challenge ’05 and LCG Service Challenge 3 Latchezar Betev / ALICE Geneva, 6 April 2005 LCG Storage Management Workshop.

Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.

BaBar & Grid Eleonora Luppi for the BaBarGrid Group TB GRID Bologna 15 febbraio 2005.

ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,

Compute and Storage For the Farm at Jlab

Kevin Thaddeus Flood University of Wisconsin

Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017

The ADC Operations Story

Evolution of the distributed computing model The case of CMS

Cache Memory Presentation I

Ákos Frohner EGEE'08 September 2008

Page Replacement.

Presentation transcript:

Stephen Gowdy FNAL 9th Feb 2015CMS Computing Model Simulation 1

 Want to look at different computing models  To use caching  Where to place caches  How large they need to be  Discussion with others to possibly collaborate  Writing a basic Python simulation  Can consider to change to C++ if better performance is needed 9th Feb 2015CMS Computing Model Simulation 2

 Event driven discrete simulation  Each job is the event  Takes account of slots in sites  Allows for perfect transfers between sites  Can check limit for internal bandwidth of site  Information on this not available in SiteDB  Code is in 9th Feb 2015CMS Computing Model Simulation 3

 Flat files read to load in site, network, job and file information  Setup sites and links  Next setup catalogue of data  Read in simulation parameters for CPU efficiency and remote read penalty  Start processing jobs in sequence  Use list of jobs from dashboard to feed simulation  See how it performs to process current jobs 9th Feb 2015CMS Computing Model Simulation 4

9th Feb 2015CMS Computing Model Simulation 5 site cpuTime inputData fractionRead start end runTime dataReadTime dataReadCPUHit theStore Job name disk bandwidth network [[site, bandwidth, quality, latency] … ] batch Site qjobs [ Job ] rjobs [ Job ] djobs [ Job ] cores bandwidth Batch catalogue {lfn:[site…]} files [(lfn, size) …] EventStore

 Extracted from SiteDB pledge database  Use information for 2014, most recent update  If site has no pledge just assume 10TB and 100 slots  Tier-2s default is larger, should probably update  No internal bandwidth information so assume 20GB/s at all sites  Recently started only considering US Tier-1 and Tier-2 sites  Sizes taken by hand from REBUS (could probably automate also)  Vanderbilt assumed to be the same as others 9th Feb 2015CMS Computing Model Simulation 6

 Site, Start Time, Wall Clock, CPU time, files read, percentage of file read  Latter isn’t available from dashboard  Possible to get from xrootd monitoring, but how to link information?  Just use the xrootd information statistically?  Extracted job information from dashboard  From 8pm 22 nd September till midnight  About 4% of jobs have no site information (discarded)  About 1% no CPU time (use wall clock)  About 2% have no start time (use CPU time before end time)  Will compare wall clock in simulation with actual for quality of simulation check  Compare overall simulated wall clock time to compare different scenarios 9th Feb 2015CMS Computing Model Simulation 7

 Extract network mesh from PhEDEx  Using the links interface  Also get reliability information  If not present assumed 99%  No actual transfer rate information available for links  Use what is available to get a number between 1GB/s and 10GB/s, not at all accurate. Default 1GB/s.  Extract file location information from PhEDEx  No historical information is available  When updating job information need to get an update for file locations  Only get information on file used by jobs  740 of the 8939 looked like they read data remotely (but some will be due to stale PhEDEx info) 9th Feb 2015CMS Computing Model Simulation 8

9th Feb 2015CMS Computing Model Simulation 9 Startup output when only using US T1 and T2 sites; $ python python/Simulation.py Read in 9 sites. Read in 72 network links. Read in 9982 files. Read in 6728 locations. Read in 3 latency bins. Read in 10 job efficiency slots. About to read and simulate 2611 jobs... …

 Need to add caching strategy later  Including cache cleaning if getting full  Cache hierarchy  Currently simulation allows no transfers, or transfers. Also can discard transfers.  Won’t transfer if there is no space available at a site  Implement different models  With new version of xrootd can read while still transferring  Actual current model of reading remotely if not present should be added 9th Feb 2015CMS Computing Model Simulation 10

 Run standard set of 2293 US jobs  With transferring all data for a job in serial total wall clock time is ~86.4Ms  249 jobs need to transfer at least one file, taking total of 1263s  Enabling remote read increases total time to ~87.5Ms  This is only effected by jobs that don’t have data locally  Need to update to reflect actual transfer times  Currently idealised using whole bandwidth for every transfer  Enabling parallel transfers (i.e. only considering longest one per job) reduces time  248 jobs need to transfer a file, taking total of 641s  Fairly large variations due to random numbers, converted to use seeds 9th Feb 2015CMS Computing Model Simulation 11

 Put all disk at the T1  Increases total job time to 99.3Ms.  Add realistic transfer times  Reallocate some disk space to CPU  Increase the load on the system till it is full 9th Feb 2015CMS Computing Model Simulation 12