Vladimir Litvin, Harvey Newman Caltech CMS Scott Koranda, Bruce Loftis, John Towns NCSA Miron Livny, Peter Couvares, Todd Tannenbaum, Jamie Frey Wisconsin.

Slides:

Advertisements

Similar presentations

Author - Title- Date - n° 1 GDMP The European DataGrid Project Team

Advertisements

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

Part 7: CondorG A: Condor-G B: Laboratory: CondorG.

Peter Couvares and Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.

A Computation Management Agent for Multi-Institutional Grids

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

Reliability and Troubleshooting with Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.

Experience with ATLAS Data Challenge Production on the U.S. Grid Testbed Kaushik De University of Texas at Arlington CHEP03 March 27, 2003.

Vladimir Litvin, Harvey Newman, Sergey Schevchenko Caltech CMS Scott Koranda, Bruce Loftis, John Towns NCSA Miron Livny, Peter Couvares, Todd Tannenbaum,

GRID Workload Management System Massimo Sgaravatto INFN Padova.

Workload Management Massimo Sgaravatto INFN Padova.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Grids and Globus at BNL Presented by John Scott Leita.

The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison

John Bent and Douglas Thain Computer Sciences Department University of Wisconsin-Madison

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

National Center for Supercomputing Applications The Computational Chemistry Grid: Production Cyberinfrastructure for Computational Chemistry PI: John Connolly.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

BaBar Grid Computing Eleonora Luppi INFN and University of Ferrara - Italy.

Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.

Grid Computing I CONDOR.

3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.

Development Timelines Ken Kennedy Andrew Chien Keith Cooper Ian Foster John Mellor-Curmmey Dan Reed.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.

Data Grid projects in HENP R. Pordes, Fermilab Many HENP projects are working on the infrastructure for global distributed simulated data production, data.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.

22 nd September 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.

STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar and Miron Livny University of Wisconsin-Madison March 25 th, 2004 Tokyo, Japan.

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Review of Condor,SGE,LSF,PBS

US CMS Centers & Grids – Taiwan GDB Meeting1 Introduction l US CMS is positioning itself to be able to learn, prototype and develop while providing.

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.

Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

GridShell/Condor: A virtual login Shell for the NSF TeraGrid (How do you run a million jobs on the NSF TeraGrid?) The University of Texas at Austin.

Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.

Condor DAGMan: Managing Job Dependencies with Condor

U.S. ATLAS Grid Production Experience

US CMS Testbed.

Condor Tutorial Asia Pacific Grid Workshop Tokyo, Japan October 2001

Condor-G Making Condor Grid Enabled

Presentation transcript:

Vladimir Litvin, Harvey Newman Caltech CMS Scott Koranda, Bruce Loftis, John Towns NCSA Miron Livny, Peter Couvares, Todd Tannenbaum, Jamie Frey Wisconsin Condor Grid Infrastructure for Caltech CMS Production on Alliance Resources

CMS Physics The CMS detector at the LHC will probe fundamental forces in our Universe and search for the yet undetected Higgs Boson Detector expected to come online 2006

CMS Physics

ENORMOUS Data Challenges One sec of CMS running will equal data volume equivalent to 10,000 Encyclopaedia Britannica Data rate handled by the CMS event builder (~500 Gbit/s) will be equivalent to amount of data currently exchanged by the world's telecom networks Number of processors in the CMS event filter will equal number of workstations at CERN today (~4000)

Leveraging Alliance Grid Resources The Caltech CMS group is using Alliance Grid resources today for detector simulation and data processing prototyping Even during this simulation and prototyping phase the computational and data challenges are substantial

Challenges of a CMS Run CMS run naturally divided into two phases –Monte Carlo detector response simulation –100’s of jobs per run –each generating ~ 1 GB –all data passed to next phase and archived –reconstruct physics from simulated data –100’s of jobs per run –jobs coupled via Objectivity database access –~100 GB data archived Specific challenges –each run generates ~100 GB of data to be moved and archived –many, many runs necessary –simulation & reconstruction jobs at different sites –large human effort starting & monitoring jobs, moving data

Meeting Challenge With Globus and Condor Globus middleware deployed across entire Alliance Grid remote access to computational resources dependable, robust, automated data transfer Condor strong fault tolerance including checkpointing and migration job scheduling across multiple resources layered over Globus as “personal batch system” for the Grid

CMS Run on the Alliance Grid Caltech CMS staff prepares input files on local workstation Pushes “one button” to launch master Condor job Input files transferred by master Condor job to Wisconsin Condor pool (~700 CPUs) using Globus GASS file transfer Master Condor job running at Caltech Caltech workstation Input files via Globus GASS WI Condor pool

CMS Run on the Alliance Grid Master Condor job at Caltech launches secondary Condor job on Wisconsin pool Secondary Condor job launches 100 Monte Carlo jobs on Wisconsin pool –each runs 12~24 hours –each generates ~1GB data –Condor handles checkpointing & migration –no staff intervention Master Condor job running at Caltech Secondary Condor job on WI pool 100 Monte Carlo jobs on Wisconsin Condor pool

CMS Run on the Alliance Grid When each Monte Carlo job completes data automatically transferred to UniTree at NCSA –each file ~ 1 GB –transferred using Globus-enabled FTP client “gsiftp” –NCSA UniTree runs Globus-enabled FTP server –authentication to FTP server on user’s behalf using digital certificate 100 Monte Carlo jobs on Wisconsin Condor pool NCSA UniTree with Globus-enabled FTP server 100 data files transferred via gsiftp, ~ 1 GB each

CMS Run on the Alliance Grid When all Monte Carlo jobs complete Secondary Condor reports to Master Condor at Caltech Master Condor at Caltech launches job to stage data from NCSA UniTree to NCSA Linux cluster –job launched via Globus jobmanager on cluster –data transferred using Globus-enabled FTP –authentication on user’s behalf using digital certificate Master starts job via Globus jobmanager on cluster to stage data Secondary Condor job on WI pool NCSA Linux cluster Secondary reports complete to master Master Condor job running at Caltech gsiftp fetches data from UniTree

CMS Run on the Alliance Grid Master Condor at Caltech launches physics reconstruction jobs on NCSA Linux cluster –job launched via Globus jobmanager on cluster –Master Condor continually monitors job and logs progress locally at Caltech –no user intervention required –authentication on user’s behalf using digital certificate Master Condor job running at Caltech Master starts reconstruction jobs via Globus jobmanager on cluster NCSA Linux cluster

CMS Run on the Alliance Grid When reconstruction jobs complete data automatically archived to NCSA UniTree –data transferred using Globus-enabled FTP After data transferred run is complete and Master Condor at Caltech s notification to staff NCSA Linux cluster data files transferred via gsiftp to UniTree for archiving

Production Data 7 Signal Data Sets events each have been simulated and reconstructed without pileup Large QCD background Data Set (1M of events) has been simulated through this system Data has been stored both NCSA UniTree and Caltech HPSS

Condor Details for Experts Use CondorG –Condor + Globus –allows Condor to submit jobs to remote host via a Globus jobmanager –any Globus-enabled host reachable (with authorization) –Condor jobs run in the “Globus” universe –use familiar Condor classads for submitting jobs universe = globus globusscheduler = beak.cs.wisc.edu/jobmanager- condor-INTEL-LINUX environment = CONDOR_UNIVERSE=scheduler executable = CMS/condor_dagman_run arguments = -f -t -l. -Lockfile cms.lock -Condorlog cms.log -Dag cms.dag -Rescue cms.rescue input = CMS/hg_90.tar.gz remote_initialdir = Prod2001 output = CMS/hg_90.out error = CMS/hg_90.err log = CMS/condor.log notification = always queue

Condor Details for Experts Exploit Condor DAGman –DAG=directed acyclic graph –submission of Condor jobs based on dependencies –job B runs only after job A completes, job D runs only after job C completes, job E only after A,B,C & D complete… –includes both pre- and post-job script execution for data-staging, cleanup, or the like Job jobA_632 Prod2000/hg_90_gen_632.cdr Job jobB_632 Prod2000/hg_90_sim_632.cdr Script pre jobA_632 Prod2000/pre_632.csh Script post jobB_632 Prod2000/post_632.csh PARENT jobA_632 CHILD jobB_632 Job jobA_633 Prod2000/hg_90_gen_633.cdr Job jobB_633 Prod2000/hg_90_sim_633.cdr Script pre jobA_633 Prod2000/pre_633.csh Script post jobB_633 Prod2000/post_633.csh PARENT jobA_633 CHILD jobB_633

Future Directions Include Alliance LosLobos Linux cluster at AHPCC in two ways –Add path so that physics reconstruction jobs may run on Alliance LosLobos Linux cluster at AHPCC in addition to NCSA cluster –Allow Monte Carlo jobs at Wisconsin to “glide- into” LosLobos –Add pileup datasets Master Condor job running at Caltech Secondary Condor job on WI pool 75 Monte Carlo jobs on Wisconsin Condor pool 25 Monte Carlo jobs on LosLobos via Condor glide-in