UCS D OSG Summer School 2011 Life of an OSG job1 2011 OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.

Slides:



Advertisements
Similar presentations
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
Advertisements

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
A Computation Management Agent for Multi-Institutional Grids
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
Lecture 18 Page 1 CS 111 Online Design Principles for Secure Systems Economy Complete mediation Open design Separation of privileges Least privilege Least.
OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.
DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.
Ways to Connect to OSG Tuesday afternoon, 3:00 pm Lauren Michael Research Computing Facilitator University of Wisconsin-Madison.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Publication and Protection of Site Sensitive Information in Grids Shreyas Cholia NERSC Division, Lawrence Berkeley Lab Open Source Grid.
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations.
Workflows: from development to Production Thursday morning, 10:00 am Greg Thain University of Wisconsin - Madison.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Open Science Grid OSG CE Quick Install Guide Siddhartha E.S University of Florida.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
OSG Site Admin Workshop - Mar 2008Using gLExec to improve security1 OSG Site Administrators Workshop Using gLExec to improve security of Grid jobs by Alain.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
Ch 26 & 27 Operating Systems.  Understand the purpose of an operating system  Be able to describe the tasks performed by an operating system.
TANYA LEVSHINA Monitoring, Diagnostics and Accounting.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Why you should care about glexec OSG Site Administrator’s Meeting Written by Igor Sfiligoi Presented by Alain Roy Hint: It’s about security.
How to get the needed computing Tuesday afternoon, 1:30pm Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California San Diego.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Ways to Connect to OSG Tuesday, Wrap-Up Lauren Michael, CHTC.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
CREAM Status and plans Massimo Sgaravatto – INFN Padova
UCS D OSG Summer School 2011 Single sign-on OSG Summer School Single sign-on in Open Science Grid by Igor Sfiligoi University of California San Diego.
Security in OSG Tuesday afternoon, 4:15pm Igor Sfiligoi Member of the OSG Security team University of California San Diego.
Rome, Sep 2011Adapting with few simple rules in glideinWMS1 Adaptive 2011 Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience by.
Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.
Condor Week May 2012Remote Condor1 Condor Week 2012 Remote Condor presented by J. M. Dost co-author I. Sfiligoi UC San Diego.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.
Dealing with real resources Wed July 21st, 3:15pm Igor Sfiligoi, OSG Scalability Area coordinator and OSG glideinWMS factory manager.
Introduction to Computers
Introduction to Computers
HTCondor Annex (There are many clouds like it, but this one is mine.)
Development Environment
Computing Clusters, Grids and Clouds Globus data service
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Workload Management System
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Glidein Factory Operations
How to enable computing
The CMS use of glideinWMS by Igor Sfiligoi (UCSD)
OSG Connect and Connect Client
Security in OSG Rob Quick
Using the Parallel Universe beyond MPI
What’s Different About Overlay Systems?
CSE Course Enrollment Information
Brian Lin OSG Software Team University of Wisconsin - Madison
Grid Management Challenge - M. Jouvin
16. Account Monitoring and Control
The LHCb Computing Data Challenge DC06
Presentation transcript:

UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of California San Diego

UCS D OSG Summer School 2011 Life of an OSG job2 Summary of past lessons ● HTC is maximizing CPU use over long periods ● And getting lots of computation done ● DHTC is HTC over many sites ● Using an overlay system makes life easier ● Each compute sites is independent ● Local authentication and authorization ● Local HTC systems ● Cloud similar to Grid in many ways

UCS D OSG Summer School 2011 Life of an OSG job3 The Open Science Grid (1) ● OSG is an umbrella organization ● Does not write software ● Does not own compute resources ● OSG negotiates between affected parties and sets standards: ● x.509+VOMS for authentication and authorization ● Globus GT2 for Grid CE technology ● A standard software distribution Partial list

UCS D OSG Summer School 2011 Life of an OSG job4 The Open Science Grid (2) ● OSG runs common services ● Troubleshooting ● Information services ● Accounting services ● A glidein factory Partial list

UCS D OSG Summer School 2011 Life of an OSG job5 Running on OSG ● The easiest way is to join an existing OSG VO and use instructions they have ● Most VOs quite mature at this point, with good procedures in place ● 2 nd easiest thing is install a Condor overlay pool and hook up to the OSG glidein factory ● Factory admins will do a large fraction of the Grid- related tasks for you ● But there is also the direct submission path ● You may want to do something new

UCS D OSG Summer School 2011 Life of an OSG job6 Are we done for the day? I just want to do my science. I will take the easy route and never submit directly to OSG. But then it is up to me to fix your screw ups! Good. This is is the spirit. You still should learn.

UCS D OSG Summer School 2011 Life of an OSG job7 The life of an OSG job Using OSG directly Because knowing the details helps you make better decisions

UCS D OSG Summer School 2011 Life of an OSG job8 OSG job basics ● Always use Condor-G ● Direct use of Globus client tools not scalable ● Know how to discover which sites support your VO ● Not everybody will let you in ● Know what to do when things go wrong ● Within a large (D)HTC system, something will occasionally go wrong!

UCS D OSG Summer School 2011 Life of an OSG job9 Using an OSG CE ● We have seen how to use Condor-G this morning gt2 itbv-ce-pbs.uchicago.edu/jobmanager-pbs Technology Hardware IP address Local HTC There are a few more knobs, but we can ignore them

UCS D OSG Summer School 2011 Life of an OSG job10 The jobmanager ● Globus by default uses jobmanager-fork ● Jobs run directly on the CE node ● Users must explicitly specify the proper jobmanager to get into the HTC system ● jobmanager-fork useful just for basic testing ● e.g. if my proxy is authorized

UCS D OSG Summer School 2011 Life of an OSG job11 Finding sites to use ● You should start with a couple friendly sites ● They will tell you how to talk to them ● Will help you debug the initial problems ● But when you want to go bigger, you need an information system that will tell you what is out there ● OSG provides a BDII information system

UCS D OSG Summer School 2011 Life of an OSG job12 OSG BDII ● LDAP based ● The data is structured using the GLUE schema ● It will tell you which sites (claim to) support you, plus ● CE URL ● Site description More in the hands-on session.

UCS D OSG Summer School 2011 Life of an OSG job13 Job errors ● Many possible error sources ● Authentication/authorization ● Jobs never start ● Jobs fail without any output coming back ● Wrong OS ● Missing libraries (or other files) Partial list

UCS D OSG Summer School 2011 Life of an OSG job14 Auth errors ● Possible causes: ● Site is not interested in supporting you ● Misconfigured site ● Expired proxy ● Difficult to debug ● First rule of security is to give the attacker as little info as possible ● Even if the “attacker” is a legitimate user!

UCS D OSG Summer School 2011 Life of an OSG job15 Jobs never start ● Could be a legitimate situation ● Other users just have higher priority than you! ● Not completely unusual when you are an opportunistic user ● But can be a site problem ● Misconfiguration ● CE “forgets” about your job Difficult to tell

UCS D OSG Summer School 2011 Life of an OSG job16 Jobs failing without output ● Black hole effect ● Typically a broken worker node (HW problems, misconfiguration, etc.) ● Can “eat” hundred of jobs before being detected (and it may not be easy to detect!) ● Pilot paradigm helps here ● Little damage if pilots are “eaten”

UCS D OSG Summer School 2011 Life of an OSG job17 Wrong OS ● You may compile for a Red Hat Linux 5, but land on Ubuntu (or even Windows) ● Could be your fault ● Site clearly advertised it was a Windows site ● But could be a site problem ● Mistakenly re-installed a worker node from the wrong CD ● Pilot paradigm again can help ● Pilot setup not site controlled

UCS D OSG Summer School 2011 Life of an OSG job18 Missing libs/files ● Sites don't advertise what files you will find on a worker node ● At best you can make a good guess ● Particularly problematic first time you use a site ● Or have not used it for a while ● But there are also the broken/missconfig. nodes ● Once again, pilots can help ● Discover and publish what files are avaialble

UCS D OSG Summer School 2011 Life of an OSG job19 Troubleshooting ● Most of the time,you cannot fix the problem yourself ● Some help from the site admin will be needed ● Too many sites to know all admins ● Use the GOC (Grid Operations Center) ● They will route your request to the right people

UCS D OSG Summer School 2011 Life of an OSG job20 Get your hands dirty ● This is all the theory I want you to know ● Exercise time ● Feel free to ask question

UCS D OSG Summer School 2011 Life of an OSG job21 Copyright statement ● This presentation contains images copyrighted by ToonClipart.com ● These images have been licensed to Igor Sfiligoi for use in his presentations ● Any other use of them is prohibited