Presentation is loading. Please wait.

Presentation is loading. Please wait.

Arlington, Dec 7th 2006 Glidein Based WMS 1 A pilot-based (PULL) approach to the Grid An overview by Igor Sfiligoi.

Similar presentations


Presentation on theme: "Arlington, Dec 7th 2006 Glidein Based WMS 1 A pilot-based (PULL) approach to the Grid An overview by Igor Sfiligoi."— Presentation transcript:

1 Arlington, Dec 7th 2006 Glidein Based WMS 1 A pilot-based (PULL) approach to the Grid An overview by Igor Sfiligoi

2 Arlington, Dec 7th 2006 Glidein Based WMS 2 Why use pilots on the Grid? ● Grid based on a loose aggregation of resources ● Users have no say on how the resources are managed ● The probability of at least one resource being broken at any given time is very high Pilots can verify the resource before pulling a user job ● Difficult to have complete and reliable information about the resources ● Managing priorities inside the resource pools difficult if not impossible Pilots can do full matchmaking based on complete info

3 Arlington, Dec 7th 2006 Glidein Based WMS 3 Why using Condor for pilots? ● Condor is a very widely used system – Mature and well known ● Has an active development team –... that listen to their users – A new condor release every few months ● Condor architecture distributed by design – Started as a project to gather unused CPU cycles – A perfect fit for the Grid world

4 Arlington, Dec 7th 2006 Glidein Based WMS 4 What are the “glideins”? ● Glideins are just properly configured, regular Condor daemons submitted as batch jobs ● From the user point of view, they look like regular Condor worker nodes Grid site Collector Schedd User jobs Gatekeeper Running Grid jobs Dedicated node Startd User job Startd User job StartdStartd StartdStartd 1a 2a 3a 4a 1b 2b 3b 4b 5b 6b Negotiator

5 Arlington, Dec 7th 2006 Glidein Based WMS 5 How do we create a WMS out of it? ● Glideins should be submitted on demand – The WMS must know about the characteristics of the user jobs ● Different Grid sites may need different configurations – The WMS must know the details of the Grid sites

6 Arlington, Dec 7th 2006 Glidein Based WMS 6 Dividi et impera ● Split the task between different processes Collector VO frontend Request glideins VO frontend Glidein Factory Grid site Glidein Factory Grid site Submit glideins Submit glideins Publish available glidein entries Publish available glidein entries Request glideins VO frontend Publish available glidein entries Request glideins Job Schedd Grid site Submit glideins Glidein Factory VO Frontend knows about user jobs Glidein Factory knows about Grid site(s) Job Schedd

7 Arlington, Dec 7th 2006 Glidein Based WMS 7 Dividi et impera (2) ● VO frontend – User/VO specific – A reasonably generic implementation provided for simple use cases ● Customizable – Looks only for a known subset of jobs – Can process custom attributes ● Glidein Factory – Completely generic – The default implementation should suit most tasks – The same factory could in principle serve several VOs

8 Arlington, Dec 7th 2006 Glidein Based WMS 8 Comunication between processes ● Using standard Condor Class-Ads – The Collector defines the WMS – This Collector is usually different than the glidein Collector Collector VO frontend Publish available glidein entries Request glideins Job Schedd Grid site Submit glideins Glidein Factory 1a 2 3 1b Startd Collector 4 6 5 Negotiator

9 Arlington, Dec 7th 2006 Glidein Based WMS 9 Glidein factory ClassAd MyType=”glidefactory” Name=”entry@factory” FactoryName=”factory” GlideinName=”entry” Attribute1=”...”... AttributeN=”...” GlideinParamXXX=”val1”... GlideinParamYYY=”valN” Published classad Attributes describe the glidein like: ARCH=”INTEL”, MaxHours=72, Site=”Florida” Due to Condor limitations, define also GlideinMyType Parameters set glidein parameter defaults like: CONDOR_HOST=”UNDEFINED”,SEC_DEFAULT_ENCRYPTION=OPTIONAL MinDisk=16G, CheckFilesExist=”/tmp/CMS,$DATA/OSG”

10 Arlington, Dec 7th 2006 Glidein Based WMS 10 Frontend ClassAd MyType=”glideclient” Name=”reqX@client” ClientName=”client” ReqName=”reqX” ReqGlidein=”entry@factory” ReqIdleGlideins=nr ReqMaxRun=nr ReqMaxSubmitXHour=nr GlideinParamWWW=”val1”... GlideinParamZZZ=”valY” Published classad GlideParamXXX must match the names published by the factory Due to Condor limitations, define also GlideinMyType Trying to keep a steady stream of glideins starting Limits on the amount of glideins submitted envisioned but not yet implemented

11 Arlington, Dec 7th 2006 Glidein Based WMS 11 Implementation details

12 Arlington, Dec 7th 2006 Glidein Based WMS 12 Frontend logic Query Schedd(s) Query WMS Collector Match and count ● Essentially a Matchmaker ● Will match user job attributes to factory attributes ● A scaled back count is then published as a request Jobs Attributes Factories Attributes Nr jobs x Factory Publish requests

13 Arlington, Dec 7th 2006 Glidein Based WMS 13 Frontend in the big picture Schedd User jobs VO Collector VO Frontend WMS Collector Frontend requests Factory Ads Grid site Startd User job Glidein Factory Grid site 2b 2a 3 6 7 8 9 4 1 5

14 Arlington, Dec 7th 2006 Glidein Based WMS 14 Glidein Factory Logic ● Essentially a publish-read-submit loop – Blindly execute orders – Security implemented at Collector level Grid site Collector Gatekeeper Non-glide Grid jobs Startd StartdStartd 6 1 Frontend requests Factory ads Glidein Factory Schedd Startd jobs 2 4 Grid site Gatekeepe r StartdStartd Grid site Gatekeepe r StartdStartd 3 5 7

15 Arlington, Dec 7th 2006 Glidein Based WMS 15 Glidein Factory Internals ● Made of multiple Entry Points – Each Entry Point an independent process – Implements the publish-read-submit loop ● Entry Points do the real job – The Glidein Factory is just a logical object – Management tools that act on the whole factory (i.e. several entry points at a time) envisioned, but not yet implemented

16 Arlington, Dec 7th 2006 Glidein Based WMS 16 Entry points at a glance Factory Daemon Schedd Startd jobs /mydir/glidein_e1 job.cond or glidein_s tartup.sh params.c fg condor_submit Grid site Gatekeepe r StartdStartd Grid site Gatekeepe r StartdStartd Grid site Gatekeepe r StartdStartd /mydir/glidein_e2 job.cond or glidein_s tartup.sh params.c fg /mydir/glidein_e3 job.cond or glidein_s tartup.sh params.c fg Factory Daemon Frontend requests Collector Entry Point Ad Frontend requests Entry Point Ad Frontend requests Entry Point Ad The status of the Entry Point is held on disk One Factory Daemon per Entry Point

17 Arlington, Dec 7th 2006 Glidein Based WMS 17 Glidein implementation ● Dummy startup script – Just loads other files and execute the ones marked as executable ● File transfer implemented using HTTP – Easy cacheable, standard tools available (Squid) – Proven to scale, widely used in Industry ● All sensitive file transfers signed (sha1) – Prevent tampering, as HTTP travels in clear

18 Arlington, Dec 7th 2006 Glidein Based WMS 18 Glidein implementation (2) ● Standard sanity checks provided – Disk space constraints – Node blacklisting ● Generic Condor configure and startup script provided, too ● Factory admins can easily add their own customization scripts (both for checks and configs) – Allowing Frontends to add custom scripts envisioned, but not yet implemented

19 Arlington, Dec 7th 2006 Glidein Based WMS 19 Glidein at a glance /mydir/glidein_e1 job.cond or glidein_s tartup.sh params.c fg condor_submit /var/www/html/glidein_e1 condor_b in.tgz user_setu p.sh condor_s tartup.sh condor_c onfig nodes_bl acklist files_list. lst signatur e.sha1 Schedd Glidein jobs Grid site Gatekeeper StartdStartd Worker Node Web Server Web Cache glidein_startup load file list execute scripts Erro rs? Startd Yes No Check the running environment, configure dynamic options or fail if problems found This batch slot would not be able to run a user job load files Only the startup script transferred by Globus. All files are signed to prevent tampering. Factory Daemon

20 Arlington, Dec 7th 2006 Glidein Based WMS 20 Advanced topics

21 Arlington, Dec 7th 2006 Glidein Based WMS 21 Security ● Traffic between the Collector and Schedd, and the Grid sites cannot be trusted – WAN is insecure by definition ● Strong authentication needs to be used The startd has access to the service X509 proxy, so that is used for authentication Supported natively by Condor

22 Arlington, Dec 7th 2006 Glidein Based WMS 22 Grid site Private networks and firewalls ● Condor developed with LAN in mind – All daemons require bi-directional networking ● Default mode fails with firewalls and private networks – Incoming traffic on WNs almost never allowed Schedd User jobs VO Collector Startd User job 1 2 3 9

23 Arlington, Dec 7th 2006 Glidein Based WMS 23 Using Condor GCB ● Adding Condor GCB to the mix solves the problem – Startd keeps a permanent TCP connection with GCB – Only outgoing connectivity used from WN ● Adds complexity to the system, but is needed – One GCB server can handle ~1k glideins Grid site Schedd User jobs VO Collector Startd User job 2 3 4 9 GCB 1 5

24 Arlington, Dec 7th 2006 Glidein Based WMS 24 Integration with Grid-local security infrastructure ● Default pilot operation mode breaks Grid security rules – The real user job is pulled in without local consent – Grid site knows only about the glideins ● The default mode also a security risk for the VO – User jobs run under the same UID as the condor daemons Users have access to the service X509 proxy Gatekeeper User queue startd User Job GUMS Glidein UID Glidein X509 Glide in User Job User X509 Glidein UID Worker node Grid site Glidein X509

25 Arlington, Dec 7th 2006 Glidein Based WMS 25 Condor integration with gLExec ● gLExec started to be deployed on OSG WNs – Will change UID based on X509 proxy ● Condor natively supports it as of v6.9.0 Glidein factory have the necessary config script ● Solves both problems at once Gatekeeper User queue Pilot User Job gLExec GUMS User UID Pilot User Job User DN User UID Worker node Grid site Glidein X509 User X509 Glidein UID Glidein X509 Glidein UID

26 Arlington, Dec 7th 2006 Glidein Based WMS 26 Future work ● Need to implement several pieces that have been mentioned before – Waiting for some real users to bump up priorities – Current set satisfies the needs of CMS test harness ● With real users will certainly come new requests – I do not dare forecast what they might be ● As a wish, I would not mind if Condor natively implemented most of the functionality

27 Arlington, Dec 7th 2006 Glidein Based WMS 27 An example of possible future work ● Hierarchies of glidein factories Collector VO frontend Publish available glidein entries Request glideins Job Schedd Grid site Glidein Factory 1a 3 4 2b Startd Collector 6 8 7 Negotiator Collector Submit glideins Glidein Factory 5a Glidein Factory Grid site Submit glideins Request glideins Publish available glidein entries Publish available glidein entries 2a 1b 5b


Download ppt "Arlington, Dec 7th 2006 Glidein Based WMS 1 A pilot-based (PULL) approach to the Grid An overview by Igor Sfiligoi."

Similar presentations


Ads by Google