Presentation is loading. Please wait.

Presentation is loading. Please wait.

Condor on WAN D. Bortolotti - INFN Bologna T. Ferrari - INFN Cnaf A.Ghiselli - INFN Cnaf P.Mazzanti - INFN Bologna F. Prelz - INFN Milano F.Semeria - INFN.

Similar presentations


Presentation on theme: "Condor on WAN D. Bortolotti - INFN Bologna T. Ferrari - INFN Cnaf A.Ghiselli - INFN Cnaf P.Mazzanti - INFN Bologna F. Prelz - INFN Milano F.Semeria - INFN."— Presentation transcript:

1 Condor on WAN D. Bortolotti - INFN Bologna T. Ferrari - INFN Cnaf A.Ghiselli - INFN Cnaf P.Mazzanti - INFN Bologna F. Prelz - INFN Milano F.Semeria - INFN Bologna M. Sgaravatto - INFN Padova C. Vistoli - INFN Cnaf CHEP 2000, Padova 7-11 February, 2000

2 Massimo Sgaravatto INFN Padova Introduction HTC system needed –to meet the requirements of INFN users –to exploit the huge CPU capacity distributed in all INFN sites with distributed ownership –Candidate: Condor Condor philosophy and characteristics meet INFN requirements Condor on WAN project: collaboration with Condor Team, Univ. of Wisconsin - Madison

3 Massimo Sgaravatto INFN Padova Condor Supporting High Throughput Computing in large, distributively owned environments –Harnesses the power of non-dedicated resources Distinguishing features: –Checkpointing –Remote I/O –ClassAds

4 Massimo Sgaravatto INFN Padova Test phase Implementation of an experimental Condor WAN pool Objectives: –Verify reliability and robustness on WAN –Verify suitability to INFN requirements

5 Massimo Sgaravatto INFN Padova Test phase: results Very good performances for CPU intensive jobs Less efficient CPU usage for I/O intensive jobs  Uniform file system  Caching, dedicated file systems,... Necessary to: –have the possibility to guarantee priorities on resource usage for specific applications –guarantee overall efficiency of the system  adequate location of checkpoint servers

6 Massimo Sgaravatto INFN Padova Implementation phase Characteristics of the INFN Condor pool: Single pool –To optimize CPU usage of all INFN hosts Sub-pools –To define policies/priorities on resource usage Checkpoint domains –To guarantee the performance and the efficiency of the system –To reduce network traffic for checkpointing activity

7 Massimo Sgaravatto INFN Padova Sub-pool Collaboration machines (i.e. workstations belonging to the same research group) configured to prioritize collaboration user jobs –Local to a single INFN site –Distributed between different sites Possibility to define different policies on resource usage –Example: High priority: Condor jobs of a specific research group Middle priority: Condor jobs of local (same site) users Low priority: Condor jobs of remote users

8 Massimo Sgaravatto INFN Padova WAN Checkpoint needs Checkpoint accomplished in short time (even for “huge” checkpoint files) –to let the owner to access machine without delay –to increase the probability to have a successful checkpoint (without losing the ckpt file because of network timeout) Limit and control network traffic Checkpoint policies don’t have to reduce job computing throughput

9 Massimo Sgaravatto INFN Padova Checkpoint domains Solution: checkpoint domains –Pool partitioned in checkpoint domains (a dedicated ckpt server for each domain) –Definition of a checkpoint domain according: Presence of a sufficiently large CPU capacity Presence of a set of machines with an efficient network connectivity Sub-pools

10 GARR-B Topology 155 Mbps ATM based Network access points (PoP) main transport nodes TORINO PADOVA BARI PALERMO FIRENZE PAVIA MILANO GENOVA NAPOLI CAGLIARI TRIESTE ROMA PISA L’AQUILA CATANIA BOLOGNA UDINE TRENTO PERUGIA LNF LNGS SASSARI LECCE LNS LNL USA 155Mbps T3 SALERNO COSENZA S.Piero FERRARA PARMA CNAF Central Manager INFN Condor Pool on WAN: checkpoint domains ROMA2 10 40 15 4 65 5 Default CKPT domain @ Cnaf CKPT domain # hosts 10 2 3 6 3 2 USA 3 5 1 15 EsNet  machines  500-1000 machines 6 ckpt servers  25 ckpt servers

11 Massimo Sgaravatto INFN Padova Network as resource In distributed environment the network is a resource –Bandwidth between executing machine and checkpoint server is a ClassAds attribute, dynamically updated The job allocates a machine taking into account also its checkpoint characteristics Job checkpoint policy defined in the job submitting file

12 Massimo Sgaravatto INFN Padova Job policies/1 Nearest policy –job prefers to select machine in the same checkpoint domain, always selecting the one with the highest bandwidth to the Ckpt server. rank = (CkptServer =?= LastCkptServer) *100 + CkptBW

13 Massimo Sgaravatto INFN Padova Job policies/2 At least N-Mbps policy –job prefers to select machine in the same checkpoint domain, always selecting the one with the highest bandwidth > N to the ckpt server rank = (CkptServer =?= LastCkptServer) *100 + CkptBW requirements = CkptBW > N

14 Massimo Sgaravatto INFN Padova Job policies/3 Fixed policy –job only selects machines in the same checkpoint domain, always selecting the one with the highest bandwidth to the checkpoint server: a job can’t move between checkpoint domains (suitable for very large jobs) requirements = (CkptServer =?= LastCkptServer || LastCkptServer =?= UNDEFINED) rank = CkptBW

15 Massimo Sgaravatto INFN Padova Checkpointing: next step Distributed dynamic checkpointing –Pool machines select the “best” checkpoint server (from a network view) –Association between execution machine and checkpoint server dynamically decided

16 Massimo Sgaravatto INFN Padova Distributed dynamic checkpointing –Network Manager Create and keep up-to-date the Network Class-Ads, between pool machines and checkpoint servers Network Class-Ads used by pool machines to select the “closest” checkpoint server N NM C Central Manager CARA Executing Machine Submitting Machine Collector Negotiator Network Manager Customer AgentResource Agent

17 Massimo Sgaravatto INFN Padova INFN Condor pool usage Different kinds of applications Condor users are happy: high computing throughput is achieved Allocation time for Condor jobs since February 99: > 36 years

18 Massimo Sgaravatto INFN Padova

19 Example

20 Massimo Sgaravatto INFN Padova Conclusions Efficiency and robustness of the Condor pool on WAN has been verified Single pool  Efficient usage of all resources Network as resource  Optimization of checkpoint operations Sub-pools  Policies on resource usage http://www.infn.it/condor


Download ppt "Condor on WAN D. Bortolotti - INFN Bologna T. Ferrari - INFN Cnaf A.Ghiselli - INFN Cnaf P.Mazzanti - INFN Bologna F. Prelz - INFN Milano F.Semeria - INFN."

Similar presentations


Ads by Google