1 Alex Romosan,Derek Wright, Alex Romosan, Derek Wright, Ekow Otoo, Doron Rotem, Arie Shoshani (Guidance: Doug Olson) Lawrence Berkeley National Laboratory.

1 Alex Romosan,Derek Wright, Alex Romosan, Derek Wright, Ekow Otoo, Doron Rotem, Arie Shoshani (Guidance: Doug Olson) Lawrence Berkeley National Laboratory Co-Scheduling CPU and Storage using Condor and SRMs Presenter: Arie Shoshani

2 Problem: Running jobs on the Grid Grid architecture needs to include components for dynamic reservation & scheduling of:Grid architecture needs to include components for dynamic reservation & scheduling of: Compute resources – Condor (startd) Storage resources – Storage Resource Managers (SRMs) Network resources – Quality of Service in routers Also need to coordinateAlso need to coordinate The co-scheduling of resources Compute and storage resources only The execution of the co-scheduled resources Need to get DATA (files) into the execution nodes Start the jobs running on nodes that have the right data on them. Recover from failures Balance use of nodes Overall optimization – replicate “hot files”

3 General Analysis Scenario MSS Request Executer Storage Resource Manager Metadata catalog Replica catalog Network Weather Service logical query network client... Request Interpreter request planning A set of logical files Execution plan and site-specific files Client’s site... Disk Cache Disk Cache Compute Engine Disk Cache Compute Resource Manager Storage Resource Manager Compute Engine Disk Cache Requests for data placement and remote computation Site 2 Site 1 Site N Storage Resource Manager Storage Resource Manager Compute Resource Manager result files Execution DAG

4 Simpler problem: run jobs on multi-node uniform clusters Optimize parallel analysis jobs on the clusterOptimize parallel analysis jobs on the cluster Jobs are partitioned into tasks: Job i : [C i, {F ij }, O i ]  {C i, F ij, O ij } Currently using LFS Currently files are NFS mounted – bottleneck Want to run tasks independently on each nodeWant to run tasks independently on each node Want to send tasks to where the files areWant to send tasks to where the files are Very important problem for HENP applicationsVery important problem for HENP applications HPSS Master Node Worker Node Worker Node Worker Node Worker Node

5 SRM is a Service SRM functionalitySRM functionality Manage space Negotiate and assign space to users Manage “lifetime” of spaces Manage files on behalf of a user Pin files in storage till they are released Manage “lifetime” of files Manage action when pins expire (depends on file types) Manage file sharing Policies on what should reside on a storage resource at any one time Policies on what to evict when space is needed Get files from remote locations when necessary Purpose: to simplify client’s task Manage multi-file requests A brokering function: queue file requests, pre-stage when possible Provide grid access to/from mass storage systems HPSS (LBNL, ORNL, BNL), Enstore (Fermi), JasMINE (Jlab), Castor (CERN), MSS (NCAR), …

6 Types of SRMs Types of storage resource managersTypes of storage resource managers Disk Resource Manager (DRM) Manages one or more disk resources Tape Resource Manager (TRM) Manages access to a tertiary storage system (e.g. HPSS) Hierarchical Resource Manager (HRM=TRM + DRM) An SRM that stages files from tertiary storage into its disk cache SRMs and File transfersSRMs and File transfers SRMs DO NOT perform file transfer SRMs DO invoke file transfer service if needed (GridFTP, FTP, HTTP, …) SRMs DO monitor transfers and recover from failures TRM: from/to MSS DRM: from/to network

7 Uniformity of Interface  Compatibility of SRMs SRM Enstore JASMine Client USER/APPLICATIONS Grid Middleware SRM DCache SRM CASTOR SRM Disk Cache

8 SRMs used in STAR for Robust Muti-file replication Anywhere BNL Disk Cache Disk Cache HRM-COPY (thousands of files) SRM-GET (one file at a time) HRM-Client Command-line Interface HRM (performs writes) HRM (performs reads) LBNL GridFTP GET (pull mode) stage files archive files Network transfer Get list of files Recovers from staging failures Recovers from file transfer failures Recovers from archiving failures

9 File movement functionality: srmGet, srmPut, srmReplicate SRMClient Client-FTP-get (pull) Client-FTP-put (push) srmGet/srmPut SRM-FTP-put (push) SRMClient SRM/ No-SRM SRM-FTP-get (pull) srmReplicate SRM/ No-SRM FTP-get

10 SRM Methods File Movement srm(Prepare)Get: srm(Prepare)Put: srmReplicate: Lifetime management srmReleaseFiles: srmPutDone: srmExtendFileLifeTime: Terminate/resume srmAbortRequest: srmAbortFile srmSuspendRequest: srmResumeRequest: Space management srmReserveSpace srmReleaseSpace srmUpdateSpace srmCompactSpace: srmGetCurrentSpace: FileType management srmChangeFileType: Status/metadata srmGetRequestStatus: srmGetFileStatus: srmGetRequestSummary: srmGetRequestID: srmGetFilesMetaData: srmGetSpaceMetaData:

11 Simpler problem: run jobs on multi-node uniform clusters Optimize parallel analysis on the clusterOptimize parallel analysis on the cluster Minimize movement of files between cluster nodes Use nodes in cluster as evenly as possible Automatic replication of “hot” files Automatic management of disk space Automatic removal of cold files (Automatic garbage collection) UseUse DRMs for disk management on each node Space & content (files) HRM for access from HPSS Condor for job scheduling on each node Startd to run jobs and monitor progress Condor for matchmaking of slots and files

12 Architecture JDD DRM startd DRM startd DRM startd schedd Collector Negotiator FSD JDD – Job Decomposition Daemon FSD – File Scheduling Daemon DRM startd HRM HPSS

13 Detail actions (JDD) JDD partitions jobs to tasksJDD partitions jobs to tasks Job i : [C i, {F ij }, O i ]  {C i, F ij, O ij } JDM constructs 2 files S(j) – set of tasks (jobs in Condor-speak) S(f) – set of files requested (Also keeps reference counts to files) JDD probes all DRMsJDD probes all DRMs For files they have For missing files it can schedule requests to HRM JDD schedules all missing filesJDD schedules all missing files Simple algorithm: schedule round-robin to nodes Simply send request to each DRM DRM removes files if needed and gets file from HRM JDD sends each startd the list of files it needsJDD sends each startd the list of files it needs Startd checks with its DRM which of the needed files it has, and constructs a class-ad that lists only relevant files

14 Detail actions (FSD) FSD queues with Condor all tasksFSD queues with Condor all tasks FSD checks with Condor periodically on status of tasksFSD checks with Condor periodically on status of tasks If a task is stuck it may choose to replicate the file (this is where a smart algorithm is needed) File replication can be made from a neighbor node or from HRM When startd runs a task, it requests DRM to pin file, run task, and release fileWhen startd runs a task, it requests DRM to pin file, run task, and release file

15 Architecture JDD DRM startd DRM startd DRM startd schedd Collector Negotiator FSD JDD – Job Decomposition Daemon FSD – File Scheduling Daemon DRM startd HRM HPSS - JDD generates list of Missing files - JDD generates list of Missing files

16 Need to develop Mechnism for startd to communicate with DRMMechnism for startd to communicate with DRM Recently added to startd Mechanism to check status of tasksMechanism to check status of tasks Mechanism to check that task is finished, and notify JDDMechanism to check that task is finished, and notify JDD Mechanism to check that job is done, notify clientMechanism to check that job is done, notify client Develop JDDDevelop JDD Develop FSDDevelop FSD

17 Open Questions (1) What if a file was removed by a DRM?What if a file was removed by a DRM? In this case, if DRM does not find the file on its disk, then the task gets rescheduled Note: usually, only “cold” files are removed Should DRMs notify JDD when they remove a file? How do you deal with output and merging of outputs?How do you deal with output and merging of outputs? Need DRMs to be able to schedule durable space Moving files out of the compute node is the responsibility of the user (code) Maybe moving files to their final destination should be a service of this system

18 Open Questions (2) Is it best to process as many files on a single system as possible?Is it best to process as many files on a single system as possible? E.g. one system has 4 files, but also the files are on 4 different systems. Which is better. Conjecture: if the overhead for splitting job is small, then splitting is optimized by matchmaking What if file bundles are needed?What if file bundles are needed? A file bundle is a set of files that are needed together Need a more sophisticated class-ads How to replicate bundles?

19 Detail activities Development workDevelopment work Design of JDD and FSD modules Development of software components Use of a real experimental cluster (8 + 1 nodes) Install Condor and SRMs Development of an optimization algorithmDevelopment of an optimization algorithm Represented as a bipartite graph Using network flow analysis techniques

20 Optimizing file replication on the cluster (D. Rotem)* Jobs can be assigned to servers subject to the following constraints:Jobs can be assigned to servers subject to the following constraints: 1. Availability of Computation slots on the server, usually these correspond to CPUs. 2. File(s) needed by the job must be resident on the server disk 3. Sufficient disk space for storing job output. 4. RAM Goal : Maximize number of jobs assigned to servers while minimizing file replication costsGoal : Maximize number of jobs assigned to servers while minimizing file replication costs * Article in preparation

21 Bipartite graph showing files and servers An arc between f-node and s-node exists if the file is stored on that server The number in the f-node represents the number of jobs that want to process that file The number in the s-node represents the number of available slots on that server

22 File replication converted to a network flow problem 1) The total maximum number of jobs that can be assigned to the servers corresponds to the maximum flow in this network. 2) By the well-known max-flow min-cut theorem this is also equal to the capacity of a minimum cut shown in bold edges. Where, a cut is a set of edges that disconnects the source from the sink Max Flow is 11 in this case – Minimum cut shown in bold The number on the arcs is the MIN between the 2 nodes

23 Improving Flow by adding an edge Maximum flow improved to 13, additional edge represents file replication Problem: to find a subset of edges in of total minimum cost that maximizes the flow between the source and the sink.

24 Solution Problem: Finding a set of edges of minimum cost to maximize flow (MaxFlowFixedCost ) Problem is (strongly) NP-Complete We use an approximation algorithm that finds a sub optimal solution in polynomial time, called Continuous Maximum Flow Improvement (C- MaxFlowImp) using linear programming techniques Can show that the solution is bounded relative to the optimal This will be implemented as part of FSD

25 Conclusions Combining compute and file resources in class-ads is a useful conceptCombining compute and file resources in class-ads is a useful concept Can take advantage of matchmaker Using DRMs to manage space and content of space provides:Using DRMs to manage space and content of space provides: Information for class-ads Automatic garbage collection Automatic staging of missing files from HPSS through HRM Minimizing the number of files in class-ads is the key to efficiencyMinimizing the number of files in class-ads is the key to efficiency Get only needed files from DRM Optimization can be done externally to Condor by File replication algorithmsOptimization can be done externally to Condor by File replication algorithms Network flow analogy provide good theoretical foundation Interaction between Condor and SRMs are through existing APIsInteraction between Condor and SRMs are through existing APIs Small enhancements were needed in startd and DRMs We believe that results can be extended to the Grid, but cost of replication will vary greatly – need to extend algorithmsWe believe that results can be extended to the Grid, but cost of replication will vary greatly – need to extend algorithms

1 Alex Romosan,Derek Wright, Alex Romosan, Derek Wright, Ekow Otoo, Doron Rotem, Arie Shoshani (Guidance: Doug Olson) Lawrence Berkeley National Laboratory.

Similar presentations

Presentation on theme: "1 Alex Romosan,Derek Wright, Alex Romosan, Derek Wright, Ekow Otoo, Doron Rotem, Arie Shoshani (Guidance: Doug Olson) Lawrence Berkeley National Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Alex Romosan,Derek Wright, Alex Romosan, Derek Wright, Ekow Otoo, Doron Rotem, Arie Shoshani (Guidance: Doug Olson) Lawrence Berkeley National Laboratory.

Similar presentations

Presentation on theme: "1 Alex Romosan,Derek Wright, Alex Romosan, Derek Wright, Ekow Otoo, Doron Rotem, Arie Shoshani (Guidance: Doug Olson) Lawrence Berkeley National Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback