Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013.

Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013

Task-based Workflow Engines 8/15/20132 part1 part2 part3: input.data split.py./split.py input.data out1: part1 mysim.exe./mysim.exe part1 >out1 out2: part2 mysim.exe./mysim.exe part2 >out2 out3: part3 mysim.exe./mysim.exe part3 >out3 result: out1 out2 out3 join.py./join.py out1 out2 out3 > result Works on: Work Queue, SGE, Condor, Local Similar to Makeflow: Pegasus, Condor’s DAGMan, Dryad

Today’s (DAG-structured) Workflows 8/15/20133

Big Data is “hard” 8/15/20134 Master Worker Scenario A: Master Worker Worker 1TB Scenario B: Distributed File System WMS Cloud Grid AND/OR DFS

Data Size Increases? Turn to DFS Distributing task dependencies over network costly as data-size increases. Many data-sets are too large to be distributed from a master node. What is used? NFS, AFS, Ceph, PVFS, GPFS – Generic POSIX-compliant cluster file system. Problem: Data-Locality hard to achieve by workflow Contributing Factor: File systems do not offer an interface to locate storage nodes with data. Problem: Parallel applications’ accessing data is a Denial-of-Service waiting to happen. (herd effect) Contributing Factor: Maintaining POSIX semantics. 8/15/20135

Other Options? Specialized Workflow Abstractions 8/15/20136

Map-Reduce 8/15/20137 Source: developers.google.com

Distributed File Systems -- Specialized Specialized cluster file system for executing Map-Reduce. Hadoop Distributed File System: 8/15/20138 #include “hdfs.h” hdfsFile writeFile = hdfsOpenFile(fs, writePath, O_WRONLY|O_CREAT, 0, 0, 0); tSize nwritten = hdfsWrite(fs, writeFile, “hello”, 6); hdfsCloseFile(fs, writeFile);

8/15/20139 Source: hadoop.apache.org

Task-based Workflows on Hadoop? Single task whole-file access inefficient. Hadoop job execution not built for single-tasks. Single file dependencies. 8/15/201310 Hadoop the Elephant Your DAG Workflow

DAG Execution on Hadoop 8/15/201311 Makeflow w.makeflow Batch Job Work Queue Hadoop Hadoop NameNode Hadoop DataNode submit job Job 1234 Map:./split.py input.data Map Input: input.data Reduce:

8/15/201312 Hadoop Job Throughput SWEET’12: Makeflow: Portable Workflow Management for Distributed Computing, Albrecht et al.

Summary: Running Workflows on Large Datasets is Hard Today, users have two solutions: Use a generic POSIX Distributed File System. – Problem: Data-Locality hard to achieve by workflow managers. – Problem: Parallel applications accessing data is a Denial-of-Service waiting to happen. Use a specialized file system that executes a specific workflow abstraction. – Problem: Users must rewrite applications to use the workflow pattern (abstraction). – Problem: Task-based Workflows inefficient. 8/15/201313

Observations on DAG-Structured Workflows 1.Scientific workflows often re-use large datasets in multiple workflows. 2.Metadata interactions occur at task start/end. 3.Tasks consume whole files. 8/15/201314

Cluster File System Overview We have designed Confuga, an active storage cluster file system purposed for running task-based workflows. Distinguishing Features Data-locality-aware scheduling with multiple dependencies. Drop-in-replacement for other compute engines. Consistency maintained at task boundaries. 8/15/201315

Confuga: An Active Storage Cluster File System S1S1 S2S2 S3S3 MDS Single MDS with multiple storage nodes. 8/15/201316 RMNM F 1 : S 1, S 2 F 2 : S 1, S 3 F 3 : S 2 F1F1 F2F2 F1F1 F3F3 F2F2 Replica Manager: File granularity / |__ readme.txt --> F 1 |__ users/ |__ patrick/ |__ blast.db --> F 2 |__ blast --> F 3 Namespace Manager: Regular Directory Hierarchy, Files Point to File Identifiers

Replica Manager Files are indexed using content-addressable- storage. Tasks: – Ensure sufficient replication of files, restripe cluster as necessary. – Garbage collect extra unneeded replicas. 8/15/201317 MDS RM F 1 : S 1, S 2 F 2 : S 1, S 3 F 3 : S 2 SHA1: abcdef123456789 s2.cluster.nd.edu

Namespace Manager Maintains a mirror file system layout on the head node. – Regular files hold file identifiers (checksums). Global synchronization point for file system updates. 8/15/201318 MDS NM / |__ readme.txt --> F 1 |__ users/ |__ patrick/ |__ blast.db --> F 2 |__ blast --> F 3

Job Scheduler 8/15/201319 S1S1 F1F1 F2F2 S2S2 F1F1 F3F3 S3S3 F3F3 F2F2 Head Node Job Description command: “blast blast.db > out”, inputs: { blast: “/users/patrick/blast”, blast.db: “/users/patrick/blast.db” }, outputs: { out: “/users/patrick/out” } Job Description command: “blast blast.db > out”, inputs: { blast: “/users/patrick/blast”, blast.db: “/users/patrick/blast.db” }, outputs: { out: “/users/patrick/out” } Step 1: submit job Step 2: copy F 3 to S 3 Step 3 Step 4: execute T 1 Step 5: result T 1

Job Scheduler: Task Namespace Context-free execution. Atomic side-effect-free commits. 8/15/201320 Task 1 command: “blast blast.db > out”, namespace: blast.db: F 2 blast: F 3 out: OUTPUT Task 1 command: “blast blast.db > out”, namespace: blast.db: F 2 blast: F 3 out: OUTPUT drwxr-x--- 2 user users 4K 8:00. lrwxrwxrwx 1 user users 49 8:00 blast.db ->../store/F 2 lrwxrwxrwx 1 user users 49 8:00 blast ->../store/F 3 -rw-r----- 1 user users 0 8:00 out $./blast blast.db > out Task 1 Result exit status: 0 namespace: out: F out Task 1 Result exit status: 0 namespace: out: F out S3S3 F2F2 F1F1 F out

User Machine Exploiting DAG-structured Workflow Semantics 8/15/201321 Storage Node open writeclose POSIXAFS (commit-on-close) Storage Node openwriteflush + close …

read-on-exec/commit-on-exit 8/15/201322 User Machine Storage Node openwrite close open read close openwrite close Open + read + close Open + write + close Eliminates inter-task synchronization requirements Batches metadata operations

Why Confuga? Integrates cleanly with current DAG Workflows. – Task namespace encapsulates data dependencies – Writes/Reads resolve at task boundaries Global namespace allows data sharing and workflow checkpointing Express multiple dependencies for a task Minimize unnecessary metadata interactions 8/15/201323

Feature Comparison SolutionData- Locality Metadata Scaling Large File Support Application as Abstraction Task-Based Workflows Workflow on DFS Hadoop ?? Confuga ?? 8/15/201324

Implementation: Confuga Using Chirp Why Chirp? – Most of Confuga can be implemented using a (slightly modified) remote file server. – Interoperates with existing distributed computation tools 8/15/201325 Chirp Local FS Network RPC ACLPolicy User App libchirp Network RPC FUSE

Extending Chirp 8/15/201326 libchirp FUSE Chirp RPC QuotaACL Local FS Standard Chirp libchirp FUSE Confuga Storage Node Chirp RPC Quota Local FS ACL Job Scheduler libchirp FUSE Confuga Head Node Chirp RPC Quota Local FS ACL Job Namespace Manager Replica Manager Chirp File System libchirp Sched -uler Confuga FS

Concluding Thoughts Smart adaptation to workflow semantics allows the file system to reduce metadata operations and to minimize cluster synchronization steps. Task namespace is explicit as part of the job description, allowing the file system to schedule tasks near multiple dependencies. 8/15/201327

Questions? Patrick Donnelly ( PDONNEL3@ND.EDU ) PDONNEL3@ND.EDU Douglas Thain ( DTHAIN@ND.EDU ) DTHAIN@ND.EDU Have a challenging distributed system problem? Visit our lab at http://www.nd.edu/~ccl/ !! http://www.nd.edu/~ccl/ Source Code: http://www.github.com/cooperative- computing-lab/cctools http://www.github.com/cooperative- computing-lab/cctools 8/15/201328

Why content-addressable-storage? 8/15/201329

Batch Job Submission Integrating Chirp with Makeflow 8/15/201330 Makeflow Local FS SGE, Condor, Work Queue, …./exe./data/a.db./data/b.db submit, put, get Makeflow./exe./data/a.db./data/b.db Chirp ACLPolicy stat, submit Requires Makeflow to abstract access to the workflow namespace. Chirp needs to support a job submission interface. stat

SNSN S2S2 Rearchitect Chirp’s multi interface multi volume management in Chirp – Files are striped round-robin across a static set of nodes – No replication – File location requires traversal of namespace – Access not provided by Chirp itself 8/15/201331 Chirp “Head Node”./volume/hosts./volume/root/users/pdonnel3/blast --> S 1 :/abcd Chirp S 1./abcd Client multi library

Changes to Chirp 8/15/201332 Two services within Confuga back-end file system: Replica Manager Namespace Manager

Publications Attaching Cloud Storage to a Campus Grid Using Parrot, Chirp, and Hadoop, 2010 IEEE CloudCom. Makeflow: A Portable Abstraction for Data Intensive Computing on Clusters, Clouds, and Grids, 2012 SWEET at ACM SIGMOD. Fine-Grained Access Control in the Chirp Distributed File System, 2012 IEEE CCGrid. 8/15/201333

Map-Reduce 8/15/201334 HDFS architecture influenced by Map-Reduce: – Block oriented; no whole-file access. – No multiple file access. Source: yahoo.com

Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013.

Similar presentations

Presentation on theme: "Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013.

Similar presentations

Presentation on theme: "Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013."— Presentation transcript:

Similar presentations

About project

Feedback