CSCI5570 Large Scale Data Processing Systems

CSCI5570 Large Scale Data Processing Systems
Distributed Data Analytics Systems James Cheng CSE, CUHK Slide Ack.: modified based on the slides from projects/dryadlinq/default.aspx

Dryad: distributed data-parallel programs from sequential building blocks
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly Microsoft Research Silicon Valley EuroSys’07

Dryad goals General-purpose execution engine for coarse-grained data-parallel applications Concentrates on throughput not latency Assumes private data center Users write simple programs, and the execution engine automatically manages scheduling, distribution, fault tolerance, etc.

Outline Computational model Dryad architecture Some case studies

A typical data-intensive query
var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending Ulfar’s most frequently visited web pages 5

Steps in the query var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending Go through logs and keep only lines that are not comments. Parse each line into a LogEntry object. Go through logentries and keep only entries that are accesses by ulfar. Group ulfar’s accesses according to what page they correspond to. For each page, count the occurrences. Sort the pages ulfar has accessed according to access frequency. 6

Serial execution For each line in logs, do…
var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending For each line in logs, do… For each entry in logentries, do.. Sort entries in user by page. Then iterate over sorted list, counting the occurrences of each page as you go. Re-sort entries in access by page frequency. 7

Parallel execution var logentries = from line in logs
where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending 8

How does Dryad fit in? Programmers write sequential programs with no thread creation or locking “SQL-like” queries: LINQ Dryad represents an application program as a distributed execution graph Computational “vertices” connected by communication “channels” (i.e., edges) Dryad runs the application by executing the vertices on a set of available computers, communicating them through files, TCP pipes, and shared-memory FIFOs File: producer writes to disk (on local computer) and consumer reads from disk TCP pipe: data transferred through a network, no disk access Shared-memory FIFO: runs in main memory within the same process

Job = Directed Acyclic Graph
Outputs Processing vertices Channels (file, pipe, shared memory) A dryad application is composed of a collection of processing vertices (processes). The vertices communicate with each other through channels. The vertices and channels should always compose into a directed acyclic graph. Inputs

Graph Description Language
A lower-level programming model than SQL Programmers can specify an arbitrary DAG to describe an application’s communication patterns, and express the data transport mechanisms (files, TCP pipes, and shared-memory FIFOs) between the computation vertices

Cloning individual vertices (n times) using the ^ operator As = A^n

B B n A A Pointwise composition using the >= operator As >= Bs

B B n A A Complete bipartite composition using the >> operator As >> Bs

B Merging two graphs using the || operator (B >= C) || (B >= D)

B B B B C C n n A A A A A bypass operation (e.g., each A vertex outputs a summary of its input to C, and C aggregates all its inputs and forwards the global statistics to every B) E = (As >= C >= Bs) To produce F (different from E), each A also sends its data to each B (by pointwise composition), and then each B makes use of the statistics received from C to process A’s data F = E || (As >= Bs)

A lower-level programming model than SQL Programmers can specify an arbitrary DAG to describe an application’s communication patterns, and express the data transport mechanisms (files, TCP pipes, and shared-memory FIFOs) between the computation vertices In order to get the best performance from a native Dryad application, programmers need to understand the structure of the computation and the organization and properties of the system resources Simpler, higher-level programming models can be built upon Dryad

Dryad System Organization
V Dryad System Organization Job Manager (JM) Constructs a job’s execution graph Schedules the work across available resources Centralizes coordinating process Does not compute and sit on data path, so not a bottleneck Name Server (NS) Keeps the position of each computer in the cluster Daemons (D) A daemon runs on each computer, acts as a proxy for JM to communicate with remote vertices and monitor the state of the computation and how much data has been read/written on its channels Let us move back to discuss the graph execution. The Job Manager (JM) handles the scheduling of all the processes in the vertices of the graph. It does this using a topological sort of the graph. When nodes in the graph fail execution, parts of the subgraph may need to be re-executed, because the inputs that are needed may no longer be available. The vertices that had generated these inputs may have to be re-run. The JM will attempt to re-execute a minimal part of the graph to recompute the missing data. On executing a vertex, the JM must choose a machine

V Runtime When all of a vertex’s input channels become ready, a new execution record is created for the vertex and placed in a scheduling queue JM consults NS to discover the list of available computers, and then schedules the vertices in the queue (i.e., shaded ones) as computers become available using D as a proxy When an execution record is paired with an available computer, the remote daemon is instructed to run the specified vertex, and during execution JM receives periodic status updates from the vertex The job is completed if every vertex completes, or failed otherwise Let us move back to discuss the graph execution. The Job Manager (JM) handles the scheduling of all the processes in the vertices of the graph. It does this using a topological sort of the graph. When nodes in the graph fail execution, parts of the subgraph may need to be re-executed, because the inputs that are needed may no longer be available. The vertices that had generated these inputs may have to be re-run. The JM will attempt to re-execute a minimal part of the graph to recompute the missing data. On executing a vertex, the JM must choose a machine

Fault Tolerance If A fails, run it again
If A’s inputs are gone, run upstream vertices again (recursively) If A is slow, run another copy elsewhere and use output from whichever finishes first

Query histogram computation
Input: log file (n partitions) Extract queries from log partitions Re-partition by hash of query (k buckets) Compute histogram within each bucket

Naïve histogram topology
P parse lines D hash distribute S quicksort C count occurrences MS merge sort Q R k n is : Each MS C P S D

Efficient histogram topology
P parse lines D hash distribute S quicksort C count occurrences MS merge sort M non-deterministic merge k Each Q' is : Each T k R R C Each is : R S D T is : P C C Q' M MS MS n

P parse lines D hash distribute S quicksort MS merge sort
MS►C R R R MS►C►D T M►P►S►C Q’ P parse lines D hash distribute S quicksort MS merge sort C count occurrences M non-deterministic merge

Inputs are grouped into subsets that are close in network topology (e.g. same computer, same rack) MS►C R R R MS►C►D T M►P►S►C Q’ Q’ Q’ Q’ P parse lines D hash distribute S quicksort MS merge sort C count occurrences M non-deterministic merge

Replicate the downstream vertex to allow Q’ to be processed in parallel MS►C R R R MS►C►D T T M►P►S►C Q’ Q’ Q’ Q’ P parse lines D hash distribute S quicksort MS merge sort C count occurrences M non-deterministic merge

MS►C R R R MS►C►D T T M►P►S►C Q’ Q’ Q’ Q’ P parse lines D hash distribute S quicksort MS merge sort C count occurrences M non-deterministic merge

Final histogram refinement
Q' R 450 T 217 10,405 99,713 33.4 GB 118 GB 154 GB 10.2 TB 1,800 computers 43,171 vertices 11,072 processes 11.5 minutes

Optimizing Dryad applications
General-purpose refinement rules Processes formed from subgraphs Re-arrange computations, change I/O type Application code not modified System makes optimization choices High-level front ends hide this from user SQL query planner, etc.

CSCI5570 Large Scale Data Processing Systems

Similar presentations

Presentation on theme: "CSCI5570 Large Scale Data Processing Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCI5570 Large Scale Data Processing Systems

Similar presentations

Presentation on theme: "CSCI5570 Large Scale Data Processing Systems"— Presentation transcript:

Similar presentations

About project

Feedback