The DryadLINQ Approach to Distributed Data-Parallel Computing

Name: The DryadLINQ Approach to Distributed Data-Parallel Computing
Uploaded: 2017-08-23T07:02:21+00:00
Duration: PTM22S0
Channel: Richard Eyles
Description: The DryadLINQ Approach to Distributed Data-Parallel Computing

The DryadLINQ Approach to Distributed Data-Parallel Computing
Yuan Yu Microsoft Research Silicon Valley

Distributed Data-Parallel Computing
Dryad talk: the execution layer How to reliably and efficiently execute distributed data-parallel programs on a compute cluster? This talk: the programming model How to write distributed data-parallel programs for a compute cluster?

The Programming Model Sequential, single machine programming abstraction Same program runs on single-core, multi-core, or cluster Preserve the existing programming environments Modern programming languages (C# and Java) are very good Expressive language and data model Strong static typing, GC, generics, … Modern IDEs (Visual Studio and Eclipse) are very good Great debugging and library support Legacy code could be easily reused

Dryad and DryadLINQ DryadLINQ provides automatic query plan generation Dryad provides automatic distributed execution

Outline Programming model DryadLINQ Applications
Discussions and conclusion

LINQ Microsoft’s Language INtegrated Query
Available in .NET3.5 and Visual Studio 2008 A set of operators to manipulate datasets in .NET Support traditional relational operators Select, Join, GroupBy, Aggregate, etc. Integrated into .NET programming languages Programs can invoke operators Operators can invoke arbitrary .NET functions Data model Data elements are strongly typed .NET objects Much more expressive than relational tables For example, nested data structures

LINQ provider interface
LINQ Framework PLINQ Local machine .Net program (C#, VB, F#, etc) Execution engines Query Objects LINQ-to-SQL DryadLINQ LINQ-to-Obj LINQ provider interface Scalability Single-core Multi-core Cluster

A Simple LINQ Query IEnumerable<BabyInfo> babies = ...;
var results = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderby baby.Year ascending select baby;

A Simple PLINQ Query IEnumerable<BabyInfo> babies = ...;
var results = from baby in babies.AsParallel() where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderby baby.Year ascending select baby;

A Simple DryadLINQ Query
PartitionedTable<BabyInfo> babies = PartitionedTable.Get<BabyInfo>(“BabyInfo.pt”); var results = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderby baby.Year ascending select baby;

DryadLINQ Data Model .Net objects Partition Partitioned Table
Partitioned table exposes metadata information type, partition, compression scheme, serialization, etc.

Demo It is just programming
The same familiar programming languages, development tools, libraries, etc.

K-means Execution Graph
ac cc C1 ac ac P1 P2 ac cc C2 P3 ac ac ac cc C3 ac ac

K-means in DryadLINQ public class Vector { public double[] entries; [Associative] public static Vector operator +(Vector v1, Vector v2) { … } public static Vector operator -(Vector v1, Vector v2) { … } public double Norm2() { …} } public static Vector NearestCenter(Vector v, IEnumerable<Vector> centers) { return centers.Aggregate((r, c) => (r - v).Norm2() < (c - v).Norm2() ? r : c); } public static IQueryable<Vector> Step(IQueryable<Vector> vectors, IQueryable<Vector> centers) { return vectors.GroupBy(v => NearestCenter(v, centers)) .Select(group => group.Aggregate((x,y) => x + y) / group.Count()); var vectors = PartitionedTable.Get<Vector>("dfs://vectors.pt"); var centers = vectors.Take(100); for (int i = 0; i < 10; i++) { centers = Step(vectors, centers); centers.ToPartitionedTable<Vector>(“dfs://centers.pt”); K-means clustering.

PageRank Execution Graph
cc N01 ae D N11 cc ae D N12 N02 ae D cc N13 N03 cc E1 ae D N21 cc ae D N22 E2 ae D cc N23 E3 cc ae D N31 cc ae D N32 ae D cc N33

PageRank in DryadLINQ public static IQueryable<Rank> Step(IQueryable<Page> pages, IQueryable<Rank> ranks) { // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank); // re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum()); } public struct Page { public UInt64 name; public Int64 degree; public UInt64[] links; public Page(UInt64 n, Int64 d, UInt64[] l) { name = n; degree = d; links = l; } public Rank[] Disperse(Rank rank) { Rank[] ranks = new Rank[links.Length]; double score = rank.rank / this.degree; for (int i = 0; i < ranks.Length; i++) { ranks[i] = new Rank(this.links[i], score); } return ranks; public struct Rank { public double rank; public Rank(UInt64 n, double r) { name = n; rank = r; } var pages = PartitionedTable.Get<Page>(“dfs://pages.pt”); var ranks = pages.Select(page => new Rank(page.name, 1.0)); // repeat the iterative computation several times for (int iter = 0; iter < n; iter++) { ranks = Step(pages, ranks); } ranks.ToPartitionedTable<Rank>(“dfs://ranks.pt”);

MapReduce in DryadLINQ
MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs { var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs } // Ex: Count the frequencies of words in a book MapReduce(book, line => line.Split(' '), w => w.ToLower(), g => g.Count())

DryadLINQ System Architecture
Client machine Cluster DryadLINQ Dryad .NET program Distributed query plan Query Expr Invoke Query plan Vertex code Input Tables LINQ query Dryad Execution Output Table .Net Objects Results foreach (11) Output Tables

DryadLINQ Distributed execution plan generation Vertex runtime
Static optimizations: pipelining, eager aggregation, etc. Dynamic optimizations: data-dependent partitioning, dynamic aggregation, etc. Vertex runtime Single machine (multi-core) implementation of LINQ Vertex code that runs on vertices Data serialization code Callback code for runtime dynamic optimizations Automatically distributed to cluster machines

A Simple Example Count the frequencies of words in a book
var map = book.SelectMany(line => line.Split(' ')); var group = map.GroupBy(w => w.ToLower()); var result = group.Select(g => g.Count());

Naïve Execution Plan ..... … line.Split(' ') M M M D D D g.Count() G G
Map M M M map D D D Distribute Merge MG MG MG g.Count() G G … G GroupBy reduce R R R Reduce X X X Consumer

Execution Plan Using Partial Aggregation
M M M Map g.Count() G1 G1 G1 GroupBy map IR IR IR InitialReduce D D D Distribute g.Sum() Merge MG MG GroupBy G2 G2 aggregation tree Combine C C Merge MG MG GroupBy G3 G3 g.Sum() reduce FinalReduce F F X X Consumer

Challenge: Program Analysis Support
The main sources of difficulty Complicated data model User-defined functions all over the places Requires sophisticated static program analysis at byte-code level Mainly some form of flow analysis Possible with modern programming languages and runtimes, such as C#/CLR

Inferring Dataset Property
Useful for query optimizations Hash(y => y.a+y.b) Select(x => new { A=x.a+x.b, B=f(x.c) } Hash(z => z.A) GroupBy(x => x.A) No data partitioning at GroupBy

Inferring Dataset Property
Useful for query optimizations Hash(y => y.a+y.b) Select(x => Foo(x) } Hash(x => x.A)? Need IL-level data-flow analysis of Foo GroupBy(x => x.A)

Caching Query Results A cluster-wide caching service to support
Reuse of common subqueries Incremental computations Cache: [Key  Value] Key is <q, d>, Value is the result of q(d) Key requires a pretty good reachability analysis For correctness, Key must include “everything” reachable from q For performance, Key should only contain things q depends on

More Static Analysis Purity checking Metadata validity checking
All the functions called in DryadLINQ queries must be side-effect free DryadLINQ (and PLINQ) doesn’t enforce it Metadata validity checking Partitioned table’s metadata contains the record type, partition scheme, serialization functions, … Need to determine if the metadata is valid DryadLINQ doesn’t fully enforce it Static enforcement of program properties for security and privacy mechanisms

Examples of DryadLINQ Applications
Data mining Analysis of service logs for network security Analysis of Windows Watson/SQM data Cluster monitoring and performance analysis Graph analysis Accelerated Page-Rank computation Road network shortest-path preprocessing Image processing Image indexing Decision tree training Epitome computation Simulation light flow simulations for next-generation display research Monte-Carlo simulations for mobile data eScience Machine learning platform for health solutions Astrophysics simulation

Decision Tree Training Mihai Budiu, Jamie Shotton et al
Learn a decision tree to classify pixels in a large set of images label image Decision Tree Machine learning 1M images x 10,000 pixels x 2,000 features x 221 tree nodes Complexity >1020 objects

Sample Execution plan Initial empty tree Image inputs, partitioned
Read, preprocess Redistribute Histogram Regroup histograms on node Compute new tree layer Compute new tree Broadcast new tree Final tree

Application Details Workflow = 37 DryadLINQ jobs
12 hours running time on 235 machines More than 100,000 processes More than 100 days of CPU time Recovers from several failures daily 34,000 lines of .NET code

Windows SQM Data Analysis
Michal Strehovsky, Sivarudrappa Mahesh et al

The Language Integration Approach
Single unified programming environment Unified data model and programming language Direct access to IDE and libraries Different from SQL, HIVE, Pig Latin Multiple layers of languages and data models Works out very well, but requires good programming language supports LINQ extensibility: custom operators/providers .NET reflection, dynamic code generation, …

Combining with PLINQ Query subquery
DryadLINQ subquery PLINQ The combination of PLINQ and DryadLINQ delivers computation to every core in the cluster

Acyclic Dataflow Graph
Acyclic dataflow graph provides a very powerful computation model Easy target for higher-level programming abstractions such as DryadLINQ Easy expression of many data-parallel optimizations We designed Dryad to be general and flexible Programmability is less of a concern Used primarily to support higher-level programming abstractions No major changes made to Dryad in order to support DryadLINQ

Expectation Maximization (Gaussians)
Generated by DryadLINQ 3 iterations shown

Decoupling of Dryad and DryadLINQ
Separation of concerns Dryad layer concerns scheduling and fault-tolerance DryadLINQ layer concerns the programming model and the parallelization of programs Result: powerful and expressive execution engine and programming model Different from the MapReduce/Hadoop approach A single abstraction for both programming model and execution engine Result: very simple, but very restricted execution engine and language

Cluster Services (Azure, HPC, or Cosmos)
Software Stack …… Machine Learning Image Processing Graph Analysis Data Mining eScience Applications DryadLINQ Dryad CIFS/NTFS SQL Servers Azure DFS Cosmos DFS Cluster Services (Azure, HPC, or Cosmos) Windows Server Windows Server Windows Server Windows Server

Conclusion Single unified programming environment
Unified data model and programming language Direct access to IDE and libraries An open and extensible system Many LINQ providers out there Existing ones: LINQ-to-XML, LINQ-to-SQL, PLINQ, … Very easy to write one for your app domain Dryad/DryadLINQ scales out all of them!

Availability Freely available for academic use
DryadLINQ source, Dryad binaries, documentation, samples, blog, discussion group, etc. Will be available soon for commercial use Free, but no product support

The DryadLINQ Approach to Distributed Data-Parallel Computing

Similar presentations

Presentation on theme: "The DryadLINQ Approach to Distributed Data-Parallel Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The DryadLINQ Approach to Distributed Data-Parallel Computing

Similar presentations

Presentation on theme: "The DryadLINQ Approach to Distributed Data-Parallel Computing"— Presentation transcript:

Similar presentations

About project

Feedback