Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributed Data-Parallel Computing Using a High-Level Programming Language Yuan Yu Michael Isard Joint work with: Andrew Birrell, Mihai Budiu, Jon Currey,

Similar presentations


Presentation on theme: "Distributed Data-Parallel Computing Using a High-Level Programming Language Yuan Yu Michael Isard Joint work with: Andrew Birrell, Mihai Budiu, Jon Currey,"— Presentation transcript:

1 Distributed Data-Parallel Computing Using a High-Level Programming Language Yuan Yu Michael Isard Joint work with: Andrew Birrell, Mihai Budiu, Jon Currey, Úlfar Erlingsson, Dennis Fetterly, Pradeep Kumar Gunda Microsoft Research Silicon Valley

2 The Goal of The Talk From the invitation: “to expose SIGMOD members to work going on in other "parallel" fields of Computer Science that potentially have deep implications for the information management research community.”

3 The 2008 Claremont Report “Designing systems that embrace non-relational data model, rather than shoehorning them into tables; “… the techniques behind parallel and distributed databases---partitioned dataflow and cost-based query optimization---should extend to new environments. “… need to pay attention to the softer issues that capture the hearts and minds of programmers (such as attractive syntax, typing and modularity, development tools, …) “… database research must look beyond its traditional boundaries and find allies throughout computing.”

4 Distributed Data-Parallel Computing Research problem: How to write distributed data-parallel programs for a compute cluster? The DryadLINQ programming model – Sequential, single machine programming abstraction – Same program runs on single-core, multi-core, or cluster – Familiar programming languages – Familiar development environment

5 Dryad and DryadLINQ DryadLINQ provides automatic query plan generation Dryad provides automatic distributed execution

6 Outline Programming model Dryad and DryadLINQ overview Lessons Conclusions

7 LINQ Microsoft’s Language INtegrated Query – Available in.NET3.5 and Visual Studio 2008 A set of operators to manipulate datasets in.NET – Support traditional relational operators Select, Join, GroupBy, Aggregate, etc. – Integrated into.NET programming languages Programs can invoke operators Operators can invoke arbitrary.NET functions Data model – Data elements are strongly typed.NET objects – Much more expressive than relational tables For example, nested data structures

8 DryadLINQ Data Model Partition Partitioned Table.Net objects Partitioned table exposes metadata information – type, partition, compression scheme, serialization, etc.

9 Demo Preserve an existing programming model – The same familiar programming languages, development tools, libraries, etc.

10 An Example: PageRank Ranks web pages using the hyperlink structure, propagating scores along the links Each iteration can be expressed as a SQL query 1.Join pages with ranks 2.Distribute ranks on outgoing edges 3.GroupBy edge destination 4.Aggregate into ranks 5.Repeat

11 One PageRank Step in DryadLINQ // one step of pagerank: dispersing and re-accumulating rank public static IQueryable PRStep(IQueryable pages, IQueryable ranks) { // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank); // re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum()); }

12 The Complete PageRank Program var pages = PartitionedTable.Get (“dfs://pages.txt”); var ranks = pages.Select(page => new Rank(page.name, 1.0)); // repeat the iterative computation several times for (int iter = 0; iter < n; iter++) { ranks = PRStep(pages, ranks); } ranks.ToPartitionedTable (“dfs://outputranks.txt”); public struct Page { public UInt64 name; public Int64 degree; public UInt64[] links; public Page(UInt64 n, Int64 d, UInt64[] l) { name = n; degree = d; links = l; } public Rank[] Disperse(Rank rank) { Rank[] ranks = new Rank[links.Length]; double score = rank.rank / this.degree; for (int i = 0; i < ranks.Length; i++) { ranks[i] = new Rank(this.links[i], score); } return ranks; } } public struct Rank { public UInt64 name; public double rank; public Rank(UInt64 n, double r) { name = n; rank = r; } } public static IQueryable PRStep(IQueryable pages, IQueryable ranks) { // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank); // re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum()); }

13 Multi-Iteration PageRank pagesranks Iteration 1 Iteration 2 Iteration 3 Memory FIFO

14 Dryad System Architecture Files, TCP, FIFO Job 1 data plane control plane PD V VV job manager cluster Job 1 : v 11, v 12, … Job 2 : v 21, v 22, … Job 3 : … scheduler New jobs

15 Dryad Provides a general, flexible execution layer – Dataflow graph as the computation model Can be modified by runtime optimizations – Higher language layer supplies graph, vertex code, serialization code, hints for data locality, … Automatically handles distributed execution – Distributes code, routes data – Schedules processes on machines near data – Masks failures in cluster and network – Fair scheduling of concurrent jobs

16 Immutable Input Assumes that inputs are immutable – High performance: Scales out to shared-nothing clusters made up of thousands of machines – Significantly simplifies the design/implementation Simple fault-tolerant story No need to handle complex transaction and synchronization – Good for processing largely static datasets – Not suitable for fine-grain, frequent updates

17 Dryad DryadLINQ System Architecture DryadLINQ Client machine (11) Distributed query plan.NET program Query Expr Cluster Output Tables Results Input Tables Invoke Query plan Output Table Dryad Execution.Net Objects ToTable foreach Vertex code

18 DryadLINQ Distributed execution plan generation – Static optimizations: pipelining, eager aggregation, etc. – Dynamic optimizations: data-dependent partitioning, dynamic aggregation, etc. Vertex runtime – Single machine (multi-core) implementation of LINQ – Vertex code that runs on vertices – Channel serialization code – Callback code for runtime dynamic optimizations – Automatically distributed to cluster machines

19 Lessons Acyclic dataflow graph is a powerful computation model Language integration is amazingly successful Leverage decades of database research Decoupling of Dryad and DryadLINQ worked out well

20 Acyclic Dataflow Graph Acyclic dataflow graph provides a very powerful computation model – Easy target for higher-level programming abstractions such as DryadLINQ – Easy expression of many data-parallel optimizations We designed Dryad to be general and flexible – Programmability is less of a concern – Used primarily to support higher-level programming abstractions – We haven’t modified Dryad in order to support DryadLINQ

21 Expectation Maximization (Gaussians) 21 Generated by DryadLINQ 3 iterations shown

22 The Language Integration Approach Single unified programming environment – Unified data model and programming language – Direct access to IDE and libraries Simpler than SQL programming – As easy for simple queries – Easier to use for even moderately complex queries No embedded languages Requires good programming language supports – LINQ extensibility: custom operators/providers –.NET reflection, dynamic code generation, …

23 LINQ Framework PLINQ Local machine.Net program (C#, VB, F#, etc) Execution engines Query Objects LINQ-to-SQL DryadLINQ LINQ-to-XML LINQ provider interface Scalability Single-core Multi-core Cluster Extremely open and extensible

24 Combining with PLINQ 24 Query DryadLINQ PLINQ subquery The combination of PLINQ and DryadLINQ delivers computation to every core in the cluster

25 Leverage Database Research Example: MapReduce written in DryadLINQ MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs { var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs }

26 But, Not So Easy The main sources of difficulty – Much more complicated data model – User-defined functions all over the places Requires sophisticated program analysis techniques – Possible with modern programming languages and runtimes, such as C#/CLR

27 Decoupling of Dryad and DryadLINQ Separation of concerns – Dryad layer concerns scheduling and fault-tolerance – DryadLINQ layer concerns the programming model and the parallelization of programs – Result: powerful and expressive execution engine and programming model Different from the MapReduce/Hadoop approach – A single abstraction for both programming model and execution engine – Result: very simple, but very restricted execution engine and language

28 Image Processing Cosmos DFSSQL Servers Software Stack 28 Windows Server Cluster Services (Azure, HPC, or Cosmos) Azure DFS Dryad DryadLINQ Windows Server Other Languages CIFS/NTFS Machine Learning Graph Analysis Data Mining Applications … Other Applications

29 Availability Freely available for academic use – Dryad in binary, DryadLINQ in source – Will release Dryad source in the future Coming soon to Microsoft commercial partners – Free, but no product support

30 Conclusions Goal: Use a compute cluster as if it is a single computer – Dryad/DryadLINQ represent a significant step Requires close collaborations across many fields of computing, including – Distributed systems – Distributed and parallel databases – Programming language design and analysis

31 Dryad/DryadLINQ Papers 1.Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks (EuroSys’07)Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks 2.DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language (OSDI’08)DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language 3.Distributed Data-Parallel Computing Using a High-Level Programming Language (SIGMOD’09)Distributed Data-Parallel Computing Using a High-Level Programming Language 4.Quincy: Fair scheduling for distributed computing clusters (SOSP’09)Quincy: Fair scheduling for distributed computing clusters 5.Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations (SOSP’09)Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations


Download ppt "Distributed Data-Parallel Computing Using a High-Level Programming Language Yuan Yu Michael Isard Joint work with: Andrew Birrell, Mihai Budiu, Jon Currey,"

Similar presentations


Ads by Google