Presentation is loading. Please wait.

Presentation is loading. Please wait.

From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Similar presentations


Presentation on theme: "From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ."— Presentation transcript:

1 From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ

2 Overview From sequential code to parallel execution Dryad fundamentals Simple program example, plan for practicals

3 Distributed computation Single computer, shared memory – All objects always available for read and write Cluster of workstations – Each computer sees a subset of objects – Writes on one computer must be explicitly shared System automatically handles complexity – Needs some help

4 Data-parallel computation LINQ is high-level declarative specification Same action on entire collection of objects set.Select(x => f(x)) – Compute f(x) on each x in set, independently set.GroupBy(x => key(x)) – Group by unique keys, independently set.OrderBy(x => key(x)) – Sort whole set (system chooses how)

5 Distributed cluster computing Dataset is stored on local disks of cluster set set.0 set.7 set.1 set.6set.4 set.3set.2 set.5

6 Distributed cluster computing Dataset is stored on local disks of cluster set.0 set.7 set.1 set.6set.4 set.3set.2 set.5

7 Simple distributed computation var set2 = set.Select(x => f(x)) set set2

8 Simple distributed computation var set2 = set.Select(x => f(x)) set.0 set.7 set.1 set.6 set.4 set.3 set.2 set.5 set2.0 set2.1 set2.2 set2.3 set2.4 set2.5 set2.6 set2.7

9 Simple distributed computation var set2 = set.Select(x => f(x)) set.0 set.1 set.2set.3 set.4set.5set.6 set.7 set2.0set2.1set2.2set2.3set2.4set2.5set2.6set2.7 ffffffff

10 Simple distributed computation var set2 = set.Select(x => f(x)) set.0 set.1 set.2set.3 set.4set.5set.6 set.7 set2.0set2.1set2.2set2.3set2.4set2.5set2.6set2.7 ffffffff

11 Distributed acyclic graph Computation reads and writes along edges Graph shows parallelism via independence Goals of DryadLINQ optimizer – Extract parallelism (find independent work) – Control data skew (balance work across nodes) – Limit cross-computer data transfer

12 Distributed grouping var groups = set.GroupBy(x => x.key) set is a collection of records each with a key Don’t know what keys are present – Or in which partitions First, reorganize data – All records with same key on same computer Then can do final grouping in parallel

13 Distributed grouping var groups = set.GroupBy(x => x.key) set hash partition by key group locally groups acac adad dbdb baba acac a c a a adad d d b b dbdb baba

14 Distributed grouping var groups = set.GroupBy(x => x.key) set hash partition by key group locally groups acac adad dbdb baba acac a c a adad d d b b dbdb baba a a a c b d a a a c b d

15 Distributed sorting var sorted = set.OrderBy(x => x.key) range partition by key sort locally sorted set sample compute histogram 100 1 1111 2323 4141 100 1 100 1 1111 2323 3 1 4141

16 Distributed sorting var sorted = set.OrderBy(x => x.key) range partition by key sort locally sorted set sample compute histogram 100 1 1111 2323 4141 100 1 100 1 1111 2323 3131 4141 [1,1] [2,100] 100 1 1 1 100 2 3 4 1111 2323 4141

17 Distributed sorting var sorted = set.OrderBy(x => x.key) range partition by key sort locally sorted set sample compute histogram 100 1 1111 2323 4141 100 1 1111 2323 4141 [1,1] [2,100] 100 1 100 2 3 4 1111 2323 4141 1 1 2 3 4 100 1 1 2 3 4 100

18 Additional optimizations var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) set hash partition by key group locally histogram a b b a a d b d a b b a a b b a a d b d b d b b d a b b a count

19 Additional optimizations var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) set hash partition by key group locally histogram a a a count a b b a a d b d a b b a a b b a a d b d a b b a a b d b d b b b b d d b b b d d

20 Additional optimizations var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) set hash partition by key group locally histogram a,6 b,6 d,4 count a b b a a d b d a b b a a b b a a d b d a b b a a a a b b b d d a a a a,6 b,6 d,4 b b b d d

21 var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) set hash partition by key group locally histogram a b b a a d b d a b b a a,2 b,2 a,2 d,2 b,2 d,2 b,2 d,2 b,2 d,2 a,2 b,2 combine counts group locally a,2 b,2 a,2 d,2 b,2 d,2 a,2 b,2

22 var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) set hash partition by key group locally histogram a b b a a d b d a b b a a,2 b,2 a,2 d,2 b,2 d,2 b,2 d,2 b,2 d,2 a,2 b,2 combine counts group locally a,2 b,2 a,2 d,2 b,2 d,2 a,2 b,2 a,2 a,2 a,2 b,2 b,2 b,2 d,2 a,2 a,2 a,2 b,2 b,2 b,2 d,2

23 var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()}) set hash partition by key group locally histogram a b b a a d b d a b b a a,2 b,2 a,2 d,2 b,2 d,2 a,2 b,2 combine counts group locally a,2 b,2 a,2 d,2 b,2 d,2 a,2 b,2 a,2 a,2 a,2 b,2 b,2 b,2 d,2 a,2 a,2 a,2 b,2 b,2 b,2 d,2 a,6 b,6 d,4 b,6 d,4

24 What Dryad does Abstracts cluster resources – Set of computers, network topology, etc. Schedule DAG: choose cluster computers – Fairly among competing jobs – So computation is close to data Recovers from transient failures – Rerun computations on machine or network fault – Speculate duplicates for slow computations

25 Resources are virtualized Each graph node is process – Writes outputs to disk – Reads inputs from upstream nodes’ output files Graph generally larger than cluster – 1TB input, 250MB partition, 4000 parts Cluster is shared – Don’t size program for exact cluster – Use whatever share of resources are available

26 What controls parallelism Initially based on partitioning of inputs After reorganization, system or user decides

27 DryadLINQ-specific operators set = PartitionedTable.Get (uri) set.ToPartitionedTable(uri) set.HashPartition(x => f(x), numberOfParts) set.AssumeHashPartition(x => f(x)) [Associative] f(x) { … } RangePartition(…), Apply(…), Fork(…) [Decomposable], [Homomorphic], [Resource] Field mappings, Multiple partitioned tables, …

28 using System; using System.Collections.Generic; using System.Linq; using System.Text; using LinqToDryad; namespace Count { class Program { public const string inputUri = @"tidyfs://datasets/Count/inputfile1.pt"; static void Main(string[] args) { PartitionedTable table = PartitionedTable.Get (inputUri); Console.WriteLine("Lines: {0}", table.Count()); Console.ReadKey(); }

29 Form into groups 9 groups, one MSRI member per group Try to pick common interest for project later

30 sherwood-246 — sherwood-253,sherwood-255 d:\dryad\data\Workshop\DryadLINQ\samples Count, Points, Robots Cluster job browser d:\dryad\data\Workshop\DryadLINQ\job_browser\DryadAnalysis.exe TidyFS (file system) browser d:\dryad\data\Workshop\DryadLINQ\bin\retail\tidyfsexplorerwpf.exe


Download ppt "From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ."

Similar presentations


Ads by Google