Presentation is loading. Please wait.

Presentation is loading. Please wait.

Microsoft DryadLINQ --Jinling Li. What’s DryadLINQ? A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. [1]

Similar presentations


Presentation on theme: "Microsoft DryadLINQ --Jinling Li. What’s DryadLINQ? A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. [1]"— Presentation transcript:

1 Microsoft DryadLINQ --Jinling Li

2 What’s DryadLINQ? A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. [1] Data-Parallel Computing (large data example)large data example High-Level Language DryadLINQ=Dryad+LINQ Dryad LINQ

3 Example Astronomers use the Sloan Digital Sky Survey to investigate problems such as the distribution of “dark matter” around distant galaxies. The current data set—SDSS Data Release 7— covers more than a quarter of the sky and contains more than 50 TB of data representing 357 million unique objects. [1]

4 DryadLINQ=Dryad+LINQ Figure source: [1]

5 Outline Dryad LINQ DryadLINQ DryadLINQ in Machine Learning Strengths and Weaknesses

6 Dryad Microsoft Dryad is a high-performance, general-purpose distributed computing engine that handles some of the most difficult aspects of cluster-based distributed computing. [2] Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. [2] A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming. [2]

7 Dryad System Architecture The job manager contains the application-specific code to construct the job’s communication graph along with library code to schedule the work across the available resources. [2] The name server is used to enumerate all the available computers. The name server also exposes the position of each computer within the network topology. [2] Figure source:[2]

8 Dryad System Architecture The job manager (JM) consults the name server (NS) to discover the list of available computers. It maintains the job graph and schedules running vertices (V) as computers become available using the daemon (D) as a proxy. [2] The first time a vertex (V) is executed on a computer its binary is sent from the job manager to the daemon and subsequently it is executed from a cache. [2] Figure source: [2]

9 Dryad Computational Model The basic computational model for Dryad is the directed- acyclic graph (DAG). Each node in the graph is a computation, and each edge in the graph is a stream of data traveling in the direction of the edge. [3] Figure source: [3]

10 Software Stack Dryad is mostly used as middleware below a high-level language layer and low-level internal cluster infrastructure. [3] Figure source:[3]

11 Below Dryad Below Dryad is a cluster-management system that supports some low-level actions like starting a process on a remote computer, and one or more distributed storage systems that support partitioned files. [4] Most of the Dryad development has been done on top of a Microsoft internal cluster infrastructure called Cosmos that was developed by the Bing product group. [4]

12 Outline Dryad Introduction Computational Model System Architecture Software Stack LINQ DryadLINQ DryadLINQ in Machine Learning Strengths and Weaknesses

13 LINQ LINQ=Language Integrated Queries A Microsoft.NET Framework component that adds native data querying capabilities to.NET languages Comprises a set of operators to manipulate collections of.Net objects and bridges the gap between the world of objects and the world of data. [5] Figure source: [5]

14 LINQ LINQ adds high level declarative data manipulation to many of the.NET programming languages, including C#, Visual Basic and F#. [5] LINQ datasets are.NET collections. Technically, a.NET collection of values of type T is a data type which implements the predefined interface IEnumerable [5] Another interface IQueryable represents a query (i.e., a computation) that can produce a collection with elements of type T. [5]

15 LINQ operators OperationExampleResult WhereC.Where(x=>x>3)(4,5) SelectC.Select(x=>x+1)(2,3,4,5,6) AggregateC.Aggregate((x,y)=>x+y)15 GroupByC.GroupBy(x=>x%2)((1,3,5),(2,4)) OrderByC.OrdrBy(x=>-x)(5,4,3,2,1) SelectManyC.SelectMany(x=>Factors(x))(1,1,2,1,3,1,2,4,1,5) JoinC.Join(C, x=>x, x=>x-4, (x,y)=>x+y)(6) Examples using LINQ operators on collection C={1,2,3,4,5}. Factors is a user defined function. Table source: [5]

16 A Simple Example: Word Count Count word frequency in a set of documents [6] var documents = GetDocuments(); var words = documents.SelectMany (document => document.Words); var groups = words.GroupBy(word=>word); var counts = groups.Select (group => new WordCount(group.Key, group.Count()));

17 Outline Dryad LINQ Introduction Interfaces Operators DryadLINQ DryadLINQ in Machine Learning Strengths and Weaknesses

18 DryadLINQ DryadLINQ bridges the gap between Dryad and LINQ layer. DryadLINQ translates programs written in LINQ into Dryad job execution plans that can be executed on a cluster by Dryad, and transparently returns the results to the host application [5]. Figure source: [5]

19 Example: LINQ operators Figure source: [5]

20 2D Piping The Dryad job execution plans generated by DryadLINQ are composable: the output of one graph can become the input of another one. In fact, this is exactly how complex LINQ queries are translated: each operator is translated to a graph independently, and the graphs are then concatenated. [7] Figure source: [7]

21 A Simple Example: Word Count Count word frequency in a set of documents [6] : var documents = GetDocuments(); var words = documents.SelectMany (document => document.Words); var groups = words.GroupBy(word=>word); var counts = groups.Select (group => new WordCount(group.Key, group.Count()));

22 Word Count in DryadLINQ Count word frequency in a set of documents [6] : var documents = DryadLinq.GetTable (“file://docs.txt”); var words = documents.SelectMany (document => document.Words); var groups = words.GroupBy(word=>word); var counts = groups.Select (group => new WordCount(group.Key, group.Count()));

23 Distributed Execution of Word Count Figure source: [6]

24 Another Example: extract Ulfar’s favorite web pages from many web log files

25 DryadLINQ Execution Overview Figure source: [2]

26 DryadLINQ=Dryad+LINQ Figure source: [8]

27 Outline Dryad LINQ DryadLINQ Introduction DryadLINQ = Compiles LINQ to Dryad LINQ operators and other examples DryadLINQ in Machine Learning Strengths and Weaknesses

28 DryadLINQ in Machine Learning

29 Real-life Application: XBox Figure source: [9]

30 Example: k-means [5] K-means in LINQ

31 K-means in DryadLINQ [5] How to implement the GroupBy operation at the heart of the k-means aggregation? DryadLINQ generates a job execution plan that uses two-level aggregation: each computer builds local groups with the local data and only sends the aggregated information about these groups to the next stage; the next stage computes the actual centroid. [5]

32 K-means in DryadLINQ Figure source: [5]

33 Example: Decision Tree [5] Represent a decision tree with a dictionary that maps tree node indices (integer values) to attribute indices in the attribute array The most common algorithm to induce a decision tree starts from an empty tree and a set of records with class labels and attributes with values. The algorithm repeatedly extends the tree by grouping records by their current location under the partial tree, and for each such group determining the attribute resulting in the greatest reduction in conditional entropy. records.GroupBy(record => TreeWalk(record, tree)).Select(group => FindBestAttribute(group));

34 Decision Tree Induction in DryadLINQ [5]

35 Decision Tree Induction in DryadLINQ Each iteration through the loop invokes a query returning the list of attribute indices that are best for each of the leaves in the old tree. the tree variable is updated on the client computer, and retransmitted to the cluster by DryadLINQ with each iteration. [ 5]

36 Decision Tree Induction in DryadLINQ [5] Figure source: [5]

37 Example: Singular Value Decomposition [5] The Singular Value Decomposition (SVD) lies at the heart of several large scale data analyses: principal components analysis, collaborative filtering, image segmentation, among many others. [ 5] The SVD of a n*m matrix A is a decomposition such that U and V are both orthonormal And is a diagonal matrix with non-negative entries. [ 5] Example

38 Singular Value Decomposition in DryadLINQ [5]

39

40 Singular Value Decomposition in DryadLINQ Figure source: [5]

41 Outline Dryad LINQ DryadLINQ DryadLINQ in Machine Learning K-means Decision Tree Singular Value Decomposition Strengths and Weaknesses

42 Strengths & Weaknesses DryadLINQ has the following features [8] : Declarative programming Automatic parallelization Integration with Visual Studio Integration with.Net Job graph optimizations Conciseness

43 Strengths & Weaknesses While DryadLINQ is a great tool to program clusters, there is a price to pay too for the convenience that it provides [5]. Efficiency managed code (C#) is not always as efficient as native code Debugging the experience of debugging a cluster program remains more painful than debugging a single-computer program Transparency In most cases one needs to have some understanding of the operation of the compiler and particularly of the job execution plans generated to avoid egregious mistakes

44 You can have it! Dryad+DryadLINQ available for download Academic license Commercial evaluation license Runs on Windows HPC platform

45 Conclusion Figure source: [1]

46 Thank you!

47 Reference [1] Microsoft Research webpage: Dryad and DryadLINQ for Data Intensive Research http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx [2] DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language Yuan Yu et al. Symposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8-10, 2008.DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language [3] Microsoft Research webpage: Dryad http://research.microsoft.com/en- us/projects/dryad/ [4] Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis FetterlyEuropean Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks [5] Large-Scale Machine Learning using DryadLINQ, chapter in Scaling Up Machine Learning, Frank McSherry, Yuan Yu, Mihai Budiu, Michael Isard, and Dennis Fetterly, Cambridge University Press, December 2011 [6] DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Presentation by Yuan Yu at OSDI, December, 2008DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing [7] Cluster Computing with DryadLINQ Presentation by Mihai Budiu at Palo Alto Research Center CSL Colloquium, Palo Alto, CA May 8, 2008Cluster Computing with DryadLINQ [8] Microsoft Research webpage: DryadLINQ http://research.microsoft.com/en- us/projects/dryadlinq/ [9] A Machine-Learning toolking in DryadLINQ Presentation slides in PowerPoint by Mihai Budiu and Kannan Achan.A Machine-Learning toolking in DryadLINQ

48 Parallel Runtimes – DryadLINQ vs. Hadoop

49 Expirement by Indiana University Bloomington


Download ppt "Microsoft DryadLINQ --Jinling Li. What’s DryadLINQ? A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. [1]"

Similar presentations


Ads by Google