Microsoft DryadLINQ --Jinling Li. What’s DryadLINQ? A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. [1]

Slides:



Advertisements
Similar presentations
Distributed Data-Parallel Programming using Dryad Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael Isard, Yuan Yu Microsoft Research Silicon Valley.
Advertisements

Cluster Computing with Dryad Mihai Budiu, MSR-SVC LiveLabs, March 2008.
Introduction to Data Center Computing Derek Murray October 2010.
Distributed Data-Parallel Computing Using a High-Level Programming Language Yuan Yu Michael Isard Joint work with: Andrew Birrell, Mihai Budiu, Jon Currey,
Spark: Cluster Computing with Working Sets
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Big Data Platforms Mihai Budiu, Oct My work Ph.D. from Carnegie Mellon, 2003 Hardware synthesis Reconfigurable hardware Compilers and computer.
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
Distributed, parallel web service orchestration using XSLT Peter Kelly Paul Coddington Andrew Wendelborn.
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC.
Distributed Computations
From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.
Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley.
Dryad / DryadLINQ Slides adapted from those of Yuan Yu and Michael Isard.
Combining Static and Dynamic Data in Code Visualization David Eng Sable Research Group, McGill University PASTE 2002 Charleston, South Carolina November.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Overview of Database Languages and Architectures.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
A Free sample background from © 2001 By Default!Slide 1.NET Overview BY: Pinkesh Desai.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
The Design Discipline.
Christopher Jeffers August 2012
Image Processing Image Processing Windows HPC Server 2008 HPC Job Scheduler Dryad DryadLINQ Machine Learning Graph Analysis Graph Analysis Data Mining.NET.
ASP.NET The.NET Framework. The.NET Framework is Microsoft’s distributed run-time environment for creating, deploying, and using applications over the.
1 8/29/05CS360 Windows Programming Professor Shereen Khoja.
Session 1 - Introduction and Data Access Layer
Secure Web Applications via Automatic Partitioning Stephen Chong, Jed Liu, Andrew C. Meyers, Xin Qi, K. Vikram, Lantian Zheng, Xin Zheng. Cornell University.
Obsydian OLE Automation Ranjit Sahota Chief Architect Obsydian Development Ranjit Sahota Chief Architect Obsydian Development.
Cloud Computing Other High-level parallel processing languages Keke Chen.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Training Kinect Mihai Budiu Microsoft Research, Silicon Valley UCSD CNS 2012 RESEARCH REVIEW February 8, 2012.
Chapter 34 Java Technology for Active Web Documents methods used to provide continuous Web updates to browser – Server push – Active documents.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Module 1: Getting Started. Introduction to.NET and the.NET Framework Exploring Visual Studio.NET Creating a Windows Application Project Overview Use Visual.
Framework for MDO Studies Amitay Isaacs Center for Aerospace System Design and Engineering IIT Bombay.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC.
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
.NET Mobile Application Development XML Web Services.
Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun -
Large-scale Machine Learning using DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Ambient Intelligence: From Sensor Networks to Smart Environments.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
CS239-Lecture 3 DryadLINQ Madan Musuvathi Visiting Professor, UCLA
Some slides adapted from those of Yuan Yu and Michael Isard
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
CSCI5570 Large Scale Data Processing Systems
Spark Presentation.
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Applying Twister to Scientific Applications
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Technopoints.
Overview of big data tools
DryadInc: Reusing work in large-scale computations
Presentation transcript:

Microsoft DryadLINQ --Jinling Li

What’s DryadLINQ? A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. [1] Data-Parallel Computing (large data example)large data example High-Level Language DryadLINQ=Dryad+LINQ Dryad LINQ

Example Astronomers use the Sloan Digital Sky Survey to investigate problems such as the distribution of “dark matter” around distant galaxies. The current data set—SDSS Data Release 7— covers more than a quarter of the sky and contains more than 50 TB of data representing 357 million unique objects. [1]

DryadLINQ=Dryad+LINQ Figure source: [1]

Outline Dryad LINQ DryadLINQ DryadLINQ in Machine Learning Strengths and Weaknesses

Dryad Microsoft Dryad is a high-performance, general-purpose distributed computing engine that handles some of the most difficult aspects of cluster-based distributed computing. [2] Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. [2] A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming. [2]

Dryad System Architecture The job manager contains the application-specific code to construct the job’s communication graph along with library code to schedule the work across the available resources. [2] The name server is used to enumerate all the available computers. The name server also exposes the position of each computer within the network topology. [2] Figure source:[2]

Dryad System Architecture The job manager (JM) consults the name server (NS) to discover the list of available computers. It maintains the job graph and schedules running vertices (V) as computers become available using the daemon (D) as a proxy. [2] The first time a vertex (V) is executed on a computer its binary is sent from the job manager to the daemon and subsequently it is executed from a cache. [2] Figure source: [2]

Dryad Computational Model The basic computational model for Dryad is the directed- acyclic graph (DAG). Each node in the graph is a computation, and each edge in the graph is a stream of data traveling in the direction of the edge. [3] Figure source: [3]

Software Stack Dryad is mostly used as middleware below a high-level language layer and low-level internal cluster infrastructure. [3] Figure source:[3]

Below Dryad Below Dryad is a cluster-management system that supports some low-level actions like starting a process on a remote computer, and one or more distributed storage systems that support partitioned files. [4] Most of the Dryad development has been done on top of a Microsoft internal cluster infrastructure called Cosmos that was developed by the Bing product group. [4]

Outline Dryad Introduction Computational Model System Architecture Software Stack LINQ DryadLINQ DryadLINQ in Machine Learning Strengths and Weaknesses

LINQ LINQ=Language Integrated Queries A Microsoft.NET Framework component that adds native data querying capabilities to.NET languages Comprises a set of operators to manipulate collections of.Net objects and bridges the gap between the world of objects and the world of data. [5] Figure source: [5]

LINQ LINQ adds high level declarative data manipulation to many of the.NET programming languages, including C#, Visual Basic and F#. [5] LINQ datasets are.NET collections. Technically, a.NET collection of values of type T is a data type which implements the predefined interface IEnumerable [5] Another interface IQueryable represents a query (i.e., a computation) that can produce a collection with elements of type T. [5]

LINQ operators OperationExampleResult WhereC.Where(x=>x>3)(4,5) SelectC.Select(x=>x+1)(2,3,4,5,6) AggregateC.Aggregate((x,y)=>x+y)15 GroupByC.GroupBy(x=>x%2)((1,3,5),(2,4)) OrderByC.OrdrBy(x=>-x)(5,4,3,2,1) SelectManyC.SelectMany(x=>Factors(x))(1,1,2,1,3,1,2,4,1,5) JoinC.Join(C, x=>x, x=>x-4, (x,y)=>x+y)(6) Examples using LINQ operators on collection C={1,2,3,4,5}. Factors is a user defined function. Table source: [5]

A Simple Example: Word Count Count word frequency in a set of documents [6] var documents = GetDocuments(); var words = documents.SelectMany (document => document.Words); var groups = words.GroupBy(word=>word); var counts = groups.Select (group => new WordCount(group.Key, group.Count()));

Outline Dryad LINQ Introduction Interfaces Operators DryadLINQ DryadLINQ in Machine Learning Strengths and Weaknesses

DryadLINQ DryadLINQ bridges the gap between Dryad and LINQ layer. DryadLINQ translates programs written in LINQ into Dryad job execution plans that can be executed on a cluster by Dryad, and transparently returns the results to the host application [5]. Figure source: [5]

Example: LINQ operators Figure source: [5]

2D Piping The Dryad job execution plans generated by DryadLINQ are composable: the output of one graph can become the input of another one. In fact, this is exactly how complex LINQ queries are translated: each operator is translated to a graph independently, and the graphs are then concatenated. [7] Figure source: [7]

A Simple Example: Word Count Count word frequency in a set of documents [6] : var documents = GetDocuments(); var words = documents.SelectMany (document => document.Words); var groups = words.GroupBy(word=>word); var counts = groups.Select (group => new WordCount(group.Key, group.Count()));

Word Count in DryadLINQ Count word frequency in a set of documents [6] : var documents = DryadLinq.GetTable (“file://docs.txt”); var words = documents.SelectMany (document => document.Words); var groups = words.GroupBy(word=>word); var counts = groups.Select (group => new WordCount(group.Key, group.Count()));

Distributed Execution of Word Count Figure source: [6]

Another Example: extract Ulfar’s favorite web pages from many web log files

DryadLINQ Execution Overview Figure source: [2]

DryadLINQ=Dryad+LINQ Figure source: [8]

Outline Dryad LINQ DryadLINQ Introduction DryadLINQ = Compiles LINQ to Dryad LINQ operators and other examples DryadLINQ in Machine Learning Strengths and Weaknesses

DryadLINQ in Machine Learning

Real-life Application: XBox Figure source: [9]

Example: k-means [5] K-means in LINQ

K-means in DryadLINQ [5] How to implement the GroupBy operation at the heart of the k-means aggregation? DryadLINQ generates a job execution plan that uses two-level aggregation: each computer builds local groups with the local data and only sends the aggregated information about these groups to the next stage; the next stage computes the actual centroid. [5]

K-means in DryadLINQ Figure source: [5]

Example: Decision Tree [5] Represent a decision tree with a dictionary that maps tree node indices (integer values) to attribute indices in the attribute array The most common algorithm to induce a decision tree starts from an empty tree and a set of records with class labels and attributes with values. The algorithm repeatedly extends the tree by grouping records by their current location under the partial tree, and for each such group determining the attribute resulting in the greatest reduction in conditional entropy. records.GroupBy(record => TreeWalk(record, tree)).Select(group => FindBestAttribute(group));

Decision Tree Induction in DryadLINQ [5]

Decision Tree Induction in DryadLINQ Each iteration through the loop invokes a query returning the list of attribute indices that are best for each of the leaves in the old tree. the tree variable is updated on the client computer, and retransmitted to the cluster by DryadLINQ with each iteration. [ 5]

Decision Tree Induction in DryadLINQ [5] Figure source: [5]

Example: Singular Value Decomposition [5] The Singular Value Decomposition (SVD) lies at the heart of several large scale data analyses: principal components analysis, collaborative filtering, image segmentation, among many others. [ 5] The SVD of a n*m matrix A is a decomposition such that U and V are both orthonormal And is a diagonal matrix with non-negative entries. [ 5] Example

Singular Value Decomposition in DryadLINQ [5]

Singular Value Decomposition in DryadLINQ Figure source: [5]

Outline Dryad LINQ DryadLINQ DryadLINQ in Machine Learning K-means Decision Tree Singular Value Decomposition Strengths and Weaknesses

Strengths & Weaknesses DryadLINQ has the following features [8] : Declarative programming Automatic parallelization Integration with Visual Studio Integration with.Net Job graph optimizations Conciseness

Strengths & Weaknesses While DryadLINQ is a great tool to program clusters, there is a price to pay too for the convenience that it provides [5]. Efficiency managed code (C#) is not always as efficient as native code Debugging the experience of debugging a cluster program remains more painful than debugging a single-computer program Transparency In most cases one needs to have some understanding of the operation of the compiler and particularly of the job execution plans generated to avoid egregious mistakes

You can have it! Dryad+DryadLINQ available for download Academic license Commercial evaluation license Runs on Windows HPC platform

Conclusion Figure source: [1]

Thank you!

Reference [1] Microsoft Research webpage: Dryad and DryadLINQ for Data Intensive Research [2] DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language Yuan Yu et al. Symposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8-10, 2008.DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language [3] Microsoft Research webpage: Dryad us/projects/dryad/ [4] Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis FetterlyEuropean Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks [5] Large-Scale Machine Learning using DryadLINQ, chapter in Scaling Up Machine Learning, Frank McSherry, Yuan Yu, Mihai Budiu, Michael Isard, and Dennis Fetterly, Cambridge University Press, December 2011 [6] DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Presentation by Yuan Yu at OSDI, December, 2008DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing [7] Cluster Computing with DryadLINQ Presentation by Mihai Budiu at Palo Alto Research Center CSL Colloquium, Palo Alto, CA May 8, 2008Cluster Computing with DryadLINQ [8] Microsoft Research webpage: DryadLINQ us/projects/dryadlinq/ [9] A Machine-Learning toolking in DryadLINQ Presentation slides in PowerPoint by Mihai Budiu and Kannan Achan.A Machine-Learning toolking in DryadLINQ

Parallel Runtimes – DryadLINQ vs. Hadoop

Expirement by Indiana University Bloomington