Distributed Data-Parallel Computing Using a High-Level Programming Language Yuan Yu Michael Isard Joint work with: Andrew Birrell, Mihai Budiu, Jon Currey,

Slides:

Advertisements

Similar presentations

Distributed Data-Parallel Programming using Dryad Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael Isard, Yuan Yu Microsoft Research Silicon Valley.

Advertisements

Cluster Computing with Dryad Mihai Budiu, MSR-SVC LiveLabs, March 2008.

Introduction to .NET Framework

Tahir Nawaz Introduction to.NET Framework. .NET – What Is It? Software platform Language neutral In other words:.NET is not a language (Runtime and a.

Introduction to Data Center Computing Derek Murray October 2010.

The DryadLINQ Approach to Distributed Data-Parallel Computing

Machine Learning in DryadLINQ Kannan Achan Mihai Budiu MSR-SVC, 1/30/

Michael Pizzo Software Architect Data Programmability Microsoft Corporation.

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

epiC: an Extensible and Scalable System for Processing Big Data

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

1 G2 and ActiveSheets Paul Roe QUT Yes Australia!

C# and LINQ Yuan Yu Microsoft Research Silicon Valley.

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

Spark: Cluster Computing with Working Sets

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.

DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.

Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC.

Distributed Computations

Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley.

Dryad / DryadLINQ Slides adapted from those of Yuan Yu and Michael Isard.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

HADOOP ADMIN: Session -2

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Dryad and DryadLINQ Theophilus Benson CS Distributed Data-Parallel Programming using Dryad By Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael.

Dryad and DryadLINQ Presented by Yin Zhu April 22, 2013 Slides taken from DryadLINQ project page: projects/dryadlinq/default.aspx.

Microsoft DryadLINQ --Jinling Li. What’s DryadLINQ? A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. [1]

Image Processing Image Processing Windows HPC Server 2008 HPC Job Scheduler Dryad DryadLINQ Machine Learning Graph Analysis Graph Analysis Data Mining.NET.

MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

Programming clusters with DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Association of C and C++ Users (ACCU) Mountain View, CA, April 13, 2011.

Cloud Computing Other High-level parallel processing languages Keke Chen.

Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,

Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.

SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2

Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Large-scale Machine Learning using DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Ambient Intelligence: From Sensor Networks to Smart Environments.

ADO.NET 3.0 – Entity Data Model Gert E.R. Drapers Chief Software Architect Visual Studio Team Edition for Database Professionals Microsoft Corporation.

REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.

BIG DATA/ Hadoop Interview Questions.

TensorFlow– A system for large-scale machine learning

CS239-Lecture 3 DryadLINQ Madan Musuvathi Visiting Professor, UCLA

Some slides adapted from those of Yuan Yu and Michael Isard

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

CSCI5570 Large Scale Data Processing Systems

Spark Presentation.

Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.

Parallel Computing with Dryad

Applying Twister to Scientific Applications

Overview of big data tools

Software models - Software Architecture Design Patterns

Saranya Sriram Developer Evangelist | Microsoft

Fast, Interactive, Language-Integrated Cluster Computing

Server & Tools Business

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

Distributed Data-Parallel Computing Using a High-Level Programming Language Yuan Yu Michael Isard Joint work with: Andrew Birrell, Mihai Budiu, Jon Currey, Úlfar Erlingsson, Dennis Fetterly, Pradeep Kumar Gunda Microsoft Research Silicon Valley

The Goal of The Talk From the invitation: “to expose SIGMOD members to work going on in other "parallel" fields of Computer Science that potentially have deep implications for the information management research community.”

The 2008 Claremont Report “Designing systems that embrace non-relational data model, rather than shoehorning them into tables; “… the techniques behind parallel and distributed databases---partitioned dataflow and cost-based query optimization---should extend to new environments. “… need to pay attention to the softer issues that capture the hearts and minds of programmers (such as attractive syntax, typing and modularity, development tools, …) “… database research must look beyond its traditional boundaries and find allies throughout computing.”

Distributed Data-Parallel Computing Research problem: How to write distributed data-parallel programs for a compute cluster? The DryadLINQ programming model – Sequential, single machine programming abstraction – Same program runs on single-core, multi-core, or cluster – Familiar programming languages – Familiar development environment

Dryad and DryadLINQ DryadLINQ provides automatic query plan generation Dryad provides automatic distributed execution

Outline Programming model Dryad and DryadLINQ overview Lessons Conclusions

LINQ Microsoft’s Language INtegrated Query – Available in.NET3.5 and Visual Studio 2008 A set of operators to manipulate datasets in.NET – Support traditional relational operators Select, Join, GroupBy, Aggregate, etc. – Integrated into.NET programming languages Programs can invoke operators Operators can invoke arbitrary.NET functions Data model – Data elements are strongly typed.NET objects – Much more expressive than relational tables For example, nested data structures

DryadLINQ Data Model Partition Partitioned Table.Net objects Partitioned table exposes metadata information – type, partition, compression scheme, serialization, etc.

Demo Preserve an existing programming model – The same familiar programming languages, development tools, libraries, etc.

An Example: PageRank Ranks web pages using the hyperlink structure, propagating scores along the links Each iteration can be expressed as a SQL query 1.Join pages with ranks 2.Distribute ranks on outgoing edges 3.GroupBy edge destination 4.Aggregate into ranks 5.Repeat

One PageRank Step in DryadLINQ // one step of pagerank: dispersing and re-accumulating rank public static IQueryable PRStep(IQueryable pages, IQueryable ranks) { // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank); // re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum()); }

The Complete PageRank Program var pages = PartitionedTable.Get (“dfs://pages.txt”); var ranks = pages.Select(page => new Rank(page.name, 1.0)); // repeat the iterative computation several times for (int iter = 0; iter < n; iter++) { ranks = PRStep(pages, ranks); } ranks.ToPartitionedTable (“dfs://outputranks.txt”); public struct Page { public UInt64 name; public Int64 degree; public UInt64[] links; public Page(UInt64 n, Int64 d, UInt64[] l) { name = n; degree = d; links = l; } public Rank[] Disperse(Rank rank) { Rank[] ranks = new Rank[links.Length]; double score = rank.rank / this.degree; for (int i = 0; i < ranks.Length; i++) { ranks[i] = new Rank(this.links[i], score); } return ranks; } } public struct Rank { public UInt64 name; public double rank; public Rank(UInt64 n, double r) { name = n; rank = r; } } public static IQueryable PRStep(IQueryable pages, IQueryable ranks) { // join pages with ranks, and disperse updates var updates = from page in pages join rank in ranks on page.name equals rank.name select page.Disperse(rank); // re-accumulate. return from list in updates from rank in list group rank.rank by rank.name into g select new Rank(g.Key, g.Sum()); }

Multi-Iteration PageRank pagesranks Iteration 1 Iteration 2 Iteration 3 Memory FIFO

Dryad System Architecture Files, TCP, FIFO Job 1 data plane control plane PD V VV job manager cluster Job 1 : v 11, v 12, … Job 2 : v 21, v 22, … Job 3 : … scheduler New jobs

Dryad Provides a general, flexible execution layer – Dataflow graph as the computation model Can be modified by runtime optimizations – Higher language layer supplies graph, vertex code, serialization code, hints for data locality, … Automatically handles distributed execution – Distributes code, routes data – Schedules processes on machines near data – Masks failures in cluster and network – Fair scheduling of concurrent jobs

Immutable Input Assumes that inputs are immutable – High performance: Scales out to shared-nothing clusters made up of thousands of machines – Significantly simplifies the design/implementation Simple fault-tolerant story No need to handle complex transaction and synchronization – Good for processing largely static datasets – Not suitable for fine-grain, frequent updates

Dryad DryadLINQ System Architecture DryadLINQ Client machine (11) Distributed query plan.NET program Query Expr Cluster Output Tables Results Input Tables Invoke Query plan Output Table Dryad Execution.Net Objects ToTable foreach Vertex code

DryadLINQ Distributed execution plan generation – Static optimizations: pipelining, eager aggregation, etc. – Dynamic optimizations: data-dependent partitioning, dynamic aggregation, etc. Vertex runtime – Single machine (multi-core) implementation of LINQ – Vertex code that runs on vertices – Channel serialization code – Callback code for runtime dynamic optimizations – Automatically distributed to cluster machines

Lessons Acyclic dataflow graph is a powerful computation model Language integration is amazingly successful Leverage decades of database research Decoupling of Dryad and DryadLINQ worked out well

Acyclic Dataflow Graph Acyclic dataflow graph provides a very powerful computation model – Easy target for higher-level programming abstractions such as DryadLINQ – Easy expression of many data-parallel optimizations We designed Dryad to be general and flexible – Programmability is less of a concern – Used primarily to support higher-level programming abstractions – We haven’t modified Dryad in order to support DryadLINQ

Expectation Maximization (Gaussians) 21 Generated by DryadLINQ 3 iterations shown

The Language Integration Approach Single unified programming environment – Unified data model and programming language – Direct access to IDE and libraries Simpler than SQL programming – As easy for simple queries – Easier to use for even moderately complex queries No embedded languages Requires good programming language supports – LINQ extensibility: custom operators/providers –.NET reflection, dynamic code generation, …

LINQ Framework PLINQ Local machine.Net program (C#, VB, F#, etc) Execution engines Query Objects LINQ-to-SQL DryadLINQ LINQ-to-XML LINQ provider interface Scalability Single-core Multi-core Cluster Extremely open and extensible

Combining with PLINQ 24 Query DryadLINQ PLINQ subquery The combination of PLINQ and DryadLINQ delivers computation to every core in the cluster

Leverage Database Research Example: MapReduce written in DryadLINQ MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs { var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs }

But, Not So Easy The main sources of difficulty – Much more complicated data model – User-defined functions all over the places Requires sophisticated program analysis techniques – Possible with modern programming languages and runtimes, such as C#/CLR

Decoupling of Dryad and DryadLINQ Separation of concerns – Dryad layer concerns scheduling and fault-tolerance – DryadLINQ layer concerns the programming model and the parallelization of programs – Result: powerful and expressive execution engine and programming model Different from the MapReduce/Hadoop approach – A single abstraction for both programming model and execution engine – Result: very simple, but very restricted execution engine and language

Image Processing Cosmos DFSSQL Servers Software Stack 28 Windows Server Cluster Services (Azure, HPC, or Cosmos) Azure DFS Dryad DryadLINQ Windows Server Other Languages CIFS/NTFS Machine Learning Graph Analysis Data Mining Applications … Other Applications

Availability Freely available for academic use – Dryad in binary, DryadLINQ in source – Will release Dryad source in the future Coming soon to Microsoft commercial partners – Free, but no product support

Conclusions Goal: Use a compute cluster as if it is a single computer – Dryad/DryadLINQ represent a significant step Requires close collaborations across many fields of computing, including – Distributed systems – Distributed and parallel databases – Programming language design and analysis

Dryad/DryadLINQ Papers 1.Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks (EuroSys’07)Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks 2.DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language (OSDI’08)DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language 3.Distributed Data-Parallel Computing Using a High-Level Programming Language (SIGMOD’09)Distributed Data-Parallel Computing Using a High-Level Programming Language 4.Quincy: Fair scheduling for distributed computing clusters (SOSP’09)Quincy: Fair scheduling for distributed computing clusters 5.Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations (SOSP’09)Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations