Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.

Similar presentations


Presentation on theme: "The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong."— Presentation transcript:

1 The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong University of Science and Technology CLOUD COMPUTING 2010 November 21-26, 2010 - Lisbon, Portugal

2 MapReduce MapReduce: parallel computing framework for large-scale data processing MapReduce: parallel computing framework for large-scale data processing Successful used in datacenters comprising commodity computers Successful used in datacenters comprising commodity computers A fundamental piece of software in the Google architecture for many years A fundamental piece of software in the Google architecture for many years Open source variant already exists: Hadoop Open source variant already exists: Hadoop Widely used in solving data-intensive problems Widely used in solving data-intensive problems 2 MapReduce … Hadoop or variants …Hadoop

3 Introduction to MapReduce Map and Reduce are higher-order functions Map and Reduce are higher-order functions Map: apply an operation to all elements in a list Map: apply an operation to all elements in a list Reduce: Like “fold”; aggregate elements of a list Reduce: Like “fold”; aggregate elements of a list 3 1 1 m m 4 4 m m 9 9 m m 16 m m 25 m m 1 1 2 2 3 3 4 4 5 5 m: x 2 0 0 1 1 r r 5 5 r r 14 r r 30 r r 55 r r final value Initial value r: + 1 2 + 2 2 + 3 2 + 4 2 + 5 2 = ?

4 Introduction to MapReduce Massive parallel processing made simple Example: world count Example: world count Map: parse a document and generate pairs Map: parse a document and generate pairs Reduce: receive all pairs for a specific word, and count Reduce: receive all pairs for a specific word, and count 4 // D is a document for each word w in D output // D is a document for each word w in D output Map Reduce for key w: count = 0 for each input item count = count + 1 output Reduce for key w: count = 0 for each input item count = count + 1 output Reduce

5 Thoughts on MapReduce MapReduce provides an easy-to-use framework for parallel programming. But is it good for general programs running in datacenters? 5

6 Our work Analyze MapReduce’s design and use a case study to probe the limitation Analyze MapReduce’s design and use a case study to probe the limitation Design a new parallelization framework - MRlite Design a new parallelization framework - MRlite Evaluate the new framework’s performance Evaluate the new framework’s performance 6 Design a general parallelization framework and programming paradigm for cloud computing

7 Thoughts on MapReduce Originally designed for processing large static data sets Originally designed for processing large static data sets No significant dependence No significant dependence Throughput over latency Throughput over latency Large-data-parallelism over small-maybe- ephemeral parallelization opportunities Large-data-parallelism over small-maybe- ephemeral parallelization opportunities 7 … Input Output MapReduce

8 The limitation of MapReduce One-way scalability One-way scalability Allows a program to scale up to process very large data sets Allows a program to scale up to process very large data sets Constrains the program’s ability to process moderate-size data items Constrains the program’s ability to process moderate-size data items Limits the applicability of MapReduce Limits the applicability of MapReduce Difficult to handle dynamic, interactive and semantic-rich applications Difficult to handle dynamic, interactive and semantic-rich applications 8

9 A case study on MapReduce Distributed compiler Very useful in development environments Very useful in development environments Code (data) has dependence Code (data) has dependence Abundant parallelization opportunities Abundant parallelization opportunities A “typical” application, 9 make -j N init/version.o vmlinux-main vmlinux-init kallsyms.o vmlinux driver/built-in.o mm/built-in.o … but a hard case for MapReduce

10 A case study: mrcc Develop a distributed compiler using the MapReduce model Develop a distributed compiler using the MapReduce model How to extract the parallelizable components in a relatively complex data flow? How to extract the parallelizable components in a relatively complex data flow? mrcc: A distributed compilation system mrcc: A distributed compilation system The workload is parallelizable but data-dependence constrained The workload is parallelizable but data-dependence constrained Explores parallelism using the MapReduce model Explores parallelism using the MapReduce model 10

11 mrcc Multiple machines available to MapReduce for parallel compilation Multiple machines available to MapReduce for parallel compilation A master instructs multiple slaves (“map workers”) to compile source files A master instructs multiple slaves (“map workers”) to compile source files 11

12 Design of mrcc 12 make -j N … … master slave “make” explores parallelism among compiling source files “make” explores parallelism among compiling source files MapReduce jobs for compiling source files MapReduce jobs for compiling source files The map task compiles an individual file The map task compiles an individual file

13 Experiment: mrcc over Hadoop MapReduce implementation MapReduce implementation Hadoop 0.20.2 Hadoop 0.20.2 Testbed Testbed 10 nodes available to Hadoop for parallel execution 10 nodes available to Hadoop for parallel execution Nodes are connected by 1Gbps Ethernet Nodes are connected by 1Gbps Ethernet Workload Workload Compiling Linux kernel, ImageMagick, and Xen tools Compiling Linux kernel, ImageMagick, and Xen tools 13

14 Result and observation The compilation using mrcc on 10 nodes is 2~11 times slower than sequential compilation on one node. Project Time for gcc (sequesntial compilation) (min) Time for mrcc/Hadoop (min) Linux kernel 49151 ImageMagick511 Xen tools 224 14 For compiling source file Put source files to HDFS: >2sPut source files to HDFS: >2s Start Hadoop job: > 20sStart Hadoop job: > 20s Retrieve object files: >2sRetrieve object files: >2s Where does the slowdown come from? Network communication overhead for data transportation and replicationNetwork communication overhead for data transportation and replication Tasking overheadTasking overhead

15 Is there sufficient parallelism to exploit? Is there sufficient parallelism to exploit? Yes. “distcc” serves as baseline Yes. “distcc” serves as baseline One-way scalability in the (MapReduce) design and (Hadoop) implementation. One-way scalability in the (MapReduce) design and (Hadoop) implementation. MapReduce is not designed for compiling. We use this case to show some of its limitations. mrcc: Distributed Compilation 15

16 Parallelization framework MapReduce/Hadoop is inefficient for general programming MapReduce/Hadoop is inefficient for general programming Cloud computing needs a general parallelization framework! Cloud computing needs a general parallelization framework! Handle applications with complex logic, data dependence and frequent updates, etc. Handle applications with complex logic, data dependence and frequent updates, etc. 39% of Facebook’s MapReduce workload have only 1 Map [Zaharia 2010] 39% of Facebook’s MapReduce workload have only 1 Map [Zaharia 2010] Easy to use and high performance Easy to use and high performance 16 [Zaharia 2010] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. 2010. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. EuroSys ‘10.

17 Lightweight solution: MRlite A lightweight parallelization framework following the MapReduce paradigm A lightweight parallelization framework following the MapReduce paradigm Parallelization can be invoked when needed Parallelization can be invoked when needed Able to scale “up” like MapReduce, and scale “down” to process moderate-size data Able to scale “up” like MapReduce, and scale “down” to process moderate-size data Low latency and massive parallelism Low latency and massive parallelism Small run-time system overhead Small run-time system overhead 17 A general parallelization framework and programming paradigm for cloud computing

18 Architecture of MRlite 18 MRlite client MRlite master scheduler MRlite master scheduler slave High speed Distributed storage High speed Distributed storage application Data flow Command flow Linked together with the app, the MRlite client library accepts calls from app and submits jobs to the master High speed distributed storage, stores intermediate files The MRlite master accepts jobs from clients and schedules them to execute on slaves Distributed nodes accept tasks from master and execute them

19 Design Parallelization is invoked when needed Parallelization is invoked when needed An application can request parallel execution for arbitrary number of times An application can request parallel execution for arbitrary number of times Program’s natural logic flow integrated with parallelism Program’s natural logic flow integrated with parallelism Remove one important limitation Remove one important limitation Facility outlives utility Facility outlives utility Use and reuse threads for master and slaves Use and reuse threads for master and slaves Memory is “first class” medium Memory is “first class” medium Avoid touching hard drives Avoid touching hard drives 19

20 Design Programming interface Programming interface Provides simple API Provides simple API API allows programs to invoke parallel processing during execution API allows programs to invoke parallel processing during execution Data handling Data handling Network file system which stores files in memory Network file system which stores files in memory No replication for intermediate files No replication for intermediate files Applications are responsible to retrieve output files Applications are responsible to retrieve output files Latency control Latency control Jobs and tasks have timeout limitations Jobs and tasks have timeout limitations 20

21 Implementation Implemented in C as Linux applications Implemented in C as Linux applications Distributed file storage Distributed file storage Implemented with NFS in memory; Mounted from all nodes; Stores intermediate files Implemented with NFS in memory; Mounted from all nodes; Stores intermediate files A specially designed distributed in-memory network file system may further improve performance (future work) A specially designed distributed in-memory network file system may further improve performance (future work) There is no limitation on the choice of programming languages There is no limitation on the choice of programming languages 21

22 Evaluation Re-implement mrcc on MRlite Re-implement mrcc on MRlite It is not difficult to port mrcc because MRlite can handle a “superset” of the MapReduce workloads It is not difficult to port mrcc because MRlite can handle a “superset” of the MapReduce workloads Testbed and workload Testbed and workload Use the same testbed and same workload to compare MRlite‘s performance with MapReduce/Hadoop’s Use the same testbed and same workload to compare MRlite‘s performance with MapReduce/Hadoop’s 22

23 Result The compilation of the three projects using mrcc on MRlite is much faster than compilation on one node. The speedup is at least 2 and the best speedup reaches 6. 23

24 MRlite vs. Hadoop The average speedup of MRlite is more than 12 times better than that of Hadoop. 24 The evaluation shows that MRlite is one order of magnitude faster than Hadoop on problems that MapReduce has difficulty in handling. Project Speedup on Hadoop Speedup on MRlite MRlite vs. Hadoop Linux0.325.817.9 ImageMagick0.486.213.0 Xen tools 0.092.022.0

25 Conclusion Cloud Computing needs a general programming framework Cloud Computing needs a general programming framework Cloud computing shall not be a platform to run just simple OLAP applications. It is important to support complex computation and even OLTP on large data sets. Cloud computing shall not be a platform to run just simple OLAP applications. It is important to support complex computation and even OLTP on large data sets. Use the distributed compilation case (mrcc) to probe the one-way scalability limitation of MapReduce Use the distributed compilation case (mrcc) to probe the one-way scalability limitation of MapReduce Design MRlite: a general parallelization framework for cloud computing Design MRlite: a general parallelization framework for cloud computing Handles applications with complex logic flow and data dependencies Handles applications with complex logic flow and data dependencies Mitigates the one-way scalability problem Mitigates the one-way scalability problem Able to handle all MapReduce tasks with comparable (if not better) performance Able to handle all MapReduce tasks with comparable (if not better) performance 25

26 Conclusion Emerging computing platforms increasingly emphasize parallelization capability, such as GPGPU MRlite respects applications’ natural logic flow and data dependencies MRlite respects applications’ natural logic flow and data dependencies This modularization of parallelization capability from application logic enables MRlite to integrate GPGPU processing very easily (future work) This modularization of parallelization capability from application logic enables MRlite to integrate GPGPU processing very easily (future work) 26

27 Thank you!


Download ppt "The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong."

Similar presentations


Ads by Google