Presentation is loading. Please wait.

Presentation is loading. Please wait.

Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA

Similar presentations


Presentation on theme: "Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA"— Presentation transcript:

1 Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Workshops Grids vs. Clouds Beijing,

2 Introduction Classical data intensive and data parallel applications
Huge amount of data to be analyzed Analysis can be done: splitting data in a great number of small chunks applying analysis procedure to each chunk independently applying a specific procedure to the set of previous results to obtain the final goal In this case can be used the Map & Reduce method It fits nicely cloud computing

3 What is MapReduce? Simple data-parallel programming model designed for scalability and fault-tolerance Pioneered by Google Processes 20 petabytes of data per day Popularized by open-source Hadoop project Used at Yahoo!, Facebook, Amazon, …

4 What is MapReduce used for?
At Google: Index construction for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: “Web map” powering Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization Spam detection

5 What is MapReduce used for?
In research: Astronomical image analysis (Washington) Bioinformatics (Maryland) Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Particle physics (Nebraska) Ocean climate simulation (Washington)

6 Example Given: snapshot of the internet
To find: The top N pages with the highest number of incoming links Also given: Infinite computing and storage resources Goal: Optimize it and make it simple enough so that anybody knowing a computer language can do this. CS591CLD Cloud Computing Seminar

7 Map Reduce Automatic parallelization & distribution Fault-tolerant
Provides status and monitoring tools Clean abstraction for programmers CS591CLD Cloud Computing Seminar

8 Programming Model Borrows from functional programming
Users implement interface of two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list CS591CLD Cloud Computing Seminar

9 Map Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). map() produces one or more intermediate values along with an output key from the input of the form <key, value> CS591CLD Cloud Computing Seminar

10 Reduce Reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key) CS591CLD Cloud Computing Seminar

11 Map Reduce CS591CLD Cloud Computing Seminar

12 Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase can’t start until map phase is completely finished. CS591CLD Cloud Computing Seminar

13 Example: Count word occurrences
map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); CS591CLD Cloud Computing Seminar

14 Example vs. Actual Source Code
Example is written in pseudo-code Actual implementation is in Java, using Hadoop True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.) CS591CLD Cloud Computing Seminar

15 Locality Master program divides up tasks based on location of data: tries to have map() tasks on same machine as physical file data, or at least same rack map() task inputs are divided into 64 MB blocks: same size as Google File System chunks CS591CLD Cloud Computing Seminar

16 Fault Tolerance Master detects worker failures
Re-executes completed & in-progress map() tasks Re-executes in-progress reduce() tasks Master notices particular input key/values cause crashes in map(), and skips those values on re-execution. Effect: Can work around bugs in third-party libraries! CS591CLD Cloud Computing Seminar

17 Optimizations No reduce can start until map is complete:
A single slow disk controller can rate-limit the whole process Master redundantly executes “slow-moving” map tasks; uses results of first copy to finish CS591CLD Cloud Computing Seminar

18 Sort Backup tasks reduce job completion time a lot!
Normal No backup tasks processes killed M = R = 4000 Backup tasks reduce job completion time a lot! System deals well with failures CS591CLD Cloud Computing Seminar

19 Optimizations “Combiner” functions can run on same machine as a mapper
Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth CS591CLD Cloud Computing Seminar

20 Some Applications Count of URL access frequency Reverse Web-Link Graph
Distributed Grep: Map - Emits a line if it matches the supplied pattern Reduce - Copies the the intermediate data to output Count of URL access frequency Map – Process web log and outputs <URL, 1> Reduce - Emits <URL, total count> Reverse Web-Link Graph Map – process web log and outputs <target, source> Reduce - emits <target, list(source)> CS591CLD Cloud Computing Seminar

21 MapReduce Conclusions
MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Functional programming paradigm can be applied to large-scale applications Fun to use: focus on problem, let library deal with messy details CS591CLD Cloud Computing Seminar

22 What is Hadoop? At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose. GFS is not open source. Doug Cutting and Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS). The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop. This is open source and distributed by Apache. 11/22/2018

23 Resources “MapReduce: Simplified Data Processing on Large Clusters”, Jeffrey Dean and Sanjay Ghemawat GFS, BigTable, Sawzall, Chubby, Protocol Buffers Google Map Reduce Lecture Series HDFS Architecture CS591CLD Cloud Computing Seminar


Download ppt "Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA"

Similar presentations


Ads by Google