Presentation is loading. Please wait.

Presentation is loading. Please wait.

MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL.

Similar presentations


Presentation on theme: "MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL."— Presentation transcript:

1 MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL

2 TEAM MEMBERS: Katie Woods: Covered Sections 1. Introduction 3.Implementation 6.Experience 7.Related Work Jordan Howell: Covered Sections 2.Programming Model 4.Refinements 5.Performance Conclusion

3 OVERVIEW What is MapReduce Programming model Section 2.1-6.1 Related Work Conclusion How to run MapReduce on OpenStack Reference What we went over Questions/Comments

4 WHAT IS MAPREDUCE? Originally created by Google Used to query large data-sets Extracts relations from unstructured data Can draw from many disparate data sources

5 2. PROGRAMMING MODEL Two parts: Map() and Reduce() MapReduce library groups together intermediate values associated with same intermediate key I Passes values to Reduce() via an iterator Reduce() merges values to form possibly smaller set of values Zero or one output value is produced per Reduce request

6 3. IMPLEMENTATION Many different implementations The right choice depends on the environment For example: Google’s setup typically dual x86 processors 2-4GB memory 100Mb/s or 1Gb/s networking hardware Hundreds or thousands of machines per cluster Inexpensive IDE disks directly on machines

7 3.1 EXECUTION Input data partitioned into M splits Intermediate key space partitioned into R pieces. 1. Splits input files into 16 to 64 MB per piece. Then starts up many copies of the program on a cluster of machines. 2. Master assigns each one map task or a reduce task. 3. Worker reads the input split. 4. Periodically, buffered pairs get written to local disk. 5. Reduce worker makes procedure call to retrieve data from map worker’s local disk. 6. Results are written to final output file for the reduce partition 7. Master wakes user program and returns all output files.

8 3.2 MASTER DATA STRUCTURES Master keeps several data structures Each Map task and reduce task it stores the state (idle, in-process, and completed) The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks

9 3.3 FAULT TOLERANCE Since MapReduce library is designed to process large amounts the library must tolerate machine failures gracefully Worker Failure: Master Failure: Semantics in the Presence of Failures:

10 3.4 LOCALITY Due to relatively scarce network bandwidth data is stored on the local disk of each machine The files are split into 64MB blocks. Each block is copied typically three times to other machines The master machine tries to schedule jobs on the machine that contains a replica of the corresponding input data. Otherwise, it schedules a job close to a machine containing the job data When running large operations most input data is read locally and consumes no network bandwidth.

11 3.5 TASK GRANULARITY The map and reduce phases are split into different size pieces M pieces for Map Phase & R pieces for Reduce Phase Total phase pieces should be much larger than the number of worker machines This helps with dynamic load balancing and recovery speed when a worker fails Reduce phase pieces are usually constrained by users since each task is in a separate output file The number of map phase pieces are chosen so that the input data size is between 16MB and 64MB Google usually uses 200,000 map pieces, 5,000 reduce pieces, and 2,000 worker machines

12 3.6 BACKUP TASKS “Straggler” machines can cause large total computation time Stragglers can arise by many different reasons Straggler alleviation is possible The master backs up in-progress tasks when the operation is close to finishing Task is marked as complete when the primary task or backup completes Backup task overhead has been tuned to no more than a few percent An example task takes 44% longer when the backup is disabled

13 4. REFINEMENTS General algorithms fit most needs User defined extensions have been found useful

14 4.1 PARTITIONING FUNCTION Users can define the number of reduce tasks to run (R) We can redefine the intermediate keys A default function is hash(key) mod R which results in fairly well balanced partitions Sometimes we may want to group output together, such as grouping web data by domain We can redefine partition to use hash(Hostname(urlkey)) mod R

15 4.2 ORDERING GUARANTEES Within each partition, intermediate key/value pairs are always processed in increasing order Makes it easy to generate a sorted output file per partition This supports efficient lookup of random keys

16 4.3 COMBINER FUNCTION There is sometimes significant repetition in the intermediate keys This is usually handled in the Reduce function, but sometimes we want to partially combine it in the Map function The combiner function sometimes grants significant performance gains

17 4.4 INPUT AND OUTPUT TYPES MapReduce can take data from several different formats The way the data is organized for input greatly effects the output Adding support for a new data type only requires users to change the reader interface

18 4.5 SIDE EFFECTS Sometimes we want to produce auxiliary files as additional outputs from the Map or Reduce operators Users are responsible for these files, as long as these outputs are deterministic This restriction has never been an issue in practice

19 4.6 SKIPPING BAD RECORDS Sometimes there are bugs in user code Course of action is to fix the bug but sometimes not feasible It is acceptable to ignore a few records Optional mode of execution Detects which records cause deterministic crashes and skips them

20 4.7 LOCAL EXECUTION Debugging problems in functions can be tricky Decisions can be made dynamically by the master To help facilitate debugging, profiling, and small-scale testing Controls are provided to limit particular map tasks

21 4.8 STATUS INFORMATION The master runs an internal HTTP server and exports a set of status pages Shows progress of the computation Contains links to the standard error and standard output files User uses data to predict how long the computation will take Top-level status page shows which works have failed & which Map and Reduce tasks were processing when they failed

22 4.9 COUNTERS MapReduce provides a counter facility to count occurrences of various events Some counter values are automatically maintained by the MapReduce library Users have found the counter facility useful for sanity checking the behavior of operations

23 5 PERFORMANCE This section measures the performance of MapReduce on two computations, Grep and Sort. These programs represent a large subset of real programs that MapReduce users have created

24 5.1 CLUSTER CONFIGURATION Cluster of ≈ 1800 machines. Two 2GHz Intel Xeon processors with Hyper-Threading. 4 GB of memory. Two 160GB IDE(Integrated Drive Electronics) Disks. Gigabit Ethernet link. Arranged in a two-level tree-shaped switched network. ≈ 100-200 Gbps aggregate bandwidth available at root. Every machine is located in the same hosting facility. Round trip between pairs is less than a millisecond. Out of the 4GB of memory available, approximately 1-1.5GB was reserved by other tasks. Programs were run on a weekend afternoon, when the CPUs, disks, and network were mostly idle.

25 5.2 GREP Grep scans through 10^10 100-byte records. The program looks for a match to a rare 3-character pattern. This pattern occurs in 92,337 records. The input gets slip up into ≈ 64 MB pieces. Output gets stored into one file

26 5.3 SORT The sort program sorts through 10^10 100-byte records. This is modeled after the TeraSort benchmark. Whole program is less than 50 lines. Like Grep the input for the sort program is split up into 64MB pieces. The sorted output is partitioned into 4000 files. The partitioning function uses the initial bytes of the key to segregate the output into one of the 4000 pieces.

27 SORT CONTINUED

28 5.4 EFFECT OF BACKUP TASK Backup Tasks Disabled After 960 still 5 reduce tasks not completed Takes 1283 Seconds to completely finish Increase of 44% time

29 5.5 MACHINE FAILURES Intentional kill 200 process Finishes in 933 seconds 5% increase over normal time

30 6 EXPERIENCE Extraction of data for popular queries Google Zeitgeist Extracting properties of web pages Geographical locations of web pages for localized search Clustering problems for Google News and Froogle products Large-scale machine learning problems and graph computations

31 6.1 LARGE SCALE INDEXING Production Indexing System Produces data structures for searches Completely rewritten with MapReduce What it does: Crawler gathers approx. 20 terabytes of data Indexing Process: 5-10 MapReduce operations

32 6.1 CONTINUED Indexing code is Simpler 3800 lines of C++ to 700 w/ MapReduce Improved Performance Separates unrelated computations Avoids extra passes over data Easier to Operate MapReduce handles issues without operator intervention Machine failures, slow machines, networking hiccups

33 7 RELATED WORK MapReduce can be viewed as a simplification of many system’s programming models to be adaptable and scalable Works off the restricted model of pushing data to be stored locally on the worker’s system Backup system similar to Charlotte System with the fix of the ability to skip bad records caused by failures

34 CONCLUSION Been successfully used at Google for different purposes Easy for programmers to use, even without the background of distributed and parallel systems Used well to sorting, data mining, machine learning and many more

35 LESSONS LEARNED Restricting the programming model makes it easy to parallelize and distribute computations fault-tolerant. Bandwidth is precious. Optimizations save it. Redundant execution can be used reduce the impact of slow machines, and to handle machine failures and data loss

36 MAP REDUCE WITH OPENSTACK Cloud Computing with Data Analytics

37 COMMON DEPLOYMENT Swift Storage joined to MapReduce cluster. Scalable storage mode to handle large amounts of data. Reason why is because annual data growth is 60%

38 BEGINNER DEPLOYMENT Cloudera has a mapreduce distribution Lets companies slowly implement big data

39 ADVANCED DEPLOYMENT Best Flexibility, Scalability, and Autonomy Build Cloud and then add Swift and Nova Quantum should be added for segmentation

40 GOOGLE FILE SYSTEM Letting a third party handle storage while company focuses on Computational processing.

41 NEED TO KNOW Let employees grow with technology If you do not plan you will most likely fail

42 WHAT WE WENT OVER What MapReduce is Programming Model: types Implantation: execution, master data structures, fault tolerance, locality, task granularity, and backup tasks Refinements: partitioning function, ordering guarantees, combiner functions, input & output types, side-effects, skipping bad records, local execution, status information, and counters Performance: cluster configuration, GREP, SORT, effect & backup tasks, and machine failures Experiences Related Work Conclusion How MapReduce is run with OpenStack

43 QUESTIONS/COMMENTS

44 REFERENCES


Download ppt "MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL."

Similar presentations


Ads by Google