2 Why Parallelism? Data size is increasing Single node architecture is reaching its limitScan 100 TB on MB/s = 12 daysStandard/Commodity and affordable architecture emergingCluster of commodity Linux nodesGigabit ethernet interconnect1 Question: Where did the “gain” come from? Parallel disk access!
3 Design Goals for Parallelism Scalability to large data volumes:Scan 100 TB on MB/s = 12 daysScan on 1000-node cluster = 16 minutes!Cost-efficiency:Commodity nodes (cheap, but unreliable)Commodity networkAutomatic fault-tolerance (fewer admins)Easy to use (fewer programmers)1 Question: Where did the “gain” come from? Parallel disk access!
4 What is MapReduce?Data-parallel programming model for clusters of commodity machinesPioneered by GoogleProcesses 20 PB of data per dayPopularized by open-source Hadoop projectUsed by Yahoo!, Facebook, Amazon, …In 2014, Google dumps MapReduce
5 What is MapReduce used for? At Google:Index building for Google SearchArticle clustering for Google NewsStatistical machine translationAt Yahoo!:Index building for Yahoo! SearchSpam detection for Yahoo! MailAt Facebook:Data miningAd optimizationSpam detectionIn research:Analyzing Wikipedia conflicts (PARC)Natural language processing (CMU)Bioinformatics (Maryland)Astronomical image analysis (Washington)Ocean climate simulation (Washington)Graph OLAP (NUS)…<Your application>
6 Challenges Cheap nodes fail, especially if you have many Commodity network = low bandwidthProgramming distributed systems is hard
7 Typical Hadoop Cluster Aggregation switchRack switch40 nodes/rack, nodes in cluster1 GBps bandwidth in rack, 8 GBps out of rackNode specs (Yahoo terasort): 8 x 2.0 GHz cores, 8 GB RAM, 4 disks (= 4 TB?)In 2011 it was guestimated that Google had 1M machines,
8 Hadoop Components Distributed file system (HDFS) Single namespace for entire clusterReplicates data 3x for fault-toleranceMapReduce implementationExecutes user jobs specified as “map” and “reduce” functionsManages work distribution & fault-tolerance
9 Hadoop Distributed File System Files are BIG (100s of GB – TB)Typical usage patternsAppend-onlyData are rarely updated in placeReads commonOptimized for large files, sequential (why??) readsFiles split into MB blocks (called chunks)Blocks replicated (usually 3 times) across several datanodes (called chuck or slave nodes)Chunk nodes are compute nodes tooNamenodeFile11234BIG – how about small files – use hadoop?Append – imply no ordering of data, therefore need to sequentially scan whole file, which is why it is optimized for sequential reads121321424334Datanodes
10 Hadoop Distributed File System Single namenode (master node) stores metadata (file names, block locations, etc)May be replicated alsoClient library for file accessTalks to master to find chunk serversConnects directly to chunk servers to access dataMaster node is not a bottleneckComputation is done at chuck node (close to data)NamenodeFile11234Hadoop follows the Master/Slave architecture described in theoriginal MapReduce paper. Since Hadoop consists of two parts: datastorage (HDFS) and processing engine (MapReduce), there are twotypes of master node: for HDFS, the master node is namenode, andslave is datanode; for MapReduce in Hadoop, the master node isjobtracker, and slave tasktracker.121321424334Datanodes
11 MapReduce Programming Model Data type: key-value recordsFile – a bag of (key, value) recordsMap function:(Kin, Vin) list(Kinter, Vinter)Takes a key-value pair and outputs a set of key-value pairs, e.g., key is the line number, value is a single line in the fileThere is one Map call for every (kin, vin ) pairReduce function:(Kinter, list(Vinter)) list(Kout, Vout)All values Vinter with same key Kinter are reduced together and processed in Vinter orderThere is one Reduce function call per unique key KinterWhat is the restriction on the size of (key, value) pairs?Each pair must be small enough to fit into one single machine, e.g, a document should be fine, an image should be fine but 1PB value is probably not.
12 Example: Word Count Execution InputMapSort & ShuffleReduceOutputMapthe, 1brown, 1fox, 1quick, 1the quickbrown foxbrown, 1brown, 2fox, 2how, 1now, 1the, 3brown, 1Mapthe, 1fox, 1ate, 1mouse, 1the fox atethe mouseUsers only need to write a map and a reduce function! The number of instances to mount and how they are mounted is left to the execution enginequick, 1ate, 1cow, 1mouse, 1quick, 1Maphow, 1now, 1brown, 1cow, 1how nowbrown cow
13 Example: Word Count def mapper(line): foreach word in line.split(): output(word, 1)def reducer(key, values):output(key, sum(values))
14 Can Word Count algorithm be improved? Some questions to considerHow many key-value pairs are emitted?What if the same word appear in the same line/document?What is the overhead?How could you reduce that number?
15 An Optimization: The Combiner A combiner is a local aggregation function for repeated keys produced by the same mapFor associative ops. like sum, count, maxDecreases size of intermediate dataExample: local counting for Word Count:def combiner(key, values):output(key, sum(values))NOTE: For Word Count, this turns out to be exactly what the Reducer does!
16 Word Count with Combiner InputMap & CombineShuffle & SortReduceOutputthe quickbrown foxthe fox atethe mousehow nowbrown cowMapReducebrown, 2fox, 2how, 1now, 1the, 3ate, 1cow, 1mouse, 1quick, 1the, 1brown, 1fox, 1the, 2
17 Care in using Combiner Usually same as reducer How about average? If operation is associative and commutative, e.g., sumHow about average?How about median?
18 The Complete WordCount Program Map functionReduce functionInvoking Map,Combiner & ReduceRun this program as aMapReduce job
19 Example 2: Search Input: (lineNumber, line) records Output: lines matching a given patternMap: if(line matches pattern): output(line)Reduce: identity functionAlternative?job.setNumReduceTasks(0);
20 Example 3: Inverted Index Source documentshamlet.txthamlet.txtInverted indexto, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txtbe, 12th.txtnot, 12th.txtafraid, 12th.txtof, 12th.txtgreatness, 12th.txtto be or not to beafraid, (12th.txt)be, (12th.txt, hamlet.txt)greatness, (12th.txt)not, (12th.txt, hamlet.txt)of, (12th.txt)or, (hamlet.txt)to, (hamlet.txt)12th.txtbe not afraid of greatness
21 Example 3: Inverted Index Input: (filename, text) recordsOutput: list of files containing each wordMap: foreach word in text.split(): output(word, filename)Combine: merge filenames for each wordReduce: def reduce(word, filenames): output(word, sort(filenames))
22 Example 4: Most Popular Words Input: (filename, text) recordsOutput: the 100 words occurring in most filesTwo-stage solution:Job 1:Create inverted index, giving (word, list(file)) recordsJob 2:Map each (word, list(file)) to (count, word)Sort these records by count as in sort jobInput: fname, wordSo:Map produces w1,fn1; w2,fn1; w3, fn1, …; w1,fn2 …Reduce produces w1,count; w2, count;MapReduce Sort.So, need 2 mapreduce tasksOptimizations:Map to (word, 1) instead of (word, file) in Job 1Estimate count distribution in advance by sampling
23 Note #mappers, #reducers, #physical nodes may not be equal For different problems, the Map and Reduce functions are different, but the workflow is the sameRead a lot of dataMapExtract something you care aboutSort and ShuffleReduceAggregate, summarize, filter or transformWrite the result
24 Refinement: Partition Function Want to control how keys get partitionedInputs to map tasks are created by contiguous splits of input fileReduce needs to ensure that records with the same intermediate key end up at the same workerSystem uses a default partition function:hash(key) mod RSometimes useful to override the hash function:E.g., hash(hostname(URL)) mod R ensures URLs from a host end up in the same output file;E.g., How about sorting?
25 Example 5: Sort Input: (key, value) records Output: same records, sorted by keyMap: identity function. Why?Reduce: identity function. Why?Trick: Pick partitioning function h such that k1<k2 h(k1)<h(k2)ant, beeMapReduce[A-M]zebraaardvarkantbeecowelephantcowMappigReduce[N-Z]aardvark,elephantpigsheepyakzebraMapsheep, yak
26 Job Configuration Parameters 190+ parameters in HadoopSet manually or defaults are used
27 Not all tasks fit the MapReduce model … Consider a data set consisting of n observations and k variablese.g., k different stock symbols or indices (say k=10,000) and n observations representing stock price signals (up / down) measured at n different times.Problem – Find very high correlations (ideally with time lags to be able to make a profit) - e.g. if Google is up today, Facebook is likely to be up tomorrow.Need to compute k * (k-1) /2 correlations to solve this problemCannot simply split the 10,000 stock symbols into 1,000 clusters, each containing 10 stock symbols, then use MapReduceThe vast majority of the correlations to compute will involve a stock symbol in one cluster, and another one in another clusterThese cross-clusters computations makes MapReduce useless in this caseThe same issue arises if you replace the word "correlation" by any other function, say f, computed on two variables, rather than oneGood for “embarrassingly parallel” jobs.What else is it not good for? Real-time processing …NOTE: Make sure you pick the right tool!
28 High Level Languages on top of Hadoop MapReduce is great, as many algorithms can be expressed by a series of MR jobsBut it’s low-level: must think about keys, values, partitioning, etcCan we capture common “job patterns”?
29 Pig Started at Yahoo! Research Now runs about 30% of Yahoo!’s jobs Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users agedStarted at Yahoo! ResearchNow runs about 30% of Yahoo!’s jobsFeatures:Expresses sequences of MapReduce jobsData model: nested “bags” of itemsProvides relational (SQL) operators (JOIN, GROUP BY, etc)Easy to plug in Java functionsPig Pen dev. env. for EclipseLoad UsersLoad PagesFilter by ageJoin on nameGroup on urlCount clicksOrder by clicksTake top 5Example from
30 Ease of TranslationNotice how naturally the components of the job translate into Pig Latin.Load UsersLoad PagesUsers = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pagesby user; Grouped = group Joined by url;Summed = foreach Grouped generategroup, count(Joined)as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5;store Top5 into ‘top5sites’;Filter by ageJoin on nameJob 1Group on urlNo need to think about how many map reduce jobs this decomposes into, or connecting data flows between map reduces jobs, or even that this is taking place on top of map reduce at all.Job 2Count clicksOrder by clicksJob 3Take top 5
31 Hive Developed at Facebook Used for majority of Facebook jobs “Relational database” built on HadoopMaintains list of table schemasSQL-like query language (HQL)Can call Hadoop Streaming scripts from HQLSupports table partitioning, clustering, complex data types, some optimizations
32 Creating a Hive Table Sample Query CREATE TABLE page_views(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'User IP address') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) // break the table into separate files STORED AS SEQUENCEFILE;Sample QueryFind all page views coming from xyz.com on March 31st:SELECT page_views.*FROM page_viewsWHERE page_views.date >= ' 'AND page_views.date <= ' 'AND page_views.referrer_url like '%xyz.com';Hive only reads partition ,* instead of scanning entire table
33 Announcement Project teams Submit list of members of your team ASAP Mail to
34 MapReduce Environment MapReduce environment takes care of:Partitioning the input dataScheduling the program’s execution across a set of machinesPerform the “group-by key” step in the Sort and Shuffle phaseHandling machine failuresManaging required inter-machine communication
35 MapReduce Execution Details Single master controls job execution on multiple slaves as well as user schedulingMappers preferentially placed on same node or same rack as their input blockPush computation to data, minimize network useMappers save outputs to local disk rather than pushing directly to reducersAllows recovery if a reducer crashesOutput of reducers are stored in HDFSMay be replicatedMappers save outputs to local disk also “allows having more reducers than nodes” – not sure why. It seems that this should also be possible even if data are being pushed.
36 Coordinator: Master Node Master node takes care of coordination:Task status: (idle, in-progress, completed)Idle tasks get scheduled as workers become availableWhen a map task completes, it sends the master the location and sizes of its (R) intermediate files, one for each reducerMaster pushes this info to reducersMaster pings workers periodically to detect failures
37 Input files Map phase Intermediate files (on local disk) Reduce phase User Program(1) submit(5) map output locationMaster(6) reducer input location(2) schedule map(2) schedule reduceworker(3) read(4) local writesplit 0(8) writeoutputfile 0split 1(7) remote readworkersplit 2workersplit 3split 4outputfile 1workerworkerInputfilesMapphaseIntermediate files(on local disk)ReducephaseOutputfiles
42 Two additional notes Barrier between map and reduce phases Reduce cannot start processing until all maps have completedBut we can begin copying intermediate data as soon as a mapper completesKeys arrive at each reducer in sorted orderNo enforced ordering across reducers
43 How many Map and Reduce tasks? M map tasks, R reduce tasksRule of a thumb:Make M much larger than the number of nodes in the clusterOne DFS chunk per map is commonImproves dynamic load balancing and speeds up recovery from worker failuresUsually R is smaller than MBecause output is spread across R filesJobConf's conf.setNumMapTasks(int num).
44 Fault Tolerance in MapReduce 1. If a task crashes:Retry on another nodeOkay for map because it has no dependenciesOkay for reduce because map outputs are on diskIf the same task repeatedly fails, fail the job or ignore that input block (user-controlled)Q: How does parallel systems handle fault tolerance traditionally? Redo everything and Checkpointing.Note: For this and the other fault tolerance features to work, your map and reduce tasks must be side-effect-free
45 Fault Tolerance in MapReduce 2. If a node crashes:Relaunch its current tasks on other nodesRelaunch any maps the node previously ranWhy? Necessary because their output files were lost along with the crashed node
46 Fault Tolerance in MapReduce 3. If a task is running slowly (straggler):Launch second copy of task on another nodeTake the output of whichever copy finishes first, and kill the other oneCritical for performance in large clusters: stragglers occur frequently due to failing hardware, bugs, misconfiguration, etc
47 TakeawaysBy providing a data-parallel programming model, MapReduce can control job execution in useful ways:Automatic division of job into tasksAutomatic placement of computation near dataAutomatic load balancingRecovery from failures & stragglersUser focuses on application, not on complexities of distributed computing
48 ConclusionsMapReduce’s data-parallel programming model hides complexity of distribution and fault tolerancePrincipal philosophies:Make it scale, so you can throw hardware at problemsMake it cheap, saving hardware, programmer and administration costs (but requiring fault tolerance)Hive and Pig further simplify programmingMapReduce is not suitable for all problems, but when it works, it may save you a lot of time
49 Resources Hadoop: http://hadoop.apache.org/core/ Hadoop docs:Pig:Hive:Hadoop video tutorials from Cloudera:
50 AcknowledgementsThis lecture is adapted from slides from multiple sources, including those fromRAD Lab, BerkeleyStanfordDukeMaryland…
52 Word Count Histogram: counting occurrences of different length words in documents, what is the Map task and what is the Reduce task?Map Task: Get word counts for each category of word length. Reduce Task: Combine counts of each category of word length.
53 Which of these are effects of reducing the number of key-value pairs that the Map function outputs? Select all that apply (related to combiner??)The amount of work that is done in the Map phase is reduced because the output set is smaller.The amount of work that is done in the Reduce phase is reduced because the input set is smaller. (T)The amount of data that travels from machines executing Map tasks across the network to machines executing Reduce tasks is reduced. (T)Reducing the number of key-value pairs output by the Map function does not have any effects.
54 Schematic of Parallel Processing A big data fileFile is split intosmaller chunksand distributedto k nodesf is a functionthat operateson each data itemA distributed setof answers