Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net.

Similar presentations


Presentation on theme: "Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net."— Presentation transcript:

1 Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China

2 Outline Brief Review Chaining MapReduce Jobs Join in MapReduce Bloom Filter

3 Brief Review A parallel programming framework Divide and merge split0 split1 split2 Input data Map task Mappers Map task Shuffle Reduce task Reducers Reduce task Output data output0 output1

4 Chaining MapReduce jobs Chaining in a sequence Chaining with complex dependency Chaining preprocessing and postprocessing steps

5 Chaining in a sequence Simple and straightforward [MAP | REDUCE]+; MAP+ | REDUCE | MAP* Output of last is the input to the next Similar to pipes Job1Job2Job3

6 Configuration conf = getConf(); JobConf job = new JobConf(conf); job.setJobName("ChainJob"); job.setInputFormat(TextInputFormat.class); job.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(job, in); FileOutputFormat.setOutputPath(job, out); JobConf map1Conf = new JobConf(false); ChainMapper.addMapper(job, Map1.class, LongWritable.class, Text.class, Text.class, Text.class, true, map1Conf);

7 Chaining with complex dependency Jobs are not chained in a linear fashion Use addDependingJob() method to add dependency information: Job3 Job1Job2 x.addDependingJob(y)

8 Chaining preprocessing and postprocessing steps Example: remove stop word in IR Approaches: Separate: inefficient Chaining those steps into a single job Use ChainMapper.addMapper() and ChainReducer.setReducer Map+ | Reduce | Map*

9 Join in MapReduce Reduce-side join Broadcast join Map-side filtering and Reduce-side join A given key A range from dataset(broadcast) a Bloom filter

10 Reduce-side join Map output key>>join key, value>>tagged with data source Reduce do a full cross-product of values output the combination results

11 Example ab 1ab 1cd 4ef ac 1b 2d 4c table x table y map() 1 4 key xab xcd xef value key yb yd yc value tag join key shuffle() 1 key xab xcd yb valuelist 2yd 4 xef yc reduce() abc 1abb 1cdb 4efc output 1

12 Broadcast join (replicated join) Broadcast the smaller table Do join in Map() Using distributed cache DistributedCache.addCacheFile()

13 Map-side filtering and Reduce- side join Join key: student IDs from info generate IDs file from info broadcast join What if the IDs file can’t be stored in memory? a Bloom Filter

14 A Bloom Filter Introduction Implementation of bloom filter Use in MapReduce join

15 Introduction to Bloom Filter space-efficient data structure, constant size, test elements, add(), contains() no false negatives and a small probability of false positives

16 Implementation of bloom filter Apply a bit array Add elements generate k indexes set the k bits to 1 Test elements generate k indexes all k bits are 1 >> true, not all are 1 >> false

17 Example add x(0,2,6) add y(0,3,9) contain m(1,3,9) contain n(0,2,9)initial state ①② ③④⑤ ×√ false positives

18 Use in MapReduce join A separate subjob to create a Bloom Filter Broadcast the Bloom Filter and use in Map() of join job drop the useless record, and do join in reduce

19 References Chunk Lam, “Hadoop in action” Jairam Chandar, “Join Algorithms using Map/Reduce”

20 THANK YOU

21 Hadoop

22


Download ppt "Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net."

Similar presentations


Ads by Google