MapReduce design patterns Chapter 5: Join Patterns 2015. 6. 4 G201449021 진다인.

MapReduce design patterns Chapter 5: Join Patterns 2015. 6. 4 G201449021 진다인

Contents I.Introduction II.A Refresher on Joins III.Reduce Side Join IV.Replicated Join V.Composite Join VI.Cartesian Product 2015-06-04 1

Introduction 자신의 데이터를 단 하나의 거대한 데이터 집합에 모두 가지고 있는 경우는 거의 없다. 왜 데이터를 다룰 때 조인이 필요한가 ? 사용자 정보 로그 데이터 웹 사이트 스트리밍된 사용자 활동 로그 유료 서비스 결제 정보 2015-06-042 / 56

Introduction 왜 데이터를 다룰 때 조인이 필요한가 ? 여러 데이터 집합을 함께 분석하면, 각 데이터 집합에서 보지 못했던 흥미로운 관계를 발견 할 수 있다. 조인 패턴을 이용하면 작은 여러 개의 데이터 집합으로부터 더 풍부한 데이터 집합을 만들어 내고 이로부터 자신이 원하는 정보를 추출해낼 수도 있다. 2015-06-043 / 56

Introduction SQL 과 MapReduce 프레임워크에서의 조인 SQL에서는 간단한 명령을 통해 조인을 수행할 수 있고, 데이터베이스 엔진이 복잡하고 귀찮 은 일을 모두 처리한다. MapReduce 프레임워크에서의 조인은 SQL에서만큼 간단하지는 않다.  MapReduce 프레임워크 특성: 한 번에 같은 입력으로부터 온 하나의 Key/Value 쌍 만 처리  조인에서는 서로 다른 구조를 가진 데이터 집합들을 다룸: MapReduce에서 조인 작 업을 정확하게 처리하기 위해서는 특정 레코드가 어느 데이터 집합에서 왔는지 알아야 함  조인에도 다양한 패턴이 존재함: 네트워크 대역폭 등 자원을 고려한 알고리즘과 패턴 최적화 필요 2015-06-044 / 56

A Refresher on Joins Inner Join + 2015-06-045 / 56

A Refresher on Joins Left Outer Join + 2015-06-046 / 56

A Refresher on Joins Right Outer Join + 2015-06-047 / 56

A Refresher on Joins Full Outer Join 2015-06-048 / 56

A Refresher on Joins Antijoin 2015-06-049 / 56

A Refresher on Joins Cartesian Product 2015-06-0410 / 56

A Refresher on Joins Cartesian Product 2015-06-0411 / 56

Reduce Side Join Pattern description 다른 조인 패턴에 비해 처리에 가장 오랜 시간이 걸림 쉬운 구현 난이도 모든 조인 연산에 대한 지원 가능 Intent Foreign Key를 이용해 대규모 데이터 셋을 조인시킴 Applicability 규모가 매우 큰 데이터 셋이어서 다른 대안이 없을 경우 어떤 조인 연산도 유연하게 처리해야 할 경우 2015-06-0412 / 56

Reduce Side Join Performance analysis 클러스터 네트워크에 큰 부하 - 각 입력 레코드의 외래 키가 추출되어 전달됨: 사전 필터링 없음 - Shuffle & Sort 단계로 매우 많은 데이터 전송 일반적인 분석 작업에 비해 많은 Reducer 사용 원하는 작업을 수행하는 데 있어 다른 방법을 사용할 수 있다면, 그 방법을 사용하기를 권장 2015-06-0413 / 56

Reduce Side Join Structure 2015-06-0414 / 56

Reduce Side Join Structure Mapper: 각 데이터 집합에서 각 레코드별로 입력받아 외래 키를 추출한다. 출력의 value는 데이터 집합의 고유 식별자(ID)를 이용해 구분(A/B) 출력의 Key: 외래 키 출력의 Value: 입력 레코드 전체 Reducer: 각 입력의 value 값을 임시 리스트로 만들어 사용자 설정에 따라 원하는 조인 연산을 수행한 다. → Inner join: 모든 리스트가 비지 않았을 때 Outer join: 유형에 따라 비어있는 리스트도 비지 않는 리스트와 조인 Antijoin: 정확히 하나의 리스트만 비어있는지 테스트하여 레코드 생성 Combiner optimization: reduce 단계에서 조인이 이루어지기 때문에 combiner 최적화 불가능 2015-06-0415 / 56

Reduce Side Join Example Stackoverflow의 user/comment 데이터 셋 사용 문제: 사용자 정보와 코멘트 데이터가 주어졌을 때, 둘을 조인시켜 누가 어떤 코멘트를 작성했는지 알 수 있도록 하라. Hi, I'm not really a person. I'm a background process that helps keep this site clean! I do things like Randomly poke old unanswered questions every hour so they get some attention Own community questions and answers so nobody gets unnecessary reputation from them Own downvotes on spam/evil posts that get permanently deleted Own suggested edits from anonymous users Remove abandoned questions " Location="on the server farm" LastAccessDate="2014-04-17T00:17:22.260" DisplayName="Community" CreationDate="2014-04- 17T00:17:22.260" Reputation="1" Id="-1"/> User data Comment data 2015-06-0416 / 56

Reduce Side Join Example Mapper #1: UserJoinMapper public static class UserJoinMapper extends Mapper { private Text outkey = new Text(); private Text outvalue = new Text(); @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { // Parse the input string into a nice map Map parsed = MRDPUtils.transformXmlToMap(value.toString()); String userId = parsed.get("UserId"); if (userId == null) { return; } // The foreign join key is the user ID outkey.set(userId); // Flag this record for the reducer and then output outvalue.set("A" + value.toString()); context.write(outkey, outvalue); } 2015-06-0417 / 56

Reduce Side Join Example Mapper #2: CommentJoinMapper public static class CommentJoinMapper extends Mapper { private Text outkey = new Text(); private Text outvalue = new Text(); @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { // Parse the input string into a nice map Map parsed = MRDPUtils.transformXmlToMap(value.toString()); String userId = parsed.get("UserId"); if (userId == null) { return; } // The foreign join key is the user ID outkey.set(userId); // Flag this record for the reducer and then output outvalue.set("B" + value.toString()); context.write(outkey, outvalue); } 2015-06-0418 / 56

Reduce Side Join Example Reducer public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { // Clear our lists listA.clear(); listB.clear(); // iterate through all our values, binning each record based on what // it was tagged with // make sure to remove the tag! for (Text t : values) { if (t.charAt(0) == 'A') { listA.add(new Text(t.toString().substring(1))); } else if (t.charAt('0') == 'B') { listB.add(new Text(t.toString().substring(1))); } // Execute our join logic now that the lists are filled executeJoinLogic(context); } 2015-06-0419 / 56

Reduce Side Join Example Reducer private void executeJoinLogic(Context context) throws IOException, InterruptedException { if (joinType.equalsIgnoreCase("inner")) { // If both lists are not empty, join A with B if (!listA.isEmpty() && !listB.isEmpty()) { for (Text A : listA) { for (Text B : listB) { context.write(A, B); } else if (joinType.equalsIgnoreCase("leftouter")) { // For each entry in A, for (Text A : listA) { // If list B is not empty, join A and B if (!listB.isEmpty()) { for (Text B : listB) { context.write(A, B); } } else { // Else, output A by itself context.write(A, new Text("")); } Inner join Left outer join 2015-06-0420 / 56

Reduce Side Join Example Reducer else if (joinType.equalsIgnoreCase("rightouter")) { // FOr each entry in B, for (Text B : listB) { // If list A is not empty, join A and B if (!listA.isEmpty()) { for (Text A : listA) { context.write(A, B); } } else { // Else, output B by itself context.write(new Text(""), B); } Right outer join 2015-06-0421 / 56

Reduce Side Join Example Reducer else if (joinType.equalsIgnoreCase("fullouter")) { // If list A is not empty if (!listA.isEmpty()) { // For each entry in A for (Text A : listA) { // If list B is not empty, join A with B if (!listB.isEmpty()) { for (Text B : listB) { context.write(A, B); } } else { // Else, output A by itself context.write(A, new Text("")); } } else { // If list A is empty, just output B for (Text B : listB) { context.write(new Text(""), B); } Full outer join 2015-06-0422 / 56

Reduce Side Join Example Reducer else if (joinType.equalsIgnoreCase("anti")) { // If list A is empty and B is empty or vice versa if (listA.isEmpty() ^ listB.isEmpty()) { // Iterate both A and B with null values // The previous XOR check will make sure exactly one of // these lists is empty and therefore won't have output for (Text A : listA) { context.write(A, new Text("")); } for (Text B : listB) { context.write(new Text(""), B); } } else { throw new RuntimeException( "Join type not set to inner, leftouter, rightouter, fullouter, or anti"); } 2015-06-0423 / 56 antijoin

Reduce Side Join Example Driver Main Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 4) { System.err.println("Usage: ReduceSideJoin [inner|leftouter|rightouter|fullouter|anti]"); System.exit(1); } String joinType = otherArgs[3]; if (!(joinType.equalsIgnoreCase("inner") || joinType.equalsIgnoreCase("leftouter") || joinType.equalsIgnoreCase("rightouter") || joinType.equalsIgnoreCase("fullouter") || joinType.equalsIgnoreCase("anti"))) { System.err.println("Join type not set to inner, leftouter, rightouter, fullouter, or anti"); System.exit(2); } Job = new Job(conf, "Reduce Side Join"); // Configure the join type job.getConfiguration().set("join.type", joinType); job.setJarByClass(ReduceSideJoinDriver.class); 2015-06-0424 / 56

Reduce Side Join Example Driver Main // Use multiple inputs to set which input uses what mapper // This will keep parsing of each data set separate from a logical // standpoint // However, this version of Hadoop has not upgraded MultipleInputs // to the mapreduce package, so we have to use the deprecated API. // Future releases have this in the "mapreduce" package. MultipleInputs.addInputPath(job, new Path(otherArgs[0]), TextInputFormat.class, UserJoinMapper.class); MultipleInputs.addInputPath(job, new Path(otherArgs[1]), TextInputFormat.class, CommentJoinMapper.class); job.setReducerClass(UserJoinReducer.class); FileOutputFormat.setOutputPath(job, new Path(otherArgs[2])); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); System.exit(job.waitForCompletion(true) ? 0 : 3); 2015-06-0425 / 56

Reduce Side Join Result 2015-06-0426 / 56

Replicated Join Pattern description 하나의 큰 데이터 셋과 여러 개의 작은 데이터 셋 사이의 조인. Map 단계에서 수행, Reducer 없음 (Map-only) Map task 설정 단계에서 메모리에 작은 데이터 셋을 미리 적재. – JVM Heap 메모리만큼의 적재 한 계 발생 Intent 데이터를 Reduce phase까지 끌어오지 않기 위함 Applicability Inner join / Left outer join 큰 데이터 셋을 제외한 작은 데이터 셋들이 각각의 Map 단계에 존재하는 메인 메모리에서 충분히 감당 할 수 있을 만큼의 크기일 때. 2015-06-0427 / 56

Replicated Join Performance analysis 시간의 측면에서는 reducer가 없기 때문에 가장 빠른 조인 수행 가능 JVM에서 안전하게 저장할 수 있는 데이터의 양에 한계가 있고 이는 각각의 map과 reduce task에 할 당해 주는 메모리의 양에 의존한다. – 실제 구현 전에 여러 번의 실험을 거쳐 얼마만큼의 메모리를 할당 할지 결정하는 작업이 필요 (주의) 인메모리에 저장하는 데이터의 크기 ≠ 디스크에 데이터를 저장할 때 필요한 바이트 수 - Java object overhead 2015-06-0428 / 56

Replicated Join Structure Mapper: 분산 캐시(Distributed cache)에서 모든 파일을 읽어와 메모리 에 있는 Look-up table에 저장한다. Combiner, Partitioner, Reducer 없 음: Map-only 2015-06-0429 / 56

Replicated Join Example: Replicated user comment 문제: 사용자 데이터와 코멘트 데이터가 있을 때, 코멘트 데이터에 사용자 정보를 조인시켜 정보를 추가 시켜라. DistributedCache를 이용하여 모든 Map task에 파일 전송 - DistributedCache에 user data 저장 메모리에 읽은 데이터를 직접 저장 조인이 일어나지 않을 데이터를 필터링하여 Reduce 단계로 전송하는 대신, Map 단계에서 직접 조인 작업 진행 2015-06-0430 / 56

Replicated Join Example: Replicated user comment Mapper code public static class ReplicatedJoinMapper extends Mapper { private HashMap userIdToInfo = new HashMap (); private Text outvalue = new Text(); private String joinType = null; @Override public void setup(Context context) throws IOException, InterruptedException { try { Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration()); if (files == null || files.length == 0) { throw new RuntimeException("User information is not set in DistributedCache"); } // Read all files in the DistributedCache for (Path p : files) { BufferedReader rdr = new BufferedReader( new InputStreamReader( new GZIPInputStream( new FileInputStream( new File(p.toString()))))); 2015-06-0431 / 56

Replicated Join Example: Replicated user comment Mapper code String line; // For each record in the user file while ((line = rdr.readLine()) != null) { // Get the user ID for this record Map parsed = MRDPUtils.transformXmlToMap(line); String userId = parsed.get("Id"); if (userId != null) { // Map the user ID to the record userIdToInfo.put(userId, line); } } catch (IOException e) { throw new RuntimeException(e); } // Get the join type joinType = context.getConfiguration().get("join.type"); } 2015-06-0432 / 56

Replicated Join Example: Replicated user comment Mapper code public void map(Object key, Text value, Context context) throws IOException, InterruptedException { // Parse the input string into a nice map Map parsed = MRDPUtils.transformXmlToMap(value.toString()); String userId = parsed.get("UserId"); if (userId == null) { return; } String userInformation = userIdToInfo.get(userId); // If the user information is not null, then output if (userInformation != null) { outvalue.set(userInformation); Context.write(value, outvalue); } else if (joinType.equalsIgnoreCase("leftouter")) { // If we are doing a left outer join, output the record with an // empty value Context.write(value, new Text("")); } 2015-06-0433 / 56

Replicated Join Example: Replicated user comment Driver code public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 4) { System.err.println("Usage: ReplicatedJoin [inner|leftouter]"); System.exit(1); } String joinType = otherArgs[3]; if (!(joinType.equalsIgnoreCase("inner") || joinType.equalsIgnoreCase("leftouter"))) { System.err.println("Join type not set to inner or leftouter"); System.exit(2); } // Configure the join type Job job = new Job(conf, "Replicated Join"); job.getConfiguration().set("join.type", joinType); job.setJarByClass(ReplicatedJoinDriver.class); 2015-06-0434 / 56

Replicated Join Example: Replicated user comment Driver code job.setMapperClass(ReplicatedJoinMapper.class); job.setNumReduceTasks(0); TextInputFormat.setInputPaths(job, new Path(otherArgs[1])); TextOutputFormat.setOutputPath(job, new Path(otherArgs[2])); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); // Configure the DistributedCache DistributedCache.addCacheFile(new Path(otherArgs[0]).toUri(), job.getConfiguration()); DistributedCache.setLocalFiles(job.getConfiguration(), otherArgs[0]); System.exit(job.waitForCompletion(true) ? 0 : 3); } 2015-06-0435 / 56

Composite Join Pattern description 대규모 정형 입력(formatted input)을 이용해 map 단계에서 수행할 수 있는 특수한 조 인 연산. Intent Shuffle and sort와 reduce 단계를 생략 - 입력이 매우 잘 정제되어 있어야 함 Applicability Inner join / Full outer join 모든 데이터 셋이 매우 클 때 모든 데이터 셋을 외래 키로 읽어 Mapper의 입력 키로 사용할 수 있을 때 모든 데이터 셋의 파티션 수가 같을 때 각 파티션이 외래 키로 정렬되어 있고 모든 외래 키가 각 데이터 셋의 파티션에 존재할 때 2015-06-0436 / 56

Composite Join Structure: Partition 한 파티션 X에는 데이터 셋 A와 B에 존재하는 같은 외래 키가 포함된다. 이 외래 키들은 파티션 X에만 존재하 며, 다른 파티션에는 존재하지 않는다. 2015-06-0437 / 56

Composite Join Structure...... Driver 코드에서 job 설정 단계 대부분 을 담당 - 입력 데이터 셋 파싱을 위한 입력 타입, 실행할 조인 유형 Mapper: 입력 튜플에서 두 값을 가져와 파일 시스템에 출력 Combiner, partitioner, reducer 없 음: map-only 2015-06-0438 / 56

Composite Join Example 문제: 주어진 코멘트 데이터에 사용자 정보를 조인시켜 두 자료를 함께 표시하라. 2015-06-0439 / 56

Composite Join Example public static void main(String[] args) throws Exception { JobConf conf = new JobConf("CompositeJoin"); conf.setJarByClass(CompositeJoinDriver.class); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 4) { System.err.println("Usage: CompositeJoin [inner|outer]"); System.exit(1); } Path userPath = new Path(otherArgs[0]); Path commentPath = new Path(otherArgs[1]); Path outputDir = new Path(otherArgs[2]); String joinType = otherArgs[3]; if (!(joinType.equalsIgnoreCase("inner") || joinType.equalsIgnoreCase("outer"))) { System.err.println("Join type not set to inner or outer"); System.exit(2); } Driver code 2015-06-0440 / 56

Composite Join Example conf.setMapperClass(CompositeMapper.class); conf.setNumReduceTasks(0); // Set the input format class to a CompositeInputFormat class. // The CompositeInputFormat will parse all of our input files and output // records to our mapper. conf.setInputFormat(CompositeInputFormat.class); // The composite input format join expression will set how the records // are going to be read in, and in what input format. conf.set("mapred.join.expr", CompositeInputFormat.compose(joinType, KeyValueTextInputFormat.class, userPath, commentPath)); TextOutputFormat.setOutputPath(conf, outputDir); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); RunningJob job = JobClient.runJob(conf); while (!job.isComplete()) { Thread.sleep(1000); } System.exit(job.isSuccessful() ? 0 : 2); Driver code (Cont’d.) 2015-06-0441 / 56

Composite Join Example public static class CompositeMapper extends MapReduceBase implements Mapper { @Override public void map(Text key, TupleWritable value, OutputCollector output, Reporter reporter) throws IOException { // Get the first two elements in the tuple and output them output.collect((Text) value.get(0), (Text) value.get(1)); } Mapper code 2015-06-0442 / 56

Cartesian Product Pattern description 다수의 레코드 입력을 다른 모든 레코드와 비교/분석. 시간 소모가 큼 Intent 모든 레코드를 다른 데이터 셋의 모든 레코드와 짝지어 비교 Applicability 모든 레코드들 사이의 관계를 분석하고 싶은 경우 수행 시간에 대한 제약이 없는 경우 2015-06-0443 / 56

Cartesian Product Performance analysis 2015-06-0444 / 56

Cartesian Product Structure Map-only job Job의 설정 및 구성 과정에서 input split 사이의 cross product 개수만큼 input split 생성 처리 후 각각의 레코드 리더에서 입력받 은 2개의 input split(각각 좌/우 데이 터 셋에서 온 split)을 이용하여 cross product 생성. 레코드 리더에서 결과값을 mapper class로 전송하면 Mapper는 단순히 결과값을 파일에 쓰는 작업 담당 2015-06-0445 / 56

Cartesian Product Example 문제: Stackoverflow의 코멘트 데이터를 이용하여, 내용에 쓰인 유사 단어를 바탕으로 서로 유사한 코멘 트를 구별해 내어라. 2015-06-0446 / 56

Cartesian Product Example Input formatter public static class CartesianInputFormat extends FileInputFormat { public static final String LEFT_INPUT_FORMAT = "cart.left.inputformat"; public static final String LEFT_INPUT_PATH = "cart.left.path"; public static final String RIGHT_INPUT_FORMAT = "cart.right.inputformat"; public static final String RIGHT_INPUT_PATH = "cart.right.path"; public static void setLeftInputInfo(JobConf conf, Class inputFormat, String inputPath) { conf.set(LEFT_INPUT_FORMAT, inputFormat.getCanonicalName()); conf.set(LEFT_INPUT_PATH, inputPath); } public static void setRightInputInfo(JobConf job, Class inputFormat, String inputPath) { job.set(RIGHT_INPUT_FORMAT, inputFormat.getCanonicalName()); job.set(RIGHT_INPUT_PATH, inputPath); } 2015-06-0447 / 56

Cartesian Product Example Input formatter @Override public InputSplit[] getSplits(JobConf conf, int numSplits) throws IOException { try { // Get the input splits from both the left and right data sets InputSplit[] leftSplits = getInputSplits(conf, conf.get(LEFT_INPUT_FORMAT), conf.get(LEFT_INPUT_PATH), numSplits); InputSplit[] rightSplits = getInputSplits(conf, conf.get(RIGHT_INPUT_FORMAT), conf.get(RIGHT_INPUT_PATH), numSplits); // Create our CompositeInputSplits, size equal to left.length * // right.length CompositeInputSplit[] returnSplits = new CompositeInputSplit[leftSplits.length * rightSplits.length]; int i = 0; // For each of the left input splits for (InputSplit left : leftSplits) { // For each of the right input splits for (InputSplit right : rightSplits) { // Create a new composite input split composing of the two returnSplits[i] = new CompositeInputSplit(2); returnSplits[i].add(left); returnSplits[i].add(right); ++i; } // Return the composite splits LOG.info("Total splits to process: " + returnSplits.length); return returnSplits; } getSplits : 좌/우 데이터 셋에서 input split을 가져 와 (좌측 split의 길이 * 우측 split의 길이) 만큼 split 생성 후 리턴 2015-06-0448 / 56

Cartesian Product Example Input formatter @Override public RecordReader getRecordReader(InputSplit split, JobConf conf, Reporter reporter) throws IOException { // create a new instance of the Cartesian record reader return new CartesianRecordReader((CompositeInputSplit) split, conf, reporter); } private InputSplit[] getInputSplits(JobConf conf, String inputFormatClass, String inputPath, int numSplits) throws ClassNotFoundException, IOException { // Create a new instance of the input format FileInputFormat inputFormat = (FileInputFormat) ReflectionUtils.newInstance(Class.forName(inputFormatClass), conf); // Set the input path for the left data set inputFormat.setInputPaths(conf, inputPath); // Get the left input splits return inputFormat.getSplits(conf, numSplits); } getInputSplits : 두 데이터 셋을 입력받아 데카르트 곱을 생성 한 후 inputSplit의 list로 리턴 getRecordReader : Cartesian record reader의 새 인스턴스 생 성 및 반환 2015-06-0449 / 56

Cartesian Product Example Record reader public static class CartesianRecordReader implements RecordReader { // Record readers to get key value pairs private RecordReader leftRR = null, rightRR = null; // Store configuration to re-create the right record reader private FileInputFormat rightFIF; private JobConf rightConf; private InputSplit rightIS; private Reporter rightReporter; // Helper variables private K1 lkey; private V1 lvalue; private K2 rkey; private V2 rvalue; private boolean goToNextLeft = true, alldone = false; 2015-06-0450 / 56

Cartesian Product Example Record reader public CartesianRecordReader(CompositeInputSplit split, JobConf conf, Reporter reporter) throws IOException { this.rightConf = conf; this.rightIS = split.get(1); this.rightReporter = reporter; try { // Create left record reader FileInputFormat leftFIF = (FileInputFormat) ReflectionUtils.newInstance(Class.forName(conf.get(CartesianInputFormat.LEFT_INPUT_FORMAT)), conf); leftRR = leftFIF.getRecordReader(split.get(0), conf, reporter); // Create right record reader rightFIF = (FileInputFormat) ReflectionUtils.newInstance(Class.forName(conf.get(CartesianInputFormat.RIGHT_INPUT_FORMAT)),conf); rightRR = rightFIF.getRecordReader(rightIS, rightConf, rightReporter); } catch (ClassNotFoundException e) { e.printStackTrace(); throw new IOException(e); } // Create key value pairs for parsing lkey = (K1) this.leftRR.createKey(); lvalue = (V1) this.leftRR.createValue(); rkey = (K2) this.rightRR.createKey(); rvalue = (V2) this.rightRR.createValue(); } 생성자 : 좌/우 레코드를 읽기 위한 reader 오브젝트를 생성하고 이 reader를 바탕으로 좌/우 데이터 셋 입력 split의 Key/Value를 생성한다. 2015-06-0451 / 56

Cartesian Product Example Record reader @Override public boolean next(Text key, Text value) throws IOException { do { // If we are to go to the next left key/value pair if (goToNextLeft) { // Read the next key value pair, false means no more pairs if (!leftRR.next(lkey, lvalue)) { // If no more, then this task is nearly finished alldone = true; break; } else { // If we aren't done, set the value to the key and set our flags key.set(lvalue.toString()); goToNextLeft = alldone = false; // Reset the right record reader this.rightRR = this.rightFIF.getRecordReader( this.rightIS, this.rightConf, this.rightReporter); } (Continuing) next 1. 첫 번째 호출에 왼쪽 데이터 셋에서 첫 번째 레코 드를 읽어 mapper 입력 키 생성 2. 이어지는 호출들에서 우측 레코드 리더를 통해 mapper에서 더 이상 처리할 레코드가 없다고 통보할 때까지 계속 레코드를 읽음 3. 우측 레코드 리더가 한 번 끝나면 초기화하고 왼 쪽 데이터 셋의 다음 레코드에 대해 같은 처리 반 복. 2015-06-0452 / 56

Cartesian Product Example Record reader (Continued) // Read the next key value pair from the right data set if (rightRR.next(rkey, rvalue)) {// If success, set the value value.set(rvalue.toString()); } else { // Otherwise, this right data set is complete // and we should go to the next left pair goToNextLeft = true;} // This loop will continue if we finished reading key/value // pairs from the right data set } while (goToNextLeft); // Return true if a key/value pair was read, false otherwise return !alldone; } 2015-06-0453 / 56

Cartesian Product Example Driver code public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { long start = System.currentTimeMillis(); JobConf conf = new JobConf("Cartesian Product"); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: CartesianProduct "); System.exit(1); } // Configure the join type conf.setJarByClass(CartesianProduct.class); conf.setMapperClass(CartesianMapper.class); conf.setNumReduceTasks(0); conf.setInputFormat(CartesianInputFormat.class); CartesianInputFormat.setLeftInputInfo(conf, TextInputFormat.class, otherArgs[0]); CartesianInputFormat.setRightInputInfo(conf, TextInputFormat.class, otherArgs[0]); TextOutputFormat.setOutputPath(conf, new Path(otherArgs[1])); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); RunningJob job = JobClient.runJob(conf); while (!job.isComplete()) { Thread.sleep(1000); } long finish = System.currentTimeMillis(); System.out.println("Time in ms: " + (finish - start)); System.exit(job.isSuccessful() ? 0 : 2); } 2015-06-0454 / 56

Cartesian Product Example Mapper code public static class CartesianMapper extends MapReduceBase implements Mapper { private Text outkey = new Text(); @Override public void map(Text key, Text value, OutputCollector output, Reporter reporter) throws IOException { // If the two comments are not equal if (!key.toString().equals(value.toString())) { String[] leftTokens = key.toString().split("\\s"); String[] rightTokens = value.toString().split("\\s"); HashSet leftSet = new HashSet (Arrays.asList(leftTokens)); HashSet rightSet = new HashSet (Arrays.asList(rightTokens)); int sameWordCount = 0; StringBuilder words = new StringBuilder(); for (String s : leftSet) { if (rightSet.contains(s)) { words.append(s + ","); ++sameWordCount; } if (sameWordCount > 2) { outkey.set(words + "\t" + key); output.collect(outkey, value); }}} 2015-06-0455 / 56

END 2015-06-04

MapReduce design patterns Chapter 5: Join Patterns 2015. 6. 4 G201449021 진다인.

Similar presentations

Presentation on theme: "MapReduce design patterns Chapter 5: Join Patterns 2015. 6. 4 G201449021 진다인."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MapReduce design patterns Chapter 5: Join Patterns 2015. 6. 4 G201449021 진다인.

Similar presentations

Presentation on theme: "MapReduce design patterns Chapter 5: Join Patterns 2015. 6. 4 G201449021 진다인."— Presentation transcript:

Similar presentations

About project

Feedback