Presentation is loading. Please wait.

Presentation is loading. Please wait.

Before we start, please download: VirtualBox: – https://www.virtualbox.org/ https://www.virtualbox.org/ The Hortonworks Data Platform: –

Similar presentations


Presentation on theme: "Before we start, please download: VirtualBox: – https://www.virtualbox.org/ https://www.virtualbox.org/ The Hortonworks Data Platform: –"— Presentation transcript:

1 Before we start, please download: VirtualBox: – https://www.virtualbox.org/ https://www.virtualbox.org/ The Hortonworks Data Platform: – http://d1ozir9xe74yyw.cloudfront.net/2.1/virtualbox/ Hortonworks_Sandbox_2.1.ova http://d1ozir9xe74yyw.cloudfront.net/2.1/virtualbox/ Hortonworks_Sandbox_2.1.ova – https://files.ifi.uzh.ch/ddis/teaching/DS15/Hortonwor ks_Sandbox_2.1.ova https://files.ifi.uzh.ch/ddis/teaching/DS15/Hortonwor ks_Sandbox_2.1.ova Assignment file: – https://files.ifi.uzh.ch/ddis/teaching/DS15/Assignmen t2_files.zip https://files.ifi.uzh.ch/ddis/teaching/DS15/Assignmen t2_files.zip

2 Overview Map/Reduce overview Local debug setup Pseudo-distributed setup The assignment tasks

3 Map/Reduce Overview What is Map/Reduce: – MapReduce is a software framework for processing large data sets in a distributed fashion over several machines. What is HDFS: – The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. – Relax some requirements of traditional distributed FS, but ensure high availability.

4 Map/Reduce Overview

5 Job1 Job2 Job3 Job4 Job5 Input and output are always in (key, value) pairs. Need to be carefully assigned. Moving program to data.

6 At IFI Hadoop Cluster (80+ machines) http://tentacle.ifi.uzh.ch:8088 /cluster/nodes http://tentacle.ifi.uzh.ch:8088 /cluster/nodes If you are using the labs, please: Always log off. Never turn off any iMacs. Never unplug the network or the power cable Typical workflow: Develop your app in local Upload it to the submission host Submit it with the submission host

7 Debug Setup Set up the IDE event logger, if you haven’t done it. Open a new work space (suggested) Include related jars! As long as you include the related jars, you will be able to run and debug it in local. You can even setup a break point. Write your java application. Set up run arguments. Press the run button.

8 Mapper public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); line = line.replaceAll(",", ""); line = line.replaceAll("\\.", ""); line = line.replaceAll("-", " "); line = line.replaceAll("\"", ""); StringTokenizer tokenizer = new StringTokenizer(line); int movieId = Integer.parseInt(tokenizer.nextToken()); while (tokenizer.hasMoreTokens()) { String word = tokenizer.nextToken().toLowerCase(); context.write(new Text("1"), new IntWritable(1)); } Only one key here “1”

9 Reducer public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum++; } context.write(new IntWritable(sum), NullWritable.get()); } For the only key “1”, count the number of its values

10 Pseudo-distributed Setup The HortonWorks Sandbox is a pseudo-distributed cluster running HDFS, Hadoop, pig and other big data tools. Once it is loaded by your virtual machine, you can assume all these services are running on the server. Import the virtual image. Using a browser to access the server Using command line to access the server Upload your jar Submit your jar

11 Overview

12 Hortonworks Data Platform Connect to server – ssh hue@127.0.0.1 -p 2222 – hadoop Browser-based interface – File browser – Job browser – Pig

13 Assignment 2 There are 4 tasks: 1: Count the total number of words in plot_summaries.txt. (Almost done for you) 2.1: Get the top 10 most frequently used words from plot_summaries.txt in descending order. 2.2: Same with 2.1, but excluding stop words. 3: Join movie.metadata.tsv and character.metadata.tsv to find actors/actresses in each movie. 4: Build an inverted index of the words for the movies in plot_summaries.txt.

14 Logging tool Please also submit your log file… And your jar file, of course.

15 Next week Microsoft Azure + Assignment 1 solution + Lecture


Download ppt "Before we start, please download: VirtualBox: – https://www.virtualbox.org/ https://www.virtualbox.org/ The Hortonworks Data Platform: –"

Similar presentations


Ads by Google