Before we start, please download: VirtualBox: – https://www.virtualbox.org/ https://www.virtualbox.org/ The Hortonworks Data Platform: –

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce With a heavy debt to: Google Map Reduce OSDI 2004 slides code.google.com.
Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
MapReduce.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Parallel and Distributed Computing: MapReduce Alona Fyshe.
CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
Intro to Map-Reduce Feb 4, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
Poly Hadoop CSC 550 May 22, 2007 Scott Griffin Daniel Jackson Alexander Sideropoulos Anton Snisarenko.
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
Reproducible Environment for Scientific Applications (Lab session) Tak-Lon (Stephen) Wu.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
大规模数据处理 / 云计算 Lecture 3 – Hadoop Environment 彭波 北京大学信息科学技术学院 4/23/2011 This work is licensed under a Creative Commons.
Actores y Actrices. Peligro Please be careful! IMDb (I assume you all know?)
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) Sep 4, 2012 Lecture 3 Cloud Computing -
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials:
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
MapReduce How to painlessly process terabytes of data.
Hadoop Introduction Wang Xiaobo Outline Install hadoop HDFS MapReduce WordCount Analyzing Compile image data TeleNav Confidential.
An Introduction to HDInsight June 27 th,
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
Hadoop as a Service Boston Azure / Microsoft DevBoston 07-Feb-2012 Copyright (c) 2011, Bill Wilder – Use allowed under Creative Commons license
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.
Programming in Hadoop Guangda HU Huayang GUO
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Kole Jaya Chakladar Group No: 1.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, January 31, 2013 Session 2: Hadoop Nuts and Bolts This work is licensed.
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
Hadoop&Hbase Developed Using JAVA USE NETBEANS IDE.
BIG DATA/ Hadoop Interview Questions.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Sort in MapReduce. MapReduce Block 1 Block 2 Block 3 Block 4 Block 5 Map Reduce Output 1 Output 2 Shuffle/Sort.
Introduction to Google MapReduce
Central Florida Business Intelligence User Group
Airlinecount CSCE 587 Fall 2017.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
인공지능연구실 이남기 ( ) 유비쿼터스 응용시스템: 실습 가이드 인공지능연구실 이남기 ( )
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Lecture 18 (Hadoop: Programming Examples)
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Presentation transcript:

Before we start, please download: VirtualBox: – The Hortonworks Data Platform: – Hortonworks_Sandbox_2.1.ova Hortonworks_Sandbox_2.1.ova – ks_Sandbox_2.1.ova ks_Sandbox_2.1.ova Assignment file: – t2_files.zip t2_files.zip

Overview Map/Reduce overview Local debug setup Pseudo-distributed setup The assignment tasks

Map/Reduce Overview What is Map/Reduce: – MapReduce is a software framework for processing large data sets in a distributed fashion over several machines. What is HDFS: – The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. – Relax some requirements of traditional distributed FS, but ensure high availability.

Map/Reduce Overview

Job1 Job2 Job3 Job4 Job5 Input and output are always in (key, value) pairs. Need to be carefully assigned. Moving program to data.

At IFI Hadoop Cluster (80+ machines) /cluster/nodes /cluster/nodes If you are using the labs, please: Always log off. Never turn off any iMacs. Never unplug the network or the power cable Typical workflow: Develop your app in local Upload it to the submission host Submit it with the submission host

Debug Setup Set up the IDE event logger, if you haven’t done it. Open a new work space (suggested) Include related jars! As long as you include the related jars, you will be able to run and debug it in local. You can even setup a break point. Write your java application. Set up run arguments. Press the run button.

Mapper public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); line = line.replaceAll(",", ""); line = line.replaceAll("\\.", ""); line = line.replaceAll("-", " "); line = line.replaceAll("\"", ""); StringTokenizer tokenizer = new StringTokenizer(line); int movieId = Integer.parseInt(tokenizer.nextToken()); while (tokenizer.hasMoreTokens()) { String word = tokenizer.nextToken().toLowerCase(); context.write(new Text("1"), new IntWritable(1)); } Only one key here “1”

Reducer public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum++; } context.write(new IntWritable(sum), NullWritable.get()); } For the only key “1”, count the number of its values

Pseudo-distributed Setup The HortonWorks Sandbox is a pseudo-distributed cluster running HDFS, Hadoop, pig and other big data tools. Once it is loaded by your virtual machine, you can assume all these services are running on the server. Import the virtual image. Using a browser to access the server Using command line to access the server Upload your jar Submit your jar

Overview

Hortonworks Data Platform Connect to server – ssh -p 2222 – hadoop Browser-based interface – File browser – Job browser – Pig

Assignment 2 There are 4 tasks: 1: Count the total number of words in plot_summaries.txt. (Almost done for you) 2.1: Get the top 10 most frequently used words from plot_summaries.txt in descending order. 2.2: Same with 2.1, but excluding stop words. 3: Join movie.metadata.tsv and character.metadata.tsv to find actors/actresses in each movie. 4: Build an inverted index of the words for the movies in plot_summaries.txt.

Logging tool Please also submit your log file… And your jar file, of course.

Next week Microsoft Azure + Assignment 1 solution + Lecture