Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Chapter 2 Data Models Database Systems: Design, Implementation, and Management, Eleventh Edition, Coronel & Morris.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
Intro to Map-Reduce Feb 4, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Distributed Computations
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
Poly Hadoop CSC 550 May 22, 2007 Scott Griffin Daniel Jackson Alexander Sideropoulos Anton Snisarenko.
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
Distributed Computations MapReduce
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.
GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
大规模数据处理 / 云计算 Lecture 3 – Hadoop Environment 彭波 北京大学信息科学技术学院 4/23/2011 This work is licensed under a Creative Commons.
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) Sep 4, 2012 Lecture 3 Cloud Computing -
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce Costin Raiciu Advanced Topics in Distributed Systems, 2011.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
HAMS Technologies 1
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce Costin Raiciu Advanced Topics in Distributed Systems, 2012.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.
© 2012 Unisys Corporation. All rights reserved. 1 Unisys Corporation. Proprietary and Confidential.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
MapReduce Programming Model. HP Cluster Computing Challenges  Programmability: need to parallelize algorithms manually  Must look at problems from parallel.
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, January 31, 2013 Session 2: Hadoop Nuts and Bolts This work is licensed.
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
Distributed Systems Lecture 3 Big Data and MapReduce 1.
Hadoop&Hbase Developed Using JAVA USE NETBEANS IDE.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Lecture 4: Mapreduce and Hadoop
Introduction to Google MapReduce
Central Florida Business Intelligence User Group
Airlinecount CSCE 587 Fall 2017.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Lecture 18 (Hadoop: Programming Examples)
Chapter X: Big Data.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Google Map Reduce OSDI 2004 slides
Presentation transcript:

Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010

 Introduction  Hadoop  MapReduce  Working With Hadoop  Environment  MapReduce Programming  Summary

 Is a software framework  User should program  Like a super-library  For distributed applications  Build-in solutions  Solutions depend on this framework  Inspired by Google's MapReduce and Google File System (GFS) papers

 Who use Hadoop  A9.com – Amazon ▪ Amazon's product search indices  Adobe ▪ 30 nodes running HDFS, Hadoop and Hbase  Baidu ▪ handle about 3000TB per week  Facebook ▪ store copies of internal log and dimension data sources  Last.fm, LinkedIn, IBM, Yahoo!, Google…

 Hadoop Common  HDFS  MapReduce  ZooKeeper

 Connections to the IR book  Ch.4 Index construction ▪ Distributed indexing (4.4)  Ch.20 Web crawling and indexes ▪ Distributed crawler (20.2) ▪ Distributed indexing (20.3)

 Is a software framework  For distributed computing  Mass amount of data  Simple processing requirement  Portability across variety platforms ▪ Clusters ▪ CMP/SMP ▪ GPGPU  Introduced by Google

Cited from MapReduce: Simplified Data Processing on Large Clusters

 Map  Map(k1,v1) -> list(k2,v2)  Reduce  Reduce(k2, list (v2)) -> list(v3)  Hadoop MapReduce  (input) -> map -> -> combine -> -> reduce -> (output)

 Source $cat file01 Hello World Bye World $cat file02 Hello Hadoop Goodbye Hadoop $

 Map Output  For File01  For File02

 Reduce Output

 More input  More mappers  Combiner Function after Map  More reducers  Partition Function before Reduce  Focus on Map & Reduce

 Hadoop in Java (C++)  Run in 3 modes  Local (Standalone) Mode  Pseudo-Distributed Mode  Fully-Distributed Mode  It is setup to Pseudo-Distributed Mode in our instance on IBM cloud

 Process 1. Start Hadoop service 2. Prepare input 3. Write your MapReduce program 4. Compile your program 5. Run your application with Hadoop

 Start Hadoop service  $ bin/hadoop namenode -format  $ bin/start-all.sh  Initialize filesystem  $ bin/hadoop fs -put localdir hinputdir  You can also use -get, -rm, -cat with fs

 Compile your program & create jar  $ javac -classpath ${HADOOP}-core.jar -d wordcount_classes WordCount.java  $ jar -cvf wordcount.jar -C wordcount_classes/.  Run your application with Hadoop  $ bin/hadoop jar wordcount.jar org.myorg.WordCount hinputdir houtputdir

void map(String name, String document): // name: document name // document: document contents for each word w in document: EmitIntermediate(w, "1"); void reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts int result = 0; for each pc in partialCounts: result += ParseInt(pc); Emit(AsString(result)); Cited from Wikipedia

public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); }

public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }

 Configurations & Main class Leave other work for the Hadoop MapReduce Framework

 Hadoop  Introduction  Connections to the IR book  MapReduce  Overview  E.g. WordCount  Environment configuration  Writing your MapReduce application

 Hadoop Project  MapReduce in Hadoop  MapReduce: Simplified Data Processing on Large Clusters 9&part=magazine&WantType=Magazines&title=Communications%20of%20the %20ACM  Hadoop Single-Node Setup  Who use Hadoop