Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.

Team3: Xiaokui Shu, Ron Cohen subx@vt.edu roncohen@vt.edu CS5604 at Virginia Tech December 6, 2010

 Introduction  Hadoop  MapReduce  Working With Hadoop  Environment  MapReduce Programming  Summary

 Is a software framework  User should program  Like a super-library  For distributed applications  Build-in solutions  Solutions depend on this framework  Inspired by Google's MapReduce and Google File System (GFS) papers

 Who use Hadoop  A9.com – Amazon ▪ Amazon's product search indices  Adobe ▪ 30 nodes running HDFS, Hadoop and Hbase  Baidu ▪ handle about 3000TB per week  Facebook ▪ store copies of internal log and dimension data sources  Last.fm, LinkedIn, IBM, Yahoo!, Google…

 Hadoop Common  HDFS  MapReduce  ZooKeeper

 Connections to the IR book  Ch.4 Index construction ▪ Distributed indexing (4.4)  Ch.20 Web crawling and indexes ▪ Distributed crawler (20.2) ▪ Distributed indexing (20.3)

 Is a software framework  For distributed computing  Mass amount of data  Simple processing requirement  Portability across variety platforms ▪ Clusters ▪ CMP/SMP ▪ GPGPU  Introduced by Google

Cited from MapReduce: Simplified Data Processing on Large Clusters

 Map  Map(k1,v1) -> list(k2,v2)  Reduce  Reduce(k2, list (v2)) -> list(v3)  Hadoop MapReduce  (input) -> map -> -> combine -> -> reduce -> (output)

 Source $cat file01 Hello World Bye World $cat file02 Hello Hadoop Goodbye Hadoop $

 Map Output  For File01  For File02

 Reduce Output

 More input  More mappers  Combiner Function after Map  More reducers  Partition Function before Reduce  Focus on Map & Reduce

 Hadoop in Java (C++)  Run in 3 modes  Local (Standalone) Mode  Pseudo-Distributed Mode  Fully-Distributed Mode  It is setup to Pseudo-Distributed Mode in our instance on IBM cloud

 Process 1. Start Hadoop service 2. Prepare input 3. Write your MapReduce program 4. Compile your program 5. Run your application with Hadoop

 Start Hadoop service  $ bin/hadoop namenode -format  $ bin/start-all.sh  Initialize filesystem  $ bin/hadoop fs -put localdir hinputdir  You can also use -get, -rm, -cat with fs

 Compile your program & create jar  $ javac -classpath ${HADOOP}-core.jar -d wordcount_classes WordCount.java  $ jar -cvf wordcount.jar -C wordcount_classes/.  Run your application with Hadoop  $ bin/hadoop jar wordcount.jar org.myorg.WordCount hinputdir houtputdir

void map(String name, String document): // name: document name // document: document contents for each word w in document: EmitIntermediate(w, "1"); void reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts int result = 0; for each pc in partialCounts: result += ParseInt(pc); Emit(AsString(result)); Cited from Wikipedia

public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); }

public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }

 Configurations & Main class Leave other work for the Hadoop MapReduce Framework

 Hadoop  Introduction  Connections to the IR book  MapReduce  Overview  E.g. WordCount  Environment configuration  Writing your MapReduce application

 Hadoop Project http://hadoop.apache.org/  MapReduce in Hadoop http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html  MapReduce: Simplified Data Processing on Large Clusters http://portal.acm.org/citation.cfm?id=1327452.1327492&coll=GUIDE&dl=&idx=J7 9&part=magazine&WantType=Magazines&title=Communications%20of%20the %20ACM  Hadoop Single-Node Setup http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html  Who use Hadoop http://wiki.apache.org/hadoop/PoweredBy

Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.

Similar presentations

Presentation on theme: "Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.

Similar presentations

Presentation on theme: "Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010."— Presentation transcript:

Similar presentations

About project

Feedback