Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.

Similar presentations


Presentation on theme: "Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation."— Presentation transcript:

1 Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

2  Need to process 10TB datasets  On 1 node: ◦ scanning @ 50MB/s = 2.3 days  On 1000 node cluster: ◦ scanning @ 50MB/s = 3.3 min  Need Efficient, Reliable and Usable framework ◦ Google File System (GFS) paper Google File System ◦ Google's MapReduce paper GoogleMapReduce

3  Hadoop uses HDFS, a distributed file system based on GFS, as its shared file system ◦ Files are divided into large blocks and distributed across the cluster (64MB) ◦ Blocks replicated to handle hardware failure ◦ Current block replication is 3 (configurable) ◦ It cannot be directly mounted by an existing operating system.  Once you use the DFS (put something in it), relative paths are from /user/{your usr id}. E.G. if your id is jwang30 … your “home dir” is /user/jwang30

4  Master-Slave Architecture  HDFS Master “Namenode” (irkm-1) ◦ Accepts MR jobs submitted by users ◦ Assigns Map and Reduce tasks to Tasktrackers ◦ Monitors task and tasktracker status, re-executes tasks upon failure   HDFS Slaves “Datanodes” (irkm-1 to irkm-6) ◦ Run Map and Reduce tasks upon instruction from the Jobtracker ◦ Manage storage and transmission of intermediate output

5

6  Hadoop is locally “installed” on each machine ◦ Version 0.19.2 ◦ Installed location is in /home/tmp/hadoop ◦ Slave nodes store their data in /tmp/hadoop- ${user.name} (configurable)

7  If it is the first time that you use it, you need to format the namenode: ◦ - log to irkm-1 ◦ - cd /home/tmp/hadoop ◦ - bin/hadoop namenode –format  Basically we see most commands look similar ◦ bin/hadoop “some command” options ◦ If you just type hadoop you get all possible commands (including undocumented)

8  hadoop dfs ◦ [-ls ] ◦ [-du ] ◦ [-cp ] ◦ [-rm ] ◦ [-put ] ◦ [-copyFromLocal ] ◦ [-moveFromLocal ] ◦ [-get [-crc] ] ◦ [-cat ] ◦ [-copyToLocal [-crc] ] ◦ [-moveToLocal [-crc] ] ◦ [-mkdir ] ◦ [-touchz ] ◦ [-test -[ezd] ] ◦ [-stat [format] ] ◦ [-help [cmd]]

9  bin/start-all.sh – starts all slave nodes and master node  bin/stop-all.sh – stops all slave nodes and master node  Run jps to check the status

10  Log to irkm-1  rm –fr /tmp/hadoop/$userID  cd /home/tmp/hadoop  bin/hadoop dfs –ls  bin/hadoop dfs –copyFromLocal example example  After that  bin/hadoop dfs –ls

11

12

13

14  Mapper.py

15  Reducer.py

16  bin/hadoop dfs -ls  bin/hadoop dfs –copyFromLocal example example  bin/hadoop jar contrib/streaming/hadoop-0.19.2-streaming.jar -file wordcount-py.example/mapper.py -mapper wordcount- py.example/mapper.py -file wordcount-py.example/reducer.py -reducer wordcount-py.example/reducer.py -input example -output java-output  bin/hadoop dfs -cat java-output/part-00000  bin/hadoop dfs -copyToLocal java-output/part-00000 java-output-local

17  Hadoop job tracker Hadoop job tracker ◦ http://irkm-1.soe.ucsc.edu:50030/jobtracker.jsp http://irkm-1.soe.ucsc.edu:50030/jobtracker.jsp  Hadoop task tracker ◦ http://irkm-1.soe.ucsc.edu:50060/tasktracker.jsp http://irkm-1.soe.ucsc.edu:50060/tasktracker.jsp  Hadoop dfs checker ◦ http://irkm-1.soe.ucsc.edu:50070/dfshealth.jsp http://irkm-1.soe.ucsc.edu:50070/dfshealth.jsp

18


Download ppt "Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation."

Similar presentations


Ads by Google