Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.

Similar presentations


Presentation on theme: "Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working."— Presentation transcript:

1 Cloud Computing project NSYSU Sec. 1 Demo

2 NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working on hadoop cluster –Basic setup –Flow chart  Compare crawler’s efficiency on different types’ hadoop cluster

3 NSYSU EE IT_LAB3 Architecture  Hardware –2 ASUS Servers, Intel Xeon CPU X3330 2.66GHz, 1TB HD & 3G ram (master, slave1) 1TB HD & 3G ram (master, slave1) –1 PC, Intel Core 2 Quad CPU Q6600 2.40GHz, 500G HD, 4G ram (slave2) 500G HD, 4G ram (slave2)  Software –CentOS 5.03 –Hadoop 0.20.1

4 NSYSU EE IT_LAB4 Architecture Machine 01 Machine 02Machine 03 master (x.x.x.1) slave2 (x.x.x.3)slave1 (x.x.x.2) Namenode JobTracker Datanode TaskTracker Datanode TaskTracker Datanode TaskTracker administer http://x.x.x.1:50070 http://x.x.x.1:50030 user Job

5 NSYSU EE IT_LAB5 HDFS HDFS http://x.x.x.1:50070

6 NSYSU EE IT_LAB6 HDFS HDFS http://x.x.x.1:50070

7 NSYSU EE IT_LAB7 Job admin Job admin http://x.x.x.1:50030

8 NSYSU EE IT_LAB8 Job admin Job admin http://x.x.x.1:50030

9 NSYSU EE IT_LAB9 Job admin Job admin http://x.x.x.1:50030

10 NSYSU EE IT_LAB10 Basic setup (hadoop) 1.Setup communication without password through ssh protocol 2.Install java 3.Import java path (or any files’ path needed) in {hadoop dir}/conf/hadoop-env.sh 4.Import namenode and Jobtracker hosts’ name in {hadoop dir}/conf/hadoop-site.sh

11 NSYSU EE IT_LAB11 Basic setup (hadoop) 5.Setup master file and slaves file 6.Format HDFS (hadoop distributed file system) (hadoop distributed file system) 7.Start Hadoop 8.Check hadoop HDFS http://namenode’s ip:50070 HDFS http://namenode’s ip:50070 Job admin http://Jobtracker’s ip:50030 Job admin http://Jobtracker’s ip:50030

12 NSYSU EE IT_LAB12 Basic setup (crawler) 1.Check your web robot agent file 2.Setup urls filter file 3.Set your seed urls file by manual input or web’s url package (Some details’ setting steps are ignored here.)

13 NSYSU EE IT_LAB13 Flow chart Seed urls Run crawl command as a hadoop job Assign job’s fragments to each tasktracker; go fetch web’s data Store context to output dir. on HDFS Link log New fetch list Doc. data Fetch log HDFS user ( ) Map & reduce

14 NSYSU EE IT_LAB14 Hadoop cluster – 1 node Machine 01 master (x.x.x.1) Namenode JobTracker Datanode TaskTracker

15 NSYSU EE IT_LAB15 Hadoop cluster – 2 nodes Machine 01 Machine 02 master (x.x.x.1) slave1 (x.x.x.2) Namenode JobTracker Datanode TaskTracker Datanode TaskTracker

16 NSYSU EE IT_LAB16 Hadoop cluster – 3 nodes Machine 01 Machine 02Machine 03 master (x.x.x.1) slave2 (x.x.x.3)slave1 (x.x.x.2) Namenode JobTracker Datanode TaskTracker Datanode TaskTracker Datanode TaskTracker

17 NSYSU EE IT_LAB17 Urls set  Get urls package from http://dmoz.org/ http://dmoz.org/  select one out of every 500, so that we end up with around 10000 URLs

18 NSYSU EE IT_LAB18 Crawler input (seeds.txt)

19 NSYSU EE IT_LAB19 Crawler ouput  Output to HDFS

20 NSYSU EE IT_LAB20 Speed compare Hadoop job costs time (9199 urls) 1 work node 1888 seconds 2 work nodes 1679 seconds 3 work nodes 1628 seconds

21 NSYSU EE IT_LAB21 Thanks for your attention!!


Download ppt "Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working."

Similar presentations


Ads by Google