Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin.

Similar presentations


Presentation on theme: "Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin."— Presentation transcript:

1 Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin

2 Outline Recap Special features Running Nutch in a distributed environment (with demo) Q&A Discussion

3 Recap Complete web search engine  Nutch = Crawler + Indexer/Searcher (Lucene) + GUI + Plugins + MapReduce & Distributed FS (Hadoop) Java based, open source Features:  Customizable  Extensible  Distributed

4 Nutch as a crawler Initial URLs GeneratorFetcher Segment Webpages/files Web Parser generate Injector CrawlDB read/write CrawlDBTool update get read/write

5 Special Features Extensible (Plugin system)  Most of the essential functionalities of Nutch are implemented as plugins  Three layers Extension points  What can be extended: Protocol, Parser, ScoringFilter, etc. Extensions  The interfaces to be implemented for the extension points Plugins  The actual implementation

6 Special Features Extensible (Plugin system)  Anyone can write a plugin Write the code Prepare metadata files  Plugin.xml: what has been extended by what  Build.xml: how ant can build your source code Ask nutch to include your plugin in conf/nutch- site.xml Tell ant to build your in src/plugin/build.xml More details @ http://wiki.apache.org/nutch/PluginCentral http://wiki.apache.org/nutch/PluginCentral

7 Special Features Extensible (Plugin system)  To use a plugin Make sure you have modified Nutch-site.xml to include the plugin Then, either  Nutch would automatically call it when needed, or  You can write something to call it with its classname and then use it

8 Special Features Distributed (Hadoop)  Map-Reduce (Diagram) Map-ReduceDiagram A framework for distributed programming Map -- Process the splits of data to get intermediate results and the keys to indicate what should be put together later Reduce -- Process the intermediate results with the same key and output final result

9 Special Features Distributed (Hadoop)  MapReduce in Nutch Example1: Parsing  Input: files from fetch  Map(url,content)  by calling parser plugins  Reduce is identity Example2: Dumping a segment  Input:, etc. files from segment  Map is identity  Reduce(url, value*)  by simply concatenating the text representation of values

10 Special Features Distributed (Hadoop)  Distributed File system Write-once-read-many coherence model  High throughput Master/slave  Simple architecture  Single point of failure Transparent  Access via Java API More info @ http://lucene.apache.org/hadoop/hdfs_design.html http://lucene.apache.org/hadoop/hdfs_design.html

11 Running Nutch in a distributed environment MapReduce  In hadoop-site.xml Specify job tracker host & port  mapred.job.tracker Specify task numbers  mapred.map.tasks  mapred.reduce.tasks Specify location for temporary files  Mapred.local.dir

12 Running Nutch in a distributed environment DFS  In hadoop-site.xml Specify namenode host, port & directory  fs.default.name  dfs.name.dir Specify location for files on each datanode  dfs.data.dir

13 Demo time!

14 Q&A

15 Discussion

16 Exercises Hands-on exercises  Install Nutch, crawl a few webpages using the crawl command and perform a search on it using the GUI  Repeat the crawling process without using the crawl command  Modify your configuration to perform each of the following crawl jobs and think when they would be useful. To crawl only webpages and pdfs but not anything else To crawl the files on your harddisk To crawl but not to parse  (Challenging) Modify Nutch such that you can unpack the crawled files in the segments back into their original state

17 Reference http://wiki.apache.org/nutch/PluginCentral -- Information on Nutch plugins http://wiki.apache.org/nutch/PluginCentral http://lucene.apache.org/hadoop/ -- Hadoop homepage http://lucene.apache.org/hadoop/ http://wiki.apache.org/lucene-hadoop/ -- Hadoop Wiki http://wiki.apache.org/lucene-hadoop/ http://wiki.apache.org/nutch- data/attachments/Presentations/attachments/mapred.pdf "MapReduce in Nutch" http://wiki.apache.org/nutch- data/attachments/Presentations/attachments/mapred.pdf http://wiki.apache.org/nutch- data/attachments/Presentations/attachments/oscon05.pdf "Scalable Computing with MapReduce“ http://wiki.apache.org/nutch- data/attachments/Presentations/attachments/oscon05.pdf http://www.mail-archive.com/nutch- commits@lucene.apache.org/msg01951.html Updated tutorial on setting up Nutch, Hadoop and Lucene together http://www.mail-archive.com/nutch- commits@lucene.apache.org/msg01951.html

18 Excursion: MapReduce Problem  Find the number of occurrences of “cat” in a file  What if the file is 20GB large? Why not do it with more computers? Solution PC1 PC2 200 300 PC1500 Split 1 Split 2 File

19 Excursion: MapReduce Problem  Find the number of occurrences of both “cat” and “dog” in a very large file Solution PC1 PC2 200, 250 300, 250 PC1cat:500 Split 1 Split 2 File cat: 200, dog: 250 cat: 300, dog: 250 PC2dog:500 cat: 200, 300 dog: 250, 250 Input Files Map Intermediate files Reduce Output files Sort/Group

20 Excursion: MapReduce Generalized Framework Split 1 Split 2 Split 3 Split 4 Worker k1:v1 k3:v2 k1:v3 k2:v4 k2:v5 k4:v6 k1:v1,v2 k2:v4,v5 k3:v2 Worker Output 1 Output 2 k4:v6 Output 3 Master back Input Files Map Intermediate files Reduce Output files Sort/Group


Download ppt "Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin."

Similar presentations


Ads by Google