Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin.

Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin

Outline Recap Special features Running Nutch in a distributed environment (with demo) Q&A Discussion

Recap Complete web search engine  Nutch = Crawler + Indexer/Searcher (Lucene) + GUI + Plugins + MapReduce & Distributed FS (Hadoop) Java based, open source Features:  Customizable  Extensible  Distributed

Nutch as a crawler Initial URLs GeneratorFetcher Segment Webpages/files Web Parser generate Injector CrawlDB read/write CrawlDBTool update get read/write

Special Features Extensible (Plugin system)  Most of the essential functionalities of Nutch are implemented as plugins  Three layers Extension points  What can be extended: Protocol, Parser, ScoringFilter, etc. Extensions  The interfaces to be implemented for the extension points Plugins  The actual implementation

Special Features Extensible (Plugin system)  Anyone can write a plugin Write the code Prepare metadata files  Plugin.xml: what has been extended by what  Build.xml: how ant can build your source code Ask nutch to include your plugin in conf/nutch- site.xml Tell ant to build your in src/plugin/build.xml More details @ http://wiki.apache.org/nutch/PluginCentral http://wiki.apache.org/nutch/PluginCentral

Special Features Extensible (Plugin system)  To use a plugin Make sure you have modified Nutch-site.xml to include the plugin Then, either  Nutch would automatically call it when needed, or  You can write something to call it with its classname and then use it

Special Features Distributed (Hadoop)  Map-Reduce (Diagram) Map-ReduceDiagram A framework for distributed programming Map -- Process the splits of data to get intermediate results and the keys to indicate what should be put together later Reduce -- Process the intermediate results with the same key and output final result

Special Features Distributed (Hadoop)  MapReduce in Nutch Example1: Parsing  Input: files from fetch  Map(url,content)  by calling parser plugins  Reduce is identity Example2: Dumping a segment  Input:, etc. files from segment  Map is identity  Reduce(url, value*)  by simply concatenating the text representation of values

Special Features Distributed (Hadoop)  Distributed File system Write-once-read-many coherence model  High throughput Master/slave  Simple architecture  Single point of failure Transparent  Access via Java API More info @ http://lucene.apache.org/hadoop/hdfs_design.html http://lucene.apache.org/hadoop/hdfs_design.html

Running Nutch in a distributed environment MapReduce  In hadoop-site.xml Specify job tracker host & port  mapred.job.tracker Specify task numbers  mapred.map.tasks  mapred.reduce.tasks Specify location for temporary files  Mapred.local.dir

Running Nutch in a distributed environment DFS  In hadoop-site.xml Specify namenode host, port & directory  fs.default.name  dfs.name.dir Specify location for files on each datanode  dfs.data.dir

Demo time!

Discussion

Exercises Hands-on exercises  Install Nutch, crawl a few webpages using the crawl command and perform a search on it using the GUI  Repeat the crawling process without using the crawl command  Modify your configuration to perform each of the following crawl jobs and think when they would be useful. To crawl only webpages and pdfs but not anything else To crawl the files on your harddisk To crawl but not to parse  (Challenging) Modify Nutch such that you can unpack the crawled files in the segments back into their original state

Reference http://wiki.apache.org/nutch/PluginCentral -- Information on Nutch plugins http://wiki.apache.org/nutch/PluginCentral http://lucene.apache.org/hadoop/ -- Hadoop homepage http://lucene.apache.org/hadoop/ http://wiki.apache.org/lucene-hadoop/ -- Hadoop Wiki http://wiki.apache.org/lucene-hadoop/ http://wiki.apache.org/nutch- data/attachments/Presentations/attachments/mapred.pdf "MapReduce in Nutch" http://wiki.apache.org/nutch- data/attachments/Presentations/attachments/mapred.pdf http://wiki.apache.org/nutch- data/attachments/Presentations/attachments/oscon05.pdf "Scalable Computing with MapReduce“ http://wiki.apache.org/nutch- data/attachments/Presentations/attachments/oscon05.pdf http://www.mail-archive.com/nutch- commits@lucene.apache.org/msg01951.html Updated tutorial on setting up Nutch, Hadoop and Lucene together http://www.mail-archive.com/nutch- commits@lucene.apache.org/msg01951.html

Excursion: MapReduce Problem  Find the number of occurrences of “cat” in a file  What if the file is 20GB large? Why not do it with more computers? Solution PC1 PC2 200 300 PC1500 Split 1 Split 2 File

Excursion: MapReduce Problem  Find the number of occurrences of both “cat” and “dog” in a very large file Solution PC1 PC2 200, 250 300, 250 PC1cat:500 Split 1 Split 2 File cat: 200, dog: 250 cat: 300, dog: 250 PC2dog:500 cat: 200, 300 dog: 250, 250 Input Files Map Intermediate files Reduce Output files Sort/Group

Excursion: MapReduce Generalized Framework Split 1 Split 2 Split 3 Split 4 Worker k1:v1 k3:v2 k1:v3 k2:v4 k2:v5 k4:v6 k1:v1,v2 k2:v4,v5 k3:v2 Worker Output 1 Output 2 k4:v6 Output 3 Master back Input Files Map Intermediate files Reduce Output files Sort/Group

Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin.

Similar presentations

Presentation on theme: "Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin.

Similar presentations

Presentation on theme: "Nutch in a Nutshell Presented by Liew Guo Min Zhao Jin."— Presentation transcript:

Similar presentations

About project

Feedback