Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin.

Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin

Outline Overview Nutch as a web crawler Nutch as a complete web search engine Special features Installation/Usage (with Demo) Exercises

Overview Complete web search engine  Nutch = Crawler + Indexer/Searcher (Lucene) + GUI + Plugins + MapReduce & Distributed FS (Hadoop) Java based, open source Features:  Customizable  Extensible (Next meeting)  Distributed (Next meeting)

Nutch as a crawler Initial URLs GeneratorFetcher Segment Webpages/files Web Parser generate Injector CrawlDB read/write CrawlDBTool update get read/write

Nutch as a complete web search engine Indexer (Lucene) Segments Index Searcher (Lucene) GUI CrawlDBLinkDB (Tomcat)

Special Features Customizable  Configuration files (XML) Required user parameters  http.agent.name  http.agent.description  http.agent.url  http.agent.email Adjustable parameters for every component  E.g. for fetcher:  Threads-per-host  Threads-per-ip

Special Features  URL Filters (Text file) Regular expression to filter URLs during crawling E.g.  To ignore files with certain suffix: -\.(gif|exe|zip|ico)$  To accept host in a certain domain +^http://([a-z0-9]*\.)*apache.org/  Plugin-information (XML) The metadata of the plugins (More details next week)

Installation & Usage Installation  Software needed Nutch release Java Apache Tomcat (for GUI) Cgywin (for windows)

Installation & Usage Usage  Crawling Initial URLs (text file or DMOZ file) Required parameters (conf/nutch-site.xml) URL filters (conf/crawl-urlfilter.txt)  Indexing Automatic  Searching Location of files (WAR file, index) The tomcat server

Demo time!

Exercises Questions:  What are the things that need to be done before starting a crawl job with Nutch?  What are the ways tell Nutch what to crawl and what not? What can you do if you are the owner of a website?  Starting from v0.8, Nutch won’t run unless some minimum user parameters, such as http.robots.agents, are set, what do you think is the reason behind?  What do you think are good crawling behaviors?  Do you think an open-sourced search engine like Nutch would make it easier for spammers to manipulate the search index ranking?  What are the advantages of using Nutch instead of commercial search engines?

Answers What are the things that need to be done before starting a crawl job with Nutch?  Set the CLASSPATH to the Lucene Core  Set the JAVA_HOME path  Create a folder containing urls to be crawled  Amend the crawl-urlfilter file  Amend the nutch-site.xml file to include the user parameters

What are the ways tell Nutch what to crawl and what not?  Url filters  Depth in crawling  Scoring function for urls What can you do if you are the owner of a website?  Web Server Administrators Use the Robot Exclusion Protocol by adding the following in /robots.txt  HTML Author Add the Robots META tag

Starting from v0.8, Nutch won’t run unless some minimum user parameters, such as http.robots.agents, are set, what do you think is the reason behind?  To ensure accountability (although tracing is still possible without them) What do you think are good crawling behaviors?  Be Accountable  Test Locally  Don't hog resources  Stay with it  Share results

Do you think an open-sourced search engine like Nutch would make it easier for spammers to manipulate the search index ranking?  True but one can always make changes in Nutch to minimize the effect. What are the advantages of using Nutch instead of commercial search engines?  Open-source  Transparent  Able to define the what are to be returned in searches and the index ranking

Exercises Hands-on exercises  Install Nutch, crawl a few webpages using the crawl command and perform a search on it using the GUI  Repeat the crawling process without using the crawl command  Modify your configuration to perform each of the following crawl jobs and think when they would be useful. To crawl only webpages and pdfs but not anything else To crawl the files on your harddisk To crawl but not to parse  (Challenging) Modify Nutch such that you can unpack the crawled files in the segments back into their original state

Next Meeting Special Features  Extensible  Distributed Feedback and discussion

References http://lucene.apache.org/nutch/ -- Official website http://lucene.apache.org/nutch/ http://wiki.apache.org/nutch/ -- Nutch wiki (Seriously outdated. Take with a grain of salt.) http://wiki.apache.org/nutch/ http://lucene.apache.org/nutch/release/ Nutch source code http://lucene.apache.org/nutch/release/ www.nutchinstall.blogspot.com Installation guide www.nutchinstall.blogspot.com http://www.robotstxt.org/wc/robots.html The web robot pages http://www.robotstxt.org/wc/robots.html

Thank you!

Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin.

Similar presentations

Presentation on theme: "Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin.

Similar presentations

Presentation on theme: "Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin."— Presentation transcript:

Similar presentations

About project

Feedback