Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin.

Similar presentations


Presentation on theme: "Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin."— Presentation transcript:

1 Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin

2 Outline Overview Nutch as a web crawler Nutch as a complete web search engine Special features Installation/Usage (with Demo) Exercises

3 Overview Complete web search engine  Nutch = Crawler + Indexer/Searcher (Lucene) + GUI + Plugins + MapReduce & Distributed FS (Hadoop) Java based, open source Features:  Customizable  Extensible (Next meeting)  Distributed (Next meeting)

4 Nutch as a crawler Initial URLs GeneratorFetcher Segment Webpages/files Web Parser generate Injector CrawlDB read/write CrawlDBTool update get read/write

5 Nutch as a complete web search engine Indexer (Lucene) Segments Index Searcher (Lucene) GUI CrawlDBLinkDB (Tomcat)

6 Special Features Customizable  Configuration files (XML) Required user parameters  http.agent.name  http.agent.description  http.agent.url  http.agent.email Adjustable parameters for every component  E.g. for fetcher:  Threads-per-host  Threads-per-ip

7 Special Features  URL Filters (Text file) Regular expression to filter URLs during crawling E.g.  To ignore files with certain suffix: -\.(gif|exe|zip|ico)$  To accept host in a certain domain +^http://([a-z0-9]*\.)*apache.org/  Plugin-information (XML) The metadata of the plugins (More details next week)

8 Installation & Usage Installation  Software needed Nutch release Java Apache Tomcat (for GUI) Cgywin (for windows)

9 Installation & Usage Usage  Crawling Initial URLs (text file or DMOZ file) Required parameters (conf/nutch-site.xml) URL filters (conf/crawl-urlfilter.txt)  Indexing Automatic  Searching Location of files (WAR file, index) The tomcat server

10 Demo time!

11 Exercises Questions:  What are the things that need to be done before starting a crawl job with Nutch?  What are the ways tell Nutch what to crawl and what not? What can you do if you are the owner of a website?  Starting from v0.8, Nutch won’t run unless some minimum user parameters, such as http.robots.agents, are set, what do you think is the reason behind?  What do you think are good crawling behaviors?  Do you think an open-sourced search engine like Nutch would make it easier for spammers to manipulate the search index ranking?  What are the advantages of using Nutch instead of commercial search engines?

12 Answers What are the things that need to be done before starting a crawl job with Nutch?  Set the CLASSPATH to the Lucene Core  Set the JAVA_HOME path  Create a folder containing urls to be crawled  Amend the crawl-urlfilter file  Amend the nutch-site.xml file to include the user parameters

13 What are the ways tell Nutch what to crawl and what not?  Url filters  Depth in crawling  Scoring function for urls What can you do if you are the owner of a website?  Web Server Administrators Use the Robot Exclusion Protocol by adding the following in /robots.txt  HTML Author Add the Robots META tag

14 Starting from v0.8, Nutch won’t run unless some minimum user parameters, such as http.robots.agents, are set, what do you think is the reason behind?  To ensure accountability (although tracing is still possible without them) What do you think are good crawling behaviors?  Be Accountable  Test Locally  Don't hog resources  Stay with it  Share results

15 Do you think an open-sourced search engine like Nutch would make it easier for spammers to manipulate the search index ranking?  True but one can always make changes in Nutch to minimize the effect. What are the advantages of using Nutch instead of commercial search engines?  Open-source  Transparent  Able to define the what are to be returned in searches and the index ranking

16 Exercises Hands-on exercises  Install Nutch, crawl a few webpages using the crawl command and perform a search on it using the GUI  Repeat the crawling process without using the crawl command  Modify your configuration to perform each of the following crawl jobs and think when they would be useful. To crawl only webpages and pdfs but not anything else To crawl the files on your harddisk To crawl but not to parse  (Challenging) Modify Nutch such that you can unpack the crawled files in the segments back into their original state

17 Q&A?

18 Next Meeting Special Features  Extensible  Distributed Feedback and discussion

19 References http://lucene.apache.org/nutch/ -- Official website http://lucene.apache.org/nutch/ http://wiki.apache.org/nutch/ -- Nutch wiki (Seriously outdated. Take with a grain of salt.) http://wiki.apache.org/nutch/ http://lucene.apache.org/nutch/release/ Nutch source code http://lucene.apache.org/nutch/release/ www.nutchinstall.blogspot.com Installation guide www.nutchinstall.blogspot.com http://www.robotstxt.org/wc/robots.html The web robot pages http://www.robotstxt.org/wc/robots.html

20 Thank you!


Download ppt "Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin."

Similar presentations


Ads by Google