Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin
Outline Overview Nutch as a web crawler Nutch as a complete web search engine Special features Installation/Usage (with Demo) Exercises
Overview Complete web search engine Nutch = Crawler + Indexer/Searcher (Lucene) + GUI + Plugins + MapReduce & Distributed FS (Hadoop) Java based, open source Features: Customizable Extensible (Next meeting) Distributed (Next meeting)
Nutch as a crawler Initial URLs GeneratorFetcher Segment Webpages/files Web Parser generate Injector CrawlDB read/write CrawlDBTool update get read/write
Nutch as a complete web search engine Indexer (Lucene) Segments Index Searcher (Lucene) GUI CrawlDBLinkDB (Tomcat)
Special Features Customizable Configuration files (XML) Required user parameters http.agent.name http.agent.description http.agent.url http.agent. Adjustable parameters for every component E.g. for fetcher: Threads-per-host Threads-per-ip
Special Features URL Filters (Text file) Regular expression to filter URLs during crawling E.g. To ignore files with certain suffix: -\.(gif|exe|zip|ico)$ To accept host in a certain domain +^ Plugin-information (XML) The metadata of the plugins (More details next week)
Installation & Usage Installation Software needed Nutch release Java Apache Tomcat (for GUI) Cgywin (for windows)
Installation & Usage Usage Crawling Initial URLs (text file or DMOZ file) Required parameters (conf/nutch-site.xml) URL filters (conf/crawl-urlfilter.txt) Indexing Automatic Searching Location of files (WAR file, index) The tomcat server
Demo time!
Exercises Questions: What are the things that need to be done before starting a crawl job with Nutch? What are the ways tell Nutch what to crawl and what not? What can you do if you are the owner of a website? Starting from v0.8, Nutch won’t run unless some minimum user parameters, such as http.robots.agents, are set, what do you think is the reason behind? What do you think are good crawling behaviors? Do you think an open-sourced search engine like Nutch would make it easier for spammers to manipulate the search index ranking? What are the advantages of using Nutch instead of commercial search engines?
Answers What are the things that need to be done before starting a crawl job with Nutch? Set the CLASSPATH to the Lucene Core Set the JAVA_HOME path Create a folder containing urls to be crawled Amend the crawl-urlfilter file Amend the nutch-site.xml file to include the user parameters
What are the ways tell Nutch what to crawl and what not? Url filters Depth in crawling Scoring function for urls What can you do if you are the owner of a website? Web Server Administrators Use the Robot Exclusion Protocol by adding the following in /robots.txt HTML Author Add the Robots META tag
Starting from v0.8, Nutch won’t run unless some minimum user parameters, such as http.robots.agents, are set, what do you think is the reason behind? To ensure accountability (although tracing is still possible without them) What do you think are good crawling behaviors? Be Accountable Test Locally Don't hog resources Stay with it Share results
Do you think an open-sourced search engine like Nutch would make it easier for spammers to manipulate the search index ranking? True but one can always make changes in Nutch to minimize the effect. What are the advantages of using Nutch instead of commercial search engines? Open-source Transparent Able to define the what are to be returned in searches and the index ranking
Exercises Hands-on exercises Install Nutch, crawl a few webpages using the crawl command and perform a search on it using the GUI Repeat the crawling process without using the crawl command Modify your configuration to perform each of the following crawl jobs and think when they would be useful. To crawl only webpages and pdfs but not anything else To crawl the files on your harddisk To crawl but not to parse (Challenging) Modify Nutch such that you can unpack the crawled files in the segments back into their original state
Q&A?
Next Meeting Special Features Extensible Distributed Feedback and discussion
References -- Official website Nutch wiki (Seriously outdated. Take with a grain of salt.) Nutch source code Installation guide The web robot pages
Thank you!