IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.

IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.
Nutch Tutorial IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.

What is Nutch? Apache has open-source solution for two components of Search Engines Crawler: Nutch Indexer: Lucene  Solr  Lucene/Solr (merged in 2010) A project headed by Doug Cutting To make an open-source search engine expandable enough to index the entire web (~ billions) Nutch includes Java crawler HTML parser + Lucene search/index library + lots more IST 516

Features of Nutch Robot crawler, can use proxy
Includes hosts via grep, exclusion by host names and suffixes Continuous indexing FTP indexing login option Index logging options Flexible query parsing Includes link-analysis module (mainly for multi-site search) Includes approximately fifteen relevance quality adjustment options Caches original page for display IST 516

Workflow of Nutch There are two paths (index path & query path) through a search engine The index path shows how the index gets filled with documents. The documents are fed to an analyzer which then transforms them into the appropriate weighted terms (or scores) and passes them to the IndexWriter IST 516

Connection Steps For security reasons, ist516 server is only accessible from IST’s VLabs First, login to IST’s VLabs environment Second, from VLabs, login to ist516 server IST 516

Connecting to VLabs From Windows/Mac remote-desktop, login to VLabs using your PSU ID/PWD Note “UP\PSU-ID” for the user-name below IST 516

Connecting to ist516.ist.psu.edu
A UNIX server is prepared for proj #2 Ist516.ist.psu.edu ( ) Can be accessed via SSH protocol only If not pre-installed, get a SSH client from  "File Transfer” IST 516

Connecting to ist516.ist.psu.edu
If a SSH client is pre-installed in VLabs, use it “Quick connect”  use the provided team ID/PWD IST 516

Ist516.ist.psu.edu Tomcat (Apache’s web server) and Nutch are already installed in the server Under each team's home directory (eg, /home/team-ID/nutch-1.0) Modify things under "nutch-1.0/conf" to change the behavior of Nutch as you wish IST 516

Running Tomcat and Nutch
To start or stop Tomcat server, all you need to do is to type: start-tomcat and stop-tomcat To run Nutch, at the command line, just type: nutch or you can provide various parameters like: nutch [parameters] The server has the most of typical UNIX software installed, including: wget: to download things using URL address nano: a small editor which Windows users may find it useful/familiar Emacs: full-fledged powerful UNIX editor IST 516

Crawling in Nutch There are two approaches to crawling:
Intranet crawling, with the crawl command. Whole-web crawling, with much greater control, using the lower level inject, generate, fetch and updatedb commands Intranet crawling is more suitable for small-scale project IST 516

1. Intranet Crawling Create a text file, say urlfile.txt, containing some seed URLs. Eg, Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl Eg, if you wish to limit the crawl to the pike.psu.edu domain, the line should read: +^ This will include any URLs in the domain pike.psu.edu IST 516

1. Intranet Crawling Edit the file conf/nutch-site.xml accordingly
At least, insert the following properties and edit in proper values for the properties: <property> <name>http.agent.name</name> <value>YOUR-CRAWLER-NAME-HERE</value> <description></description> </property> IST 516

1. Intranet Crawling Use the crawl command for crawling.
Its options include: -dir: names the directory to put the crawl in -depth: indicates the link depth from the root page that should be crawled -delay: determines the number of seconds between accesses to each host -threads: determines the number of threads that will fetch in parallel Eg, a typical call might be: > nutch crawl urlfile.txt -dir crawl.test -depth 3 >& log IST 516

1. Intranet Crawling The indexer uses the downloaded contents to generate an inverted index of all terms and all pages The document set is divided into a set of index segments, each of which is fed to a single searcher process Each searcher also draws upon the Web content from earlier, so it can provide a cached copy of any Web page IST 516

2. Internet Crawling More steps are needed than intranet crawling
Explore it for your proj #2 Refer to: IST 516

3. Searching Tomcat is installed and each of your group has your own webapp directory, which holds the nutch war file To search, put the nutch war file into your servlet container. > cp ~/nutch-0.9/nutch*.war ~/tomcat/webapps/ROOT.war Go to the directory that your crawler created and run the Tomcat server: > cd crawl.test > start-tomcat IST 516

3. Searching Connect your browser to:
? is your group number Eg, Team1: To access this URL, students need to log in to VLabs first and access from there: vlabs.up.ist.psu.edu + PSU ID/PWD Refer to VLabs Tutorial for more details: IST 516

3. Searching IST 516

Editing Nutch Look To change the look & feel of search interface
Search.html is automatically generated Instead, change XML files directly: ~/nutch-1.0/src/web/pages/en/search.xml ~/nutch-1.0/src/web/pages/en/about.xml ~/nutch-1.0/src/web/pages/en/help.xml More details on how to edit Nutch look, see here: IST 516

Reference Peter Wang’s Nutch Tutorial
Apache’s Official Nutch Tutorial Peter Wang’s Nutch Tutorial IST 441’s Nutch Tutorialhttp://clgiles.ist.psu.edu/IST441/materials/nutch-lucene/nutch-crawling-and-searching.pdf IST 516

IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.

Similar presentations

Presentation on theme: "IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.

Similar presentations

Presentation on theme: "IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback