Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lucene & Nutch Lucene  Project name  Started as text index engine Nutch  A complete web search engine, including: Crawling, indexing, searching  Index.

Similar presentations


Presentation on theme: "Lucene & Nutch Lucene  Project name  Started as text index engine Nutch  A complete web search engine, including: Crawling, indexing, searching  Index."— Presentation transcript:

1 Lucene & Nutch Lucene  Project name  Started as text index engine Nutch  A complete web search engine, including: Crawling, indexing, searching  Index 100M+ pages, crawl >10M/day  Provide distributed architecture Written in JAVA  Other language ports are work-in-progress

2 Lucene Open source search project  http://lucene.apache.org http://lucene.apache.org Index & search local files  Download lucene-2.2.0.tar.gz from http://www.apache.org/dyn/closer.cgi/lucene/java/ http://www.apache.org/dyn/closer.cgi/lucene/java/  Extract files  Build an index for a directory java org.apache.lucene.demo.IndexFiles dir_path  Try search at command line: java org.apache.lucene.demo.SearchFiles

3 Deploy Lucene Copy luceneweb.war to your {tomcat- home}/webapps Browse to http://localhost:8080/lucenewebhttp://localhost:8080/luceneweb  Tomcat will deploy the web app.  Edit webapps/luceneweb/configuration.jsp Point “indexLocation”to your indexes Search at http://localhost:8080/lucenewebhttp://localhost:8080/luceneweb

4 Nutch A complete search engine http://lucene.apache.org/nutch/release/ http://lucene.apache.org/nutch/release/ Mode  Intranet/local search  Internet search Usage  Crawl  Index  Search

5 Intranet Search Configuration  Input URLs: create a directory and seed file $ mkdir urls $ echo http://www.cs.ucsb.edu > urls/ucsbhttp://www.cs.ucsb.edu  Edit conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with cs.ucsb.edu  Edit conf/nutch-site.xml

6 Intranet: Running the Crawl Crawl options include: -dir dir names the directory to put the crawl in. -threads threads determines the number of threads that will fetch in parallel. -depth depth indicates the link depth from the root page that should be crawled. -topN N determines the maximum number of pages that will be retrieved at each level up to the depth. E.g. $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50

7 Intranet Search Deploy nutch war file  rm -rf TOMCAT_DIR/webapps/ROOT*  cp nutch-0.9.war TOMCAT_DIR/webapps/ROOT.war The webapp finds indexes in./crawl, relative to where you start Tomcat  TOMCAT_DIR/bin/catalina.sh start Search at http://localhost:8080/http://localhost:8080/ CS.UCSB domain demo: http://hactar.cs.ucsb.edu:8080 http://hactar.cs.ucsb.edu:8080

8 Internet Crawling Concept  crawldb: all URL info  linkdb: list of known links to each url  segments: each is a set of urls that are fetched as a unit  indexes: Lucene-format indexes

9 Internet Crawling Process  Get seed URLs  Fetch  Update crawl DB  Compute top URLs, goto 2  Create Index  Deploy

10 Seed URL URLs from the DMOZ Open Directory  wget http://rdf.dmoz.org/rdf/content.rdf.u8.gzhttp://rdf.dmoz.org/rdf/content.rdf.u8.gz  gunzip content.rdf.u8.gz  mkdir dmoz  bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls Kids search URL from ask.com Inject URLs  bin/nutch inject kids/crawldb 67k-url/ Edit conf/nutch-site.xml

11 Fetch Generate a fetchlist from the database  $ bin/nutch generate kids/crawldb kids/segments Save the name of fetchlist in variable s1  s1=`ls -d kids/segments/2* | tail -1` Run the fetcher on this segment  bin/nutch fetch $s1

12 Update Crawl DB and Re-fetch Update craw db with the results of the fetch  bin/nutch updatedb kids/crawldb $s1 Generate top-scoring 50K pages  bin/nutch generate kids/crawldb kids/segments - topN 50000 Refetch  s1=`ls -d kids/segments/2* | tail -1`  bin/nutch fetch $s1

13 Index, Deploy, and Search Create inverted index  bin/nutch invertlinks kids/linkdb kids/segments/* Index the segments  bin/nutch index kids/indexes kids/crawldb kids/linkdb kids/segments/* Deploy & Search  Same as in Intranet search  Demo of 1M pages (570K + 500K)‏

14 Issues Default crawling cycle is 30 days for all URLs Duplicates are those have same URL or md5 of page content JavaScript parser uses regular expression to extract URL literals from code.


Download ppt "Lucene & Nutch Lucene  Project name  Started as text index engine Nutch  A complete web search engine, including: Crawling, indexing, searching  Index."

Similar presentations


Ads by Google