Download presentation
Presentation is loading. Please wait.
1
Lucene & Nutch Lucene Project name Started as text index engine Nutch A complete web search engine, including: Crawling, indexing, searching Index 100M+ pages, crawl >10M/day Provide distributed architecture Written in JAVA Other language ports are work-in-progress
2
Lucene Open source search project http://lucene.apache.org http://lucene.apache.org Index & search local files Download lucene-2.2.0.tar.gz from http://www.apache.org/dyn/closer.cgi/lucene/java/ http://www.apache.org/dyn/closer.cgi/lucene/java/ Extract files Build an index for a directory java org.apache.lucene.demo.IndexFiles dir_path Try search at command line: java org.apache.lucene.demo.SearchFiles
3
Deploy Lucene Copy luceneweb.war to your {tomcat- home}/webapps Browse to http://localhost:8080/lucenewebhttp://localhost:8080/luceneweb Tomcat will deploy the web app. Edit webapps/luceneweb/configuration.jsp Point “indexLocation”to your indexes Search at http://localhost:8080/lucenewebhttp://localhost:8080/luceneweb
4
Nutch A complete search engine http://lucene.apache.org/nutch/release/ http://lucene.apache.org/nutch/release/ Mode Intranet/local search Internet search Usage Crawl Index Search
5
Intranet Search Configuration Input URLs: create a directory and seed file $ mkdir urls $ echo http://www.cs.ucsb.edu > urls/ucsbhttp://www.cs.ucsb.edu Edit conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with cs.ucsb.edu Edit conf/nutch-site.xml
6
Intranet: Running the Crawl Crawl options include: -dir dir names the directory to put the crawl in. -threads threads determines the number of threads that will fetch in parallel. -depth depth indicates the link depth from the root page that should be crawled. -topN N determines the maximum number of pages that will be retrieved at each level up to the depth. E.g. $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
7
Intranet Search Deploy nutch war file rm -rf TOMCAT_DIR/webapps/ROOT* cp nutch-0.9.war TOMCAT_DIR/webapps/ROOT.war The webapp finds indexes in./crawl, relative to where you start Tomcat TOMCAT_DIR/bin/catalina.sh start Search at http://localhost:8080/http://localhost:8080/ CS.UCSB domain demo: http://hactar.cs.ucsb.edu:8080 http://hactar.cs.ucsb.edu:8080
8
Internet Crawling Concept crawldb: all URL info linkdb: list of known links to each url segments: each is a set of urls that are fetched as a unit indexes: Lucene-format indexes
9
Internet Crawling Process Get seed URLs Fetch Update crawl DB Compute top URLs, goto 2 Create Index Deploy
10
Seed URL URLs from the DMOZ Open Directory wget http://rdf.dmoz.org/rdf/content.rdf.u8.gzhttp://rdf.dmoz.org/rdf/content.rdf.u8.gz gunzip content.rdf.u8.gz mkdir dmoz bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls Kids search URL from ask.com Inject URLs bin/nutch inject kids/crawldb 67k-url/ Edit conf/nutch-site.xml
11
Fetch Generate a fetchlist from the database $ bin/nutch generate kids/crawldb kids/segments Save the name of fetchlist in variable s1 s1=`ls -d kids/segments/2* | tail -1` Run the fetcher on this segment bin/nutch fetch $s1
12
Update Crawl DB and Re-fetch Update craw db with the results of the fetch bin/nutch updatedb kids/crawldb $s1 Generate top-scoring 50K pages bin/nutch generate kids/crawldb kids/segments - topN 50000 Refetch s1=`ls -d kids/segments/2* | tail -1` bin/nutch fetch $s1
13
Index, Deploy, and Search Create inverted index bin/nutch invertlinks kids/linkdb kids/segments/* Index the segments bin/nutch index kids/indexes kids/crawldb kids/linkdb kids/segments/* Deploy & Search Same as in Intranet search Demo of 1M pages (570K + 500K)
14
Issues Default crawling cycle is 30 days for all URLs Duplicates are those have same URL or md5 of page content JavaScript parser uses regular expression to extract URL literals from code.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.