Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006.

Similar presentations


Presentation on theme: "Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006."— Presentation transcript:

1 Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006

2 Why use Nutch? Front-end to large collections of documents Demonstrate research without writing lots of extra code

3 Outline Nutch - information retrieval –Pros & Cons –Crawling the Local Filesystem –How Nutch Works –Indexing a Database –Query Filters: Searching with Nutch

4 Nutch Open source search engine Written in Java Built on top of Apache Lucene

5 Advantages of Nutch Scalable –Index local host or entire Internet Portable –Runs anywhere with Java Flexible –Plugin system + API Code pretty easy to read & work with Better than implementing it yourself!

6 Disadvantages of Nutch Documentation still somewhat lacking Not yet fully mature No GUI Odd Tomcat setup Several “gotchas”

7 Crawling the Local Filesystem Step 1: Create list of files to index file_list: /data0/projects/clairlib/CLAIR/aleClairlib.pl /data0/projects/clairlib/CLAIR/buildALE.pl /data0/projects/clairlib/CLAIR/get_cosine_example.pl /data0/projects/clairlib/CLAIR/lookUpTFIDF.pl /data0/projects/clairlib/CLAIR/makeCorpus.pl /data0/projects/clairlib/CLAIR/normalize_cosines.pl /data0/projects/clairlib/CLAIR/queryALE.pl /data0/projects/clairlib/CLAIR/testCluster.pl /data0/projects/clairlib/CLAIR/testCorpusDownload.pl /data0/projects/clairlib/CLAIR/testDocument.pl /data0/projects/clairlib/CLAIR/testDocumentPair.pl /data0/projects/clairlib/CLAIR/testIP.pl /data0/projects/clairlib/CLAIR/testUtil.pl /data0/projects/clairlib/CLAIR/testWebSearch.pl /data0/projects/clairlib/CLAIR/NSIR/bin/testNSIR.pl /data0/projects/clairlib/CLAIR/NSIR/bin/nsir_web.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Parser.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Tnt2PreCass.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanEmptySentences.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanPunctuation_tnt.pl

8 Crawling the Local Filesystem Step 2: Edit Configuration –crawl-urlfilter.txt Very restrictive by default Must allow file: URLs

9 crawl-urlfilter.txt default # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ # skip everything else -.

10 crawl-urlfilter.txt # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip image and other suffixes we can't yet parse.\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # allow everything else +.

11 Crawling the Local Filesystem Step 3: Edit Configuration –nutch-site.xml (overrides nutch-default.xml) Enable protocol-file plugin and parse plugins plugin.includes nutch-extensionpoints|protocol-file|urlfilter-regex|parse-(text|html|pdf|msword)|index-basic|query- (basic|site|url) Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins.

12 Crawling the Local Filesystem Step 4: Run the crawl –bin/nutch crawl myurls Step 5: Start Tomcat –GOTCHA: must start in the crawl directory! –Or edit WEB-INF/classes/nutch-site.xml searcher.dir /oriole0/nutch-0.7.1/crawl-20051208231019

13 Modifying the Results Page Just customize search.jsp! For example, display external ‘citations’ link instead of ‘anchors’ ( &query= "> ) ( ">citations ) "> ) --%>

14 How Nutch Works Protocol plugin URL Content byte[] content String contentType URL url Properties metadata Protocol. getProtocolOutput

15 How Nutch Works Parsing plugins URL Content byte[] content String contentType URL url Properties metadata Protocol. getProtocolOutput Parse String text Parser. getParse ParseData data Properties metadata Outlink[] outlinks String title ParseStatus status

16 Indexing a Database Need to write a new plugin Luckily interface is pretty simple Much less tightly coupled than full-text search inside database

17 Indexing a Database Approach –Get the text out –Generate a 1:1 mapping from URLs to documents in the database

18 Indexing a Database Protocol plugin –Replaces default ‘http’ plugin –Converts http request to database request

19 Indexing a Database Parse plugin –Replaces text or HTML parser –Protocol plugin gets the text and metadata, so don’t need to do much here

20 Indexing a Database Configuration - plugin.xml

21 Indexing a Database Configuration - nutch-site.xml –Add correct plugin Make sure Nutch can find plugin –$NUTCH_HOME/plugins

22 Improving the Plugin Configuration via XML Determine which database to use for what URLs Automatically ‘crawl’ database Pass unknown URLs to default plugin

23 Searching with Nutch Parse query - NutchAnalysis Filter query - QueryFilters Pass to Lucene - IndexSearcher –Optimization/caching - LuceneQueryOptimizer –Translate hits from Lucene back to Nutch

24 Query Filter Nutch Query QueryFilter. filter() Lucene Query

25 Date Query Filter Date query filter restricts by date

26 Basic Query Filter Boosts weight of particular fields Manipulates phrases

27 Additional Query Filters Could implement relevance feedback in this framework Manual relevance feedback –could add morelike:somedocument operator Automatic relevance feedback - extend BasicQueryFilter

28 Additional Capabilities Distributed searching –Nutch Distributed File System MapReduce a la Google More

29 Nutch Distributed Filesystem Write-once Stream-oriented (append-only, sequential read) Distributed, transparent, replicated, fault-tolerant Distribute index and content

30 MapReduce Distributed processing technique Idea from functional programming

31 Map Apply same operation to several data items Example (Python): def getDocument(docid): """ fetch document with given docid from database """ # do some stuff... return document docids = [1, 2, 3, 4, 5] documents = map(getDocument,docids) Mapping for individual items is independent - distributable!

32 Reduce Combine results of map operation Simple example - sum of squares measurements = [4, 2, 6, 9] def sum(x,y): return x+y def square(x): return x^2 result = reduce(sum,map(square,measurements))

33 Can use to distribute crawling, indexing, etc MapReduce in Nutch

34 Conclusions Nutch is –featureful –flexible –extensible –scalable Get started with nutch: http://lucene.apache.org/nutch Sample plugins and code samples: http://umich.edu/~aelkiss/nutch


Download ppt "Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006."

Similar presentations


Ads by Google