Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.

Similar presentations


Presentation on theme: "Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different."— Presentation transcript:

1 Nutch Search Engine Tool

2 Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different document formats (PDF, HTML, XML, JS, DOC,PPT etc.)‏  Web interface for querying the index  Management of Recrawls

3 Nutch Architecture 4 main components  Crawler  Web Database (WebDB, LinkDB, segments)‏  Indexer  Searcher Crawler and Searcher are highly decoupled enabling independent scaling Highly modular, Plugin based architechture

4 Nutch Architecture Doug Cutting, "Nutch: Open Source Web Search", 22 May 2004, WWW2004, New York

5 Steps in a Crawl+Index cycle  Create a new WebDB (admin db -create).  Inject root URLs into the WebDB (inject).  Generate a fetchlist from the WebDB in a new segment (generate).  Fetch content from URLs in the fetchlist (fetch).  Update the WebDB with links from fetched pages (updatedb).  Repeat steps 3-5 until the required depth is reached.  Update segments with scores and links from the WebDB (updatesegs).  Index the fetched pages (index).  Eliminate duplicate content (and duplicate URLs) from the indexes (dedup).  Merge the indexes into a single index for searching (merge).

6 Crawling (cont.) Can effectively crawl upto ~100M pages Crawl Statistics on KReSIT site (it.iitb)  Took 153 mins for a deep crawl (depth = 10)  Crawled 4171 documents  Size of crawl on disk 168MB  Size of index ~25MB

7 Web Database (WebDB)‏ Persistent data structure for mirroring the structure and properties of the web graph being crawled The WebDB stores two types of entities  Pages  Links Optimised for frequent updation

8 Crawl Structure of it.iitb

9 Page DB Page Database  used for fetch scheduling  Contains: pages indexed and sorted by MD5 and URL outlinks, fetch information, page score A set of APIs are provided to perform the various operations

10 Sample data of PageDB Page 1: Version: 4 URL: http://keaton/tinysite/A.html ID: fb8b9f0792e449cda72a9670b4ce833a Next fetch: Thu Nov 24 11:13:35 GMT 2005 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 1 Score: 1.0 NextScore: 1.0 Page 2: Version: 4 URL: http://keaton/tinysite/B.html ID: 404db2bd139307b0e1b696d3a1a772b4 Next fetch: Thu Nov 24 11:13:37 GMT 2005 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 3 Score: 1.0 NextScore: 1.0

11 Link DB Link Database  Contains: links sorted by MD5 links sorted by URL  Represents full link graph.  Stores anchor text associated with each link  Used for: Link analysis; Anchor text indexing.

12 Segments Collection of pages fetched and indexed by the crawler in a single run One segment dir for each crawl-fetch-update cycle at a particular depth Contains raw text and parsed data of the files crawled Used to return the cached copy of a page and in snippet generation in results page

13 Segments segread tool gives a useful summary of all segments. (Parsed, Started, Finished, Dir) It can also be used to dump the segment data in raw text format. The dump switch gives the following details  Fetcher Output:. Entries that go into the WebDB  Content: Raw content including http-headers and other meta-data. stored cached copy of a page  ParseData & ParseText: appropriate parser plugin by looking at the Raw content, is used to generate this data

14 Nutch API

15 Plugins Provide extensions to extension-points Each extension point defines an interface that must be implemented by extension Some core extension points  IndexingFilter: add meta-data to indexed fields  Parser: to parse a new type of document  NutchAnalyzer: language specific analyzers

16 References Nutch Docs: http://lucene.apache.org/nutch/ Nutch Wiki: http://wiki.apache.org/nutch/ Prasad Pingali, CLIA consortium, Nutch Workshop, 2007 Tom White, Introduction to Nutch, java.net website (http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch- 1.html)


Download ppt "Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different."

Similar presentations


Ads by Google