Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.

Similar presentations


Presentation on theme: "Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010."— Presentation transcript:

1 Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010

2 May-20-10CS572-Summer2010CAM-2 Outline What is Nutch? –Motivation –Architecture –What currently exists? –How I got involved Deploying Nutch on NASA’s Planetary Data System (PDS) –Free text “Google-like” search of the PDS catalog –Architecture/Implementation

3 May-20-10CS572-Summer2010CAM-3 What is Nutch? The brainchild of Doug Cutting –Research/programmer guru who has worked at several high profile research labs (Yahoo, Bell Labs) Nutch builds upon Cutting’s lower level text indexing library and API called Lucene Nutch provides crawling services, protocol services, parsing services, content management services on top of the indexing capability provided by Lucene

4 May-20-10CS572-Summer2010CAM-4 Motivation Observation: Web Search is a commodity –Why can’t it be provided freely? Allows tweaking of typically “hidden” ranking algorithms Allows developers to focus less on the infrastructure (since Brin & Page’s paper, the infrastructure is well-known), and more on providing value-added capabilities

5 May-20-10CS572-Summer2010CAM-5 Motivation Value-added capabilities –Improving fetching speed –Parsing and handling of the hundreds of different content types available on the internet –Handling different protocols for obtaining content –Better ranking algorithms (OPIC, PageRank) More or less, in Nutch, these capabilities all map to extension points available via Nutch’s plugin framework

6 May-20-10CS572-Summer2010CAM-6 Nutch’s Architecture Nutch Core facilities –Parsing –Indexing –Crawling –Content Management –Querying –Plugin Framework Nutch’s extension points –Scoring, Parsing, Indexing, Querying, URLFiltering

7 May-20-10CS572-Summer2010CAM-7 Nutch’s Architecture Maps to Search engine architecture proposed by Brin & Page

8 May-20-10CS572-Summer2010CAM-8 What Currently Exists? Version 0.6.x –First easily deployable version Version 0.7.x –Added several new features including several new parsers (MS-WORD, PowerPoint), URLFilter extension point, first Apache release after Incubation, mime type system Version 0.8.x –Completely new underlying architecture based on Hadoop –Parse plugins framework, multi-valued metadata container –Parser Factory enhancement Version 0.9.x –Major bug fixes –Hadoop, and Lucene library upgrades Version 1.0 –Flexible filter framework –Flexible scoring –Initial integration with Tika –Full Search Engine functionality and capabilities, in production at large scale (Internet Archive) Version 1.1, For full list, see http://svn.apache.org/repos/asf/nutch/trunk/CHANGES.txthttp://svn.apache.org/repos/asf/nutch/trunk/CHANGES.txt

9 May-20-10CS572-Summer2010CAM-9 What Doesn’t? Plenty! Bug fixes (> 200 issues in JIRA right now with no resolution) Nutch 2.0 architecture –http://search-lucene.com/m/gbrBF1RMWk9http://search-lucene.com/m/gbrBF1RMWk9 –Refactored Nutch architecture, delegating to Solr, HBase, Tika, and ORM

10 May-20-10CS572-Summer2010CAM-10 How I got involved In this very class! –Okay well it used to be called Cs599, but you get the picture Started out by contributing RSS parsing plugin –My final project in 599 Moved on from there to –NUTCH-88, redesign of the parsing framework –NUTCH-139, Metadata container support –NUTCH-210, Web Context application file –And various other bug fixes, and contributions here and there –Mailing list support –Wiki support Became committer in October 2006 Helped spin Nutch into Apache TLP, March 2010, Nutch PMC member

11 May-20-10CS572-Summer2010CAM-11 Real world application of Nutch I work at NASA’s Jet Propulsion Laboratory NASA’s Planetary Data System –NASA’s archive for all planetary science data collected by missions over the past 30 years –Collected 20 TB over the past 30 years Increasing to over 200 TB in the next 3 years! –Built up a catalog of all data collected Where does Nutch fit in?

12 May-20-10CS572-Summer2010CAM-12 Where does Nutch fit into the PDS? PDS Management Council decide they want “Google-like” search of the PDS catalog Our plan: use Nutch to implement capability for PDS

13 May-20-10CS572-Summer2010CAM-13 PDS Google-like Search Architecture Search Engine Architecture (e.g. Nutch, Google) PDS Catalog PDS-DPDS-D Existing PDS Query Indexer Index Lucene Crawler PDS Extract Parser PDS Parser pds.war Tomcat Web Server Catalog Metadata

14 May-20-10CS572-Summer2010CAM-14 Approach Export PDS catalog datasets in RDF format (flat files) Use nutch to crawl RDF files –protocol-file plugin in Nutch Wrote our own parse-pds plugin –Parse the RDF files, and then extract the metadata Wrote our own index-pds plugin –Index the fields that we want from the parsed metadata Wrote our own query-pds plugin –Search the index on the fields that we want

15 May-20-10CS572-Summer2010CAM-15 Search Interface

16 May-20-10CS572-Summer2010CAM-16 Results

17 May-20-10CS572-Summer2010CAM-17 Lessons Learned Nutch currently isn’t exactly simple to deploy, or configure –There is much discussion on mailing lists that refer to “magic configuration” properties that aren’t intuitive Nutch documentation is currently…lacking If you know how to use Nutch then it is extremely easy to use, and a time-saver Active participation in mailing lists, wiki, necessary to use Nutch

18 May-20-10CS572-Summer2010CAM-18 Good News Nutch is here to stay –Only open source, implementation for commodity web search –If you want to start your own Google++, Nutch is a great place to start Participation is welcome –Look what happened to me (student-> commiter) –Plenty of areas to improve (including documentation)

19 May-20-10CS572-Summer2010CAM-19 Your Class Project It’s probably a good idea to at least take a look at Nutch, whether you use it or not You can see how a real implementation of theory described in class operates –Implemented in pure Java (1.5) Add/extend capabilities within Nutch –Help finish plugging Nutch into HBase –Configure Nutch using Spring –Fully integrate Nutch and Solr –Fix *important* bugs –Add more scoring algorithm implementations

20 May-20-10CS572-Summer2010CAM-20 Wrapup Thanks for your attention! Nutch home page: –http://nutch.apache.orghttp://nutch.apache.org Mailing lists –dev@nutch.apache.org (developer’s list)dev@nutch.apache.org –user@nutch.apache.org (user’s list)user@nutch.apache.org


Download ppt "Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010."

Similar presentations


Ads by Google