Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll.

Similar presentations


Presentation on theme: "Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll."— Presentation transcript:

1 Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll

2 Lucid Imagination, Inc. The How Many Game How many of you: o Have taken a class in Information Retrieval (IR)? o Are doing work/research in IR? o Have heard of or are using Lucene? o Have heard of or are using Solr? o Are doing work on core IR algorithms such as compression techniques or scoring? o Are doing UI/Application work/research as they relate to search?

3 Lucid Imagination, Inc. Topics Brief Bio Search 101 (skip?) What is: o Apache Lucene o Apache Solr What can they do? o Features and functionality o Intangibles What’s new in Lucene and Solr? o How can they help my research/work/____?

4 Lucid Imagination, Inc. Brief Bio Apache Lucene/Solr Committer Apache Mahout co-founder o Scalable Machine Learning Co-founder of Lucid Imagination o http://www.lucidimagination.com http://www.lucidimagination.com Previously worked at Center for Natural Lang. Processing at Syracuse Univ. with Dr. Liddy Co-Author of upcoming “Taming Text” (Manning Publications) o http://www.manning.com/ingersoll

5 Lucid Imagination, Inc. Search 101 Search tools are designed for dealing with fuzzy data/questions o Works well with structured and unstructured data o Performs well when dealing with large volumes of data o Many apps don’t need the limits that databases place on content o Search fits well alongside a DB too Given a user’s information need, (query) find and, optionally, score content relevant to that need o Many different ways to solve this problem, each with tradeoffs What’s “relevant” mean?

6 Vector Space Model (VSM) for relevance Common across many search engines Apache Lucene is a highly optimized implementation of the VSM Search 101 RelevanceIndexing Finds and maps terms and documents Conceptually similar to a book index At the heart of fast search/retrieve

7 Lucid Imagination, Inc. Apache Lucene in a Nutshell http://lucene.apache.org/java Java based Application Programming Interface (API) for adding search and indexing functionality to applications Fast and efficient scoring and indexing algorithms Lots of contributions to make common tasks easier: o Highlighting, spatial, Query Parsers, Benchmarking tools, etc. Most widely deployed search library on the planet

8 Lucid Imagination, Inc. Lucene Basics Content is modeled via Documents and Fields o Content can be text, integers, floats, dates, custom o Analysis can be employed to alter content before indexing Searches are supported through a wide range of Query options o Keyword o Terms o Phrases o Wildcards o Many, many more

9 Lucid Imagination, Inc. Apache Solr in a Nutshell http://lucene.apache.org/solr Lucene-based Search Server + other features and functionality Access Lucene over HTTP: o Java, XML, Ruby, Python,.NET, JSON, PHP, etc. Most programming tasks in Lucene are configuration tasks in Solr Faceting (guided navigation, filters, etc.) Replication and distributed search support Lucene Best Practices

10 A small sampling of Lucene/Solr-Powered Sites 10 Buy.com

11 Lucid Imagination, Inc. Features and Functionality

12 Lucid Imagination, Inc. Quick Solr/Lucene Demo Pre-reqs: o Apache Ant 1.7.x, Subversion (SVN) Command Line 1: o svn co https://svn.apache.org/repos/asf/lucene/dev/trunk solr-trunkhttps://svn.apache.org/repos/asf/lucene/dev/trunk o cd solr-trunk/solr/ o ant example o cd example o java –Dsolr.clustering.enabled=true –jar start.jar Command Line 2 o cd exampledocs; java –jar post.jar *.xml http://localhost:8983/solr/browse?q=&debugQuery=true&annotate Browse=true http://localhost:8983/solr/browse?q=&debugQuery=true&annotate Browse=true

13 Lucid Imagination, Inc. Other Features Data Import Handler o Database, Mail, RSS, etc. Rich document support via Apache Tika o PDF, MS Office, Images, etc. Replication for high query volume Distributed search for large indexes o Production systems with 1B+ documents Configurable Analysis chain and other extension points o Total control over tokenization, stemming, etc.

14 Lucid Imagination, Inc. Intangibles Open Source Flexible, non-restrictive license o Apache License v2 – non-viral o “Do what you want with the software, just don’t claim you wrote it” Large community willing to help o Great place to learn about real world IR systems Many books and other documentation o Lucene in Action by Hatcher, McCandless and Gospodnetic

15 Lucid Imagination, Inc. What’s New? https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/C HANGES.txt https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/C HANGES.txt https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/CHA NGES.txt https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/CHA NGES.txt Codecs o Pluggable Index Formats o Provide Different index compression techniques Stats to enable alternate scoring approaches  BM25, Lang. Modeling, etc. -- More work to be done here Faster o Java Strings are slow; convert to use byte arrays

16 Lucid Imagination, Inc. Other New Items Many new Analyzers (tokenizers, etc.) o Richer Language support (Hindi, Indonesian, Arabic, …) Richer Geospatial (Local) Search capabilities o Score, filter, sort by distance o http://wiki.apache.org/solr/SpatialSearch Results Grouping o Group Related Results o http://wiki.apache.org/solr/FieldCollapsing http://wiki.apache.org/solr/FieldCollapsing More Faceting Capabilities o Pivot o New underlying algorithms

17 Lucid Imagination, Inc. How can Lucene/Solr help me? Everyone Fast indexing/search times means less time waiting for jobs to complete Completely Open (source, community) Free to use, modify, etc. Large community ready and willing to help User Experience Researchers Rapid UI prototyping Total Control of results and facets Easy to setup and use with little to no programming required IR Researchers Flexible Indexing models (trunk) Flexible Relevance models via functions and other mechanisms Extendable Job Seekers Google Summer of Code Other Internships (see me) Real programming skills that are highly valued in industry Publicly visible, demonstrable skills Lucene/Solr

18 Lucid Imagination, Inc. Job Trends http://www.indeed.com

19 Lucid Imagination, Inc. Other Things that Can Help Nutch o Crawling o http://nutch.apache.org http://nutch.apache.org Mahout o Machine learning (clustering, classification, others) o http://mahout.apache.org http://mahout.apache.org OpenNLP o Part of Speech, Parsers, Named Entity Recognition o http://incubator.apache.org/opennlp http://incubator.apache.org/opennlp Open Relevance Project o Relevance Judgments o http://lucene.apache.org/openrelevance http://lucene.apache.org/openrelevance

20 Lucid Imagination, Inc. Resources http://lucene.apache.org http://www.lucidimagination.com {java-user|solr-user}@lucene.apache.org @gsingers http://www.slideshare.net/gsingers grant@lucidimagination.com


Download ppt "Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll."

Similar presentations


Ads by Google