Presentation is loading. Please wait.

Presentation is loading. Please wait.

Apache Solr Beyond The Box Chris Hostetter 2008-11-05

Similar presentations


Presentation on theme: "Apache Solr Beyond The Box Chris Hostetter 2008-11-05"— Presentation transcript:

1 Apache Solr Beyond The Box Chris Hostetter 2008-11-05 http://people.apache.org/~hossman/apachecon2008us/ http://lucene.apache.org/solr/

2 2 Why Are We Here? Plugins! ● What, How, Where, When, Why? ● Solr Internals In A Nutshell ● Real World Examples ● Testing ● Questions

3 3 What, How, Where, Who, When, Why?

4 4 What Is Solr (To Users) ● Information Retrieval Application ● Index/Query Via HTTP ● Comprehensive HTML Administration Interfaces ● Scalability - Efficient Replication To Other Solr Search Servers ● Highly Configurable Caching ● Flexible And Adaptable With XML Configuration Customizable Request Handlers And Response Writers Data Schema With Dynamic Fields And Unique Keys Analyzers Created At Runtime From Tokenizers And TokenFilters

5 What Is Solr (To Developers) ● Information Retrieval Application ● Java5 WebApp (WAR) With A Web Services-ish API ● Extensible Plugin Architecture ● MVC-ish Framework Around The Java Lucene Search Library ● Allows Custom Business Logic and Text Analysis Rules To Live Close To The Data ● Abstracts Away The Tricky Stuff: Index Consistency Data Replication Cache Management

6 How It Started

7 When/Why To Write A Plugin “X can be done more efficiently closer to the data.” OR “To force X for all clients.”

8 8 Solr Internals In A Nutshell

9 9 50,000' View HTTP SolrDispatchFilter Java EmbeddedSolrServer SolrCore SolrRequestHandler CoreContainer SolrQuery(Request/Res ponse) QueryResponseWriter

10 MVC-ish ● SolrRequestHandler... A Controller handleRequest(SolrQueryRequest, SolrQueryResponse ) ● SolrQueryRequest... An Event (++) Input Parameters List of ContentStreams Maintains SolrCore & SolrIndexSearcher References ● SolrQueryResponse... Model Tree of "Simple" Objects and DocLists ● ResponseWriter... View write(Writer,SolrQueryRequest, SolrQueryResponse)

11 11 public class HelloWorld extends RequestHandlerBase { public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) { String name = req.getParams().get("name"); Integer age = req.getParams().getInt("age"); rsp.add("greeting", "Hello " + name); rsp.add("yourage", age); } public String getVersion() { return "$Revision:$"; } public String getSource() { return "$Id:$"; } public String getSourceId() { return "$URL:$"; } public String getDescription() { return "Says Hello"; } } Hello World

12 12 http://localhost:8983/solr/hello?name=Hoss&age=32&wt=xml 0 1 Hello Hoss 32 http://localhost:8983/solr/hello?name=Hoss&age=32&wt=json { "responseHeader":{ "status":0, "Qtime":1}, "greeting":"Hello Hoss", "yourage":32 } Hello World Output

13 Types Of Plugins ● SolrRequestHandler SearchComponentQparserPluginValueSourceParser ● SolrHighlighter SolrFragmenterSolrFormatter ● UpdateRequestProcessorFactory ● QueryResponseWriter Italics: Only One Per SolrCore Color Color: Likelihood Of Needing To Write Your Own ● Similarity(Factory) ● Analyzer TokenizerFactoryTokenFilterFactory ● FieldType ● SolrCache CacheRegenerator ● SolrEventListener ● UpdateHandler

14 14 Real World Examples

15 15 Tibetan And Himalayan Digital Library Tools

16 16 public class TshegBarTokenizerFactory extends BaseTokenizerFactory { public TokenStream create(Reader input) { return new TshegBarTokenizer(input); } public class EdgeTshegTrimmerFactory extends BaseTokenFilterFactory { public TokenStream create(TokenStream input) { return new EdgeTshegTrimmer(input); } Tsheg Analysis Factories

17 17 DFLL

18 DFLL: Faceted Browsing

19 DFLL Category Metadata ● Category ID and Label: 3126 == “Tablet PCs” ● Category Query: tablet_form:[* TO *] ● Ordered List of Facets Facet ID and Label: 500016 == “OS Provided” Facet Display Info: Count vs. Alphabetical, etc... Ordered List of Constraints ● Constraint ID and Label: 111536 == “Apple OS X” ● Constraint Query: os:(“OSX10.1” “OSX10.2”...)

20 20 Document catMetaDoc = searcher.getFirstMatch(catDocId) Metadata m = parseAndCacheMetadata(catMetaDoc, searcher) m = m.clone() DocListAndSet results = searcher.getDocListAndSet(m.catQuery,...) response.add(“products”, results.docList) foreach (Facet f : m) { foreach (Constraint c : f) { c.setCount(searcher.numDocs(c.query, results.docSet)) } response.add(“metadata”, m.asSimpleObjects()) DfllHandler Psuedo-Code

21 Conceptual Picture DocLis t getDocListAndSet(Query,Query[],Sort,offset,n) os:(“OSX10.1” “OSX10.2”...) memory:[1GB TO *] tablet_form:[* TO *] price asc proc_manu:Intel proc_manu:AMD Section of ordered results DocSet Unordered set of all results price:[0 TO 500] price:[500 TO 1000] manu:Dell manu:HP manu:Lenovo numDocs() = 594 = 382 = 247 = 689 = 104 = 92 = 75 Query Response

22 22...... 0 1 88 OS provided 111536 Apple Mac OS X 50 1... DFLL Response

23 23 DfllCacheRegenerator SolrCore “Auto-warms” all SolrCaches when new versions of the index are opened for searching (after a commit). public interface CacheRegenerator { public boolean regenerateItem(SolrIndexSearcher newSearcher, SolrCache newCache, SolrCache oldCache, Object oldKey, Object oldVal) throws IOException; }

24 24 DataImportHandler

25 25 Builds and incrementally updates indexes based on configured SQL or XPath queries. <entity name="item" pk="ID" query="select * from ITEM" deltaQuery="select ID... where ITEMDATE > '${dataimporter.last_index_time}'">... <entity name="f" pk="ITEMID" query="select DESC from FEATURE where ITEMID='${item.ID}'" deltaQuery="select ITEMID from FEATURE where UPDATEDATE > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from ITEM where ID=${f.ITEMID}">... DataImportHandler

26 DataImportHandler Plugins ● DataSource FileDataSource HttpDataSource JdbcDataSource ● EntityProcessor FileListEntityProcessor SqlEntityProcessor ● CachedSqlEntityProcessor XPathEntityProcessor ● Transformer DateFormatTransformer NumberFormatTransformer RegexTransformer ScriptTransformer TemplateTransformer

27 27 LocalSolr

28

29 LocalUpdateProcessorFactory ● Uses lat/lon fields to compute Cartesian Tier info ● Adds grid bodes of various sizes as new fields lat lng 9 17

30 LocalSolr Cartesian Tiers

31 LocalSolrQueryComponent ● Use in place of default QueryComponent ● Augments regular query with DistanceQuery and DistanceSortSource ● Can use a custom SolrCache for distances for commonly used points <searchComponent name="geoquery" class="....LocalSolrQueryComponent" /> geoquery...

32 32 GuardianComponent

33 GuardianComponent Goal ● When Searching Really Short Docs, Rule Out Matches That Are “Significantly” Longer Then Query ● Increase Precision At The Expense Of Recall q = Dance Party Dance Party (1995) Dance Party (2005) (V) Dance Party, USA (2006) Workout Party... Let's Dance! (2004) (V) Shrek in the Swamp Karaoke Dance Party (2001) (V)

34 Implementation ● SearchComponent ● Configured To Run After QueryComponent ● Post-Processes DocList Pick MAX_LEN Based On Number Of Query Clauses Re-analyze Stored “title“ Field Eliminate Any Results That Are With More Then MAX_LEN Tokens In “title“

35 Alternate Approach ● ● Write TokenCountingTokenFilter For titleLen ● Write MaxLenQParserPlugin Subclass Your Favorite QParser Pick MAX_LEN Based On Number Of Query Clauses From Super Add +titleLen:[* TO MAX_LEN] Clause To Query

36 36 Testing Your Plugins

37 37 AbstractSolrTestCase public class YourTest extends AbstractSolrTestCase {... public void testSomeStuff() throws Exception { assertU(adoc("id", "7", "description", "Travel Guide”, "title", "Paris in 10 Days")); assertU(adoc("id", "42", "description", "Cool Book", "title", "Hitch Hiker's Guide to the Galaxy")); assertU(commit()); assertQ("multi qf", req("q", "guide", "qt", "dismax", "qf", "title^2 description^1"),"//*[@numFound='2']","//result/doc[1]/int[@name='id'][.='42']","//result/doc[2]/int[@name='id'][.='7']" ); }

38 38 Questions? http://lucene.apache.org/solr/ ?


Download ppt "Apache Solr Beyond The Box Chris Hostetter 2008-11-05"

Similar presentations


Ads by Google