Presentation on theme: "Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene Users Group San Francisco."— Presentation transcript:
Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene Users Group San Francisco
What is NRT? Search on documents nearly as fast as they are indexed Delete documents in a way that is immediate and IO efficient Good for things like Twitter and other apps that require realtime searching (Social 2.0)
Today? Users expect to search their data immediately after updating it (Web/Social 2.0 apps) Search engines are designed to perform efficient batch indexing (not realtime) Batch indexing is slow and updates take a while to be searchable
NRT in Lucene Uses core Lucene code to make existing batch indexing nearly realtime Required retrofitting of some of the core implementation Details are hidden Hopefully really easy for developers to use
LUCENE-1314 IndexReader.clone is like reopen However it performs a copy-on-write of norms and deletes Used by LUCENE-1516 to keep deletes in RAM (rather than flush them to disk)
LUCENE-1516 Adds ability to obtain an IndexReader from IndexWriter Efficient in ram deletes Call IndexWriter.getReader instead of IndexReader.reopen All updating, deletes, roepening, and flushing details hidden from user Will be in Lucene 2.9
LUCENE-1313 Near Realtime Search Makes IW.getReader faster New segments are flushed to IndexWriter internal RAMDirectory Could increase overall indexing performance because theres no pause while the ram buffer is being written to disk Will be in Lucene 2.9?
LUCENE-1483 Searches on fieldcaches at the segment level Means faster field cache loading and more efficient memory usage Good for realtime because field cache loading is less of a bottleneck, less ram usage Will be in Lucene 2.9
LUCENE-1526 Optimize copy-on-write When were doing IndexReader.clone, we may be creating a huge new array for a small number of deletes or norms updates So we need to do incremental copy-on- write of things like deletes, norms, and field caches (?) Lucene 3.0?
LUCENE-1231 Column stride fields will make field cache loading faster because data will be loaded sequentially from disk Today there are potentially two hard drive seeks per field cache value (TermEnum.next, TermDocs.next) Lucene 3.0?
Future of Lucene NRT LUCENE-1292 – Realtime parallel untokenized field index (for tags) Pulsing - Store smaller postings directly in the term dictionary (to avoid seeks) for faster field cache loading Replication More benchmarks
LinkedIn Open Source Projects Bobo – Facet library that counts using custom field caches http://code.google.com/p/bobo-browse/ http://code.google.com/p/bobo-browse/ Zoie – Realtime search on top of Lucene http://code.google.com/p/zoie/ http://code.google.com/p/zoie/ Voldemort – Distributed key-value storage http://project-voldemort.com/ http://project-voldemort.com/
BoboBrowse: facet features MultiSelect Runtime-defined facets (query-based, etc) Fast (custom field-cache based) Custom facet types: –Hierarchical (/a/b/c) –Range –Multivalued
Zoie: realtime features No modifications to core lucene Multiple read/write: RAMDir + FSDir IndexReader on (small) RAMDir opened per request: instantly realtime IndexReaderDecorator for custom Reader Transparent Indexing: implement StreamDataProvider then inject
Next Steps Help work on the patches? https://issues.apache.org/jira/browse/LUC ENE LinkedIn is hiring Contact: email@example.com or firstname.lastname@example.org@gmail.com email@example.com