Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene Users Group San Francisco.
Information Retrieval in Practice
Microsoft ® Office Outlook ® 2007 Training Retrieve, back up, or share messages Sweetwater ISD presents:
Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
More on File Management
Chapter 5: Introduction to Information Retrieval
Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advanced Indexing Techniques with
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
Evaluating Search Engine
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval Review
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Evaluating the Performance of IR Sytems
Web Archive Information Retrieval Miguel Costa, Daniel Gomes (speaker) Portuguese Web Archive.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Overview of Search Engines
Information Retrieval Space occupancy evaluation.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Manage your mailbox V: Retrieve, back up, or share messages Use your stored messages Whether you’re using the Personal Folders method or the Archive method.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Search Engines and Information Retrieval Chapter 1.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Index Building Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules.
Colleague, Excel & Word Best of Friends Presented by: Joan Kaun & Yvonne Nelson College of the Rockies.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Forms and Server Side Includes. What are Forms? Forms are used to get user input We’ve all used them before. For example, ever had to sign up for courses.
Create Lists in Millennium Jenny Schmidt SWITCH Library Consortium.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Incremental Indexing Dr. Susan Gauch. Indexing  Current indexing algorithms are essentially batch processing  They start from scratch every time  What.
Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA.
Performance Measurement. 2 Testing Environment.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Evidence from Content INST 734 Module 2 Doug Oard.
Lucene Jianguo Lu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
AA202: Performance Enhancers for Laserfiche Connie Anderson, Technical Writer.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Efficient Multi-User Indexing for Secure Keyword Search
Search Engine Architecture
Text Based Information Retrieval
Safe by default, optimized for efficiency
Web Caching? Web Caching:.
Multimedia Information Retrieval
Lucene in action Information Retrieval A.A
Presentation transcript:

Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA

Overview Defining Performance Basics Indexing –Parameters –Threading Search Document Retrieval Search Quality

Defining Performance Many factors in assessing Lucene (and search) performance Speed Quality of results (subjective) –Precision # relevant retrieved out of # retrieved –Recall # relevant retrieved out of total # relevant Size of index –Compression rate Other Factors: –Local vs. distributed

Basics Consider latest version of Lucene –Lucene 2.3/Trunk has many performance improvements over prior versions Consider Solr –Solr employs many Lucene best practices contrib/benchmark can help assess many aspects of performance, including speed, precision and recall –Task based approach makes for easy extension Sanity check your needs Profile to identify bottlenecks

Indexing Factors Lucene indexes Document s into memory On certain occasions, memory is flushed to the index representation (called a segment) Segments are periodically merged Internal Lucene models are changing and (drastically) improving performance

IndexWriter factors setMaxBufferedDocs controls minimum # of docs before merge occurs –Larger == faster –> RAM setMergeFactor controls how often segments are merged –Smaller == less RAM, better for large # of updates –Larger == faster, better for batch setMaxFieldLength controls the # of terms indexed from a document setUseCompoundFile controls the file format Lucene uses. Turning off compound file format is faster, but you could run out of file descriptors

Lucene 2.3 IndexWriter Changes setRAMBufferSizeMB –New model for automagically controlling indexing factors based on the amount of memory in use –Obsoletes setMaxBufferedDocs and setMergeFactor Takes storage and term vectors out of the merge process Turn off auto-commit if there are stored fields and term vectors Provides significant performance increase

Analysis An Analyzer is a Tokenizer and one or more TokenFilter s More complicated analysis, slower indexing –Many applications could use simpler Analyzer s than the StandardAnalyzer –StandardTokenizer is now faster in 2.3 (thus making StandardAnalyzer faster) Reuse in 2.3: –Re-use Token, Document and Field instances –Use the char[] API with Token instead of String API

Thread Safety Use a single IndexWriter for the duration of indexing Share IndexWriter between threads Parallel Indexing –Index to separate Directory instances –Merge when done with IndexWriter.addIndexes() –Distribute and collect

Other Indexing Factors NFS –Have been some improvements lately, but… –“proceed with caution” –Not as good as local filesystem Replication –Index locally and then use rsync to replicate copies of index to other servers –Have I mentioned Solr?

Benchmarking Indexing contrib/benchmark Try out different algorithms between Lucene 2.2 and trunk (2.3) –contrib/benchmark/conf: indexing.alg indexing-multithreaded.alg Info: –Mac Pro 2 x 2GHz Dual-Core Xeon –4 GB RAM – ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M

Benchmarking Results Records/SecAvg. T Mem M Trunk2,12252M Trunk-mt (4) 3,68057M

Search Performance Many factors influence search speed –Query Type, size, analysis, # of occurrences, index size, index optimization, index type –Known Enemies Search Quality also has many factors –Query formulation, synonyms, analysis, etc. –How to judge quality?

Query Types Some queries in Lucene get rewritten into simpler queries: –WildcardQuery rewrites to a BooleanQuery of all the terms that satisfy the wildcards a* -> abe, apple, an, and, array… –Likewise with RangeQuery, especially with date ranges

Query Size Stopword removal can help reduce size Choose expansions carefully Consider using fewer fields to search over When doing relevance feedback, don’t use whole document, instead focus on most important terms

Index Factors for Search Size: –more unique terms, more to search –Stopword removal and stemming can help reduce –Not a linear factor due to index compression Type –RAMDirectory if index smaller –MMapDirectory may perform better

Search Speed Tips IndexSearcher –Thread-safe, so share –Open once and use as long as possible Cache Filters when appropriate Optimize if you have the time Warm up your Searcher first by sending it some preliminary queries before making it live

Known Enemies CPU, Memory, I/O are all known enemies of performance –Can’t live without them, either! Profile, run benchmarks, look at garbage collection policies, etc. Check your needs –Do you need wildcards? –Do you need so many Field s?

Document Retrieval Common Search Scenario: –Many small Field s containing info about the Document –One or two big Field s storing content –Run search, display small Field s to user –User picks one result to view content

FieldSelector G ives developer greater control over how the Document is loaded –Load, Lazy, No Load, Load and Break, Size, etc. In previous scenario, lazy load the large Field s Easier to store original content without performance penalty

Quality Queries Evaluating search quality is difficult and subjective Lucene provides good out of the box quality by most accounts Can evaluate using TREC or other experiments, but these risk overtuning Unfortunately, judging quality is a labor- intensive task

Quality Experiments Needs: –Standard collection of docs - easy –Set of queries Query logs Develop in-house TREC, other conferences –Set of judgments Labor intensive Can use log analysis to determine estimates of which queries are relevant based on clicks, etc.

Query Formulation Invest the time in determining the proper analysis of the fields you are searching –Case sensitive search –Punctuation analysis –Strict matching Stopword policy –Stopwords can be useful Operator choice Synonym choices

Effective Scoring Similarity class provides callback mechanism for controlling how some Lucene scoring factors count towards the score –tf(), idf(), coord() Experiment with different length normalization factors –You may find Lucene is overemphasizing shorter or longer documents

Effective Scoring Can also implement your own Query class –Ask if anyone else has done it first on java-user mailing list Go beyond the obvious: –org.apach.lucene.search.function package provides means for using values of Field s to change the scores Geographic scoring, user ratings, others Payloads (stay tuned for next presentation)

Resources Talk available at: Mailing List – Lucene In Action –