Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch

Similar presentations


Presentation on theme: "Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch"— Presentation transcript:

1 Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch

2 Advanced Indexing Techniques with Apache Lucene - Payloads Agenda Part 1: Inverted Index 101 –Posting Lists –Stored Fields vs. Payloads Part 2: Use cases for Payloads –BoostingTermQuery –Simple facet counting

3 Advanced Indexing Techniques with Apache Lucene - Payloads Lucene’s data structures Inverted Index Store search Results retrieve stored fields Hits

4 Advanced Indexing Techniques with Apache Lucene - Payloads c:\docs\shakespeare.txt: To be or not to be. c:\docs\einstein.txt: The important thing is not to stop questioning. Query: not String comparison slow! Solution:Inverted index

5 Advanced Indexing Techniques with Apache Lucene - Payloads c:\docs\shakespeare.txt: To be or not to be. c:\docs\einstein.txt: The important thing is not to stop questioning. Query: notInverted index be important is not or questioning stop to the thing Document IDs

6 Advanced Indexing Techniques with Apache Lucene - Payloads c:\docs\shakespeare.txt: To be or not to be. c:\docs\einstein.txt: The important thing is not to stop questioning. Inverted index be important is not or questioning stop to the thing Query: ”not to” Document IDs

7 Advanced Indexing Techniques with Apache Lucene - Payloads c:\docs\shakespeare.txt: To be or not to be. c:\docs\einstein.txt: The important thing is not to stop questioning. Query: ”not to”Inverted index be important is not or questioning stop to the thing Document IDs Positions 3

8 Advanced Indexing Techniques with Apache Lucene - Payloads c:\docs\shakespeare.txt: To be or not to be. c:\docs\einstein.txt: The important thing is not to stop questioning. Inverted index with Payloads be important is not or questioning stop to the thing Document IDs PositionsPayloads 4 B

9 Advanced Indexing Techniques with Apache Lucene - Payloads So far… String comparison slow Inverted index used to accelerate search Store positions in posting lists to allow phrase searches Store payloads in posting lists to store arbitrary data with each position

10 Advanced Indexing Techniques with Apache Lucene - Payloads Lucene’s data structures Inverted Index Store search Results retrieve stored fields Hits

11 Advanced Indexing Techniques with Apache Lucene - Payloads Store Field 1: title Field 2: content Field 3: hashvalue Documents: F3 D0 F1F2F3 D1 F1F2 D2 F1F2 F3

12 Advanced Indexing Techniques with Apache Lucene - Payloads F3 Store D0 F1F2F3 D1 F1F2 D2 F1F2 F3 Optimized for random access Document-locality

13 Advanced Indexing Techniques with Apache Lucene - Payloads F3 Store D0 F1F2F3 D1 F1F2 D2 F1F2 F3 Optimized for scanning and skipping Space-efficient encoding Posting list with Payloads D0D1 F3000 Document IDs PositionsPayloads XXX

14 Advanced Indexing Techniques with Apache Lucene - Payloads Agenda Part 1: Inverted Index 101 –Posting Lists –Stored Fields vs. Payloads Part 2: Use cases for Payloads –BoostingTermQuery –Simple facet counting

15 Advanced Indexing Techniques with Apache Lucene - Payloads org.apache.lucene.analysis.Token void setPayload(Payload payload) org.apache.lucene.index.Payload Payload(byte[] data) Payload(byte[] data, int offset, int length) Payloads - API

16 Advanced Indexing Techniques with Apache Lucene - Payloads org.apache.lucene.index.TermPositions boolean next(); int doc() int freq(); int nextPosition(); int getPayloadLength(); byte[] getPayload(byte[] data, int offset) Payloads - API

17 Advanced Indexing Techniques with Apache Lucene - Payloads Use case: Example: BoostingTermQuery Score certain occurrences of a term higher than others E. g.: Query: ‘warning’ doc1: ”HURRICANE WARNING” doc2: “ The Warning Label Generator is a fun way to generate your own warning labels!” (www.warninglabelgenerator.com)

18 Advanced Indexing Techniques with Apache Lucene - Payloads Analyzer: final byte BoldBoost = 5; … Token token = new Token(…); … if (isBold) { token.setPayload( new Payload(new byte[] {BoldBoost})); } … return token; Example: BoostingTermQuery

19 Advanced Indexing Techniques with Apache Lucene - Payloads Similarity: Similarity boostingSimilarity = new DefaultSimilarity() { public float scorePayload(byte [] payload, int offset, int length) { if (length == 1) return payload[offset]; }; Example: BoostingTermQuery

20 Advanced Indexing Techniques with Apache Lucene - Payloads Example: BoostingTermQuery BoostingTermQuery: Query btq = new BoostingTermQuery( new Term(“field”, “searchterm”)); Searching: Searcher searcher = new IndexSearcher(…); Searcher.setSimilarity(boostingSimilarity); … Hits hits = searcher.search(btq);

21 Advanced Indexing Techniques with Apache Lucene - Payloads Use case: Example from java-user: Unique Doc Ids Store a unique document id (UID) that maps to a row in a database table Retrieve UID at search time to influence matching/scoring FieldCache takes to long to load

22 Advanced Indexing Techniques with Apache Lucene - Payloads Solution: Example from java-user: Unique Doc Ids Index one special term for each document, e. g. ID:UID Index one occurrence for each document Store UID in the Payload of the occurrence

23 Advanced Indexing Techniques with Apache Lucene - Payloads For indexing: TokenStream class SinglePayloadTokenStream extends TokenStream { boolean done = false; public void setUID(int uid) {...} public Token next() throws IOException { if (done) return null; Token token = new Token(“UID”); token.setPayload(new Payload(uid); done = true; return token; } Example from java-user: Unique Doc Ids

24 Advanced Indexing Techniques with Apache Lucene - Payloads For retrieving: TermPositions public int[] getCachedUIDs(IndexReader reader) { int[] cache = new int[reader.maxDoc()]; TermPositions tp = reader.termPositions( new Term(“ID”, “UID”); byte[] buffer = new byte[4]; while(tp.next()) { // iterate over docs tp.nextPosition(); // only one pos per doc tp.getPayload(buffer, 0); cache[tp.doc()] = bytesToInt(buffer); } return cache; } Example from java-user: Unique Doc Ids

25 Advanced Indexing Techniques with Apache Lucene - Payloads Performance: Example from java-user: Unique Doc Ids Load UIDs for 2M docs into memory FieldCache: 16.5 s Payloads: 430 ms

26 Advanced Indexing Techniques with Apache Lucene - Payloads Use case: Example: (Very) Simple facet counting Collection with docs from different sources Show top-n results from each source instead of top-n results from entire collection

27 Advanced Indexing Techniques with Apache Lucene - Payloads Analyzer: public TokenStream tokenStream(String fieldName, Reader reader) { if (fieldName.equals(“_facet”)) { return new TokenStream() { boolean done = false; public Token next() { if (done) return null; Token token = new Token(…); token.setPayload( new Payload(computeHash(url)); done = true; return token; }}}} Example: (Very) Simple facet counting

28 Advanced Indexing Techniques with Apache Lucene - Payloads Hitcollector: Example: (Very) Simple facet counting Use different PriorityQueues for different sites Instead of returning top-n results of the whole data set, return top-n results per site

29 Advanced Indexing Techniques with Apache Lucene - Payloads Summary Example: (Very) Simple facet counting In this example: facet (site) used for scoring, but extendable for facet counting Good performance due to locality of facet values

30 Advanced Indexing Techniques with Apache Lucene - Payloads Use case: Example: Efficient Numeric Search Find documents that have a numeric value in a specific range, e. g. all docs with a date >2006 and <2007 Currently in Lucene: RangeQuery Store all values in the dictionary Query expansion

31 Advanced Indexing Techniques with Apache Lucene - Payloads Dictionary Postinglists Example: Efficient Numeric Search 01/01/ /02/ /04/ /30/2006 Query: [01/05/2006 TO 11/25/2006] Problem: A large number of postinglists have to be processed

32 Advanced Indexing Techniques with Apache Lucene - Payloads Idea: Example: Efficient Numeric Search Index special term, e. g. ‘numeric:date’ and store actual value in a Payload for each doc Problem: Postinglist can become very big -> entire list has to be processed Solution: Hybrid approach

33 Advanced Indexing Techniques with Apache Lucene - Payloads Dictionary Postinglists Example: Efficient Numeric Search date:01/2006 date:02/2006. date:12/2006 Store day in payload Store position where date occurred Document IDs PositionsPayloads

34 Advanced Indexing Techniques with Apache Lucene - Payloads Example: Efficient Numeric Search Tradeoff between number of postinglists to process and size of postinglists Significant speedup possible with good choice of chunk size

35 Advanced Indexing Techniques with Apache Lucene - Payloads Conclusion Payloads offer great flexibility Payloads are stored very space-efficient Sophisticated data structures enable efficient skipping over payloads Payloads should be used whenever special data is required for finding hits and scoring

36 Advanced Indexing Techniques with Apache Lucene - Payloads Outlook Finalize API (currently Beta) Add more out-of-the-box query types Per-document Payloads – updateable FieldCache implementation that uses Payloads

37 Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Questions ?


Download ppt "Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch"

Similar presentations


Ads by Google