Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Indexing Techniques with

Similar presentations


Presentation on theme: "Advanced Indexing Techniques with"— Presentation transcript:

1 Advanced Indexing Techniques with
Michael Busch Advanced Indexing Techniques with Apache Lucene - Payloads

2 Advanced Indexing Techniques with Apache Lucene - Payloads
Agenda Part 1: Inverted Index 101 Posting Lists Stored Fields vs. Payloads Part 2: Use cases for Payloads BoostingTermQuery Simple facet counting Advanced Indexing Techniques with Apache Lucene - Payloads

3 Advanced Indexing Techniques with Apache Lucene - Payloads
Lucene’s data structures Inverted Index Store search retrieve stored fields Hits Results Advanced Indexing Techniques with Apache Lucene - Payloads

4 String comparison slow!
Query: not c:\docs\einstein.txt: The important thing is not to stop questioning. String comparison slow! Solution: Inverted index c:\docs\shakespeare.txt: To be or not to be. Advanced Indexing Techniques with Apache Lucene - Payloads

5 Advanced Indexing Techniques with Apache Lucene - Payloads
Inverted index Query: not be important is not or questioning stop to the thing 1 0 1 c:\docs\einstein.txt: The important thing is not to stop questioning. c:\docs\shakespeare.txt: To be or not to be. 1 Document IDs Advanced Indexing Techniques with Apache Lucene - Payloads

6 Advanced Indexing Techniques with Apache Lucene - Payloads
Inverted index Query: ”not to” be important is not or questioning stop to the thing 1 0 1 c:\docs\einstein.txt: The important thing is not to stop questioning. c:\docs\shakespeare.txt: To be or not to be. 1 Document IDs Advanced Indexing Techniques with Apache Lucene - Payloads

7 Advanced Indexing Techniques with Apache Lucene - Payloads
Inverted index Query: ”not to” be important is not or questioning stop to the thing 1 1 3 4 2 7 6 5 5 c:\docs\einstein.txt: The important thing is not to stop questioning. 1 3 c:\docs\shakespeare.txt: To be or not to be. 1 1 0 4 Document IDs Positions Advanced Indexing Techniques with Apache Lucene - Payloads

8 Advanced Indexing Techniques with Apache Lucene - Payloads
Inverted index with Payloads be important is not or questioning stop to the thing 1 1 3 4 2 7 6 5 1 5 c:\docs\einstein.txt: The important thing is not to stop questioning. c:\docs\shakespeare.txt: To be or not to be. 1 4 B Document IDs Positions Payloads Advanced Indexing Techniques with Apache Lucene - Payloads

9 Advanced Indexing Techniques with Apache Lucene - Payloads
So far… String comparison slow Inverted index used to accelerate search Store positions in posting lists to allow phrase searches Store payloads in posting lists to store arbitrary data with each position Advanced Indexing Techniques with Apache Lucene - Payloads

10 Advanced Indexing Techniques with Apache Lucene - Payloads
Lucene’s data structures Inverted Index Store search retrieve stored fields Hits Results Advanced Indexing Techniques with Apache Lucene - Payloads

11 Advanced Indexing Techniques with Apache Lucene - Payloads
Store Store Field 1: title Field 2: content Field 3: hashvalue Documents: F3 D0 F1 F2 D1 D2 Advanced Indexing Techniques with Apache Lucene - Payloads

12 Advanced Indexing Techniques with Apache Lucene - Payloads
Store D0 F1 F2 F3 D1 F1 F2 F3 D2 F1 F2 F3 Optimized for random access Document-locality Advanced Indexing Techniques with Apache Lucene - Payloads

13 Advanced Indexing Techniques with Apache Lucene - Payloads
Store Posting list with Payloads D0 D1 F3 Document IDs Positions Payloads X D0 F1 F2 F3 D1 F1 F2 F3 D2 F1 F2 F3 Optimized for scanning and skipping Space-efficient encoding Advanced Indexing Techniques with Apache Lucene - Payloads

14 Advanced Indexing Techniques with Apache Lucene - Payloads
Agenda Part 1: Inverted Index 101 Posting Lists Stored Fields vs. Payloads Part 2: Use cases for Payloads BoostingTermQuery Simple facet counting Advanced Indexing Techniques with Apache Lucene - Payloads

15 org.apache.lucene.analysis.Token
Payloads - API org.apache.lucene.analysis.Token void setPayload(Payload payload) org.apache.lucene.index.Payload Payload(byte[] data) Payload(byte[] data, int offset, int length) Advanced Indexing Techniques with Apache Lucene - Payloads

16 Advanced Indexing Techniques with Apache Lucene - Payloads
Payloads - API org.apache.lucene.index.TermPositions boolean next(); int doc() int freq(); int nextPosition(); int getPayloadLength(); byte[] getPayload(byte[] data, int offset) Advanced Indexing Techniques with Apache Lucene - Payloads

17 Advanced Indexing Techniques with Apache Lucene - Payloads
Example: BoostingTermQuery Use case: Score certain occurrences of a term higher than others E. g.: Query: ‘warning’ doc1: ”HURRICANE WARNING” doc2: “The Warning Label Generator is a fun way to generate your own warning labels!” (www.warninglabelgenerator.com) Advanced Indexing Techniques with Apache Lucene - Payloads

18 Advanced Indexing Techniques with Apache Lucene - Payloads
Example: BoostingTermQuery Analyzer: final byte BoldBoost = 5; Token token = new Token(…); if (isBold) { token.setPayload( new Payload(new byte[] {BoldBoost})); } return token; Advanced Indexing Techniques with Apache Lucene - Payloads

19 Advanced Indexing Techniques with Apache Lucene - Payloads
Example: BoostingTermQuery Similarity: Similarity boostingSimilarity = new DefaultSimilarity() { public float scorePayload(byte [] payload, int offset, int length) { if (length == 1) return payload[offset]; }; Advanced Indexing Techniques with Apache Lucene - Payloads

20 Advanced Indexing Techniques with Apache Lucene - Payloads
Example: BoostingTermQuery BoostingTermQuery: Query btq = new BoostingTermQuery( new Term(“field”, “searchterm”)); Searching: Searcher searcher = new IndexSearcher(…); Searcher.setSimilarity(boostingSimilarity); Hits hits = searcher.search(btq); Advanced Indexing Techniques with Apache Lucene - Payloads

21 Advanced Indexing Techniques with Apache Lucene - Payloads
Example from java-user: Unique Doc Ids Use case: Store a unique document id (UID) that maps to a row in a database table Retrieve UID at search time to influence matching/scoring FieldCache takes to long to load Advanced Indexing Techniques with Apache Lucene - Payloads

22 Advanced Indexing Techniques with Apache Lucene - Payloads
Example from java-user: Unique Doc Ids Solution: Index one special term for each document, e. g. ID:UID Index one occurrence for each document Store UID in the Payload of the occurrence Advanced Indexing Techniques with Apache Lucene - Payloads

23 For indexing: TokenStream
Example from java-user: Unique Doc Ids For indexing: TokenStream class SinglePayloadTokenStream extends TokenStream { boolean done = false; public void setUID(int uid) {...} public Token next() throws IOException { if (done) return null; Token token = new Token(“UID”); token.setPayload(new Payload(uid); done = true; return token; } Advanced Indexing Techniques with Apache Lucene - Payloads

24 For retrieving: TermPositions
Example from java-user: Unique Doc Ids For retrieving: TermPositions public int[] getCachedUIDs(IndexReader reader) { int[] cache = new int[reader.maxDoc()]; TermPositions tp = reader.termPositions( new Term(“ID”, “UID”); byte[] buffer = new byte[4]; while(tp.next()) { // iterate over docs tp.nextPosition(); // only one pos per doc tp.getPayload(buffer, 0); cache[tp.doc()] = bytesToInt(buffer); } return cache; Advanced Indexing Techniques with Apache Lucene - Payloads

25 Advanced Indexing Techniques with Apache Lucene - Payloads
Example from java-user: Unique Doc Ids Performance: Load UIDs for 2M docs into memory FieldCache: 16.5 s Payloads: ms Advanced Indexing Techniques with Apache Lucene - Payloads

26 Advanced Indexing Techniques with Apache Lucene - Payloads
Example: (Very) Simple facet counting Use case: Collection with docs from different sources Show top-n results from each source instead of top-n results from entire collection Advanced Indexing Techniques with Apache Lucene - Payloads

27 Advanced Indexing Techniques with Apache Lucene - Payloads
Example: (Very) Simple facet counting Analyzer: public TokenStream tokenStream(String fieldName, Reader reader) { if (fieldName.equals(“_facet”)) { return new TokenStream() { boolean done = false; public Token next() { if (done) return null; Token token = new Token(…); token.setPayload( new Payload(computeHash(url)); done = true; return token; }}}} Advanced Indexing Techniques with Apache Lucene - Payloads

28 Advanced Indexing Techniques with Apache Lucene - Payloads
Example: (Very) Simple facet counting Hitcollector: Use different PriorityQueues for different sites Instead of returning top-n results of the whole data set, return top-n results per site Advanced Indexing Techniques with Apache Lucene - Payloads

29 Advanced Indexing Techniques with Apache Lucene - Payloads
Example: (Very) Simple facet counting Summary In this example: facet (site) used for scoring, but extendable for facet counting Good performance due to locality of facet values Advanced Indexing Techniques with Apache Lucene - Payloads

30 Advanced Indexing Techniques with Apache Lucene - Payloads
Example: Efficient Numeric Search Use case: Find documents that have a numeric value in a specific range, e. g. all docs with a date >2006 and <2007 Currently in Lucene: RangeQuery Store all values in the dictionary Query expansion Advanced Indexing Techniques with Apache Lucene - Payloads

31 Dictionary Postinglists
Example: Efficient Numeric Search Dictionary Postinglists 01/01/2006 01/02/2006 01/04/2006 . 12/30/2006 Query: [01/05/2006 TO 11/25/2006] Problem: A large number of postinglists have to be processed Advanced Indexing Techniques with Apache Lucene - Payloads

32 Advanced Indexing Techniques with Apache Lucene - Payloads
Example: Efficient Numeric Search Idea: Index special term, e. g. ‘numeric:date’ and store actual value in a Payload for each doc Problem: Postinglist can become very big -> entire list has to be processed Solution: Hybrid approach Advanced Indexing Techniques with Apache Lucene - Payloads

33 Dictionary Postinglists
Example: Efficient Numeric Search Dictionary Postinglists date:01/2006 date:02/2006 . date:12/2006 Store day in payload Store position where date occurred Document IDs Positions Payloads Advanced Indexing Techniques with Apache Lucene - Payloads

34 Advanced Indexing Techniques with Apache Lucene - Payloads
Example: Efficient Numeric Search Tradeoff between number of postinglists to process and size of postinglists Significant speedup possible with good choice of chunk size Advanced Indexing Techniques with Apache Lucene - Payloads

35 Advanced Indexing Techniques with Apache Lucene - Payloads
Conclusion Payloads offer great flexibility Payloads are stored very space-efficient Sophisticated data structures enable efficient skipping over payloads Payloads should be used whenever special data is required for finding hits and scoring Advanced Indexing Techniques with Apache Lucene - Payloads

36 Advanced Indexing Techniques with Apache Lucene - Payloads
Outlook Finalize API (currently Beta) Add more out-of-the-box query types Per-document Payloads – updateable FieldCache implementation that uses Payloads Advanced Indexing Techniques with Apache Lucene - Payloads

37 Advanced Indexing Techniques with
Questions ? Advanced Indexing Techniques with Apache Lucene - Payloads


Download ppt "Advanced Indexing Techniques with"

Similar presentations


Ads by Google