Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lucid Imagination, Inc. – 1 Big, Bigger Biggest Large scale issues: Phrase queries and common words OCR Tom Burton West.

Similar presentations


Presentation on theme: "Lucid Imagination, Inc. – 1 Big, Bigger Biggest Large scale issues: Phrase queries and common words OCR Tom Burton West."— Presentation transcript:

1 Lucid Imagination, Inc. – http://www.lucidimagination.com 1 Big, Bigger Biggest Large scale issues: Phrase queries and common words OCR Tom Burton West Hathi Trust Project

2 Lucid Imagination, Inc. – http://www.lucidimagination.com Hathi Trust Large Scale Search Challenges Goal: Design a system for full-text search that will scale to 5 million to 20 million volumes (at a reasonable cost.) Challenges: Must scale to 20 million full-text volumes Very long documents compared to most large-scale search applications Multilingual collection OCR quality varies 2

3 Lucid Imagination, Inc. – http://www.lucidimagination.com Index Size, Caching, and Memory Our documents average about 300 pages which is about 700KB of OCR. Our 5 million document index is between 2 and 3 terabytes. About 300 GB per million documents Large index means disk I/O is bottleneck Tradeoff JVM vs OS memory Solr uses OS memory (disk I/O caching) for caching of postings Memory available for disk I/O caching has most impact on response time (assuming adequate cache warming) Fitting entire index in memory not feasible with terabyte size index 3

4 Lucid Imagination, Inc. – http://www.lucidimagination.com Response time varies with query 4 Average: 673 Median: 91 90th: 328 99th: 7,504 Average: 673 Median: 91 90th: 328 99th: 7,504

5 Lucid Imagination, Inc. – http://www.lucidimagination.com 5 Slowest 5 % of queries The slowest 5% of queries took about 1 second or longer. The slowest 1% of queries took between 10 seconds and 2 minutes. Slowest 0.5% of queries took between 30 seconds and 2 minutes These queries affect response time of other queries Cache pollution Contention for resources Slowest queries are phrase queries containing common words

6 Lucid Imagination, Inc. – http://www.lucidimagination.com Query processing Phrase queries use position index (Boolean queries do not). Position index accounts for 85% of index size Position list for common words such as the can be many GB in size This causes lots of disk I/O. Solr depends on the operating systems disk cache to reduce disk I/O requirements for words that occur in more than one query I/O from Phrase queries containing common words pollutes the cache 6

7 Lucid Imagination, Inc. – http://www.lucidimagination.com Slow Queries Slowest test query: the lives and literature of the beat generation took 2 minutes. 4MB data read for Boolean query. 9,000+ MB read for Phrase query. WORD NUMBER OF DOCUMENTS POSTINGS LIST (SIZE MB) TOTAL TERM OCCURRENCES (MILLIONS) POSITION LIST (SIZE MB) the800,0000.84,351 of892,0000.892,795 and769,0000.771,870 literature435,0000.4499 generation414,0000.4155 lives432,0000.4355 beat278,0000.2811 TOTAL 4.02 9,036 7

8 Lucid Imagination, Inc. – http://www.lucidimagination.com Why not use Stop Words? The word the occurs more than 4 billion times in our 1 million document index. Removing stop words (the, of etc.) not desirable for our use cases. Couldnt search for many phrases to be or not to be the who man in the moon vs. man on the moon Stop words in one language are content words in another language German stop words war and die are content words in English English stop words is and by are content words (ice and village) in Swedish 8

9 Lucid Imagination, Inc. – http://www.lucidimagination.com CommonGrams Ported Nutch CommonGrams algorithm to Solr Create Bi-Grams selectively for any two word sequence containing common terms Slowest query: The lives and literature of the beat generation the-lives lives-and and-literature literature-of of-the the-beat generation 9

10 Lucid Imagination, Inc. – http://www.lucidimagination.com 10 Standard index vs. CommonGrams Standard IndexCommon Grams WORD TOTAL OCCURRENCES IN CORPUS (MILLIONS) NUMBER OF DOCS (THOUSANDS) the2,013386 of1,299440 and855376 literature4210 lives2194 generation2199 beat0.6130 TOTAL4,176 TERM TOTAL OCCURRENCES IN CORPUS (MILLIONS) NUMBER OF DOCS (THOUSANDS) of-the446396 generation2.42262 the-lives0.36128 literature-of0.35103 lives-and0.25115 and-literature0.2477 the-beat0.0626 TOTAL450

11 Lucid Imagination, Inc. – http://www.lucidimagination.com Comparison of Response time (ms) AVERAGEMEDIAN90 th 99 th SLOWEST QUERY Standard Index 459321466,784120,595 Common Grams 683712,2267,800 11

12 Lucid Imagination, Inc. – http://www.lucidimagination.com Other issues Analyze your slowest queries We analyzed the slowest queries from our query logs and discovered additional common words to be added to our list. We used Solr Admin panel to run our slowest queries from our logs with the debug flag checked. We discovered that words such as lart were being split into two token phrase queries. We used the Solr Admin Analysis tool and determined that the analyzer we were using was the culprit. 12

13 Lucid Imagination, Inc. – http://www.lucidimagination.com Other issues We broke Solr … temporarily Dirty OCR in combination with over 200 languages creates indexes with over 2.4 billion unique terms Solr/Lucene index size was limited to 2.1 Billion unique terms Patched: Now its 274 Billion Dirty OCR is difficult to remove without removing good words. Because Solr/Lucene tii/tis index uses pointers into the frequency and position files we suspect that the performance impact is minimal compared to disk I/O demands, but we will be testing soon. 13


Download ppt "Lucid Imagination, Inc. – 1 Big, Bigger Biggest Large scale issues: Phrase queries and common words OCR Tom Burton West."

Similar presentations


Ads by Google