Lucid Imagination, Inc. – 1 Big, Bigger Biggest Large scale issues: Phrase queries and common words OCR Tom Burton West Hathi Trust Project
Lucid Imagination, Inc. – Hathi Trust Large Scale Search Challenges Goal: Design a system for full-text search that will scale to 5 million to 20 million volumes (at a reasonable cost.) Challenges: Must scale to 20 million full-text volumes Very long documents compared to most large-scale search applications Multilingual collection OCR quality varies 2
Lucid Imagination, Inc. – Index Size, Caching, and Memory Our documents average about 300 pages which is about 700KB of OCR. Our 5 million document index is between 2 and 3 terabytes. About 300 GB per million documents Large index means disk I/O is bottleneck Tradeoff JVM vs OS memory Solr uses OS memory (disk I/O caching) for caching of postings Memory available for disk I/O caching has most impact on response time (assuming adequate cache warming) Fitting entire index in memory not feasible with terabyte size index 3
Lucid Imagination, Inc. – Response time varies with query 4 Average: 673 Median: 91 90th: th: 7,504 Average: 673 Median: 91 90th: th: 7,504
Lucid Imagination, Inc. – 5 Slowest 5 % of queries The slowest 5% of queries took about 1 second or longer. The slowest 1% of queries took between 10 seconds and 2 minutes. Slowest 0.5% of queries took between 30 seconds and 2 minutes These queries affect response time of other queries Cache pollution Contention for resources Slowest queries are phrase queries containing common words
Lucid Imagination, Inc. – Query processing Phrase queries use position index (Boolean queries do not). Position index accounts for 85% of index size Position list for common words such as the can be many GB in size This causes lots of disk I/O. Solr depends on the operating systems disk cache to reduce disk I/O requirements for words that occur in more than one query I/O from Phrase queries containing common words pollutes the cache 6
Lucid Imagination, Inc. – Slow Queries Slowest test query: the lives and literature of the beat generation took 2 minutes. 4MB data read for Boolean query. 9,000+ MB read for Phrase query. WORD NUMBER OF DOCUMENTS POSTINGS LIST (SIZE MB) TOTAL TERM OCCURRENCES (MILLIONS) POSITION LIST (SIZE MB) the800, ,351 of892, ,795 and769, ,870 literature435, generation414, lives432, beat278, TOTAL ,036 7
Lucid Imagination, Inc. – Why not use Stop Words? The word the occurs more than 4 billion times in our 1 million document index. Removing stop words (the, of etc.) not desirable for our use cases. Couldnt search for many phrases to be or not to be the who man in the moon vs. man on the moon Stop words in one language are content words in another language German stop words war and die are content words in English English stop words is and by are content words (ice and village) in Swedish 8
Lucid Imagination, Inc. – CommonGrams Ported Nutch CommonGrams algorithm to Solr Create Bi-Grams selectively for any two word sequence containing common terms Slowest query: The lives and literature of the beat generation the-lives lives-and and-literature literature-of of-the the-beat generation 9
Lucid Imagination, Inc. – 10 Standard index vs. CommonGrams Standard IndexCommon Grams WORD TOTAL OCCURRENCES IN CORPUS (MILLIONS) NUMBER OF DOCS (THOUSANDS) the2, of1, and literature4210 lives2194 generation2199 beat TOTAL4,176 TERM TOTAL OCCURRENCES IN CORPUS (MILLIONS) NUMBER OF DOCS (THOUSANDS) of-the generation the-lives literature-of lives-and and-literature the-beat TOTAL450
Lucid Imagination, Inc. – Comparison of Response time (ms) AVERAGEMEDIAN90 th 99 th SLOWEST QUERY Standard Index ,784120,595 Common Grams ,2267,800 11
Lucid Imagination, Inc. – Other issues Analyze your slowest queries We analyzed the slowest queries from our query logs and discovered additional common words to be added to our list. We used Solr Admin panel to run our slowest queries from our logs with the debug flag checked. We discovered that words such as lart were being split into two token phrase queries. We used the Solr Admin Analysis tool and determined that the analyzer we were using was the culprit. 12
Lucid Imagination, Inc. – Other issues We broke Solr … temporarily Dirty OCR in combination with over 200 languages creates indexes with over 2.4 billion unique terms Solr/Lucene index size was limited to 2.1 Billion unique terms Patched: Now its 274 Billion Dirty OCR is difficult to remove without removing good words. Because Solr/Lucene tii/tis index uses pointers into the frequency and position files we suspect that the performance impact is minimal compared to disk I/O demands, but we will be testing soon. 13