Lucid Imagination, Inc. – 1 Big, Bigger Biggest Large scale issues: Phrase queries and common words OCR Tom Burton West.

Slides:



Advertisements
Similar presentations
Why facebook? Your brand is no longer what you tell people it is, its what people tell each other it is. Your brand is what people say about you when.
Advertisements

ELIBRARY CURRICULUM EDITION The ultimate K-12 curriculum and reference solution.
HathiTrust Large Scale Search Tom Burton-West Information Retrieval Programmer Digital Library Production Service University of Michigan
MEDIA COSTS. Newspaper Rates Classified Ads –Grouped into categories –Paid by word or line Display Ads –More creative –Generally larger –Paid by column.
Case Study: Photo.net March 20, What is photo.net? An online learning community for amateur and professional photographers 90,000 registered users.
$100 $200 $300 $400 $100 $200 $300 $400 $100 $200 $300 $400 $100 $200 $300 $400 $100 $200 $300 $400.
Cache Storage For the Next Billion Students: Anirudh Badam, Sunghwan Ihm Research Scientist: KyoungSoo Park Presenter: Vivek Pai Collaborator: Larry Peterson.
Chapter 4 Memory Management Basic memory management Swapping
External sorting R & G – Chapter 13 Brian Cooper Yahoo! Research.
Introduction to Indexes Rui Zhang The University of Melbourne Aug 2006.
1. SQL Server 2014 In-Memory by Design Arthur Zubarev June 21, 2014.
A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh.
WEB OF KNOWLEDGE 5.2
Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Final Project of Information Retrieval and Extraction by d 吳蕙如.
HathiTrust Large Scale Search: Scalability meets Usability Tom Burton-West Information Retrieval Programmer Digital Library Production Service University.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems.
Information Retrieval IR 4. Plan This time: Index construction.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Overview of Search Engines
BTREE Indices A little context information What’s the purpose of an index? Example of web search engines Queries do not directly search the WWW for data;
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???
MAC OS – Unit A Page: 10-11, Investigating Data Processing Understanding Memory.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Work performed pupil 8B class: Danil Kozlov Supervisor: Lepeshkina Natalia Valeryevna RussiaVoronezh.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.
Sorting by the Numbers Sorting Part Four. Question Suppose you are given the task of writing an application to sort a big data file. What do you need.
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener Doğuş University.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Evidence from Content INST 734 Module 2 Doug Oard.
K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Lucene Jianguo Lu.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
CS203 – Advanced Computer Architecture Virtual Memory.
Architecting Search in 2013/2016 On-Prem Ajay Iyer.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Tweet Search Cody, Darin, Kyle, Vincent. General Architecture Application GUI Index Builder/Loader Datastructure TriTree Posting Lists Tweet Tweets Ranker.
CS315 Introduction to Information Retrieval Boolean Search 1.
CS161 – Design and Architecture of Computer
Why indexing? For efficient searching of a document
Large Scale Search: Inverted Index, etc.
Text Indexing and Search
CS161 – Design and Architecture of Computer
Main Memory Database Systems
Data Structures and Algorithms
Data Structures and Algorithms
Lecture 7: Index Construction
Content Analysis of Text
Getting Started With Solr
Query processing: phrase queries and positional indexes
Lecture 13: Computer Memory
External Sorting Dina Said
Presentation transcript:

Lucid Imagination, Inc. – 1 Big, Bigger Biggest Large scale issues: Phrase queries and common words OCR Tom Burton West Hathi Trust Project

Lucid Imagination, Inc. – Hathi Trust Large Scale Search Challenges Goal: Design a system for full-text search that will scale to 5 million to 20 million volumes (at a reasonable cost.) Challenges: Must scale to 20 million full-text volumes Very long documents compared to most large-scale search applications Multilingual collection OCR quality varies 2

Lucid Imagination, Inc. – Index Size, Caching, and Memory Our documents average about 300 pages which is about 700KB of OCR. Our 5 million document index is between 2 and 3 terabytes. About 300 GB per million documents Large index means disk I/O is bottleneck Tradeoff JVM vs OS memory Solr uses OS memory (disk I/O caching) for caching of postings Memory available for disk I/O caching has most impact on response time (assuming adequate cache warming) Fitting entire index in memory not feasible with terabyte size index 3

Lucid Imagination, Inc. – Response time varies with query 4 Average: 673 Median: 91 90th: th: 7,504 Average: 673 Median: 91 90th: th: 7,504

Lucid Imagination, Inc. – 5 Slowest 5 % of queries The slowest 5% of queries took about 1 second or longer. The slowest 1% of queries took between 10 seconds and 2 minutes. Slowest 0.5% of queries took between 30 seconds and 2 minutes These queries affect response time of other queries Cache pollution Contention for resources Slowest queries are phrase queries containing common words

Lucid Imagination, Inc. – Query processing Phrase queries use position index (Boolean queries do not). Position index accounts for 85% of index size Position list for common words such as the can be many GB in size This causes lots of disk I/O. Solr depends on the operating systems disk cache to reduce disk I/O requirements for words that occur in more than one query I/O from Phrase queries containing common words pollutes the cache 6

Lucid Imagination, Inc. – Slow Queries Slowest test query: the lives and literature of the beat generation took 2 minutes. 4MB data read for Boolean query. 9,000+ MB read for Phrase query. WORD NUMBER OF DOCUMENTS POSTINGS LIST (SIZE MB) TOTAL TERM OCCURRENCES (MILLIONS) POSITION LIST (SIZE MB) the800, ,351 of892, ,795 and769, ,870 literature435, generation414, lives432, beat278, TOTAL ,036 7

Lucid Imagination, Inc. – Why not use Stop Words? The word the occurs more than 4 billion times in our 1 million document index. Removing stop words (the, of etc.) not desirable for our use cases. Couldnt search for many phrases to be or not to be the who man in the moon vs. man on the moon Stop words in one language are content words in another language German stop words war and die are content words in English English stop words is and by are content words (ice and village) in Swedish 8

Lucid Imagination, Inc. – CommonGrams Ported Nutch CommonGrams algorithm to Solr Create Bi-Grams selectively for any two word sequence containing common terms Slowest query: The lives and literature of the beat generation the-lives lives-and and-literature literature-of of-the the-beat generation 9

Lucid Imagination, Inc. – 10 Standard index vs. CommonGrams Standard IndexCommon Grams WORD TOTAL OCCURRENCES IN CORPUS (MILLIONS) NUMBER OF DOCS (THOUSANDS) the2, of1, and literature4210 lives2194 generation2199 beat TOTAL4,176 TERM TOTAL OCCURRENCES IN CORPUS (MILLIONS) NUMBER OF DOCS (THOUSANDS) of-the generation the-lives literature-of lives-and and-literature the-beat TOTAL450

Lucid Imagination, Inc. – Comparison of Response time (ms) AVERAGEMEDIAN90 th 99 th SLOWEST QUERY Standard Index ,784120,595 Common Grams ,2267,800 11

Lucid Imagination, Inc. – Other issues Analyze your slowest queries We analyzed the slowest queries from our query logs and discovered additional common words to be added to our list. We used Solr Admin panel to run our slowest queries from our logs with the debug flag checked. We discovered that words such as lart were being split into two token phrase queries. We used the Solr Admin Analysis tool and determined that the analyzer we were using was the culprit. 12

Lucid Imagination, Inc. – Other issues We broke Solr … temporarily Dirty OCR in combination with over 200 languages creates indexes with over 2.4 billion unique terms Solr/Lucene index size was limited to 2.1 Billion unique terms Patched: Now its 274 Billion Dirty OCR is difficult to remove without removing good words. Because Solr/Lucene tii/tis index uses pointers into the frequency and position files we suspect that the performance impact is minimal compared to disk I/O demands, but we will be testing soon. 13