Presentation on theme: "Qinqing Gan Torsten Suel Improved Techniques for Result Caching in Web Search Engines Presenter: Arghyadip ● Konark."— Presentation transcript:
Qinqing Gan Torsten Suel Improved Techniques for Result Caching in Web Search Engines Presenter: Arghyadip ● Konark
Summary: Result caching in web search engines (1)Query Result Caching of search engines to improve the query processing performance. (2)To increase the effective throughput of the entire search engine system. (3)Discussion of various weighted,un-weighted and hybrid query result caching techniques. (4)Performance Evaluation.
Query Processing Main challenge for query processing is the significant size of the index data for a query Need to optimize to scale with users and data Caching is one of such optimizations Result caching: has query occurred before? List caching: has index data for term been accessed before?
Related Work Number of subsequent papers on result caching: (Cache Hit only) Baeza-Yates et al. (SPIRE 2003, 2007, SIGIR 2003) Fagni et al. (TOIS 2006) Lempel/Moran (WWW 2003) Saraiva et al. (SIGIR 2001) Xie/Hallaron (Infocom 2002) Fagni el al. proposes hybrid methods that combine a dynamic cache with a more static cache Baeze-Yates et al. (Spire 2007) use some features for cache admission policy
Caching Basics LRU: least recently used LFU: least frequently used Can be implemented using basic data structures score defined as the time since last occurrence of the same query in LRU, or the frequency of a query in LFU. Evict query with smallest score Recency (LRU) vs. frequency (LFU) Various hybrids: Combines two or more.
SDC (Static and Dynamic Caching) LFU LRU Alpha = 0.7 Fagni et al. (TOIS 2006)
Characteristics of Queries(AOL Query Log) Query frequencies follow Zipf distribution While a few queries are quite frequent, most queries occur only once or a few times Double Logarithm ic Scale
Characteristics of Queries Query traces exhibit some amount of burstiness, i.e., most of the queries occur only once or twice A significant part of this burstiness is due to the same user reissuing a query to the engine. With an assumed query arrival rate at 132 Queries per minute Most queries repeat within few minutes/hour
Only Cache Hit? Query Result Fails. Frequent Admission and Eviction Occurs.
Ideology: Study result caching as a weighted caching problem - Hit ratio - Cost saving Hybrid algorithms for weighted caching
Weighted Caching Assume all cache entries have same size. Standard caching: all entries also same cost Weighted caching: different costs. Result caching: some queries more expensive to recompute than others In fact, costs highly skewed Should keep expensive results longer
Weighted Caching Algorithms LFU_w: evict entry with smallest value of past frequency * cost (weighted version on LFU) Landlord On insertion, give entry a deadline equal to its cost Evict entry with smallest deadline, and deduct this deadline from all other deadlines in the cache Weighed version of LFU (Young, Cao/Irani 1998) SDC_w: Combination of LFU_w and Landlord.
New Hybrid Algorithms SDC lru_lfu landlord_lfu_w
Weighted Caching and Power Laws Problem with weighted caching with high skew Suppose q_1 has occurred once and has cost 10, and q_2 has occurred 10 times and has cost 1 LFU_w gives same priority is that right? Lottery: Multiple rounds, one winner per round Some people buy more tickets than others But each person buys same number each week Given past history, guess future winners Suppose ticket sales are Zipfian
Weighted Caching and Power Laws Compare: smoothing techniques in language models Three solutions: Good-Turing estimator Estimator derived from power law Pragmatic: fit correction factors from real data Last solution subsumes others
Dataset and Evaluations 2006 AOL query log with 36 million queries 4GB of Data Collected as HTML Pages from Quora Lemur Search Engine has no support for Result Caching Plan to Develop Weighted LRU, LFU and SDC Result Caching on top of Lemur Compare the performance with different weights assigned to Hit Ratio and Load over all the above caching variants Evaluate which weight metric works best