Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Predictive Parallelization: Taming Tail Latencies in Web Search Myeongjae Jeon, Saehoon Kim, Seung-won Hwang, Yuxiong He, Sameh Elnikety, Alan L. Cox,

Similar presentations


Presentation on theme: "1 Predictive Parallelization: Taming Tail Latencies in Web Search Myeongjae Jeon, Saehoon Kim, Seung-won Hwang, Yuxiong He, Sameh Elnikety, Alan L. Cox,"— Presentation transcript:

1 1 Predictive Parallelization: Taming Tail Latencies in Web Search Myeongjae Jeon, Saehoon Kim, Seung-won Hwang, Yuxiong He, Sameh Elnikety, Alan L. Cox, Scott Rixner Microsoft Research, POSTECH, Rice University

2 Performance of Web Search 1) Query response time – Answer quickly to users (e.g., in 300 ms) 2) Response quality (relevance) – Provide highly relevant web pages – Improve with resources and time consumed 2 Focus: Improving response time without compromising quality

3 Background: Query Processing Stages 3 doc 2 nd phase ranking Snippet generator Doc. index search Response For example: 300 ms latency SLA Query Focus: Stage 1 100s – 1000s of good matching docs 10s of the best matching docs Few sentences for each doc

4 Goal 4 Speeding up index search (stage 1) without compromising result quality – Improve user experience – Larger index serving – Sophisticated 2 nd phase doc 2 nd phase ranking Snippet generator Doc. index search Response Query For example: 300 ms latency SLA

5 All web pages How Index Search Works 5 Partition all web pages across index servers (massively parallel) Distribute query processing (embarrassingly parallel) Aggregate top-k relevant pages Partition Index server Index server Index server Index server Index server Index server Aggregator Top-k pages Top-k pages Pages Query Problem: A slow server makes the entire cluster slow Problem: A slow server makes the entire cluster slow

6 Observation Query processing on every server. Response time is determined by the slowest one. We need to reduce its tail latencies 6 Latency

7 Aggregator Index servers Aggregator Index servers Fast response Slow response 7 Examples Terminate long query in the middle of processing → Fast response, but quality drop Long query (outlier)

8 Parallelism for Tail Reduction Opportunity Available idle cores CPU-intensive workloads Challenge Tails are few Tails are very long 8 Breakdown Latency Network4.26 ms Queueing0.15 ms I/O4.70 ms CPU ms Latency breakdown for the 99%tile. Percentile LatencyScale 50%tile7.83 msx1 75%tile12.51 msx1.6 95%tile57.15 msx7.3 99%tile msx26.1 Latency distribution

9 Query Parallelism for Tail Reduction 1.Opportunity – 30% CPU utilization – Available idle cores 2.Few long queries 3.Computationally- intensive workload 9 Table. Latency distribution in Bing index server. Breakdown Latency Network4.26 ms Queueing0.15 ms I/O4.70 ms CPU ms Table. Latency breakdown for the 99%tile. Percentile LatencyScale 50%tile7.83 msx1 75%tile12.51 msx1.6 95%tile57.15 msx7.3 99%tile msx %tile latency of ms = 99% requests have latency ≤ ms

10 Predictive Parallelism for Tail Reduction 10 Short queries – Many – Almost no speedup Long queries – Few – Good speedup

11 Predictive Parallelization Workflow 11 query Execution time predictor Predict (sequential) execution time of the query with high accuracy Index server

12 Predictive Parallelization Workflow 12 query Execution time predictor Resource manager Index server Using predicted time, selectively parallelize long queries short long

13 Predictive Parallelization Focus of Today’s Talk 1.Predictor: of long query through machine learning 2.Parallelization: of long query with high efficiency 13

14 Brief Overview of Predictor 14 AccuracyCost High recall for guaranteeing 99%tile reduction Low prediction overhead and misprediction cost In our workload, 4% queries with > 80 ms At least 3% must be identified (75% recall) Existing approaches: Lower accuracy and higher cost Existing approaches: Lower accuracy and higher cost Prediction overhead of 0.75ms or less and high precision

15 Accuracy: Predicting Early Termination Only some limited portion contributes to top-k relevant results Such portion depends on keyword (or score distribution more exactly) 15 Inverted index for “SIGIR” Processing Not evaluated Doc 1Doc 2Doc 3…….Doc N-2Doc N-1Doc N Docs sorted by static rank Highest Lowest Web documents …….

16 Term Features [ Macdonald et al., SIGIR 12] – IDF, NumPostings – Score (Arithmetic, Geometric, Harmonic means, max, var, gradient) Query features – NumTerms (before and after rewriting) – Relaxed – Language Space of Features

17 New Features: Query Rich clues from queries in modern search engines rank=BM25F enablefresh=1 partialmatch=1 language=en location=us …. SIGIR (Queensland or QLD)

18 Term Features [ Macdonald et al., SIGIR 12] – IDF, NumPostings – Score (Arithmetic, Geometric, Harmonic means, max, var, gradient) Query features – NumTerms (before and after rewriting) – Relaxed – Language Space of Features

19 CategoryFeature Term feature (14) AMeanScore GMeanScore HMeanScore MaxScore EMaxScore VarScore NumPostings GAvgMaxima MaxNumPostings In5%Max NumThres ProK IDF Query feature (6) English NumAugTerm Complexity RelaxCount NumBefore NumAfter All features cached to ensure responsiveness (avoiding disk access) Term features require 4.47GB memory footprint (for 100M terms)

20 Feature Analysis and Selection Accuracy gain from boosted regression tree, suggesting cheaper subset 20

21 Efficiency: Cheaper subset possible?

22 Prediction Performance 22 Query features are important Using cheap features is advantageous – IDF from keyword features + query features – Much smaller overhead (90+% less) – Similarly high accuracy as using all features 80 ms Thresh. Precision (|A∩P|/|P|) Recall (|A∩P|/|A|) Cost Keyword features High All features High Cheap features Low A = actual long queries P = predicted long queries

23 Algorithms

24 Accuracy of Algorithms Summary – 80% long queries (> 80 ms) identified – 0.6% short queries mispredicted – 0.55 ms for prediction time with low memory overhead

25 Predictive Parallelism

26 99%tile Response Time 26 Outperforms “Parallelize all” 50% throughput increase

27 Performance: Response Time

28 Response Time

29 Related Work Search query parallelism – Fixed parallelization [Frachtenberg, WWWJ 09] – Adaptive parallelization using system load only [Raman et al., PLDI 11]  High overhead due to parallelizing all queries Execution time prediction – Keyword-specific features only [Macdonald et al., SIGIR 12] → Lower accuracy and high memory overhead for our target problem 29

30 Future Work Misprediction – Dynamic adaptation – Prediction confidence Diverse workloads – Analytics, graph processing, 30

31 Your query to Bing is now parallelized if predicted as long. Thank You! query Execution time predictor Resource manager short long


Download ppt "1 Predictive Parallelization: Taming Tail Latencies in Web Search Myeongjae Jeon, Saehoon Kim, Seung-won Hwang, Yuxiong He, Sameh Elnikety, Alan L. Cox,"

Similar presentations


Ads by Google