1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

Slides:



Advertisements
Similar presentations
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Advertisements

CPU Scheduling Tanenbaum Ch 2.4 Silberchatz and Galvin Ch 5.
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Predictive Parallelization: Taming Tail Latencies in
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
CS533 Concepts of Operating Systems Jonathan Walpole.
Chap 5 Process Scheduling. Basic Concepts Maximum CPU utilization obtained with multiprogramming CPU–I/O Burst Cycle – Process execution consists of a.
Chapter 5 CPU Scheduling. CPU Scheduling Topics: Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
Tians Scheduling: Using Partial Processing in Best-Effort Applications Yuxiong He *, Sameh Elnikety *, Hongyang Sun + * Microsoft Research + Nanyang Technological.
Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.
Parallel and Distributed IR
Saehoon Kim§, Yuxiong He. , Seung-won Hwang§, Sameh Elnikety
Distributed Process Management1 Learning Objectives Distributed Scheduling Algorithms Coordinator Elections Orphan Processes.
Parallel Execution Plans Joe Chang
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Convergence /20/2017 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.
The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.
SAGE: Self-Tuning Approximation for Graphics Engines
1 Scheduling I/O in Virtual Machine Monitors© 2008 Diego Ongaro Scheduling I/O in Virtual Machine Monitors Diego Ongaro, Alan L. Cox, and Scott Rixner.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
1 Distributed Operating Systems and Process Scheduling Brett O’Neill CSE 8343 – Group A6.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
1 Distributed Process Scheduling: A System Performance Model Vijay Jain CSc 8320, Spring 2007.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
Multiprocessor and Real-Time Scheduling Chapter 10.
Chapter 101 Multiprocessor and Real- Time Scheduling Chapter 10.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Web Search Using Mobile Cores Presented by: Luwa Matthews 0.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware
Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CprE 458/558: Real-Time Systems (G. Manimaran)1 CprE 458/558: Real-Time Systems Distributed Real-Time Systems.
Scalable and Coordinated Scheduling for Cloud-Scale computing
A System Performance Model Distributed Process Scheduling.
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Modern Information Retrieval
Zeta: Scheduling Interactive Services with Partial Execution Yuxiong He, Sameh Elnikety, James Larus, Chenyu Yan Microsoft Research and Microsoft Bing.
Architecture for Resource Allocation Services Supporting Interactive Remote Desktop Sessions in Utility Grids Vanish Talwar, HP Labs Bikash Agarwalla,
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Optimizing Distributed Actor Systems for Dynamic Interactive Services
Hathi: Durable Transactions for Memory using Flash
Institute of Parallel and Distributed Systems (IPADS)
lecture 5: CPU Scheduling
EMERALDS Landon Cox March 22, 2017.
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Controlled Kernel Launch for Dynamic Parallelism in GPUs
PA an Coordinated Memory Caching for Parallel Jobs
Frequency Governors for Cloud Database OLTP Workloads
Department of Computer Science University of California, Santa Barbara
Scheduling Adapted from:
Chapter 5: CPU Scheduling
Chapter 5: CPU Scheduling
Department of Computer Science University of California, Santa Barbara
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice), and Scott Rixner (Rice)

Performance of Web Search 1) Query response time – Answer quickly to users – Reduce both mean and high-percentile latency 2) Response quality (relevance) – Provide highly relevant web pages – Improve with resources and time consumed 2

All web pages How Microsoft Bing Works 3 Query - All web pages are partitioned across index servers - Distributed query processing (embarrassingly parallel) - Aggregate top K relevant pages Partition Index Server Index Server Index Server Index Server Index Server Index Server Aggregator Top K pages Top K pages

Our Work Multicore index server – Multiple queries executed concurrently – Each query executed sequentially Optimization opportunity – Low CPU utilization – Many idle cores Contributions – Parallelize a single query – Reduce response times 4 Index Server Query 1 Query 2 Query 3

Outline 1.Query parallelism – Run a query with multiple threads 2.Adaptive parallelism – Select degree of parallelism per query 3.Evaluation 5

Query Processing and Early Termination Processing “Not evaluated” part is useless – Unlikely to contribute to top K relevant results 6 Inverted index “EuroSys” Processing Not evaluated Doc 1Doc 2Doc 3…….Doc N-2Doc N-1Doc N Docs sorted by static rank Highest Lowest Web documents …….

Assigning Data to Threads Purpose: data processing similar to sequential execution 7 Highest rankLowest rank Sorted web documents Sequential execution T1T1 T2T2 Key approach: dynamic alternating on small chunks T1T1 T2T2 T2T2 T2T2 T1T1 T2T2 T1T1 T2T

Share Thread Info Using Global Heap Share information to reduce wasted execution Use a global heap and update asynchronously Small sync overhead by batched updates 8 Thread 1 (local top k) Global heap (global top K) Sync Thread 2 (local top k)

Outline 1.Query parallelism – Run a query with multiple threads 2.Adaptive parallelism – Select degree of parallelism per query 3.Evaluation 9

Key Ideas System load – Parallelize a query at light load – Execute a query sequentially at heavy load Speedup – Parallelize a query more aggressively if parallelism shows a better speedup 10

No LoadHeavy Load 11 Query 1 Query 4 Query 5 Query 6 Query 1 Query 2 Query 3

No SpeedupLinear Speedup 12 Query T1T1 T1T1 T6T6

Speedup in Reality Mostly neither no speedup nor linear speedup 13

Adaptive Algorithm Decide parallelism degree at runtime – Pick a degree (p) that minimizes response time min(T p + K  T p  p / N) K: system load (queue length) T p : Execution time with parallelism degree p 14 My execution time Latency impact on waiting queries

Experimental Setup Machine setup – Two 6-core Xeon processors (2.27GHz) – 32GB memory 22GB dedicated to caching – 90GB web index in SSD Workload – 100K Bing user queries Experimental system – Index server – Client Replay obtained queries Poisson distribution Varying arrival rate (query per second) 15

Mean Response Time - Fixed Parallelism - 16 No fixed degree of parallelism performs well for all loads

Mean Response Time - Adaptive Parallelism - 17 Lower than all other fixed degrees

Mean Response Time - Adaptive Parallelism Select any degree among all possible options - Parallelism degrees are utilized unevenly to produce the best performance

Mean Response Time - Adaptive Parallelism - Much lower than sequential execution 19 Interesting range 47%

95th-Percentile Response Time Similar improvements in 99% response time 20 52%

More Experimental Results Quality of responses – Equivalent to sequential execution Importance of speedup – Compare to system load only – Using both system load and speedup provides up to 17% better for the average 21

Conclusion 1.Effectively parallelize search server – Parallelize a single query with multiple threads – Adaptively select parallelism degree per query 2.We improve response time by: – 47% on average, 52% for the 95th-percentile – Same quality of responses 3.In Bing next week! 22