Presentation on theme: "Resource Revocation in Mesos Stephen Twigg and Huy Vo Apache Mesos Motivation Why Revoke? Impact Problem: DRF can ensure that undersubscribed frameworks."— Presentation transcript:
Resource Revocation in Mesos Stephen Twigg and Huy Vo Apache Mesos Motivation Why Revoke? Impact Problem: DRF can ensure that undersubscribed frameworks receive priority but only if there are free resources. Furthermore, it does not restrict how much of the system a framework can claim. This makes it possible for speculative/malicious frameworks to lock up the system indefinitely. Solution: Resource Revocation! Forcibly kill young tasks of greedy frameworks to ensure fairer distribution of system resources +Provide stronger guarantees to frameworks +Lower latency for new frameworks to start work +Reward well-characterized frameworks +Restrict impact of rogue frameworks +Better fairness amongst frameworks -System goodput could suffer -More burden on frameworks to monitor resources and handle killed tasks Implementation Details Stock MesosWith Revocation Sample scenario where two frameworks each must execute two tasks on a system with two slots. Revocation ensures better latency and goodput fairness between the two frameworks. Mesos diagram courtesy of Mesos sits as a cluster management agent moderating the deployment of framework tasks onto cluster nodes Uses a generous offer-response API that allows frameworks to hand-select the nodes they need for jobs Employs dynamic DRF to fairly allocate resources Developed in C++ on top of Libprocess Used in production for clusters at UC Berkeley and Twitter Resource Revocation 1. Extended API so a framework now may establish a guaranteed share request with Mesos to cover some minimum SLA 2. Occasionally, Mesos scheduler tallies all resources needed to cover the unmet guaranteed shares of constituent frameworks 3. If not enough unused resources available to meet that tally, Mesos revokes from frameworks over their guarantees until tally is met. Offer Revocation Mesos does not make simultaneous offers and previously gave frameworks unrestricted time to consider offers. With offer revocation, Mesos recovers these resources after an expanding offer timeout. Conclusions and Future Work Framework one is a greedy framework modeling data analysis tasks, e.g. web crawling, capable of using the entire system by launching many 3 minute tasks. Framework two is a more constrained framework modeling realtime tasks, generates 30 second task every 16 seconds but desires low-latency Scenario A: FW2 enters 150 sec after FW1 startsScenario B: FW2 enters 30 sec after FW1 starts Revocation provides consistent goodput Revocation ensures lower latency Revocation minimally impacts system goodput Revocation better isolates frameworks Revocation causes goodput drop: ‘Revocation loss’ due to premature termination of running tasks ‘Guaranteed loss’ by denying greedy frameworks access to unused guaranteed share Actual goodput loss reasonable even in worst-case scenarios. Without revocation, work builds up in the starved framework causing latency spikes and requiring framework to grab extra resources in order to recover Without revocation, frameworks susceptible to losing place in system if offer timing is slightly off causing bizarre, periodic latency spikes Latencies and goodput of the realtime framework, when run on Mesos with revocation enabled and sharing with greedy frameworks, are nearly indistinguishable compared to when run on the cluster alone. Framework 1 (data anlysis) goodput in all scenarios Framework 2 (realtime) goodput in all scenarios Evaluation Revocation clearly provided a net gain for the system by improving latency guarantees for incoming frameworks and preventing frameworks from losing grasp of the system System goodput losses were minimal and revocation presented frameworks with more consistent goodput over time and, surprisingly, better isolation. Future explorations will test Mesos with revocation against more realistic workloads in a larger cluster More revocation schemes to be considered including voluntary revocation, less aggressive revocation, evaluating and rewarding well-behaved frameworks using an ML algorithm, and allowing frameworks to give specific resource wants in addition to general needs.