Presented by Xianze Zhan & Chenying He CS 525 University of Illinois

Presented by Xianze Zhan & Chenying He CS 525 University of Illinois
Is that Fair? Presented by Xianze Zhan & Chenying He CS 525 University of Illinois

HUG: Multi-Resource Fairness for Correlated and Elastic Demands
Presented by Xianze Zhan CS 525 University of Illinois

Motivation Multiple Resource Fairness: Optimal performance guarantees
Maximize utilization Shared Bandwidth! Two tenants with 5 VMs 3

Three Requirements 1. Isolation Guarantee
-minimum bandwidth guarantees proportional to correlation vectors. Used for worst-case performance estimation. 2. High Utilization -work conservation 3. Proportionality -Payment ↔ bandwidth allocation General Allocation model Inter-Tenant Network Sharing Requirements work conservation - which ensures that either a link is fully utilized or demands from all flows traversing the link have been satisfied

Single-Resource Max-Min Fairness
Tenant A asks for half bandwidth Tenant B asks for all Progress(M): demand normalized allocation Isolation Guarantee = Min(MA ...Mn)

Single Resource Easy to achieve both optimal isolation guarantee and work conservation!

Multi-Resource Demands across links are correlated
Tenant can lie to gain more bandwidth dA = <½,1> dB = <1,⅙ > Two tenants, Two links

Multi-Resource Max-Min(PS-P)
Elastic Demands(at least): Both Tenant A & B asks for all bandwidth in L1 & L2

Only suboptimal Isolation Guarantee!

Only suboptimal Isolation Guarantee! What is the optimal IG solution?

Dominant Resource Fairness(DRF)
Correlated Demands(exactly): dA = <½,1> dB = <1,⅙ >

Optimal IG but low Utilization! Not work conservation!

Optimal IG but low Utilization! Critical: Prisoner’s Dilemma!

Prisoner’s Dilemma dA = <½,1> dB = <1,⅙ > dA = <1,1>

HUG Achieve Optimal Isolation Guarantee with Highest Utilization
Strategyproof - restrict gaining spare resources by lying

HUG Two stages: Use DRF to ensure Isolation Guarantees
- strategy-proof 2. Calculate upper bounds to restrict spare resource usage to maximize Utilization -higher utilization while keeping strategy-proof

HUG Elastic and Correlated Demands: dA = <½,1> dB = <1,⅙ >
Upper Bound: ⅔ for both tenants

HUG Evaluation EC2 deployments with 100 machines
Facebook production trace from 3000/3200-machine Facebook cluster Focused on bandwidth and isolation guarantee evaluation

Isolation Guarantee

Bandwidth Utilization

Slowdown Evaluation

Take Away HUG provides optimal Isolation Guarantee with Highest Utilization Strategy-proof Short term optimal fairness May affect payment model HUG can be beneficial to all resource types in multi-resource allocation

Discussion Single point of failure? Yes. The paper does not address any sort of fault-tolerance mechanisms (decentralization hinted at in future work section) Where/when do tenants calculate their correlation vectors? How fast is it? Tenants would periodically update correlation vectors through public API. It takes milliseconds to compute new allocations.

Discussion Scalability is misleading?
Centralized source allocator might be a bottleneck (although it scales well). The metric of scaling to 100k machines is only for 100,000 emulated machines, which is very misleading. They only sent the same message 1000 times to 100 different machines.

Open Discussion Decentralized approach pros and cons?
Long term fairness guarantees? Optimal work conservation& high IG vs Optimal IG& high work conservation

FairRide: Near-Optimal, Fair Cache Sharing
Presented by Chenying He CS 525 University of Illinois

Introduction - Caching in multi-tenant environments
Global: single memory pool, agnostic of users or applications Isolation:static allocations of memory among multiple users, possibly under-utilization, no sharing Sharing: allowing dynamic allocations of memory among users and one copy of shared files Global Isolation Sharing

Desired properties of allocation policies
Isolation Guarantee No user should receive less cache space than she would have had if the cache space were statically and equally divided between all users Strategy Proofness A user cannot improve her allocation or cache performance at the expense of other users by gaming the system Pareto Efficiency The system should be efficient in that it’s not possible to increase one user’s cache allocation without lowering the allocation of some other user

Brief Summary Isolation Guarantee Strategy Proofness Pareto Efficiency
global ✘ ✔ max-min fairness

Problems of current schemes
Isolation schemes: Inefficient utilization due to 1.users not fully utilizing their allocated cache 2.multiple copies of shared files Global schemes(LRU,etc): Unfavorable to users who read data at long intervals Possible abuse of system

Fairness: a problem An issue in both global and max-min fairness systems!

Goal - What we want Unfortunately, no policy can achieve all three properties… There’s a strong trade-off between Pareto efficiency. We want something like this: Isolation Guarantee Strategy Proofness Pareto Efficiency blablabla ✔ ?

Further explanation - Max-min Fairness
who Max-min fairness: Maximize the minimum allocation across all users Basic Idea: When a user accesses a file f, the system checks whether there’s enough space to cache it. If not, it repeatedly evicts the files of the users who have the largest cache allocation to make enough room for f.

Further explanation - Shared Files Caching
Multiple users may share the same files. fi,j : file j cached on behalf of user i kj : number of users that have requested the caching of file j alloci : total cache size allocated to user i

Further explanation - Cheating
Free-riding: spuriously accessing files

Blocking comes to the rescue

FairRide - max-min fairness w/ probabilistic blocking
Expected delaying Blocking probability is 1/(nj+1). (nj is the number of other users caching the file.) Hit rate for malicious user Bob if blocking applied: (5+10*1/(1+1))/(5+10)=67.7%<83.3% Thus discourages cheating!

Analysis Revisit: With file sharing, no cache allocation policy can satisfy all three following properties: strategy-proofness, isolation-guarantee & Pareto-efficiency(SIP theorem). FairRide achieves isolation-guarantee. Cache total : the amount of cache a user accesses Alloc : allocation budget

Analysis - cont’d FairRide is strategy-proof.
P(a user i can access a file j without caching it) = benefit = cost = benefit-cost ratio = Caching files is based on actual access frequencies rather than cheating!

Analysis - cont’d FairRide uses lower-bound blocking probabilities for achieving strategy-proofness. Suppose a user has 2 files, fj, fk with access frequencies of freqj and freqk. pj and pk are the blocking probabilities if the user chooses not to cache the files. Then the benefit-cost ratios for the two files are freqj*pj*(nj+1) and freqk*pk*(nk+1). When nj is 0, pj must be 1. Thus

Implementation FairRide is implemented in Tachyon Users and Shares
Each application has a FairRide client ID. Shares for each user can be configured. Pluggable Policy Supports LRU and LFU. Supports pinfile(fileId) API.

Implementation - cont’d
Delaying Sleep the thread before giving a data buffer to the Tachyon client. Delay time: BWdisk : premeasured disk bandwidth size(buffer) : size of the data buffer sent to the client Node-based Policy Enforcement Allocation policies are enforced independently at each node. No global coordination.

Evaluation Figure 5 : cheating and blocking
Figure 6: benchmarks with multiple workloads Does sharing the cache improve performance significantly(vs isolation) Can fairride prevent cheating with small efficiency loss(vs best case) Does cheating degrade system performance significantly(vs max-min) Figure 7: many users

Evaluation 15% Facebook workload

Reduction in median job completion time
Evaluation Reduction in median job completion time Global : based on users’ global usage Naive global : chooses from only blocks on that node Optimized global : chooses from any user blocks in the cluster

Takeaway Fairride provides both isolation-guarantee and strategy-proofness Fairride’s within 4% of total efficiency in some of the production workloads(assume users don’t cheat).

Q & A How to decide how long the expected delay is? Is that possible to make it optimal? What if it takes network or other conditions into consideration? Delay simulates that of probabilistic blocking. It’s node-based, so no global coordination Why choose isolation-guarantee and strategy-proofness rather than Pareto- efficiency and isolation-guarantee or Pareto-efficiency and strategy- proofness? There’s a strong tradeoff between Pareto-efficiency and strategy-proofness. Max-min fairness already offers Pareto-efficiency and isolation-guarantee. Fairride is designed to avoid unfair situations, in which it’s like prisoner’s dilemma. Little cache loss can create large performance decrease, and a more balanced memory allocation in general is better.

Q & A What happens when some users leave (jobs finish)? Is the memory dynamically reallocated to the rest of the processes? Shared for each user can be configured. Eventually the allocation will converge to the configuration. Blocking/delaying strategies prevent cheating but come at a cost of higher reply times. Can there be a better strategy where ‘good’ nodes are not blocked? There’s no way to distinguish a cheating user and a well-behaved user. They may have the similar patterns.

Q & A To cheat, malicious users seem to know information. For example, the files that they access are shared by other users. Is it normal in practice? Yes. They can probe the system to find out which files have been cached by other users. Still didn’t get the distributed aspect of the cache. Is one file cached in multiple places or broken among many nodes? For node-based version of it, it’s cached in the node’s single memory pool. For global naive and global optimal, the paper did not specify the shared files aspect of it.

FairRide implemented on Tachyon. Other caching systems? Delaying (probabilistic blocking) dis-incentivizes cheating by adding an artificial delay to file access time. However, users with genuine workloads similar to those of cheaters’ may suffer from these delays (not just cheaters). Degraded performance for files shared across large # of users. FairRide policies don’t allow cache locality to be fully exploited. Do FairRide policies scale beyond 20 users? What are the overheads?

Initial assignment of caches to a big set of users may not be effective. Some probabilistic strategy can yield better result in warming up cache. Unlike HUG paper, they did not consider cooperative environment. FairRide is only better when users cheat. And it comes at the cost of Pareto-efficiency. Although the expected performance is Randomization algorithm is efficient when implementing strategy proof, but it is not that stable. This is a relatively new paper. Has this been adopted in any large scale system/service? (just curious)

Presented by Xianze Zhan & Chenying He CS 525 University of Illinois

Similar presentations

Presentation on theme: "Presented by Xianze Zhan & Chenying He CS 525 University of Illinois"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented by Xianze Zhan & Chenying He CS 525 University of Illinois

Similar presentations

Presentation on theme: "Presented by Xianze Zhan & Chenying He CS 525 University of Illinois"— Presentation transcript:

Similar presentations

About project

Feedback