Presentation on theme: "A Cloud Data Center Optimization Approach using Dynamic Data Interchanges Prof. Stephan Robert University of Applied Sciences."— Presentation transcript:
A Cloud Data Center Optimization Approach using Dynamic Data Interchanges Prof. Stephan Robert http://www.stephan-robert.ch University of Applied Sciences of Western Switzerland IEEE CloudNet San Francisco November 2013
Motivation and background Distributed datacenters in the Cloud have become popular ways to increase data availability and reducing costs Cloud storage has received a lot of attention with a view to reduce costs: – Minimizing infrastructure and running costs – Allocation of data servers to customers – Geo-optimization (look at locations of where customers are to decide where to place datacenters)
Datacenter optimization Research areas on optimizing datacenter operations: – Energy and power management – Cost benefit analysis – Cloud networks versus Grids – Geo-distribution of cloud centers – Multi-level caching
Motivation and background (cont.) We consider the operational situation when we have decided on the datacenter locations. Is there any other optimization we can perform? Problem we examine: – Data locality: users not always near the data -> higher costs – Situation can change over time: we can decide to place our data near the users now, but there is no guarantee this will not change in the future
Principal idea We consider a model for actively moving data closer to the current users. When needed, we move data from one server to a temporary (cache) area in a different server. In the near future, when users request this particular data, we can serve them from the local cache.
Benefits Benefit of copying (caching) data to a local server: – We correct the mismatch between where the data is and where the users are. – We only copy once (cost), read many (benefit). – We train the algorithm by using a history of requests to determine the relative frequency of items being requested (in an efficient way, as the number can be very large).
Model We consider a combinatorial optimization model to determine the best placement of the data This model will tell us if we need to copy data from one datacenter to another, in anticipation of user requests. The optimization aim is to minimize the total expected cost of serving the future user data requests The optimization constraints are the cache size capacities. The model accounts for: – The cost of copying data between datacenters – The relative cost/benefit of delivering the data from a remote vs. a local server – The likelihood that particular data will be requested in particular locations in the near future
Model if object i is obtained from datacenter d Each object must be available in at least one datacenter The cache size Z of each datacenter must not be exceeded Expected cost of retrieving object i from datacenter d Cost of copying object i from default datacenter to another datacenter d Probability object i will be requested by user u
Operational aspects Firstly, we must obtain a historical log of requests, including who requested what, where the file was located and file size. We use this information to calculate the access probabilities in the model (in practice, using Hbase/Hadoop in a distributed manner). The costs in the model have to be decided based on the architecture etc (eg the relative benefit of using a local server versus a remote one for a particular user. Periodically (eg daily) we run the algorithm to determine any data duplication that is beneficial to do. (Of course, the network must be aware of the local copies and know to use them).
Computational experimentation Computational experimentation carried out in a simulation environment (no real-life implementation at this stage) We measured the costs/benefits of obtaining the data directly against using our optimization model to rearrange the data periodically Consistent performance for 3, 5, 10 datacenters.
Computational experimentation Setup of N datacenters located on a circle Users placed at random inside the circle Costs linked to the distance Data object requests were generated from Zipf distribution (independently for each user) First half if data used to train the algorithm (historic access log), the second half used for the simulation.
Simulation results Data centers UsersCache sizeObjects (problem size) Cost (default) Cost (optimized) % cost improve ment 320Small (1500)Med (500)2290187026% 3500Large (3000)Small (100)574124083229% 31000Small (1500)Med (500)1133748285227% 5100Large (3000)Large (1000)12160743339% 5500Small (1500)Med (500)573754413623% 51000Large (3000)Small (100)1156468014631% 1020Large (3000)Large (1000)2266183219% 10100Small (1500)Med (500)11533879124% 101000Large (3000)Large (1000)1135748542425% Promising results with ~ 20% cost reduction on average Full results appear in the proceedings paper
Practicalities – is the idea feasible in a real system? More complexities but also easy solutions – Time criticality: no need to use on live system, can optimize object locations overnight periodic dynamic reconfiguration – Metadata storage: need to store object access frequencies to calculate the probabilities p. Implemented a metadata storage in HBase on a Hadoop cluster. –> conclusion feasible and easy
Complexity issues Optimization problem is complex (NP hard) to solve. – Can keep input size small: We only need to consider the most popular objects.. – Currently developing a fast heuristic algorithm based on knapsack methods Standard problems of data – Other complexities: legal issues of moving data across countries (if personal data are involved)