Hierarchical I/O Scheduling for Collective I/O

Hierarchical I/O Scheduling for Collective I/O
Jialin Liu Yong Chen Yu Zhuang Data-Intensive Scalable Computing Lab (DISCL) Department of Computer Science Texas Tech University [Thank you for coming to this talk.] In this talk, I will be presenting a recent research study on a “Hierarchical I/O Scheduling” for collective I/O.

Outline Background Motivations Hierarchical I/O (HIO) Scheduling
Theoretical Analysis Implementation and Evaluations Conclusion and Future Work

Background More and more scientific applications are data-intensive.
In climate science, researchers desire finer resolution, from 1000km to 10km. In combustion simulation, the data size exceeds terascale. Today’s scientific applications tends to be data-intensive. A large amounts of data are generated or collected from scientific applications and instruments. For example, the left figure shows a climate simulation, scientists are desiring more finer resolution than before, that results in more data to be simulated. On the right figure is a 3D combustion simulation, which usually generates several TBs data, and the majority of computation focuses on 2D planes. Climate simulation, (NCAR) Combustion simulation (CCSE)

Background The poor I/O performance is a critical issue
I/O bottleneck tends to be worse in the big data era. Low sustained system Collective I/O is widely used Cooperation among processes Optimize non-contiguous I/O requests Largely reduce the I/O latency As the data volume keeps increasing, the *poor I/O performance* has been attributed as a critical cause of the low sustained performance of parallel systems. How to improve the I/O performance when processing such datasets is a big issue. Collective I/O has been widely used to optimize the non-contiguous I/O request in parallel applications. The large amounts of small and non-contiguous I/O requests can cause huge I/O latency, which degrade the application’s performance. The collective I/O cooperate the concurrent processes to share each other’s access information before doing the read/write. Therefore, the small non-contiguous I/O can be aggregated to form a large contiguous I/O. and the I/O performance can be largely improved.

Background Two phase collective I/O is the most popular implementation of collective I/O In the first phase, several processes are assigned as aggregators; In the second phase, the aggregators shuffle the data among all processes. p0 p1 p2 Initial state p0 p1 p2 I/O phase p0 p1 p2 Shuffle phase The two phase collective I/O is common implementation of collective I/O it is designed to optimize the non-contiguous I/O requests, in which the processes are cooperated to form a small number of I/O aggregators. And the I/O aggregator will do the I/O for all the processes. In the figure, there are three I/O aggregators, in the first phase, the aggregated requests are served for each aggregator, then in the second phase, each aggregator will shuffle the data among all the processes.

Motivation Multiple MPI applications running
I/O requests queuing on each storage node Interruption occurs Service order is random and the execution time is various Collective i/o is a technique to optimize one application’s I/O, such optimization doesn’t consider the interruption from other application, other processes. while in current and future extreme scale HPC system, the highly currency is not neglectable. With such interruption, the service order of the aggregator is random and one application’s aggregator can have different waiting time. To improve the average execution time and improve the performance, there should be *An optimized, scheduled service order*

Motivation Different aggregators have different shuffle cost
App i App x Aggregated requests 1 3 1 3 2 3 1 2 3 2 4 4 4 4 4 4 p0 p2 p4 px px px Aggregators px px px t node0 node1 node2 Suppose there are multiple MPI application running. App_i is one running application which has five processes, the number 0-4 refers to I/O requests from p0-p4 respectively. These I/O requests are aggregated to form a two phase collective I/O. Three processes, p0 p2 p4 are picked as aggregators. From the aggregated requests, we can see that p0 is responsible for p0 p1 p2 p3, the aggregator p2 is also responsible for p0-p3, but *p4 is only accessing its own data*. such file domain allocation is determined by the request’s offset. *Click the mouse*: This is our first observation, i.e., different aggregators have different shuffle cost, e.g., only p0 and p2 will shuffle, while p4 only access its own data, therefore, no shuffle needed. Let’s see what the aggregators are served on each storage node *Click the mouse to see the animation*: From the animation, We can see that, due to the interruption of other applications. the aggregators from one application may not be served at the same time. For example, application i’s aggregator p4 is served on node2, p0 is served second in node1 and p2 is served last on node0. *p0 p2 p4 are from same application*, but they are served at *different time* due to the interruption on different nodes. *Click the mouse to see the second observation*: This is our second observation. Recall that after the aggregators finishing the service on the storage nodes, there is still a shuffle phase, in which the data are needed to be re-distributed. The application is done untill the two phases are finished. Interrupted Collective Read Different aggregators have different shuffle cost Aggregators may not be served at the same time

Motivation Why does the service order matter?
Execution Time=I/O phase cost + shuffle phase cost Based on the observations that different aggregators have different shuffle cost and one application’s aggregators may not be served at the same time. Let’s see why different service order can matter for the application’s performance We assume the read cost and shuffle cost are same for the save volume of data. We also assume that the aggregator has same size of request, no matter from same or different application. The left figure shows a service order that P4 is served first then p0 and p2. when p0 finishes its I/O phase, which is from 6t to 12t, p0 returns to the compute node to do the shuffle, at the same time, p2 starts its service on the storage nodes. that’s why p0’s shuffle phase and p2’s I/O phase is overlapping. (click the mouse) Right figure shows another possible order, p0 first then p2 and p4. Recall that p4 doesn’t need to do shuffle, therefore p4 only takes some time on I/O phase. no need to participate the shuffle phase Fig(a) results in 24t, and fig(b) only takes 18t. compare the two service order, we can see that p0 and p2 should scheduled first, p4 can be delayed. The difference of p0,2 with p4 is p0 p2 are slower than p4 due to the shuffle cost. And the benefit come from scheduling the slow aggregator first. Two Different Service Order In Fig. (a), p4 is serviced and returned first, the execution time is 24t. In Fig. (b), p0 is serviced first and p4 is serviced last. cost=18t. The benefit comes from scheduling the ‘slow aggregator’ first

Our Idea To optimize the collective I/O performance
Schedule the “slow aggregator” first Hierarchical I/O Scheduling (HIO) To reduce the average execution time Perform the scheduling on all applications We propose the hierarchical I/O scheduling to optimize the performance of collective I/O. The fundamental idea is to to rearrange the aggregators on each node such that the “slow aggregator” is scheduled first The goal of this work is to reduce the average execution time for *all* concurrent applications.

HIO Algorithm: Shuffle Cost
The aggregators have different shuffle cost, the more the shuffle cost, the slower the aggregator Example of Process Assignment in Three-node Multi-core System. To know which aggregator is slower, we need to calculate the shuffle cost first. We define the “slower aggregator” as the aggregator with more shuffle cost. During the shuffle phase, the communication pattern includes inter-node and intra-node communication. We assume there is at most one aggregator on one node. In the equation, cost equals message size:m divided by bandwidth:B, (r is latency) Inter-node cost Intra-node cost

HIO Algorithm: Acceptable Delay
Communication Pattern Detecting Inter communication pattern or intra communication pattern Acceptable Delay(AD) How much time one aggregator can be delayed before the finish of the slowest aggregator without sacrificing the average performance. Saturate the AD and schedule service for other concurrent application. a_id aggregator rank p_id non-aggregator rank n_c cores per node To distinguish the inter communication and intra communication, we assume the process manager Hydra uses a default process mapping strategy. Then we can perform a modular operation between the aggregator rank and number of cores per node, if the aggregator and non-aggregator lie in the same node, then the communication is intra communication. Otherwise it inter communication. From the motivation example, we can see that the aggregators with “lower shuffle cost” can be scheduled later. That means some aggregators can be delayed without reducing the performance. Such delayed time can be used for other applications. We would like to utilize such delay time in the scheduling algorithm. (Click the mouse to see the acceptable delay) Therefore, we introduce the ‘Acceptable Delay’ (AD) to support our approach. An aggregator’s AD means how much time it can be delayed before the finish of the slowest aggregator without sacrificing the average performance 2. Due to the aggregators’ various AD, we do not have to service all the aggregators from one application at the same time, which means we can saturate the AD and utilize the saved time to provide service for other concurrent application. ADi i-th aggregator AD Ti i-th aggregator cost

HIO Algorithm: Time Windows
Divide the I/O queue into sub-sequences The earlier requests from one application should be served earlier The aggregators within a time window are all from different applications t Since there are multiple applications running simultaneously, and each application can have multiple execution instance from time to time. So we use the time window to avoid starvation and to maintain fairness. Usually, the window size is fixed 1. In our design, We also divide the I/O queue into sub-queue but with different size. 2. The earlier requests from one application should be served earlier 3. The aggregators within a time window are all from different applications, such that for one application, its later collective I/O operations would not be scheduled ahead of earlier collective operations. Only the requests within the same execution instance will be scheduled/re-ordered. App0 App1 App2 App3 App4 Divided Requests Queues with Time Window

HIO Algorithm: Scheduling
HIO: Hierarchical I/O Scheduling Scheduling on the server side. Calculate the shuffle cost, AD and read cost on the client side at the MPI-IO layer. HIO Algorithm: Divide-Sort-Compare Step 1-Divide the I/O queue into segments on each node; Step 2-Sort the requests in each window by relative AD or AppID; Step 3-Compare request’s AD with its successor’s read cost We call our algorithm *Hierarchical* I/O scheduling, because we perform the scheduling on *server* side, and we do some pre-work on the client side. On the client side’s MPI-IO layer, we need to predict the shuffle cost, AD and read cost. The HIO algorithm has three steps Divide Sort and Compare In the first step, we divide the I/O queues on each storage node by applying the time window. In the second step, we sort the requests in each window by Relative Acceptable Delay or by Application ID in the order of shortest first. In the third step, we compare each requests’ Acceptable Delay with its successor’s read cost for determining the optimal order

HIO Algorithm: Scheduling
Input/Output Pseudo Code Sort If Aggj.ad > Aggj+1.read Aggj  Aggj+1 Update AD In this slide,we show more details of the HIO algorithm. The input includes the number of storage nodes, number of applications, aggregators’ acceptable delay, read cost and relative acceptable delay The output is an optimal service for concurrent MPI applications on each node. 2.The pseudo code shows the HIO algorithm We designed a threshold to determine whether sort by acceptable delay or sort by application ID. We will apply theoretical analysis on different sort. and compare our algorithm with previous servide-side I/O scheduling. 3. In the comparing procedure, we Compare each request’s AD with its successor’s read cost, if the aggregator’s ad larger than its successor’s read cost, which means the earlier aggregator can be delayed until the later aggregator finish its read. Here the earlier and later refer to different applications. ----- Meeting Notes (4/24/13 16:37) ----- motivation-idea-solution Compare

Theoretical Analysis M applications and n storage nodes.
Each request cost t on server side, sij in shuffle phase. I/O phase cost: {t,2t,3t, …, mt} . Density is g(x) , probability distribution function is G(x) = (x/m)n . Execution time of NIO Execution time of SIO Execution time of HIO SIO-NIO HIO-SIO we perform this theoritcal analysis to see the potential of HIO algorithm. In this modeling, 1. we assume every application has one request On each node, 2. suppose each request needs time t to finish the service on the server side and sij to finish the shuffle phase (sij is the jth aggregator’s shuffle cost of the ith application). 3. The longest finish time of requests on all nodes determines the application’s completion time on the server side, which could be {t, 2t, 3t, …. mt} 4. The application’s execution time is the sum of the completion time on the server side and the maximum shuffle cost on the client side. 5. With the normal collective I/O scheduling, applications’ server-side completion time, which is the I/O phase cost, has the same distribution, i.e., the density is g(x) and PDF is (x/m)^n 6. Therefore, the execution time of each application is about mt plus maximum shuffle cost. 7. With the server-io scheduling, in which the same applications’ requests are serviced at the same time on all nodes, the average execution time is reduced, about half of the non-scheduling execution time. as shown in the table (2nd row) 1. For the potential of the HIO scheduling, we analyze the best case and worst case separately. The best case requires two conditions: first, each application’s slowest aggregator comes to different node; second, the slowest aggregator dominates the application’s execution time. 2. For the worst case, either the first condition or the second condition is not satisfied. If the first condition is not met, it indicates that applications’ slowest aggregators arrive at the same node. 3. If the second condition is not met, the slowest aggregator’s shuffle cost is close to zero. With the HIO scheduling, the initial order will be sorted by application id, which means that the same application will be serviced at the same time. Finally, we have the range of average execution time as shown in the table (3rd row) When we compare the three scheduling methods, we have clearly see the performance gain of HIO over SIO and NIO. (last two rows) Execution time of Normal I/O scheduling(NIO)

Implementation Predict the shuffle cost in MPI-IO layer
Client MPI-IO ADIOI_Calc_file_domains ADIOI_Calc_my/others_req PVFS-Hint: Shuffle Cost Server PVFS PINT_req_sched_post HIO We modified this driver to integrate the shuffle cost analysis and pass it to the PVFS server side scheduler as a hint. 2. When an application calls the collective read function ADIOI_Read_and_exch in ad_read_coll.c under the src/mpi/romio/adio/common, 3. the shuffle cost is calculated after the aggregators are allocated, i.e., ADIOI_Calc_file_domains. The message size m is calculated with ADIOI_Calc_my_req and ADIOI_Calc_others_req. 4. The calculated shuffle cost is stored into a variable of PVFS-hint type. The hint is passed to file servers along with I/O requests. On the PVFS server side, 5. in the request scheduling function PINT_req_sched_post(), we implemented the HIO algorithm. 6. The original function only enqueues the coming requests into the tail of the queue, while the HIO algorithm first divides the waiting queue into several sequences, and performs the scheduling within each sub-queue following the scheduling algorithm discussed in Section III. Implementation Predict the shuffle cost in MPI-IO layer Pass the shuffle cost as a hint to PVFS server side Modify the PVFS request scheduler Integrate the HIO with original scheduling algorithm

Results and Analysis I/O request size: 16 MB. Number of processes: 50 Concurrent application: 6, 12, 24, and 48 Storage storage node: 6 Test 1(left) We evaluate the HIO with different number of concurrent applications. The histogram shows that the HIO can reduce the interruption cost significantly by leveraging the shuffle cost and scheduling all the concurrent applications at the same time. Compare to Normal I/O scheduling, the execution time was decreased to 34.1% and 15.2 compare to Server-IO scheduling. When the number of concurrent application increased, the performance gain is even better. Test 2(right) We have conducted experiments with varying the request size too. As reported in Figure b, the I/O request size was set as 64KB, 1MB, 5MB, and 10MB respectively. The number of concurrent applications was set as six, and the number of aggregators was configured as six too. we can see that the HIO achieves more speedup in average execution time. The average execution time was decreased by up to 34.1% compared with NIO and by up to 15.2% compared with SIO. The performance gain of using HIO scheduling was increased from 6.8% to 18.3% in terms of the execution time reduction rate.

Results and Analysis The normal collective I/O did not work well with the increasing size of the system. The HIO performed better than SIO. 20 queries for each dataset concurrently. "select temperature from dataset where 10 < temperature < 31". Improved the average query response time by up to 59.8%. Test 1(Fig.a) We varied the number of storage nodes to evaluate the performance of HIO Number of aggregators in each application same as the number of storage nodes. . in order to have each application access all storage nodes Request size was set as 15MBs, and the aggregator’s request size is equal. as the number of storage nodes increasing, the HIO performs better than other scheduling methods, this is because with more storage nodes, there will be more chances that one application’s Acceptable Delay can be utilized for other applications. Therefore, the average execution time of all concurrent applications is reduced. Test 2(Fig.b) We also evaluate the HIO in of a query system, We want to see how the HIO works with some real applications. In this system, we specify a range query and use collective I/O to retrieval the data. We run multiple queries simultaneously, without HIO, the interruption among queries will cause large latency on storage nodes. HIO achieves good performance by scheduling the I/O with the consideration of their shuffle cost. The response time is reduced by up to 59.8%.

Conclusion and Future Work
The performance of collective I/O is degraded due to the increasing shuffle cost caused by highly concurrent accesses and interruptions. This problem tends to be more and more critical as many applications become highly data intensive. This approach is the first considering the shuffle cost involved in the collective I/O. The HIO is potential in improving the I/O performance in big data era. Through theoretical analysis and experiments, it has been confirmed that the HIO can improve the performance of collective I/O In the future, we will continue improving parallel I/O at a large scale including exascale systems Previous scheduling of collective I/O only consider the cost on server side, but one MPI application does not finish untill all phases of the I/O are completed. This approach is not only the first work considering the shuffle cost on client side, but also is one potential solution to address the increasing shuffle cost in large scale and big data era. In the future, we will further conduct research on improving the I/O performance and verify the HIO idea in even larger scale.

Please visit our website: http://discl.cs.ttu.edu
Thank You Please visit our website: ACKNOWLDEGEMENT: This research is sponsored in part by the National Science Foundation under grant CNS and the Texas Tech University startup grant. The authors are thankful to Yanlong Yin of Illinois Institute of Technology and Wei-Keng Liao of Northwestern University for their constructive and thoughtful suggestions toward this study. We acknowledge the High Performance Computing Center (HPCC) at Texas Tech University for providing resources that have contributed to the research results reported within this paper. We are also especially thankful to Dr. Sadaf Alam for presenting this study on behalf of us. This research is supported by NSF and Texas Tech University startup grant. We are also thankful to Yanlong Yin from IIT and Wei-keng Liao from Northwestern University. Thank you all for your time.

Backup Slides-Experiment Setup
Platform 16-node linux testbed One PowerEdge R515 rack server node and 15 PowerEdge R415 nodes 32 processors and 128 cores. Benchmark MPI-IO-Test FASM, a query system. The datasets are from climate science, NetCDF format. Comparison Server-IO scheduling (SIO) Normal collective I/O (NIO) (explain the server-IO a little bit) The server IO method is to schedule the same application’s aggregator to the same order, such that all the aggregators within one application can be served at the same time. We compare our algorithm with the normal collective I/O and server-IO scheduling.

Backup Slides-Theoretical Analysis
M applications and n storage nodes. Each request cost t on server side, sij in shuffle phase. I/O phase cost: {t,2t,3t, …, mt} . Density is g(x) , probability distribution function is G(x) = (x/m)n . we perform this theoritcal analysis to see the potential of HIO algorithm. 1. On each node, we assume every application has one request 2. suppose each request needs time t to finish the service on the server side and sij to finish the shuffle phase (sij is the jth aggregator’s shuffle cost of the ith application). 3. The longest finish time of requests on all nodes determines the application’s completion time on the server side, which could be {t, 2t, 3t, …. mt} 4. The application’s execution time is the sum of the completion time on the server side and the maximum shuffle cost on the client side. 5. With the normal collective I/O scheduling, applications’ server-side completion time, which is the I/O phase cost, has the same distribution, i.e., the density is g(x) and PDF is (x/m)^n 6. Therefore, the execution time of each application is about mt plus maximum shuffle cost. 7. With the server-io scheduling, in which the same applications’ requests are serviced at the same time on all nodes, the average execution time is reduced, about half of the non-scheduling execution time. Equation 2. Execution time of Server-IO scheduling (SIO) Equation 1. Execution time of Normal I/O scheduling(NIO)

Backup Slides-Theoretical Analysis
Best Worst Equation 3. Best case of HIO Equation 4. Worst case of HIO The HIO scheduling achieves better scheduling performance. Especially when the shuffle cost keeps increasing Due to highly concurrent accesses from large-scale HPC systems and/or big data retrieval and analysis problems. 1. For the potential of the HIO scheduling, we analyze the best case and worst case separately. The best case requires two conditions: first, each application’s slowest aggregator comes to different node; second, the slowest aggregator dominates the application’s execution time as shown in equation 3. 2. For the worst case, either the first condition or the second condition is not satisfied. If the first condition is not met, it indicates that applications’ slowest aggregators arrive at the same node. Thus, the total average cost is equation 4. 3. If the second condition is not met, the slowest aggregator’s shuffle cost is close to zero. With the HIO scheduling, the initial order will be sorted by application id, which means that the same application will be serviced at the same time. Then we have the average execution time same with equation 2 4. The execution time reduction by applying server-io is shown in equation 5 5. From equation 6, we can see our HIO can further reduce the server-io to (m-1)/2t Equation 5. Reduction by applying SIO Equation 6. Further reduction by applying HIO

Hierarchical I/O Scheduling for Collective I/O

Similar presentations

Presentation on theme: "Hierarchical I/O Scheduling for Collective I/O"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hierarchical I/O Scheduling for Collective I/O

Similar presentations

Presentation on theme: "Hierarchical I/O Scheduling for Collective I/O"— Presentation transcript:

Similar presentations

About project

Feedback