Presentation is loading. Please wait.

Presentation is loading. Please wait.

Synthesizing Representative I/O Workloads for TPC-H J. Zhang*, A. Sivasubramaniam*, H. Franke, N. Gautam*, Y. Zhang, S. Nagar * Pennsylvania State University.

Similar presentations


Presentation on theme: "Synthesizing Representative I/O Workloads for TPC-H J. Zhang*, A. Sivasubramaniam*, H. Franke, N. Gautam*, Y. Zhang, S. Nagar * Pennsylvania State University."— Presentation transcript:

1 Synthesizing Representative I/O Workloads for TPC-H J. Zhang*, A. Sivasubramaniam*, H. Franke, N. Gautam*, Y. Zhang, S. Nagar * Pennsylvania State University IBM T.J. Watson Rutgers University

2 Outline Motivation Related Work Methodology –Arrival Time –Access Pattern –Request Sizes Accuracy of synthetic traces Concluding Remarks

3 Motivation I/O subsystems are critical for commercial services and in production environments. Real applications are essential for system design and evaluation. TPC-H is a decision-support workload for business enterprises.

4 Disadvantages of Traces Not easily obtainable Can be very large Difficult to get statistical confidence Very difficult to change workload behavior Does not isolate the influence of one parameter On the other hand, a deeper understanding of the workload can: Help generate a synthetic workload Help in system design itself.

5 What do we need to synthesize? Inter-arrival times (temporal behavior) of disk block requests. Access pattern (spatial behavior) of blocks being referenced Size (volume) of each I/O request.

6 Related work Scientific Application I/O behavior –Time-series models for arrivals –Sequentiality/Markov models for access pattern Commercial/production workloads –Self-similar arrival patterns –Sequentiality in TPC-H/TPC-D No prior complete synthesis of all three attributes for TPC-H

7 Our TPC-H Workload Trace Collection Platform –IBM Netfinity 8-way SMP with 2.5GB memory and 15 disks –Linux 2.4.17 –DB2 UDB EE V7.2 TPC-H Configuration –Power Run of 22 queries –Partitioning tables across the disks –30 GB dataset

8 Validation Identify characteristics Disksim 2.0 Original I/O traces Generate synthetic traces Response time CDF  RMS: root-mean-square error of differences between two CDF curves  nRMS: RMS/m, m is average response time for the original trace Metrics

9 Overall Methodology Arrival pattern characteristics –Investigate correlations Time series Self-similar iid distributions Access pattern characteristics –Sequentiality/pseudo sequentiality/randomness –Size characteristics Investigating correlations between time, space and volume to get final synthesis

10 Arrival pattern Statistical analysis –Auto-correlation function (ACF) plots Shows the correlation between current inter-arrival time and one that is x-steps away

11 –Correlations seem very weak (<0.15 for 12 queries, and <0.30 for the rest) Errors with Time series models ( AR/MA/ARIMA/ARFIMA) are high No suggestions for self-similar either –Perhaps iid (independent and identically distributed) is not a bad assumption.

12 Fitting distributions –Tried hyper-exponential/normal/pareto –Used Maximum Likelihood Estimator (normal/pareto) and Expectation Maximization (hyper-exponential) to estimate distribution parameters –Use K-S test to measure goodness-of-fit –Maximum distance between fitted distribution and original CDF was ensured to be less than 0.1

13 Comparing CDF of fitted distribution and data

14 Access Pattern (Location + Size) Most studies use sequentiality to describe TPC-H However, this is not always the case. Cat1: Q10 Q4, Q14 Cat2: Q12, Q1,Q3,Q5,Q7, Q8,Q15,Q18, Q19,Q21 Cat3: Q20 Q9, Q17 Arrival Time Location Arrival Time

15 Category 1: Intermingling sequential streams Consider the following: –Run: A strictly sequential set of I/O requests –Stream: A pseudo-sequential set of I/O requests that could be interrupted by another stream. –i.e. a stream could have several runs that are interrupted by runs of other streams.

16 Run and Stream 1-45-811-149-1015-18 An example run of 5 requests 1-47-811-149-12 A stream (pseudo-sequential) of 4 requests An example trace: 1-47-811-149-12 100-104105-108109-112 Stream A Stream B 1-47-811-14100-104105-108109-112 Trace 9-12

17 Secondary Attributes Run Length: # of requests in a run Run Start location: start sector of run Stream Length: # of requests in a stream Inter-stream Jump Distance: spatial separation between start of run and previous request Intra-stream Jump Distance: spatial separation between successive requests within a stream Number of active streams (at any instant) Interference Distance: number of requests between 2 successive requests in a stream Derive empirical distributions for these from the trace

18 Location Synthesis - Q10 (Time and size from trace)  LocIID: locations are i.i.d.  LocRUN: incorporate run length distribution and run start location distribution.  LocSTREAM: combine all stream and run statistics.

19 Request Size Requests are one of –64, 128, 192, 256, 320, 384, 448, 512 blocks But attributes (location, size, time) are not independent !!!

20 Correlations between size and location 64128192256320384448512.716.009.010.009.011.225.577.012.013.012.013.015.016.342.916.004.005.057 Size All req. Run start Within run Fraction of requests

21 Correlations between size and time

22 Correlations between location and time

23 Final Synthesis Methodology (Category 1)  Location: use LocSTREAM to generate start locations. Two kinds of requests: a run start request or a request within a run  Time: use Pr(inter-arrival time | run start requests) and Pr(inter-arrival time | within a run requests) to generate times.  Size: 1)For run start request, use Pr(size | inter-arrival times of run start requests) to generate sizes. 2)For within a run requests, use Pr(size | within a run requests) to generate sizes.

24 Can be easily adapted for Category 2 (strictly sequential) and Category 3 (random) queries. Validation: Compare the response time characteristics of synthesized and real trace.

25 Validation of CDF of response times (Category 1)

26 Validation of CDF of response times (Category 2)

27 Validation of CDF of response times (Category 3)

28 Storage Requirements Q1Q3Q4Q5Q6Q7Q8Q9Q10 3.463.642.763.433.463.473.66.0042.79 0.100.090.200.070.010.040.050.150.16 Storage Fraction(x0.001) nRMS Q12Q14Q15Q17Q18Q19Q20Q21 3.736.493.462.033.543.444.572.95 0.060.190.010.050.060.030.100.07 Storage Fraction(x0.001) nRMS

29 Contributions A synthesis methodology to capture –Inter-mingling streams of requests –Exploiting correlations between request attributes An application of this methodology to TPC-H Along the way (for TPC-H), –iid can capture arrival time characteristics –Strict sequentiality is not always the case

30 Backup slides

31 Validating arrival time synthesis

32 LocSTREAM 1.Use Pr(stream length) to generate stream lengths. 2.Use Pr(run length | stream length) to generate run lengths for each stream length. 3.Generate start location for each run: a) Use Pr(inter-stream jump dist.) to generate the start location of the first run in the stream. b) Use Pr(intra-stream jump distance | this stream) to generate other runs’ start location in this stream. 4.Use Pr(interference distance) to interleave all streams.


Download ppt "Synthesizing Representative I/O Workloads for TPC-H J. Zhang*, A. Sivasubramaniam*, H. Franke, N. Gautam*, Y. Zhang, S. Nagar * Pennsylvania State University."

Similar presentations


Ads by Google