Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Similar presentations


Presentation on theme: "Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework."— Presentation transcript:

1 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 1 IPDPS’07 A Performance Prediction Framework for Grid- Based Data Mining Applications Leonid Glimcher Gagan Agrawal

2 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 2 IPDPS’07 Motivating Scenario Data Repository Clusters Compute Clusters User ? 3 stages: Disk i/o, Network, Compute.

3 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 3 IPDPS’07 Remote Data Analysis Remote data analysis –Grid is a good fit –Details can be very tedious Middleware abstracts away lots of development details Resource selection – crucial to performance Performance prediction facilitates resource selection

4 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 4 IPDPS’07 Presentation Road Map Problem statement and motivation Middleware background Our performance prediction approach Experimental evaluation Related work Conclusions

5 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 5 IPDPS’07 Problem Statement Given:  Parallel data processing application  Execution time break-down (profile)  Configurations of available computing resources  Dataset replicas in different size repositories Predict application execution time in order to select right dataset replica and resource configuration

6 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 6 IPDPS’07 FREERIDE-G Design

7 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 7 IPDPS’07 FREERIDE-G Processing KEY observation: most data mining algorithms follow canonical loop Middleware API: Subset of data to be processed Reduction object Local and global reduction operations Iterator While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. }

8 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 8 IPDPS’07 Performance Prediction Approach 3 Phases of execution: –Retrieval at data server –Data delivery to compute node –Parallel processing at compute node Special processing structure: –Generalized reduction T exec = T disk + T network + T compute

9 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 9 IPDPS’07 Needed profile information Numbers of  storage nodes (n)  compute nodes (c) Available bandwidth between these (b), in profile configuration Execution time breakdown:  data retrieval (t d )  network communication (t n )  data processing (t c ) components Dataset size (s) Reduction object information:  maximum size  communication time Global reduction time

10 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 10 IPDPS’07 Data Retrieval and Communication Time Data Retrieval: Dataset size (s) and number of data hosts (n) for base profile and predicted configuration (s’ and n’). Used to scale t d. Data Communication: Also need dataset size and number of data hosts, as well as bandwidth (b and b’). Used to scale t n.

11 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 11 IPDPS’07 Initial Data Processing Time Prediction Dataset size (s) and number of compute nodes (c): base profile (s,c) predicted profile (s’, c’) Used to scale up t c. Limitations – not modeling: Inter-processor communication time Global reduction time

12 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 12 IPDPS’07 Modeling Interprocessor Communication Parallel computation involves communication of reduction object Communication time (T ro ) Reduction object size (r) Interprocessor bandwidth (w) Latency (l) Reduction object size either remains constant or scales linearly

13 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 13 IPDPS’07 Modeling Global Reduction Global reduction time (T g ) is also serialized Depending on application, global reduction time: –Scales linearly with number of nodes but is constant independent of size –Stays constant independent of number of nodes, but scales linearly with data size

14 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 14 IPDPS’07 Modeling Across Heterogeneous Clusters Need scaling factors for all 3 stages of computation (from a set of representative applications).

15 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 15 IPDPS’07 FREERIDE-G Applications Data mining: K-means clustering KNN search EM clustering Scientific data processing: Vortex extraction (right) Molecular defect detection and categorization

16 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 16 IPDPS’07 Experimental Setup Base: 700 MHz Pentiums connected through Myrinet LaNai 7.0 Heterogeneous prediction: 2.4 GHz Opteron 250’s connected through Infiniband (1Gb) Goal – to correctly model changes in: 1.Parallel configuration 2.Dataset size 3.Network bandwidth 4.Underlying resources

17 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 17 IPDPS’07 Modeling Parallel Performance Errors for 3 approaches for: 1.Vortex detection, base: 1-1 configuration 710 MB dataset 2.Defect detection, base: 1-1 configuration 130 MB dataset Results: modeling reduction pays off accurate predictions

18 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 18 IPDPS’07 Modeling Dataset Size Errors for 1 (best) approach for: 1.EM clustering (1.4 GB), base: 1-1 configuration 350 MB dataset 2.Defect detection (1.8 GB), base: 1-1 configuration 130 MB dataset Results: biggest error when number of data nodes is same as number of compute nodes accurate predictions

19 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 19 IPDPS’07 Impact of Network Bandwidth Errors for 1 (best) approach for: 1.EM clustering (250 Kbps), base: 1-1 configuration 500 Kbps 2.Defect detection (250 Kbps), base: 1-1 configuration 500 Kbps Results: biggest error when number of data nodes is same as number of compute nodes Modeling reduction is most accurate

20 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 20 IPDPS’07 Predictions for different type of cluster Errors for 1 (best) approach for: 1.Defect detection (1.8 GB), base: 1-1 configuration 710 MB dataset 2.EM clustering (700 MB), base: 8-8 configuration 350 MB dataset Results: Scaling factors different Largest error when predicted configuration has same number of compute nodes as base

21 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 21 IPDPS’07 Existing Work 3 broad categories for resource allocation:  Heuristic approach to mapping  Prediction through modeling: Statistical estimation/prediction Analytical modeling of parallel application  Simulation based performance prediction

22 Computer Science and Engineering Predicting Performance for Grid-Based Dataminingglimcher@cse.ohio-state.edu P. 22 IPDPS’07 Summary Performance prediction approach Exploits similarities in application processing structure to come up with very accurate results Approach accurately models changes in: –Computing configuration –Dataset size –Network bandwidth –Underlying compute resources


Download ppt "Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework."

Similar presentations


Ads by Google