Communication Pattern Based Node Selection for Shared Networks

Slides:

Advertisements

Similar presentations

1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.

Advertisements

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

Thoughts on Shared Caches Jeff Odom University of Maryland.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

UCSD SAN DIEGO SUPERCOMPUTER CENTER 1 Symbiotic Space-Sharing: Mitigating Resource Contention on SMP Systems Professor Snavely, University of California.

Fast Pattern-Based Throughput Prediction for TCP Bulk Transfers Tsung-i (Mark) Huang Jaspal Subhlok University of Houston GAN ’ 05 / May 10, 2005.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)

Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.

Performance Evaluation

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.

Peter Dinda Department of Computer Science Northwestern University Beth Plale Department.

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

Capacity planning for web sites. Promoting a web site Thoughts on increasing web site traffic but… Two possible scenarios…

PRESTON SMITH ROSEN CENTER FOR ADVANCED COMPUTING PURDUE UNIVERSITY A Cost-Benefit Analysis of a Campus Computing Grid Condor Week 2011.

Evaluating current processors performance and machines stability R. Esposito 2, P. Mastroserio 2, F. Taurino 2,1, G. Tortone 2 1 INFM, Sez. di Napoli,

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Rice01, slide 1 Characterizing NAS Benchmark Performance on Shared Heterogeneous Networks Jaspal Subhlok Shreenivasa Venkataramaiah Amitoj Singh University.

Georges da Costa 2, Marcos Dias de Assunção 1, Jean-Patrick Gelas 1, Yiannis Georgiou 3, Laurent Lefèvre 1, Anne-Cécile Orgerie 1, Jean-Marc Pierson 2,

Texas Learning and Computation Center High Performance Systems Lab Automatic Clustering of Grid Nodes Nov 14, 2005 Qiang Xu, Jaspal Subhlok University.

Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.

1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.

Network Aware Resource Allocation in Distributed Clouds.

Trace Generation to Simulate Large Scale Distributed Application Olivier Dalle, Emiio P. ManciniMar. 8th, 2012.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

 Internal Validity  Construct Validity  External Validity * In the context of a research study, i.e., not measurement validity.

Message-Passing for Wireless Scheduling: an Experimental Study Paolo Giaccone (Politecnico di Torino) Devavrat Shah (MIT) ICCCN 2010 – Zurich August 2.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.

1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Rakhi Anand Optimizing the Execution of Parallel Applications in Volunteer Environments Parallel Software Technologies Laboratory Department of Computer.

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

Optimization Flow Control—I: Basic Algorithm and Convergence Present : Li-der.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.

Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.

1 RECONSTRUCTION OF APPLICATION LAYER MESSAGE SEQUENCES BY NETWORK MONITORING Jaspal SubhlokAmitoj Singh University of Houston Houston, TX Fermi National.

Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection Sailesh Kumar Sarang Dharmapurikar Fang Yu Patrick Crowley Jonathan.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Multi-Resource Packing for Cluster Schedulers Robert Grandl Aditya Akella Srikanth Kandula Ganesh Ananthanarayanan Sriram Rao.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.

1 Packet Network Simulator-on-Chip Henry Wong Danyao Wang University of Toronto Connections 2009 ECE Graduate Symposium.

Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.

Generated Trips and their Implications for Transport Modelling using EMME/2 Marwan AL-Azzawi Senior Transport Planner PDC Consultants, UK Also at Napier.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Interaction and Animation on Geolocalization Based Network Topology by Engin Arslan.

Data Center Network Architectures

Software Architecture in Practice

Fast Pattern-Based Throughput Prediction for TCP Bulk Transfers

Rohit Kapoor, Ling-Jyh Chen, M. Y. Sanadidi, Mario Gerla

Department of Computer Science University of California, Santa Barbara

Shreeni Venkataramaiah

CARLA Buenos Aires, Argentina - Sept , 2017

Pradeep Kyasanur Nitin H. Vaidya Presented by Chen, Chun-cheng

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Department of Computer Science University of California, Santa Barbara

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Communication Pattern Based Node Selection for Shared Networks Srikanth Goteti Interactive Data Corp Jaspal Subhlok University of Houston AMS Symposium 2003

Resource Selection for Network/Grid Applications Model Data GUI Sim 1 Pre Stream Application ? where is the best performance Network

Current Approaches to Node Selection Model Data GUI Sim 1 Pre Stream Measure and model network properties, such as available bandwidth and CPU loads (with tools like NWS) Find “best” nodes for execution based on network status But expected application performance based on measured network status may not be accurate depends on application characteristics translation, e.g., unused bandwidth vs expected throughput data may be stale as frequent measurements are expensive

Performance Skeleton Performance Skeleton is a synthetic short running program whose execution characteristics mirror the application it represents An application and its skeleton have similar communication pattern synchronization pattern CPU usage memory usage Goal: Performance of a skeleton is directly related to the performance of the application under any condition e.g., a skeleton executes in .1% of the time the application takes to execute on any part of a shared network

Node Selection with Performance Skeletons Model Data GUI Data Sim 1 GUI Model Pre Stream Construct a skeleton for application of interest Sim 1 Pre Stream Select candidate node sets based on network status Execute the skeleton on them Select the node set with best skeleton performance to schedule actual application

Node Selection Procedure Construct a performance skeleton mostly by hand in this paper, subject of ongoing work Select candidate node sets identify the communication graph of the application typically a chain, ring or all-all structure obtain available bandwidth between nodes with NWS (Network Weather Service) and build a graph select nodes to “maximize the minimum available bandwidth” between pairs of communicating nodes best possible node sets based on application structure and network status Execute the skeleton on each candidate node set Select the node set with best skeleton performance, map one process to each node

Communication Structure of NAS Benchmarks 1 1 3 2 CG 1 2 3 2 3 BT IS 1 1 1 2 3 2 3 2 3 LU MG SP 1 2 3 EP

1 2

Validation Experiments Best nodes to execute benchmarks selected by each of the following methods… skeleton based: full framework discussed all to all: based on maximizing the minimum available bandwidth between on the network graph random …compare performance of the application on nodes selected by each of these procedures on a busy network Experiments repeated a large number of times to get statistically meaningful results

Experimental Framework Linux cluster of 10 dual CPU 1.7GHz Pentium nodes connected by 100 MHz links and crossbar switch experiments with Class B NAS MPI benchmark suite Class W NAS benchmarks (avrg runtime ~1.5 seconds on our cluster) used as skeletons for class B benchmarks available bandwidth between nodes is varied with Linux iproute2 for the duration of experiments as follows: path between a pair of nodes is “shared” by S streams i.e., available bandwidth is set to 1/S of peak one stream is randomly added to or removed from the cluster every 30 seconds

Performance Results: slowdown due to network traffic 1 3 2 CG skeleton based has average slowdown of 20%, versus 40 % for random and 27% for all to all significant variation across benchmarks, most benefit for CG – it is communication heavy and uses only 3 links

Conclusions type slide Performance skeletons have a role in resource management for grids removes limitations of using NWS type systems (what you measure versus what you get problem) A lot more experimentation is needed to establish and validate the concepts Automatic construction of performance skeletons is a major open challenge Skeletons may have other uses a fast way of estimating the performance of an application e.g. on a slow simulated future system