Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.

Slides:

Advertisements

Similar presentations

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Advertisements

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.

Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines P. Balaji ¥ W. Feng α Q. Gao ¥ R. Noronha ¥ W. Yu ¥ D. K. Panda ¥ ¥ Network.

COS 461 Fall 1997 Workstation Clusters u replace big mainframe machines with a group of small cheap machines u get performance of big machines on the cost-curve.

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula,

04/25/06Pavan Balaji (The Ohio State University) Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol over InfiniBand.

1 GridTorrent Framework: A High-performance Data Transfer and Data Sharing Framework for Scientific Computing.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.

Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji,

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.

The Virtual Microscope Umit V. Catalyurek Department of Biomedical Informatics Division of Data Intensive and Grid Computing.

DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters Thu D. Nguyen and Christopher Peery Department of Computer Science.

Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.

RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

1 Performance Evaluation of Gigabit Ethernet & Myrinet

COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.

NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 

P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp

RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Direct Access File System (DAFS): Duke University Demo Source-release reference implementation of DAFS Broader research goal: Enabling efficient and transparently.

Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy,

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.

Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji and D. K. Panda Network Based.

1/29/2002 CS Distributed Systems 1 Infiniband Architecture Aniruddha Bohra.

1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

The NE010 iWARP Adapter Gary Montry Senior Scientist

Workload-driven Analysis of File Systems in Shared Multi-Tier Data-Centers over InfiniBand K. Vaidyanathan P. Balaji H. –W. Jin D.K. Panda Network-Based.

MediaGrid Processing Framework 2009 February 19 Jason Danielson.

Electronic visualization laboratory, university of illinois at chicago A Case for UDP Offload Engines in LambdaGrids Venkatram Vishwanath, Jason Leigh.

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.

Heavy and lightweight dynamic network services: challenges and experiments for designing intelligent solutions in evolvable next generation networks Laurent.

Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,

Srihari Makineni & Ravi Iyer Communications Technology Lab

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

GVis: Grid-enabled Interactive Visualization State Key Laboratory. of CAD&CG Zhejiang University, Hangzhou

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.

Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.

Interconnection network network interface and a case study.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

Background Computer System Architectures Computer System Software.

Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.

Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.

Infiniband Architecture

Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Performance-Robust Parallel I/O

Presentation transcript:

Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin Kurc, Umit Catalyurek, Joel Saltz, BMI Department The Ohio State University

Presentation Layout Motivation and Background Sockets Implementations DataCutter Library Experimental Results Conclusions and Future Work

Background Data Intensive Applications  Communication Intensive; I/O Intensive  Require Guarantees in Performance  Scalability with guarantees  Adaptability to Heterogeneous Networks  Several of them are built over TCP/IP Times have changed  Faster networks available (cLAN, InfiniBand)  Faster protocols available (VIA, EMP)

Motivation High Performance Sockets Layers  Take advantage of faster networks [balaji02, shah99]  No changes to the applications  Bottleneck: Design of Applications based on TCP/IP Communication Questions  Can a high performance substrate allow the implementation of a scalable interactive data-intensive application with performance guarantees to the end user?  Can a high performance substrate improve the adaptability of data-intensive applications to heterogeneous environments? “High Performance User-Level Sockets over Gigabit Ethernet”, Pavan Balaji, Piyush Shivam, Pete Wyckoff and D. K. Panda, Cluster 2002, Chicago “High Performance Sockets and RPC over Virtual Interface (VI) Architecture”, H. V. Shah, C. Pu and R. S. M., CANPC workshop 1999.

Latency with Bandwidth Constraint TCP BandwidthBandwidth Message Size TCP VIA LatencyLatency Message Size 00 11 Reqd BW 00 0 1 11 2 VIA Latency Vs Message Size is studied Latency Vs Bandwidth is relevant for performance guarantees

An Example… TCP VIA 00 11 Reqd BW 00 11 VIA TCP Image rendering should be interactive Response times should be small

Pipelining: Computation/Communication Overlap LatencyLatency Message Size (log Scale) TCP VIA Computation 00 11 Compute Nodes Linear Computation with Message Size Root Node

An Example… Root Node Linear Computation with Message Size Consider for perfect pipelining TCP requires 16KB message size VIA requires 2KB message size Say the computation function takes 1 sec/KB Each computation step takes 16 secs for TCP 2 secs for VIA Compute Nodes Say, a node becomes slower by a factor of 2 Time taken by compute node (16 * 2) = 32 secs for TCP Increases by 16 seconds (2 * 2) = 4 secs for VIA Increases by 2 seconds

Presentation Layout Motivation and Background Sockets Implementations DataCutter Library Experimental Results Conclusions and Future Work

Sockets Implementations NIC IP TCP Sockets Application “VI aware” NIC IP TCP Sockets Application IP-to-VI layer GigaNet cLAN NIC Sockets over VIA Application or Library OS Agent VIPL Pros High Compatibility  Cons Kernel Context Switches Multiple Copies CPU Resources Traditional Berkeley Sockets GigaNet Sockets (LANE) SocketVIA Kernel Context Switches Multiple Copies CPU Resources High Performance

Experimental Setup 16 Dell Precision 420 Nodes  Dual 1 GHz PIII Processors  32bit 33MHz PCI bus  512MB SDRAM and 256K L2-level cache  Linux kernel version GigaNet cLAN NICs  cLAN 1000 Host Adapters  cLAN 5300 Cluster switches

Performance of SocketVIA Vs TCP Latency: TCP (45us), VIA (9us), SocketVIA (9.5us) Bandwidth: TCP (510Mbps), VIA (790Mbps), SocketVIA (763Mbps)

Presentation Layout Motivation and Background Sockets Implementations DataCutter Library Experimental Results Conclusions and Future Work

DataCutter Software Support for Data Driven Applications Component Framework for Combined Task/Data Parallelism User defines sequence of pipelined components (filters and filter groups) –Stream based communication User directive tells preprocessor/runtime system to generate and instantiate copies of filters Flow control between transparent filter copies –Replicated individual filters –Transparent: single stream illusion

DataCutter Library NIC IP TCP Sockets DataCutter Library Applications GigaNet cLAN NIC Sockets over VIA DataCutter Library OS Agent VIPL Applications

Virtual Microscope Server decompressclipsubsample View Pipelining of various stages: data reading, decompress, clipping, sub- sampling operations can be realized as a chain of filters. Replication of filters to obtain parallelism read

Software Load Balancing Data Reading Load Balancer Slower Node

Presentation Layout Motivation and Background Sockets Implementations DataCutter Library Experimental Results Conclusions and Future Work

Experiments Conducted Optimal Block Size Guarantee on Updates per Second (Complete Image) Guarantee on Latency of Partial Update (Moving the Image) Round Robin Load Balancing Demand Driven Load Balancing

Effects of Guarantees on Updates per Second (Complete Images) SocketVIA performs better TCP can’t give guarantees > 3.25 but… Limited improvement Design Decisions are bottlenecks Re-sizing of data blocks Try to alleviate the bottlenecks Only concern is Updates per Second Achievable at low block sizes No application changes (in this case) Significant performance improvement

Effects of Guarantees on Latency of Partial Updates (Moving the Image) For High latency guarantees… Blindly using is good enough Bandwidth Saturation Pre-tuned applications For Low latency guarantees… TCP is no longer in the picture Blindly using SocketVIA is not OK Resizing of blocks can help

Effect of Heterogeneous Clusters on Round Robin (RR) Scheduling Dynamic Heterogeneity Shared Processors Process Swapping Perfect Pipelining Complete overlap of comm. with comp. Occurs at 16KB for TCP At 2KB for VIA Scope of Error A larger chunk to a slower node More time for complete the chunk More time for the load balancer to react

Effect of Heterogeneous Clusters on Demand Driven (DD) Scheduling Demand Driven Scheduling Additional Latency Cost SocketVIA should perform better (?) Natural overlap of comm. with comp. Use of SocketVIA or TCP makes no diff.

Presentation Layout Motivation and Background Sockets Implementations DataCutter Library Experimental Results Conclusions and Future Work

High Performance Sockets are good ! –It’s your friend –But, use it wisely Minor changes can make a major impact –Order of magnitude performance improvement –Sustained Performance Guarantees –Fine grained Load-Balancing Higher Adaptability to Heterogeneous Networks Benefits of Parallelization over Pipelining with SocketVIA for large clusters High Performance Sockets Implementations –TCP Termination (for the DataCenter environment) –Use in DSM, DataCenter and Storage Server environments

For more information, please visit the Network Based Computing Group, The Ohio State University Thank You! NBC Home Page