Assessment of Data Path Implementations for Download and Streaming Pål Halvorsen 1,2, Tom Anders Dalseng 1 and Carsten Griwodz 1,2 1 Department of Informatics,

Slides:



Advertisements
Similar presentations
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Advertisements

© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.
Device Drivers. Linux Device Drivers Linux supports three types of hardware device: character, block and network –character devices: R/W without buffering.
Multicast Tree Reconfiguration in Distributed Interactive Applications Pål Halvorsen 1,2, Knut-Helge Vik 1 and Carsten Griwodz 1,2 1 Department of Informatics,
©2000 Pål HalvorsenMIS 2000, Chicago, October 2000 Network Level Framing in INSTANCE Pål Halvorsen, Thomas Plagemann, and Vera Goebel University of Oslo,
Home Exam 2: Video Encoding on GPUs using nVIDIA CUDA with Managed Memory Home Exam 2: Video Encoding on GPUs using nVIDIA CUDA with Managed Memory September.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Chapter 17 Networking Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Home Exam 1: Video Encoding on Intel x86 using Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX) Home Exam 1: Video Encoding on Intel.
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
Geoff Salmon, Monia Ghobadi, Yashar Ganjali, Martin Labrecque, J. Gregory Steffan University of Toronto.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
G Robert Grimm New York University Disco.
Improving IPC by Kernel Design Jochen Liedtke Presented by Ahmed Badran.
Embedded Real-time Systems The Linux kernel. The Operating System Kernel Resident in memory, privileged mode System calls offer general purpose services.
RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
1 Today I/O Systems Storage. 2 I/O Devices Many different kinds of I/O devices Software that controls them: device drivers.
VSP Video Station Protocol Presented by : Mittelman Dana Ben-Hamo Revital Ariel Tal Instructor : Sela Guy Presented by : Mittelman Dana Ben-Hamo Revital.
COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.
FreeBSD Network Stack Performance Srinivas Krishnan University of North Carolina at Chapel Hill.
Understanding Factors That Influence Performance of a Web Server Presentation CS535 Project By Thiru.
Router Architectures An overview of router architectures.
SET TOP BOX What is set-top box ? An interactive device which integrates the video and audio decoding capabilities of television with a multimedia application.
Router Architectures An overview of router architectures.
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)
©2001 Pål HalvorsenINFOCOM 2001, Anchorage, April 2001 Integrated Error Management in MoD Services Pål Halvorsen, Thomas Plagemann, and Vera Goebel University.
Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.
ATM and Fast Ethernet Network Interfaces for User-level Communication Presented by Sagwon Seo 2000/4/13 Matt Welsh, Anindya Basu, and Thorsten von Eicken.
Chapter 17 Networking Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William Stallings.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Cpr E 308 Input/Output Recall: OS must abstract out all the details of specific I/O devices Today –Block and Character Devices –Hardware Issues – Programmed.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Penn State CSE “Optimizing Network Virtualization in Xen” Aravind Menon, Alan L. Cox, Willy Zwaenepoel Presented by : Arjun R. Nath.
CS 342 – Operating Systems Spring 2003 © Ibrahim Korpeoglu Bilkent University1 Input/Output CS 342 – Operating Systems Ibrahim Korpeoglu Bilkent University.
Optimizing UDP-based Protocol Implementations Yunhong Gu and Robert L. Grossman Presenter: Michal Sabala National Center for Data Mining.
Assignment 5/9 – 2005 INF 5070 – Media Servers and Distribution Systems:
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
Lab Assignment 15/ INF5060: Multimedia data communication using network processors.
Considerations of SCTP Retransmission Delays for Thin Streams Jon Pedersen 1, Carsten Griwodz 1,2 & Pål Halvorsen 1,2 1 Department of Informatics, University.
ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.
How to Minimize Transport Protocol Processing: Implementation and Evaluation of Network Level Framing Pål Halvorsen, Thomas Plagemann, and Vera Goebel.
CS 4396 Computer Networks Lab Router Architectures.
A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)
ND The research group on Networks & Distributed systems.
Server Resources 12/ INF5070 – Media Storage and Distribution Systems:
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Using Uncacheable Memory to Improve Unity Linux Performance
Device Driver Concepts Digital UNIX Internals II Device Driver Concepts Chapter 13.
Performance Analysis of HPC with Lmbench Didem Unat Supervisor: Nahil Sobh July 22 nd 2005 netfiles.uiuc.edu/dunat2/www.
Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Test for timestamp : measure code execution time.
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Accelerating Peer-to-Peer Networks for Video Streaming
CS 286 Computer Organization and Architecture
CS 286: Memory Paging and Virtual Memory
Internetworking: Hardware/Software Interface
Xen Network I/O Performance Analysis and Opportunities for Improvement
Performance Issues in WWW Servers
Low Overhead Interrupt Handling with SMT
Presentation transcript:

Assessment of Data Path Implementations for Download and Streaming Pål Halvorsen 1,2, Tom Anders Dalseng 1 and Carsten Griwodz 1,2 1 Department of Informatics, University of Oslo, Norway 2 Simula Research Laboratory, Norway International conference on distributed multimedia systems (DMS’05), Banff, Canada, September 2005

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Overview Motivation Existing mechanisms in Linux Possible enhancements Summary and Conclusions

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Delivery Systems Network bus(es)

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 file system communication system application user space kernel space bus(es) Delivery Systems

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Pentium 4 Processor registers cache(s) I/O controller hub memory controller hub RDRAM PCI slots network card disk file system communication system application file system communication system application disknetwork card Intel Hub Architecture  several in-memory data movements and context switches

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Motivation Data copy operations are expensive  consume CPU, memory, hub, bus and interface resources (proportional to data size)  profiling shows that ~40% of CPU time is consumed by copying data between user and kernel  gap between memory and CPU speeds increase  different access times to different banks System calls makes a lot of switches between user and kernel space

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 file system communication system application user space kernel space bus(es) data_pointer Basic Idea of Zero–Copy Data Paths

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Motivation Data copy operations are expensive  consume CPU, memory, hub, bus and interface resources (proportional to data size)  profiling shows that ~40% of CPU time is consumed by copying data between user and kernel  gap between memory and CPU speeds increase  different access times to different banks System calls makes a lot of switches between user and kernel space A lot of research has been performed in this area!!!! BUT, what is the status today of commodity operating systems?

Existing Linux Data Paths

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Content Download file system communication system application user space kernel space bus(es)

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Content Download: read / send application kernel page cache socket buffer application buffer read send copy DMA transfer  2n copy operations  2n system calls

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Content Download: mmap / send application kernel page cache socket buffer mmap send copy DMA transfer  n copy operations  1 + n system calls

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Content Download: sendfile application kernel page cache socket buffer sendfile gather DMA transfer append descriptor DMA transfer  0 copy operations  1 system calls

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Content Download: Results UDPTCP Tested transfer of 1 GB file on Linux 2.6 Both UDP (with enhancements) and TCP

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Streaming file system communication system application user space kernel space bus(es)

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Streaming: read / send application kernel page cache socket buffer application buffer read send copy DMA transfer  2n copy operations  2n system calls

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Streaming: read / writev application kernel page cache socket buffer application buffer read writev copy DMA transfer  3n copy operations  2n system calls copy  Previous solution one less copy per packet

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Streaming: mmap / send application kernel page cache socket buffer application buffer mmap uncork copy DMA transfer  2n copy operations  1 + 4n system calls copy send cork

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Streaming: mmap / writev application kernel page cache socket buffer application buffer mmap writev copy DMA transfer  2n copy operations  1 + n system calls copy  Previous solution three less calls per packet

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Streaming: sendfile application kernel page cache socket buffer application buffer DMA transfer  n copy operations  4n system calls gather DMA transfer append descriptor copy uncorksendfilesendcork

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Streaming: Results Tested streaming of 1 GB file on Linux 2.6 RTP over UDP TCP sendfile (content download) Compared to not sending an RTP header over UDP, we get an increase of 29% (additional send call) More copy operations and system calls required  potential for improvements

Enhanced Streaming Data Paths

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Enhanced Streaming: mmap / msend application kernel page cache socket buffer application buffer DMA transfer  n copy operations  1 + 4n system calls gather DMA transfer append descriptor copy msend allows to send data from an mmap ’ed file without copy mmap uncorksend cork msend copy DMA transfer  Previous solution one more copy per packet

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Enhanced Streaming: mmap / rtpmsend application kernel page cache socket buffer application buffer DMA transfer  n copy operations  1 + n system calls gather DMA transfer append descriptor copy mmap uncorksend cork rtpmsend RTP header copy integrated into msend system call  previous solution require three more calls per packet

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Enhanced Streaming: mmap / krtpmsend application kernel page cache socket buffer application buffer DMA transfer  0 copy operations  1 system call gather DMA transfer append descriptor copy krtpmsend  previous solution require one more call per packet An RTP engine in the kernel adds RTP headers rtpmsend RTP engine  previous solution require one more copy per packet

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Enhanced Streaming: rtpsendfile application kernel page cache socket buffer application buffer DMA transfer  n copy operations  n system calls gather DMA transfer append descriptor copy rtpsendfile  existing solution require three more calls per packet uncorksendfilesendcork RTP header copy integrated into sendfile system call

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Enhanced Streaming: krtpsendfile application kernel page cache socket buffer application buffer DMA transfer  0 copy operations  1 system call gather DMA transfer append descriptor copy krtpsendfile  previous solution require one more call per packet An RTP engine in the kernel adds RTP headers rtpsendfile RTP engine  previous solution require one more copy per packet

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Enhanced Streaming: Results Tested streaming of 1 GB file on Linux 2.6 RTP over UDP TCP sendfile (content download) Existing mechanism (streaming) mmap based mechanisms sendfile based mechanisms ~27% improvement ~25% improvement

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Conclusions Current commodity operating systems still pay a high price for streaming services However, small changes in the system call layer might be sufficient to remove most of the overhead Conclusively, commodity operating systems still have potential for improvement with respect to streaming support What can we hope to be supported? Road ahead: optimize the code, make patch and submit to kernel.org

2005 Pål Halvorsen, Tom Anders Dalseng & Carsten Griwodz DMS’ 05, Banff, Canada. September 2005 Questions??