Using Uncacheable Memory to Improve Unity Linux Performance

Slides:

Advertisements

Similar presentations

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

Advertisements

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Euro-Par Uppsala Architecture Research Team [UART] | Uppsala University Dept. of Information Technology Div. of.

Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.

1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )

Xen and the Art of Virtualization A paper from the University of Cambridge, presented by Charlie Schluting For CS533 at Portland State University.

Page-based Commands for DRAM Systems Aamer Jaleel Brinda Ganesh Lei Zong.

Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.

1 Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine, and Mendel Rosenblum, Stanford University, 1997.

Microkernels: Mach and L4

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

Figure 1.1 Interaction between applications and the operating system.

Page-based Commands for DRAM Systems Aamer Jaleel Brinda Ganesh Lei Zong.

1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.

Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.

ECE 424 Embedded Systems Design Lecture 8 & 9 & 10: Embedded Processor Architecture Chapter 5 Ning Weng.

FreeBSD Network Stack Performance Srinivas Krishnan University of North Carolina at Chapel Hill.

Vir. Mem II CSE 471 Aut 011 Synonyms v.p. x, process A v.p. y, process B v.p # index Map to same physical page Map to synonyms in the cache To avoid synonyms,

Operating System Support for Virtual Machines Samuel King, George Dunlap, Peter Chen Univ of Michigan Ashish Gupta.

Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.

RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.

Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)

Memory Management in Windows and Linux &. Windows Memory Management Virtual memory manager (VMM) –Executive component responsible for managing memory.

Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.

Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen

2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.

Microkernels, virtualization, exokernels Tutorial 1 – CSC469.

VxWorks & Memory Management

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

Intelligent Interleaving of Scenarios: A Novel Approach to System Level Test Generation Shady Copty, Itai Jaeger(*), Yoav Katz, Michael Vinov IBM Research.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

Srihari Makineni & Ravi Iyer Communications Technology Lab

Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Micro-sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems Jeongseob Ahn, Chang Hyun Park, and Jaehyuk.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

VTurbo: Accelerating Virtual Machine I/O Processing Using Designated Turbo-Sliced Core Embedded Lab. Kim Sewoog Cong Xu, Sahan Gamage, Hui Lu, Ramana Kompella,

Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.

Full and Para Virtualization

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,

L1/HLT trigger farm Bologna setup 0 By Gianluca Peco INFN Bologna Genève,

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.

LISA Linux Switching Appliance Radu Rendec Ioan Nicu Octavian Purdila Universitatea Politehnica Bucuresti 5 th RoEduNet International Conference.

Persistent Memory (PM)

DIRECT MEMORY ACCESS and Computer Buses

NFV Compute Acceleration APIs and Evaluation

CMSC 611: Advanced Computer Architecture

Memory COMPUTER ARCHITECTURE

The Multikernel: A New OS Architecture for Scalable Multicore Systems

12.4 Memory Organization in Multiprocessor Systems

CS 286 Computer Organization and Architecture

Memory hierarchy.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors

Final Review CS144 Review Session 9 June 4, 2008 Derrick Isaacson

High Performance Messaging on Workstations

Software Cache Coherent Control by Parallelizing Compiler

Xen Network I/O Performance Analysis and Opportunities for Improvement

High Performance Computing

CS 105 “Tour of the Black Holes of Computing!”

CS 105 “Tour of the Black Holes of Computing!”

NetPerL Seminar An Analysis of TCP Processing Overhead

Chapter 1: Introduction CSS503 Systems Programming

Synonyms v.p. x, process A v.p # index Map to same physical page

Presentation transcript:

Using Uncacheable Memory to Improve Unity Linux Performance Ning Qu Xiaogang Gou Xu Cheng Microprocessor Research and Development Center Peking University

Unity SoC architecture Issues Unity SoC architecture No snooping Cache coherency problem everywhere !! Peking University

Issues cont. poor temporal locality! DMA DMA User Process User Process process I/O buffer process I/O buffer poor temporal locality! Linux Kernel Linux Kernel kernel I/O buffer kernel I/O buffer DMA DMA I/O device buffer I/O device buffer I/O Device I/O Device Peking University

Motivation How to avoid the disadvantages? Heavy cost of Cache coherency operations Many high-end embedded processors have Cache, But many of them have very limited support to guarantee cache coherency Poor locality leads to more data Cache pollution Cache is based on property of locality Some programs have poor locality, for example TCP/IP processing How to avoid the disadvantages? Uncacheable memory may be a solution! 1) For example, Unity I and some ARM processors doesn’t support single cache line operation 2) Some programs have poor locality, hence the existing cache policies do not help Peking University

Contributions Analyze the scenarios in which Cache doesn’t perform well, propose uncacheable memory has two advantages Eliminate most of Cache coherency operations Avoid Cache pollution Apply uncacheable memory in Unity Linux to improve the I/O performance. Some important aspects improves from 5% - 29% Won’t hurt the system performance with carefully design Peking University

Outline Issues Motivation Contribution Uncacheable Memory Evaluation Related Work Conclusions Peking University

using uncacheable memory Recv Packet Flow step 1 step 2 step 3 step 4 using uncacheable memory User Space Simple data processing flush cache User Buffer Kernel Space Buffer Buffer Buffer Buffer I/O Device CPU copy Step3: 1) Uncacheable memory access is slow 2) Simple data processing means low cache pollution Step4: 1) Uncacheable memory has no cache pollution, but access is slow 2) Cacheable memory has much cache pollution, but access is fast DMA copy Peking University

using uncacheable memory Send Packet Flow step 1 step 2 step 3 step 4 using uncacheable memory User Space User Buffer clean cache DMA copy Kernel Space Buffer Buffer Buffer Buffer CPU copy Simple data processing I/O Device Step1: 1) Uncacheable memory write is slow, but has no cache pollution 2) Cacheable memory has to do write allocates and pollute cache, but has access acceleration Peking University

Cacheable vs. Uncacheable Send Receive CH processing 1. copy from U to K 2. clean data cache 1. clean&invalidate data cache 2. copy from K to U NC 1. copy from U to K(N) 1. copy from K(N) to U side effect 1. accessing uncacheable memory is slower 2. no data cache pollution 3. no cache clean operation 3. no cache flush operation DMA send and receive cost analysis Peking University

Cacheable vs. Uncacheable cont. Cache clean cost DMA Send: load U to Cache load U into Cache load K to Cache store to K load K to Cache load K Cache flush cost load U into Cache and store DMA Recv: Peking University

Cacheable vs. Uncacheable cont. Single write cost is small, but single read cost is great! Send Uncacheable approach cost = write K once in single write mode. Cacheable approach cost >= read K once in burst mode and write K once in burst mode. Receive Uncacheable approach cost = read K once in single read mode. Cacheable approach cost = read K once in burst mode. To evaluate the cost and the TCP/IP processing performance improvement by uncacheable method compared with hardware cacheable method, we design a simplified experiment according to the TCP/IP processing described in Table There are two factors in the experiment, one is the data length which determines the cost of data copy operations and the other is the data cache dirty ratio which determines the cost of data cache’s clean&invalidate operations We implement a kernel module to measure the cost of sending and receiving packets by these two methods In the diagrams Cache(6.25%) means cacheable method with 6.25% data cache dirty ratio and NonC means uncacheable. The results convince us that it’s likely to gain benefits from uncacheable method when sending packets. But for receiving packets, uncacheable method costs too much because of copy and TCP/IP processing. Recv and Send Performance CH vs NC Peking University

Using Uncacheable Memory Implemented in Unity Linux ported from Linux 2.4.17 Uncacheable page table eliminate Cache coherency operations when modifying the page tables Uncacheable socket buffer for sending eliminate Cache coherency operations avoid data Cache pollution Peking University

Outline Motivation Issues Contribution Uncacheable Memory? Evaluation Related Work Conclusions Peking University

Methodology Benchmarks: Netperf, Lmbench and Modified Andrew benchmark. Experiments environment 160 MHz Unity network computer with 256 MB DRAM, a SoC build-in 10M/100M Ethernet card Dell 4600 server, two Intel Xeon PIII 700 MHz processors with 4 GB DRAM and 1000M/100M Ethernet card All benchmarks are executed in single-user mode on NFS. Peking University

Netperf Benchmark Results Netperf TCP_STREAM Send Performance Peking University

Netperf Benchmark Results cont. Q: When receive size is increasing, transactions throughput improvement is decreasing. ? using uncacheable memory in recv socket buffer Netperf TCP_RR Performance Peking University

Lmbench Benchmark Results Lmbench Performance Peking University

Modified Andrew Benchmark Results As expected, execution time reduces from 6%-12% for the first four phases. For Phase V, the time reduction is less than 1%. This is reasonable because only Phase V heavily depends on computation instead of I/O. Summary: Based on these results, We believe by using uncacheable memory with careful design, overall performance of Unity system will outperform the implementation by simply using cacheable memory and cache hardware operations. Modified Andrew Benchmark Peking University

Related Work Future work: new memory type support Related work: accelerate uncacheable memory performance New memory type Intel write-combining MIPS R10000: uncached-accelerated page New instructions SPARC V9, ARM, Unity II: block move instructions Future work: new memory type support Read like common cache with low pollution Write like Write-Combining without write-allocate Peking University

Conclusions This paper focuses on the uncacheable memory usage. Pros: eliminating coherency operations and avoiding data Cache pollution. Cons: slow accessing time Uncacheable memory can perform well with a carefully design when considering system specialties Many embedded architectures have a lot of their own design specialties due to the limitation on energy cost, area size and design complexity There may be more fields the uncacheable memory can be applied in Peking University

Thank You! Questions? Peking University