Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen

Slides:



Advertisements
Similar presentations
Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011.
Advertisements

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Chapter 6 Computer Architecture
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.
Architectural Support for Operating Systems. Announcements Most office hours are finalized Assignments up every Wednesday, due next week CS 415 section.
Architectural Considerations for CPU and Network Interface Integration C. D. Cranor; R. Gopalakrishnan; P. Z. Onufryk IEEE Micro Volume: 201, Jan.-Feb.
Computer System Overview Chapter 1. Basic computer structure CPU Memory memory bus I/O bus diskNet interface.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 I/O Management in Representative Operating Systems.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: , 5.8, 5.10, 5.15; Also, 5.13 & 5.17.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu.
Computer System Architectures Computer System Software
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Types of Computers Mainframe/Server Two Dual-Core Intel ® Xeon ® Processors 5140 Multi user access Large amount of RAM ( 48GB) and Backing Storage Desktop.
Improving Network I/O Virtualization for Cloud Computing.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi
COMPUTER SYSTEM OVERVIEW. Operating Systems: Internals and Design Principles “No artifact designed by man is so convenient for this kind of functional.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.
Company name KUAS HPDS dRamDisk: Efficient RAM Sharing on a Commodity Cluster Vassil Roussev, Golden G. Richard Reporter :
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Srihari Makineni & Ravi Iyer Communications Technology Lab
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
DYNES Storage Infrastructure Artur Barczyk California Institute of Technology LHCOPN Meeting Geneva, October 07, 2010.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
DCIM: Distributed Cache Invalidation Method for Maintaining Cache Consistency in Wireless Mobile Networks.
A Measurement Based Memory Performance Evaluation of Streaming Media Servers Garba Isa Yau and Abdul Waheed Department of Computer Engineering King Fahd.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
Using Uncacheable Memory to Improve Unity Linux Performance
Sunpyo Hong, Hyesoon Kim
Eager Writeback — A Technique for Improving Bandwidth Utilization
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Lecture Overview Shift Register Buffering Direct Memory Access.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques Dr. Xiao Qin Auburn University
Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
Chapter 3 Getting Started. Copyright © 2005 Pearson Addison-Wesley. All rights reserved. Objectives To give an overview of the structure of a contemporary.
1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
Memory COMPUTER ARCHITECTURE
A Study on Snoop-Based Cache Coherence Protocols
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Temporal Streaming of Shared Memory
CS703 - Advanced Operating Systems
Computer Evolution and Performance
CS 3410, Spring 2014 Computer Science Cornell University
A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,
Presentation transcript:

Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences 2009.2.15 Workshop on The Influence of I/O on Microprocessor Architecture (IOM-2009) We are from Institute of Computing Technology, Chinese Academy of Sciences.

An Brief Intro Of ICT, CAS ICT has developed the Loongson CPU ICT has built the Fastest HPC in China – Dawning 5000, which is 233.5TFlops and rank 10th in Top500. The Institute of Computing Technology of the Chinese Academy of Sciences was the first academic institution to specialize in comprehensive research into computer science and technology in China. ICT successfully produced China’s first general-purpose digital computer in 1958 and has turned into Dawning series HPC development since 1990s. Recently, the Dawning 5000 achieves a rank of 10th in the Top500 list. ICT is also the place where China's first general-purpose CPU chip, Loongson (also named Godson for academy) was born. The four-core Godson-3 will be released in 2009, whose maximal frequency is 1GHz and power dissipation is 5-10 W.   In addition, ICT has spun off many academic institutions and hi-tech companies, such as Lenovo.

Overview Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work This slide introduce an overview of the presentation.

Importance of I/O operations I/O are ubiquitous Load binary files:Disk Memory Brower web, media stream:NetworkMemory… I/O are important Many commercial applications are I/O intensive: Database, Internet applications etc. I/O operations are extremely important in modern computer systems. First, I/O are ubiquitous. When we want to run an applications, the binary files are loaded from disk to memory via I/O subsystem. When we browser web or watch online video, the content is transferred from network to memory via I/O subsystem. Second, many commercial applications are I/O intensive, such as database, Internet applications

State-of-the-Art I/O Technologies I/O Bues: 20GB/s PCI-Express 2.0 HyperTransport 3.0 QuickPath Interconnect I/O Devices RAID: 400MB/s 10GE: 1.25GB/s I/O buses and I/O devices technologies have been improved dramatically in the past decade. I/O buses can provide bandwidths of over 20GB/s which are very close to that of a DDR2/3 DRAM memory system. The I/O devices performance also increases significantly. In a PC platform,a software Raid0 of four SATA disk can provide a bandwidth of over 400MB/s. The 10Gb Ethernet can offer a bandwidth of 1.25GB/s.

A Typical Computer Architecture NIC This is a typical computer architecture (the Intel’s i7 architecture). In fact, I/O subsystem makes a strong contribution for either performance or cost, although some contribution are downside, such as latency. These reflect importance of both I/O subsystem and this workshop.

Direct Memory Access (DMA) DMA is an essential feature of I/O operation in all modern computers DMA allows I/O subsystems to access system memory for reading and/or writing independently of CPU.  Many I/O devices use DMA Including disk drive controllers, graphics cards, network cards, sound cards and GPUs This slide is to explain DMA’s essential role in I/O systems. We can just read this slide.

Overview Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work Next, we will introduce the nature of DMA mechanism

DMA in Computer Architecture NIC Note: this slide contains four animations. Action_1: to show that DMA controllers exist in various I/O devices, such as NIC, disk (SATA controller), sound device. Action_2 & Action_3: to show the sources, destinations and paths of DMA operations. For example, I/O data is moved from memory to disk (Action 2), I/O data is moved from NIC to memory (Action 3).

An Example of Disk Read: DMA Receiving Operation Memory CPU Driver Buffer Descriptor ① ② ③ Kernel Buffer ④ User Buffer ⑤ DMA Engine Note: this slide contains 12 animations. Action_1 ~ Action_5: to show the five steps of an DMA receiving operation. Here “Kernel Buffer” is the “page cache” of OS kernel, and “User Buffer” is a memory buffer malloced in application. Action_6 ~ Action_11 : to show that we will illustrate the detail steps of copy-operation next. The yellow box represents I/O data. At first, it resides in “Driver Buffer” which is the left side of DRAM. In the end, it is moved to “Kernel Buffer” which is the right side of DRAM. Action_12 : to show that there are two memory access operations and two cache access operations for one piece of I/O data (could be one cache block). However, the memory access operations is time-consuming. Cache Access Latency: ~20 Cycles Memory Access Latency:~200 Cycles

Potential Improvement of DMA Memory CPU Driver Buffer Descriptor ① ② ③ Kernel Buffer ④ User Buffer ⑤ Note: this slide contains 9 animations. Action_1 ~ Action_3: to show the first three steps of an DMA receiving operation. Action_4 : to stress that one optimization approach could be that inject I/O data into cache directly. Action_5 ~ Action_8: to show that there are two cache access operations and one memory operations. The copy time is reduced significantly. Action_9 : to explain that this is a typical shared-cache scheme. DMA Engine This is a typical Shared-Cache Scheme

Problems of Shared-Cache Scheme Cache Pollution Cache Thrashing Degrade performance when DMA requests are large (>100KB) for “Oracle + TPC-H” application Processor’s LLC treats all data equally, which may cause cache pollution and thrashing problem, especially when I/O data is large (over 100KB). Our experiments show that shared-cache scheme can degrade performance when DMA requests are large for “Oracle + TPC-H” applications.

Rethink DMA Mechanism The Nature of DMA Corollaries There is a producer-consumer relationship between CPU and DMA engine Memory plays a role of transient place for I/O data transferred between processor and I/O device Corollaries  Once I/O data is produced, it will be consumed I/O data within DMA buffer will be used only once in most cases (i.e. almost no reuse).  Characterizations of I/O data are different from CPU data  It may not be appropriate to store I/O data and CPU data together Note: this slide contains 9 animations. Before proposing other optimization approach, let’s rethink the DMA mechanism. Action_1: We can see that the nature of DMA is … (just read) Action_2: There are several Corollaries. First, Once I/O data is produced, it will be consumed. (It should be noted that invalid or duplicate network packets are not consumed) I/O data within DMA buffer will be used only once in most cases (i.e. almost no reuse). Action_3: Just read them.

Overview Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work Based on these understanding, next we present a DMA cache scheme to optimize I/O performance.

DMA Cache Proposal A Dedicated Cache Storing I/O data Capable of exchanging data with processor’s last level cache (LLC)  Reduce overhead of I/O data movement DMA Note: this slide contains 3 animations. Action_1 ~ Action_3: to show incoming I/O data is stored in the dedicated DMA cache and can be moved to processor’s LLC when CPU accesses it. In such a way, overhead of memory copy-operations can be avoided.

DMA Cache Design Issues Cache Coherence Data Path Replacement Policy Write Policy Prefetching CPU Cache State Diagram DMA Cache State Diagram Note: this slide contains 1 animations. There are several design issues of DMA cache. Action 1: cache coherence issue should be solved carefully. The upper figure illustrates a typical cache state diagram of snooping-cache for I/O data coherency. In fact, this cache state diagram is not need to be modified while adding a DMA cache. Moreover, The state diagram for DMA cache is similar to that of processor cache except driving sources. For example, the “Invalid  Clean” state transitions of CPU cache and DMA cache are driven by CPU read and DMA read respectively. However, these state diagrams are applied for uniprocessor platform. We are researching multiprocessor platform currently. DMA Cache State Diagram is similar to CPU Cache in Uniprocessor system We are researching multiprocessor platform…

DMA Cache Design Issues Cache Coherence Data Path Replacement Policy Write Policy Prefetching DMA In practice, additional data paths and data access ports for LLC are not required because data migration operations between DMA cache and LLC can share existing data paths and ports of snooping mechanism. The required modification is that all requests sent to LLC should be forwarded to DMA cache. We think this modification is not a critical path of cache hierarchy. Next, we will illustrate data paths for CPU read and DMA read. Additional data paths and data access ports for LLC are not required because data migration operations between DMA cache and LLC can share existing data paths and ports of snooping mechanism

Data Path: CPU Read cmd data Cache Ctrl Snoop Ctrl Last Level Cache Hit in DMA cache? Miss in LLC & Hit in DMA Cache Cache Ctrl Snoop Ctrl Last Level Cache Cache Ctrl Snoop Ctrl DMA Cache Note: this slide contains 6 animations. Action_1: one CPU read request is forwarded to both LLC and DMA cache. Action_2: DMA cache’s controller will send a signal to LLC to inform if the request is hit in DMA cache. Action_3: the LLC controller gets the information of “miss in LLC and hit in DMA cache” Action_4: the LLC sends a refill request to DMA cache Action_5: the DMA cache sends data to LLC. Action_6: the requested data is forwarded to upper cache. It should be noted that it data path of CPU write is similar. For cases of “hit in LLC” and “miss in both cache”, the data paths are similar to current system without DMA cache. In addition, the cache coherence ensure that a piece of data can only reside in either place. System Bus DMA Ctrl I/O Device Mem Ctrl Memory

Data Path: DMA Read cmd data Cache Ctrl Snoop Ctrl Last Level Cache Miss in DMA Cache & Hit in LLC Hit in LLC? System Bus Note: this slide contains 7 animations. Action_1: one DMA read request is forwarded to both LLC and DMA cache. Action_2: LLC’s snoop controller will send a signal to DMA cache to inform if the request is hit in LLC. Action_3: the DMA cache’s snoop controller gets the information of “miss in DMA cache and hit in LLC” Action_4: the DMA cache sends a refill request to LLC Action_5: the LLC sends data to DMA cache. Action_6: the requested data is forwarded to I/O device. Action_7: Prefetching is triggered to fetch next I/O data from memory or LLC. Like last slide, data path of DMA write is similar. DMA read DMA Ctrl I/O Device Mem Ctrl Memory

DMA Cache Design Issues Cache Coherence Data Path Replacement Policy Write Policy Prefetching An LRU-like Replace Policy Invalid Block Clean Block Dirty Block We use an LRU-like replacement policy in which the DMA cache blocks are selected for replacement in the following priority: 1) First, the invalidated blocks will be selected. 2) Second, the least used clean block will be selected. 3) Finally, the least used dirty block will be selected. However, I/O data access pattern is different from CPU data. This implies that blocks in the DMA cache will not be reused in most cases. While other replacement policy may be more efficient for DMA cache, a more detailed study of replacement policy is beyond the scope of this presentation.

DMA Cache Design Issue Cache Coherence Data Path Replacement Policy Write Policy Prefetching Adopt Write-Allocate Policy Both Write-Back or Write Through policies are available We adopt write-allocate policy and write-back policy for both caches. Thus, the dirty blocks in both caches would not be written back to memory immediately until they are is selected for replacement. However, We also tried other write policies (e.g., write through) and found that DMA cache with write through policy can outperform write-back policy.

DMA Cache Design Issue Cache Coherence Data Path Replacement Policy Write Policy Prefetching I/O data has very regular access pattern, being linear within a DMA buffer. Thus, even though the DMA cache adopts a straightforward sequential prefetching technique, it is able to significantly improve performance. We adopt a sequential prefetching scheme with a prefetching degree of four cache blocks for DMA cache. Adopt straightforward sequential prefetching Prefetching trigged by cache miss Fetch 4 blocks one time

Overview Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work Next, we introduce our research methodology.

Memory Trace Collection Hyper Memory Trace Tool (HMTT) Capable of Collecting all memory requests Provide APIs for injecting tags into memory trace to identify high-level system operations We adopt a hardware snooping tool to collect all memory reference traces and an FPGA-based trace-driven emulation system to evaluate our proposals. This is the memory trace tool, called HMTT. It is plugged on DIMM slot to collect all memory requests on memory bus. Usually the memory trace collected by hardware monitor does not contain high-level application operations, such as context switch, function call, I/O operations. HMTT has a software trace-control mechanism to correlate memory trace with high-level operations. It provides APIs for injecting tags into memory trace to identify these operations. In our case, we can instrument APIs into device drivers to inject tags into memory trace when DMA buffer is allocated and freed. Combined with I/O data transfer directions, those memory reference to the DMA buffer can be identified as DMA memory references. In such a way, We can differentiate DMA memory reference from processor memory reference.

FPGA Emulation L2 Cache from Godson-2F DDR2 Memory Controller from Godson-2F DDR2 DIM model from Micron Technology Xtreme system from Cadence Memory trace L2 Cache DMA Cache We adopt an FPGA platform to emulate cache system. Our cycle-accurate experimental system consists of a last level cache and a DDR2 controller from the Godson-2F processor, a DDR2 DIMM model from Micron Technology. The whole cache emulation system is implemented in synthesizable RTL code. We use an FPGA based RTL emulation accelerator, the Xtreme system from Cadence, to accelerate this system. MemCtrl DDR2 Dram

Overview Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work Next, we will show some evaluations.

Experimental Setup Machine Benchmark Configurations Snoop Cache (2MB) AMD Opteron 2GB Memory 1 GE NIC IDE disk Benchmark File Copy TPC-H SPECWeb2005 Configurations Snoop Cache (2MB) Shared Cache (2MB) DMA Cache 256KB + prefetch 256KB w/o prefetch 128KB + prefetch 128KB w/o prefetch 64KB + prefetch 64KB w/o prefetch 32KB + prefetch 32KB w/o prefetch We run all the benchmarks on a server machine. The server is an AMD Opteron 2.2GHz processor and 2GB dual-channel DDR memory machine. We use HMTT to collect memory reference traces of three real applications (file-copy, TPC-H, and SPECweb2005). We investigate various approaches, such as snoop cache, shared cache, and DMA cache. For DMA cache, we have also evaluated various configurations, with or without prefetching, various capacities.

Characterization of DMA The portions of DMA memory reference varies depending on applications The sizes of DMA requests varies depending on application Note: this slide contains 1 animations. Before_Action: we can see that the file-copy and TPC-H benchmark have high portions of DMA memory reference, which are 31% and 20% respectively. But DMA memory references in SPECweb account for very small, only about 1%. Action_1: the average size of DMA requests and the cumulative distributions of the size of DMA requests for three benchmarks respectively (one DMA request includes a number of DMA memory references with 64 bytes). The average sizes of DMA write requests are about 110KB and 121KB for file-copy and TPC-H respectively. For SPECweb2005, the size of all DMA requests issued by NIC are smaller than 1.5KB because the maximum transmission unit (MTU) of Gigabit Ethernet frame is only 1518 bytes. The size of DMA requests issued by IDE controller for SPECweb2005 is also very small, an average of about 10KB.

Normalized Speedup Baseline is snoop cache scheme This figure illustrates normalized speedup (the baseline is snoop cache scheme). We can find that the DMA cache scheme with the size of 256KB and prefetching outperforms all other schemes. However, small capacities DMA cache (64KB/32KB) degrade performance for File-Copy and TPC-H. The reason is that the average sizes of DMA requests of the two applications are about 110KB, which cause cache thrashing problem heavily. For the same reason, shared cache also exhibits poor performance for these two applications. Baseline is snoop cache scheme DMA cache schemes exhibits better performance than others

DMA Write & CPU Read Hit Rate Let’s take a look at hit rate of “DMA write and CPU read”. This means that when I/O data is written into shared-cache or DMA cache, how much will be actually used by CPU. From the figure, we can see both shared cache and DMA cache (see 256KB) can exhibit high hit rates. However, as the last slide show, shared cache can still degrade performance. Then, where do cycle go for shared cache scheme? Both shared cache and DMA cache exhibit high hit rates Then, where do cycle go for shared cache scheme?

Breakdown of Normalized Total Cycles Note: this slide contains 3 animations. There is a figure of breakdown of total cycles for various schemes. We can conclude that the performance improvements introduced by DMA cache are divided into two parts: Action_1: First, since I/O data to be referenced by CPU is already in DMA cache, the CPU requests forwarded to memory is reduced significantly, causing less time for “CPU reading and writing memory”. We can see that shared cache scheme can also reduce this part of overhead. Action_2: Second, since I/O data access pattern is very regular, we can adopt prefetching technique for DMA cache to reduce overhead of I/O read requests. Because DMA cache only serves I/O data, so the design and implementation of prefether can be straightforward. For shared cache scheme, the prefetcher requires more complicated control logic to determine whether data is for CPU or for I/O, and then it can generate prefetching requests efficiently . Action_3: However, in contrast with snoop cache scheme, the cycles of “DMA write memory” increase significantly for shared cache and DMA cache with write-back policy. We investigate the reason and find that, in snooping-cache scheme, all data is written into memory directly. In such a situation, accessing DDR2 DRAM can achieve quite good performance due to few row buffer conflicts because the I/O data is written into a continuous address region. However, in shared-cache scheme, data written to memory may not be continuous. The reasons are two: (1) the I/O data is injected into cache, incurring cache block replacement. Unfortunately, the replaced dirty blocks are usually non-continuous. (2) The continuous I/O data may also be written back to memory non-continuously because they may be selected for replacement randomly later. Because of those non-continuous addresses, DDR DRAM may exhibit poor performance due to lots of row buffer conflicts. It should be noted that this phenomenon will be hidden in a fixed cycle DRAM simulator. We reveal this phenomenon because we adopt a detailed memory controller from a commercial CPU. In DMA cache, we can adopt a write-through policy different from LLC to avoid this negative influence. Our ongoing experiments show that the green part can be reduced when using write-through policy and DMA cache can achieve a speedup of about 1.5X.

% of DMA Writes causing Dirty Block Replacement For shared cache, the I/O data is injected into cache, incurring cache block replacement. This figure shows the portion of dirty blocks which are required to be updated to memory firstly. In fact, those DMA writes cause cache thrashing problem. We can see that large DMA cache can eliminate these phenomena, but those small DMA caches also have the cache thrashing problem. Thus, the DMA cache size is a quite sensitive parameters. Those DMA writes cause cache pollution and thrashing problem The 256KB DMA cache is able to significantly eliminate  these phenomena

% of Valid Prefetched Blocks This slide show that a straightforward sequential prefetching for DMA cache is very effective. It can exhibit an impressive high prefetching accuracy. DMA caches can exhibit an impressive high prefetching accuracy This is because I/O data has very regular access pattern.

Overview Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work Final, we summary this presentation and introduce our ongoing work.

Conclusions and Ongoing Work The Nature of DMA There is a producer-consumer relationship between CPU and DMA engine Memory plays a role of transient place for I/O data transferred between processor and I/O device We propose a DMA cache scheme and its design issues. Experimental results show that DMA cache can significantly improve I/O performance. Ongoing Work The impact of multiprocessor, multiple DMA channels for DMA cache In theory, a shared cache with an intelligent replacement policy can achieve the effect of DMA cache scheme. Godson-3 has integrated an dedicate cache management policy for I/O data. Just read it

THANKS! Q&A?