Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen

Name: Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen
Uploaded: 2017-08-19T01:45:31+00:00
Duration: PTM30S36
Channel: Lambert Dean
Description: Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen

Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance
Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences Workshop on The Influence of I/O on Microprocessor Architecture (IOM-2009) We are from Institute of Computing Technology, Chinese Academy of Sciences.

An Brief Intro Of ICT, CAS
ICT has developed the Loongson CPU ICT has built the Fastest HPC in China – Dawning 5000, which is 233.5TFlops and rank 10th in Top500. The Institute of Computing Technology of the Chinese Academy of Sciences was the first academic institution to specialize in comprehensive research into computer science and technology in China. ICT successfully produced China’s first general-purpose digital computer in 1958 and has turned into Dawning series HPC development since 1990s. Recently, the Dawning 5000 achieves a rank of 10th in the Top500 list. ICT is also the place where China's first general-purpose CPU chip, Loongson (also named Godson for academy) was born. The four-core Godson-3 will be released in 2009, whose maximal frequency is 1GHz and power dissipation is 5-10 W. In addition, ICT has spun off many academic institutions and hi-tech companies, such as Lenovo.

Overview Background Nature of DMA Mechanism DMA Cache Scheme
Research Methodology Evaluations Conclusions and Ongoing Work This slide introduce an overview of the presentation.

Importance of I/O operations
I/O are ubiquitous Load binary files：Disk Memory Brower web, media stream：NetworkMemory… I/O are important Many commercial applications are I/O intensive： Database, Internet applications etc. I/O operations are extremely important in modern computer systems. First, I/O are ubiquitous. When we want to run an applications, the binary files are loaded from disk to memory via I/O subsystem. When we browser web or watch online video, the content is transferred from network to memory via I/O subsystem. Second, many commercial applications are I/O intensive, such as database, Internet applications

State-of-the-Art I/O Technologies
I/O Bues: 20GB/s PCI-Express 2.0 HyperTransport 3.0 QuickPath Interconnect I/O Devices RAID: 400MB/s 10GE: 1.25GB/s I/O buses and I/O devices technologies have been improved dramatically in the past decade. I/O buses can provide bandwidths of over 20GB/s which are very close to that of a DDR2/3 DRAM memory system. The I/O devices performance also increases significantly. In a PC platform，a software Raid0 of four SATA disk can provide a bandwidth of over 400MB/s. The 10Gb Ethernet can offer a bandwidth of 1.25GB/s.

A Typical Computer Architecture
NIC This is a typical computer architecture (the Intel’s i7 architecture). In fact, I/O subsystem makes a strong contribution for either performance or cost, although some contribution are downside, such as latency. These reflect importance of both I/O subsystem and this workshop.

Direct Memory Access (DMA)
DMA is an essential feature of I/O operation in all modern computers DMA allows I/O subsystems to access system memory for reading and/or writing independently of CPU. Many I/O devices use DMA Including disk drive controllers, graphics cards, network cards, sound cards and GPUs This slide is to explain DMA’s essential role in I/O systems. We can just read this slide.

Research Methodology Evaluations Conclusions and Ongoing Work Next, we will introduce the nature of DMA mechanism

DMA in Computer Architecture
NIC Note: this slide contains four animations. Action_1: to show that DMA controllers exist in various I/O devices, such as NIC, disk (SATA controller), sound device. Action_2 & Action_3: to show the sources, destinations and paths of DMA operations. For example, I/O data is moved from memory to disk (Action 2), I/O data is moved from NIC to memory (Action 3).

An Example of Disk Read: DMA Receiving Operation
Memory CPU Driver Buffer Descriptor ① ② ③ Kernel Buffer ④ User Buffer ⑤ DMA Engine Note: this slide contains 12 animations. Action_1 ~ Action_5: to show the five steps of an DMA receiving operation. Here “Kernel Buffer” is the “page cache” of OS kernel, and “User Buffer” is a memory buffer malloced in application. Action_6 ~ Action_11 : to show that we will illustrate the detail steps of copy-operation next. The yellow box represents I/O data. At first, it resides in “Driver Buffer” which is the left side of DRAM. In the end, it is moved to “Kernel Buffer” which is the right side of DRAM. Action_12 : to show that there are two memory access operations and two cache access operations for one piece of I/O data (could be one cache block). However, the memory access operations is time-consuming. Cache Access Latency： ~20 Cycles Memory Access Latency：~200 Cycles

Potential Improvement of DMA
Memory CPU Driver Buffer Descriptor ① ② ③ Kernel Buffer ④ User Buffer ⑤ Note: this slide contains 9 animations. Action_1 ~ Action_3: to show the first three steps of an DMA receiving operation. Action_4 : to stress that one optimization approach could be that inject I/O data into cache directly. Action_5 ~ Action_8: to show that there are two cache access operations and one memory operations. The copy time is reduced significantly. Action_9 : to explain that this is a typical shared-cache scheme. DMA Engine This is a typical Shared-Cache Scheme

Problems of Shared-Cache Scheme
Cache Pollution Cache Thrashing Degrade performance when DMA requests are large (>100KB) for “Oracle + TPC-H” application Processor’s LLC treats all data equally, which may cause cache pollution and thrashing problem, especially when I/O data is large (over 100KB). Our experiments show that shared-cache scheme can degrade performance when DMA requests are large for “Oracle + TPC-H” applications.

Rethink DMA Mechanism The Nature of DMA Corollaries
There is a producer-consumer relationship between CPU and DMA engine Memory plays a role of transient place for I/O data transferred between processor and I/O device Corollaries Once I/O data is produced, it will be consumed I/O data within DMA buffer will be used only once in most cases (i.e. almost no reuse).  Characterizations of I/O data are different from CPU data  It may not be appropriate to store I/O data and CPU data together Note: this slide contains 9 animations. Before proposing other optimization approach, let’s rethink the DMA mechanism. Action_1: We can see that the nature of DMA is … (just read) Action_2: There are several Corollaries. First, Once I/O data is produced, it will be consumed. (It should be noted that invalid or duplicate network packets are not consumed) I/O data within DMA buffer will be used only once in most cases (i.e. almost no reuse). Action_3: Just read them.

Research Methodology Evaluations Conclusions and Ongoing Work Based on these understanding, next we present a DMA cache scheme to optimize I/O performance.

DMA Cache Proposal A Dedicated Cache Storing I/O data
Capable of exchanging data with processor’s last level cache (LLC)  Reduce overhead of I/O data movement DMA Note: this slide contains 3 animations. Action_1 ~ Action_3: to show incoming I/O data is stored in the dedicated DMA cache and can be moved to processor’s LLC when CPU accesses it. In such a way, overhead of memory copy-operations can be avoided.

DMA Cache Design Issues
Cache Coherence Data Path Replacement Policy Write Policy Prefetching CPU Cache State Diagram DMA Cache State Diagram Note: this slide contains 1 animations. There are several design issues of DMA cache. Action 1: cache coherence issue should be solved carefully. The upper figure illustrates a typical cache state diagram of snooping-cache for I/O data coherency. In fact, this cache state diagram is not need to be modified while adding a DMA cache. Moreover, The state diagram for DMA cache is similar to that of processor cache except driving sources. For example, the “Invalid  Clean” state transitions of CPU cache and DMA cache are driven by CPU read and DMA read respectively. However, these state diagrams are applied for uniprocessor platform. We are researching multiprocessor platform currently. DMA Cache State Diagram is similar to CPU Cache in Uniprocessor system We are researching multiprocessor platform…

Cache Coherence Data Path Replacement Policy Write Policy Prefetching DMA In practice, additional data paths and data access ports for LLC are not required because data migration operations between DMA cache and LLC can share existing data paths and ports of snooping mechanism. The required modification is that all requests sent to LLC should be forwarded to DMA cache. We think this modification is not a critical path of cache hierarchy. Next, we will illustrate data paths for CPU read and DMA read. Additional data paths and data access ports for LLC are not required because data migration operations between DMA cache and LLC can share existing data paths and ports of snooping mechanism

Data Path: CPU Read cmd data Cache Ctrl Snoop Ctrl Last Level Cache
Hit in DMA cache? Miss in LLC & Hit in DMA Cache Cache Ctrl Snoop Ctrl Last Level Cache Cache Ctrl Snoop Ctrl DMA Cache Note: this slide contains 6 animations. Action_1: one CPU read request is forwarded to both LLC and DMA cache. Action_2: DMA cache’s controller will send a signal to LLC to inform if the request is hit in DMA cache. Action_3: the LLC controller gets the information of “miss in LLC and hit in DMA cache” Action_4: the LLC sends a refill request to DMA cache Action_5: the DMA cache sends data to LLC. Action_6: the requested data is forwarded to upper cache. It should be noted that it data path of CPU write is similar. For cases of “hit in LLC” and “miss in both cache”, the data paths are similar to current system without DMA cache. In addition, the cache coherence ensure that a piece of data can only reside in either place. System Bus DMA Ctrl I/O Device Mem Ctrl Memory

Data Path: DMA Read cmd data Cache Ctrl Snoop Ctrl Last Level Cache
Miss in DMA Cache & Hit in LLC Hit in LLC? System Bus Note: this slide contains 7 animations. Action_1: one DMA read request is forwarded to both LLC and DMA cache. Action_2: LLC’s snoop controller will send a signal to DMA cache to inform if the request is hit in LLC. Action_3: the DMA cache’s snoop controller gets the information of “miss in DMA cache and hit in LLC” Action_4: the DMA cache sends a refill request to LLC Action_5: the LLC sends data to DMA cache. Action_6: the requested data is forwarded to I/O device. Action_7: Prefetching is triggered to fetch next I/O data from memory or LLC. Like last slide, data path of DMA write is similar. DMA read DMA Ctrl I/O Device Mem Ctrl Memory

Cache Coherence Data Path Replacement Policy Write Policy Prefetching An LRU-like Replace Policy Invalid Block Clean Block Dirty Block We use an LRU-like replacement policy in which the DMA cache blocks are selected for replacement in the following priority: 1) First, the invalidated blocks will be selected. 2) Second, the least used clean block will be selected. 3) Finally, the least used dirty block will be selected. However, I/O data access pattern is different from CPU data. This implies that blocks in the DMA cache will not be reused in most cases. While other replacement policy may be more efficient for DMA cache, a more detailed study of replacement policy is beyond the scope of this presentation.

DMA Cache Design Issue Cache Coherence Data Path Replacement Policy
Write Policy Prefetching Adopt Write-Allocate Policy Both Write-Back or Write Through policies are available We adopt write-allocate policy and write-back policy for both caches. Thus, the dirty blocks in both caches would not be written back to memory immediately until they are is selected for replacement. However, We also tried other write policies (e.g., write through) and found that DMA cache with write through policy can outperform write-back policy.

DMA Cache Design Issue Cache Coherence Data Path Replacement Policy
Write Policy Prefetching I/O data has very regular access pattern, being linear within a DMA buffer. Thus, even though the DMA cache adopts a straightforward sequential prefetching technique, it is able to significantly improve performance. We adopt a sequential prefetching scheme with a prefetching degree of four cache blocks for DMA cache. Adopt straightforward sequential prefetching Prefetching trigged by cache miss Fetch 4 blocks one time

Research Methodology Evaluations Conclusions and Ongoing Work Next, we introduce our research methodology.

Memory Trace Collection
Hyper Memory Trace Tool (HMTT) Capable of Collecting all memory requests Provide APIs for injecting tags into memory trace to identify high-level system operations We adopt a hardware snooping tool to collect all memory reference traces and an FPGA-based trace-driven emulation system to evaluate our proposals. This is the memory trace tool, called HMTT. It is plugged on DIMM slot to collect all memory requests on memory bus. Usually the memory trace collected by hardware monitor does not contain high-level application operations, such as context switch, function call, I/O operations. HMTT has a software trace-control mechanism to correlate memory trace with high-level operations. It provides APIs for injecting tags into memory trace to identify these operations. In our case, we can instrument APIs into device drivers to inject tags into memory trace when DMA buffer is allocated and freed. Combined with I/O data transfer directions, those memory reference to the DMA buffer can be identified as DMA memory references. In such a way, We can differentiate DMA memory reference from processor memory reference.

FPGA Emulation L2 Cache from Godson-2F
DDR2 Memory Controller from Godson-2F DDR2 DIM model from Micron Technology Xtreme system from Cadence Memory trace L2 Cache DMA Cache We adopt an FPGA platform to emulate cache system. Our cycle-accurate experimental system consists of a last level cache and a DDR2 controller from the Godson-2F processor, a DDR2 DIMM model from Micron Technology. The whole cache emulation system is implemented in synthesizable RTL code. We use an FPGA based RTL emulation accelerator, the Xtreme system from Cadence, to accelerate this system. MemCtrl DDR2 Dram

Research Methodology Evaluations Conclusions and Ongoing Work Next, we will show some evaluations.

Experimental Setup Machine Benchmark Configurations Snoop Cache (2MB)
AMD Opteron 2GB Memory 1 GE NIC IDE disk Benchmark File Copy TPC-H SPECWeb2005 Configurations Snoop Cache (2MB) Shared Cache (2MB) DMA Cache 256KB + prefetch 256KB w/o prefetch 128KB + prefetch 128KB w/o prefetch 64KB + prefetch 64KB w/o prefetch 32KB + prefetch 32KB w/o prefetch We run all the benchmarks on a server machine. The server is an AMD Opteron 2.2GHz processor and 2GB dual-channel DDR memory machine. We use HMTT to collect memory reference traces of three real applications (file-copy, TPC-H, and SPECweb2005). We investigate various approaches, such as snoop cache, shared cache, and DMA cache. For DMA cache, we have also evaluated various configurations, with or without prefetching, various capacities.

Characterization of DMA
The portions of DMA memory reference varies depending on applications The sizes of DMA requests varies depending on application Note: this slide contains 1 animations. Before_Action: we can see that the file-copy and TPC-H benchmark have high portions of DMA memory reference, which are 31% and 20% respectively. But DMA memory references in SPECweb account for very small, only about 1%. Action_1: the average size of DMA requests and the cumulative distributions of the size of DMA requests for three benchmarks respectively (one DMA request includes a number of DMA memory references with 64 bytes). The average sizes of DMA write requests are about 110KB and 121KB for file-copy and TPC-H respectively. For SPECweb2005, the size of all DMA requests issued by NIC are smaller than 1.5KB because the maximum transmission unit (MTU) of Gigabit Ethernet frame is only 1518 bytes. The size of DMA requests issued by IDE controller for SPECweb2005 is also very small, an average of about 10KB.

Normalized Speedup Baseline is snoop cache scheme
This figure illustrates normalized speedup (the baseline is snoop cache scheme). We can find that the DMA cache scheme with the size of 256KB and prefetching outperforms all other schemes. However, small capacities DMA cache (64KB/32KB) degrade performance for File-Copy and TPC-H. The reason is that the average sizes of DMA requests of the two applications are about 110KB, which cause cache thrashing problem heavily. For the same reason, shared cache also exhibits poor performance for these two applications. Baseline is snoop cache scheme DMA cache schemes exhibits better performance than others

DMA Write & CPU Read Hit Rate
Let’s take a look at hit rate of “DMA write and CPU read”. This means that when I/O data is written into shared-cache or DMA cache, how much will be actually used by CPU. From the figure, we can see both shared cache and DMA cache (see 256KB) can exhibit high hit rates. However, as the last slide show, shared cache can still degrade performance. Then, where do cycle go for shared cache scheme? Both shared cache and DMA cache exhibit high hit rates Then, where do cycle go for shared cache scheme?

Breakdown of Normalized Total Cycles
Note: this slide contains 3 animations. There is a figure of breakdown of total cycles for various schemes. We can conclude that the performance improvements introduced by DMA cache are divided into two parts: Action_1: First, since I/O data to be referenced by CPU is already in DMA cache, the CPU requests forwarded to memory is reduced significantly, causing less time for “CPU reading and writing memory”. We can see that shared cache scheme can also reduce this part of overhead. Action_2: Second, since I/O data access pattern is very regular, we can adopt prefetching technique for DMA cache to reduce overhead of I/O read requests. Because DMA cache only serves I/O data, so the design and implementation of prefether can be straightforward. For shared cache scheme, the prefetcher requires more complicated control logic to determine whether data is for CPU or for I/O, and then it can generate prefetching requests efficiently . Action_3: However, in contrast with snoop cache scheme, the cycles of “DMA write memory” increase significantly for shared cache and DMA cache with write-back policy. We investigate the reason and find that, in snooping-cache scheme, all data is written into memory directly. In such a situation, accessing DDR2 DRAM can achieve quite good performance due to few row buffer conflicts because the I/O data is written into a continuous address region. However, in shared-cache scheme, data written to memory may not be continuous. The reasons are two: (1) the I/O data is injected into cache, incurring cache block replacement. Unfortunately, the replaced dirty blocks are usually non-continuous. (2) The continuous I/O data may also be written back to memory non-continuously because they may be selected for replacement randomly later. Because of those non-continuous addresses, DDR DRAM may exhibit poor performance due to lots of row buffer conflicts. It should be noted that this phenomenon will be hidden in a fixed cycle DRAM simulator. We reveal this phenomenon because we adopt a detailed memory controller from a commercial CPU. In DMA cache, we can adopt a write-through policy different from LLC to avoid this negative influence. Our ongoing experiments show that the green part can be reduced when using write-through policy and DMA cache can achieve a speedup of about 1.5X.

% of DMA Writes causing Dirty Block Replacement
For shared cache, the I/O data is injected into cache, incurring cache block replacement. This figure shows the portion of dirty blocks which are required to be updated to memory firstly. In fact, those DMA writes cause cache thrashing problem. We can see that large DMA cache can eliminate these phenomena, but those small DMA caches also have the cache thrashing problem. Thus, the DMA cache size is a quite sensitive parameters. Those DMA writes cause cache pollution and thrashing problem The 256KB DMA cache is able to significantly eliminate these phenomena

% of Valid Prefetched Blocks
This slide show that a straightforward sequential prefetching for DMA cache is very effective. It can exhibit an impressive high prefetching accuracy. DMA caches can exhibit an impressive high prefetching accuracy This is because I/O data has very regular access pattern.

Research Methodology Evaluations Conclusions and Ongoing Work Final, we summary this presentation and introduce our ongoing work.

Conclusions and Ongoing Work
The Nature of DMA There is a producer-consumer relationship between CPU and DMA engine Memory plays a role of transient place for I/O data transferred between processor and I/O device We propose a DMA cache scheme and its design issues. Experimental results show that DMA cache can significantly improve I/O performance. Ongoing Work The impact of multiprocessor, multiple DMA channels for DMA cache In theory, a shared cache with an intelligent replacement policy can achieve the effect of DMA cache scheme. Godson-3 has integrated an dedicate cache management policy for I/O data. Just read it

THANKS！ Q&A?

Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen

Similar presentations

Presentation on theme: "Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen

Similar presentations

Presentation on theme: "Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen"— Presentation transcript:

Similar presentations

About project

Feedback