Presentation is loading. Please wait.

Presentation is loading. Please wait.

INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu.

Similar presentations


Presentation on theme: "INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu."— Presentation transcript:

1 INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu Hu, Mingyu Chen 2010.1 Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS)

2 INSTITUTE OF COMPUTING TECHNOLOGY The role of I/O I/O is ubiquitous Load binary files : Disk  Memory Brower web, media stream : Network  Memory  … I/O is significant Many commercial applications are I/O intensive : Database etc.

3 INSTITUTE OF COMPUTING TECHNOLOGY State-of-the-Art I/O Technologies I/O Bus: 20GB/s PCI-Express 2.0 HyperTransport 3.0 QuickPath Interconnect I/O Devices SSD RAID: 1.2GB/s 10GE: 1.25GB/s Fusion-io: 8GB/s, 1M IOPS (2KB random 70/30 read/write mix)

4 INSTITUTE OF COMPUTING TECHNOLOGY Direct Memory Access (DMA) DMA is used for I/O operations in all modern computers DMA allows I/O subsystems to access system memory independently of CPU. Many I/O devices have DMA engines Including disk drive controllers, graphics cards, network cards, sound cards and GPUs

5 INSTITUTE OF COMPUTING TECHNOLOGY Outline Revisiting I/O DMA Cache Design Evaluations Conclusions

6 INSTITUTE OF COMPUTING TECHNOLOGY DMA Engine CPU Memory Driver Buffer Descriptor ① ② ③ Kernel Buffer ④ An Example of Disk Read: DMA Receiving Operation Cache Access Latency : ~20 Cycles Memory Access Latency : ~200 Cycles

7 INSTITUTE OF COMPUTING TECHNOLOGY DMA Engine CPU Memory Driver Buffer Descriptor ① ② ③ Kernel Buffer ④ Direct Cache Access [Ram-ISCA05] This is a typical Shared-Cache Scheme Prefetch-Hint Approach [Kumar-Micro07]

8 INSTITUTE OF COMPUTING TECHNOLOGY Problems of Shared-Cache Scheme Cache Pollution Cache Thrashing Not suitable for other I/O  Degrade performance when DMA requests are large (>100KB) for “ Oracle + TPC-H” application To address this problem deeply, we need to investigate the I/O data characteristics.

9 INSTITUTE OF COMPUTING TECHNOLOGY I/O Data V.S. CPU Data MemCtrl I/O Data CPU Data HMTT I/O Data + CPU Data

10 INSTITUTE OF COMPUTING TECHNOLOGY A short AD of HMTT [Bao-Sigmetrics08] A Hardware/Software Hybrid Memory Trace Tool Can support DDR2 DIMM interface on multiple platforms Can collect full system off-chip memory traces Can provide trace with semantic information, e.g., virtual address Process id I/O operation Can collect the trace of commercial applications, e.g., Oracle Web server The HMTT System

11 INSTITUTE OF COMPUTING TECHNOLOGY Characteristics of I/O Data(1) % of Memory References to I/O data % of References of various I/O types

12 INSTITUTE OF COMPUTING TECHNOLOGY Characteristics of I/O Data(2) I/O request size distribution?

13 INSTITUTE OF COMPUTING TECHNOLOGY Characteristics of I/O Data(3) Sequential access in I/O data Compared with CPU data, I/O data is very regular

14 INSTITUTE OF COMPUTING TECHNOLOGY Characteristics of I/O Data(4) Reuse Distance (RD) LRU Stack Distance 1 3 2 4 1 2 2 3 3 4 4 3 1 1 2 1 2 4 3 1 2 3 4 1 2 3 1 2 1 2 3 1 1 2 4 RD CDF x% <=n

15 INSTITUTE OF COMPUTING TECHNOLOGY Characteristics of I/O Data(5) DMA-W CPU-R CPU-RW CPU-W DMA-R

16 INSTITUTE OF COMPUTING TECHNOLOGY Rethink I/O & DMA Operation 20~40% of memory references are for I/O data in I/O-intensive applications. Characteristics of I/O data are different from CPU data An explicit produce-consume relationship for I/O data Reuse distance of I/O data is smaller than CPU data References to I/O data are primarily sequential  Separating I/O data and CPU data

17 INSTITUTE OF COMPUTING TECHNOLOGY Separating I/O data and CPU data Before Separating After Separating

18 INSTITUTE OF COMPUTING TECHNOLOGY Outline Revisiting I/O DMA Cache Design Evaluations Conclusions

19 INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Design Issues Write Policy Cache Coherence Replacement Policy Prefetching Dedicated DMA Cache (DDC)

20 INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Design Issues Adopt Write-Allocate Policy Both Write-Back or Write Through policies are available Write Policy Cache Coherence Replacement Policy Prefetching

21 INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Design Issues Write Policy Cache Coherence Replacement Policy Prefetching IO-ESI Protocol for WT policy IO-MOESI Protocol for WB Policy The only difference between IO- MOESE/IO-ESI and the original protocols is exchanging the local source and the probe source of state transitions

22 INSTITUTE OF COMPUTING TECHNOLOGY A Big Issue How to prove the correctness of integrating the heterogeneous cache coherency protocols in a system?

23 INSTITUTE OF COMPUTING TECHNOLOGY A Global State Method for Heterogeneous Cache Coherence Protocol [Pong-SPAA93, Pong-JACM98] DMA $ CPU $ …… OSIMIS OS + I + √ MS + I + X EI + R|E MI + W|* S+I+S+I+ R|I

24 INSTITUTE OF COMPUTING TECHNOLOGY Global State Cache Coherence Theorem Given N (N>1) well-defined cache protocols, they are not conflict if and only if there does not exist any Conflict Global States in the global state transition machine. 5 Global States: S + I + EI * I * MI * OS * I * √ √ √ √ √

25 INSTITUTE OF COMPUTING TECHNOLOGY MOESI + ESI 6 Global States: S + I + E C I * I * M C I * E D I * O C S * I * √ √ √ √ √ √

26 INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Design Issues Write Policy Cache Coherence Replacement Policy Prefetching An LRU-like Replace Policy 1. Invalid 2. Shared 3. Owned 4. Exlusive 5. Modified

27 INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Design Issues Write Policy Cache Coherence Replacement Policy Prefetching Adopt straightforward sequential prefetching Prefetching trigged by cache miss Fetch 4 blocks one time

28 INSTITUTE OF COMPUTING TECHNOLOGY Design Complexity vs. Design Cost Dedicated DMA Cache (DDC) Partition-Based DMA Cache (PBDC)

29 INSTITUTE OF COMPUTING TECHNOLOGY Outline Revisiting I/O DMA Cache Design Evaluations Conclusions

30 INSTITUTE OF COMPUTING TECHNOLOGY Speedup of Dedicated DMA Cache

31 INSTITUTE OF COMPUTING TECHNOLOGY % of Valid Prefetched Blocks DMA caches can exhibit an impressive high prefetching accuracy This is because I/O data has very regular access pattern.

32 INSTITUTE OF COMPUTING TECHNOLOGY Performance Comparisons Although PBDC does not additional on-chip storage, it can achieve about 80% of DDC’s performance improvements.

33 INSTITUTE OF COMPUTING TECHNOLOGY Outline Revisiting I/O DMA Cache Design Evaluations Conclusions

34 INSTITUTE OF COMPUTING TECHNOLOGY Conclusions We have proposed a DMA cache technique to separate I/O data and CPU We adopt a Global State Method for Integrating Heterogeneous Cache Protocols Experimental results show that DMA Cache schemes are better than the existing approaches that use unified, shared caches for I/O data and CPU data Still Open Problems, e.g., Can I/O data goes direct to L1 cache? How to design heterogeneous caches for different types of data? How to optimize MC with awareness of IO

35 INSTITUTE OF COMPUTING TECHNOLOGY Thanks ! & Question?

36 INSTITUTE OF COMPUTING TECHNOLOGY RTL Emulation Platform LLC and DMA cache Model from Loongson-2F DDR2 Memory Controller from Loongson-2F DDR2 DIMM model from Micron Technology LL Cache MemCtrl DDR2 DIMM DMA Cache Memory trace

37 INSTITUTE OF COMPUTING TECHNOLOGY Parameters DDR2-666

38 INSTITUTE OF COMPUTING TECHNOLOGY Normalized Speedup for WB Baseline is snoop cache scheme DMA cache schemes exhibits better performance than others

39 INSTITUTE OF COMPUTING TECHNOLOGY DMA Write & CPU Read Hit Rate Both shared cache and DMA cache exhibit high hit rates Then, where do cycle go for shared cache scheme?

40 INSTITUTE OF COMPUTING TECHNOLOGY Breakdown of Normalized Total Cycles

41 INSTITUTE OF COMPUTING TECHNOLOGY Design Complexity of PBDC

42 INSTITUTE OF COMPUTING TECHNOLOGY More References on Cache Coherence Protocol Verification Fong Pong, Michel Dubois, Formal verification of complex coherence protocols using symbolic state models, Journal of the ACM (JACM), v.45 n.4, p.557-587, July 1998 Fong Pong, Michel Dubois, Verification techniques for cache coherence protocols, ACM Computing Surveys (CSUR), v.29 n.1, p.82-126, March 1997


Download ppt "INSTITUTE OF COMPUTING TECHNOLOGY DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance Dang Tang, Yungang Bao, Weiwu."

Similar presentations


Ads by Google