Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Uncacheable Memory to Improve Unity Linux Performance

Similar presentations


Presentation on theme: "Using Uncacheable Memory to Improve Unity Linux Performance"— Presentation transcript:

1 Using Uncacheable Memory to Improve Unity Linux Performance
Ning Qu Xiaogang Gou Xu Cheng Microprocessor Research and Development Center Peking University

2 Unity SoC architecture
Issues Unity SoC architecture No snooping Cache coherency problem everywhere !! Peking University

3 Issues cont. poor temporal locality! DMA DMA User Process User Process
process I/O buffer process I/O buffer poor temporal locality! Linux Kernel Linux Kernel kernel I/O buffer kernel I/O buffer DMA DMA I/O device buffer I/O device buffer I/O Device I/O Device Peking University

4 Motivation How to avoid the disadvantages?
Heavy cost of Cache coherency operations Many high-end embedded processors have Cache, But many of them have very limited support to guarantee cache coherency Poor locality leads to more data Cache pollution Cache is based on property of locality Some programs have poor locality, for example TCP/IP processing How to avoid the disadvantages? Uncacheable memory may be a solution! 1) For example, Unity I and some ARM processors doesn’t support single cache line operation 2) Some programs have poor locality, hence the existing cache policies do not help Peking University

5 Contributions Analyze the scenarios in which Cache doesn’t perform well, propose uncacheable memory has two advantages Eliminate most of Cache coherency operations Avoid Cache pollution Apply uncacheable memory in Unity Linux to improve the I/O performance. Some important aspects improves from 5% - 29% Won’t hurt the system performance with carefully design Peking University

6 Outline Issues Motivation Contribution Uncacheable Memory Evaluation
Related Work Conclusions Peking University

7 using uncacheable memory
Recv Packet Flow step 1 step 2 step 3 step 4 using uncacheable memory User Space Simple data processing flush cache User Buffer Kernel Space Buffer Buffer Buffer Buffer I/O Device CPU copy Step3: 1) Uncacheable memory access is slow 2) Simple data processing means low cache pollution Step4: 1) Uncacheable memory has no cache pollution, but access is slow 2) Cacheable memory has much cache pollution, but access is fast DMA copy Peking University

8 using uncacheable memory
Send Packet Flow step 1 step 2 step 3 step 4 using uncacheable memory User Space User Buffer clean cache DMA copy Kernel Space Buffer Buffer Buffer Buffer CPU copy Simple data processing I/O Device Step1: 1) Uncacheable memory write is slow, but has no cache pollution 2) Cacheable memory has to do write allocates and pollute cache, but has access acceleration Peking University

9 Cacheable vs. Uncacheable
Send Receive CH processing 1. copy from U to K 2. clean data cache 1. clean&invalidate data cache 2. copy from K to U NC 1. copy from U to K(N) 1. copy from K(N) to U side effect 1. accessing uncacheable memory is slower 2. no data cache pollution 3. no cache clean operation 3. no cache flush operation DMA send and receive cost analysis Peking University

10 Cacheable vs. Uncacheable cont.
Cache clean cost DMA Send: load U to Cache load U into Cache load K to Cache store to K load K to Cache load K Cache flush cost load U into Cache and store DMA Recv: Peking University

11 Cacheable vs. Uncacheable cont.
Single write cost is small, but single read cost is great! Send Uncacheable approach cost = write K once in single write mode. Cacheable approach cost >= read K once in burst mode and write K once in burst mode. Receive Uncacheable approach cost = read K once in single read mode. Cacheable approach cost = read K once in burst mode. To evaluate the cost and the TCP/IP processing performance improvement by uncacheable method compared with hardware cacheable method, we design a simplified experiment according to the TCP/IP processing described in Table There are two factors in the experiment, one is the data length which determines the cost of data copy operations and the other is the data cache dirty ratio which determines the cost of data cache’s clean&invalidate operations We implement a kernel module to measure the cost of sending and receiving packets by these two methods In the diagrams Cache(6.25%) means cacheable method with 6.25% data cache dirty ratio and NonC means uncacheable. The results convince us that it’s likely to gain benefits from uncacheable method when sending packets. But for receiving packets, uncacheable method costs too much because of copy and TCP/IP processing. Recv and Send Performance CH vs NC Peking University

12 Using Uncacheable Memory
Implemented in Unity Linux ported from Linux Uncacheable page table eliminate Cache coherency operations when modifying the page tables Uncacheable socket buffer for sending eliminate Cache coherency operations avoid data Cache pollution Peking University

13 Outline Motivation Issues Contribution Uncacheable Memory? Evaluation
Related Work Conclusions Peking University

14 Methodology Benchmarks: Netperf, Lmbench and Modified Andrew benchmark. Experiments environment 160 MHz Unity network computer with 256 MB DRAM, a SoC build-in 10M/100M Ethernet card Dell 4600 server, two Intel Xeon PIII 700 MHz processors with 4 GB DRAM and 1000M/100M Ethernet card All benchmarks are executed in single-user mode on NFS. Peking University

15 Netperf Benchmark Results
Netperf TCP_STREAM Send Performance Peking University

16 Netperf Benchmark Results cont.
Q: When receive size is increasing, transactions throughput improvement is decreasing. ? using uncacheable memory in recv socket buffer Netperf TCP_RR Performance Peking University

17 Lmbench Benchmark Results
Lmbench Performance Peking University

18 Modified Andrew Benchmark Results
As expected, execution time reduces from 6%-12% for the first four phases. For Phase V, the time reduction is less than 1%. This is reasonable because only Phase V heavily depends on computation instead of I/O. Summary: Based on these results, We believe by using uncacheable memory with careful design, overall performance of Unity system will outperform the implementation by simply using cacheable memory and cache hardware operations. Modified Andrew Benchmark Peking University

19 Related Work Future work: new memory type support
Related work: accelerate uncacheable memory performance New memory type Intel write-combining MIPS R10000: uncached-accelerated page New instructions SPARC V9, ARM, Unity II: block move instructions Future work: new memory type support Read like common cache with low pollution Write like Write-Combining without write-allocate Peking University

20 Conclusions This paper focuses on the uncacheable memory usage.
Pros: eliminating coherency operations and avoiding data Cache pollution. Cons: slow accessing time Uncacheable memory can perform well with a carefully design when considering system specialties Many embedded architectures have a lot of their own design specialties due to the limitation on energy cost, area size and design complexity There may be more fields the uncacheable memory can be applied in Peking University

21 Thank You! Questions? Peking University


Download ppt "Using Uncacheable Memory to Improve Unity Linux Performance"

Similar presentations


Ads by Google