Presentation is loading. Please wait.

Presentation is loading. Please wait.

Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini Department.

Similar presentations


Presentation on theme: "Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini Department."— Presentation transcript:

1 Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini Department of Electrical, Electronic and Information Engineering (DEI) University of Bologna, Italy Microelectronic Systems Design Research Group, University of Kaiserslautern, Germany ver0

2 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 2 Outline Experimental Results Memory Sharing Methods Infrastructure Setup (Hardware & Software) Motivations & Contributions ZYNQ Architecture (Brief) Introduction Lessons Learned & Conclusion

3 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ (c) Luca Bedogni 2012 Introduction Performance Per Watt!! 1951 UNIVAC I : operations per 1 watt-second 2012 Half a century later! ST P2012 : 40 billion operations per 1 watt-second

4 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ Introduction Solution : Specialized functional units (Accelerators) CPU L1$ DRAM Case 1 TASK 1 TASK 2 TASK 3 TASK 4 var1 var2 var3 var1 var2 cached Case 2 Faster! More Power Efficient! Better Performance Per Watt! What about Variables? ????? CPU should Flush the cache! - Problem can be more complicated! e.g. Multiple CPU cores! - Every processing element: Should have a consistent view of the shared memory! - Accelerator Coherency Port (ACP): Allows accelerator hardware To Perform coherent accesses To CPU(s) memory space!

5 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 5 OCM PLPS ARM A9 NEON MMU ARM A9 NEON MMU L1L1 L1L1 SnoopSnoop L2 PL310 DRAM Controller (Synopsys IntelliDDR MPMC) Peripherals (UART, USB, Network, SD, GPIO,…) Inter Connect (ARM NIC-301) HP0 HP1 HP2 HP3 SGP0 SGP1 MGP0 MGP1 AXI Masters AXI Slaves AXI Master ACP DMA Controller (ARM PL330) Xilinx ZYNQ Architecture

6 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 6 OCM PLPS DRAM Controller HP0 AXI Master (Accelerator) ACP L2 PL310 Motivations & Contributions Which method is better to share data between CPU and Accelerator? ARM A9 NEON MMU ARM A9 NEON MMU L1L1 L1L1 SnoopSnoop For each method, What is the data transfer speed? How much is the energy consumption? Effect of background workload on performance? -Various acceleration methods are addressed in the literature (GPU, hardware boards, …) -We develop an infrastructure (HW+SW) For the Xilinx ZYNQ -We run practical tests & measurements To quantify the efficiency of different CPU-accelerator memory sharing methods.

7 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 7 Hardware

8 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 8 Software Linux Kernel Level Drivers AXI Dummy Driver AXI Dummy Driver AXI Driver Simple driver: -Initializes the dummy AXI masters (HP1) -Triggers an endless read/write loop More complicated: -Handles AXI masters -ACP & HP0 -Memory allocation -ISR registration -statistics PL310 -time measurement Over ACP: kmalloc Over HP: dma_alloc_coherent AXI Driver user side interface application Background application: A Simple memory read/write loop Background application: A Simple memory read/write loop Oprofile statistical profiler. Measure all CPU performance metrics. Oprofile statistical profiler. Measure all CPU performance metrics.

9 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 9 Source Image (image_size Address FIR Result Image (image_size Address read process write Loop: N times Measure execution interval. FIFO: 128K 128K Selection of Pakcets: (Addressing) - Normal - Bit-reversed Allocated by: kmalloc dma_alloc_coherent Depends on the memory Sharing method Image Sizes: 4KBytes 16K 65K 128K 256K 1MBytes 2MBytes We define : Different methods to accomplish the task. Measure : Execution time & Energy. Processing Task Definition

10 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 10 Memory Sharing Methods Accelerator ACP SCU L2 DRAM ACP Only (HP only is similar, there is no SCU and L2) CPU only (with&without cache) CPU ACP (CPU HP similar) Accelerator ACP SCU L2 DRAM CPU 1 2 ACP ---CPU ---ACP ---

11 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 11 Speed Comparison 256K 1MBytes 128K64K 16K4K ACP Loses! 298MBytes/s 239MBytes/s CPU OCM between CPU ACP & CPU HP

12 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 12 Dummy Traffic Effect 256K HP: 1382Mbytes/s ACP: 1664Mbytes/s CPU dummy traffic Occupies cache entries So less free entries remain for the accelerator

13 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 13 Power Comparison

14 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 14 Energy Comparison CPU only methods : worst case! CPU ACP ; always better energy than CPU HP0 When the image size grows CPU ACP converges CPU HP0 CPU OCM always between CPU ACP and CPU HP

15 Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini – Energy and performance exploration of ACP Using ZYNQ 15 Lessons Learned & Conclusion If a specific task should be done by the cooperation of CPU and accelerator: CPU ACP and CPU OCM are always better than CPU HP in terms of energy If we are running other applications which heavily depend on caches, CPU OCM and then CPU HP are preferred! If a specific task should be done by accelerator only: For small arrays ACP Only & OCM Only can be used For large arrays (>size of L2$) HP Only always acts better.


Download ppt "Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini Department."

Similar presentations


Ads by Google