Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini

Slides:



Advertisements
Similar presentations
Provide data pathways that connect various system components.
Advertisements

Diagnosing Performance Overheads in the Xen Virtual Machine Environment Aravind Menon Willy Zwaenepoel EPFL, Lausanne Jose Renato Santos Yoshio Turner.
Device Drivers. Linux Device Drivers Linux supports three types of hardware device: character, block and network –character devices: R/W without buffering.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 1:Interrupts and shared memory dr.ir. A.C. Verschueren.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 3: Input/output and co-processors dr.ir. A.C. Verschueren.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
Efficient Communication Hardware Accelerators and PS
Embedded Systems Programming
Page-based Commands for DRAM Systems Aamer Jaleel Brinda Ganesh Lei Zong.
5.1 Chaper 4 Central Processing Unit Foundations of Computer Science  Cengage Learning.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Final presentation Encryption/Decryption on embedded system Supervisor: Ina Rivkin students: Chen Ponchek Liel Shoshan Winter 2013 Part A.
Hardware Overview Net+ARM – Well Suited for Embedded Ethernet
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Temperature Variation Aware Energy Optimization in Heterogeneous MPSoCs Mohammadsadegh Sadri Department of Electrical, Electronic and Information Engineering.
Chapter 8 Input/Output. Busses l Group of electrical conductors suitable for carrying computer signals from one location to another l Each conductor in.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
VxWorks & Memory Management
Cis303a_chapt06_exam.ppt CIS303A: System Architecture Exam - Chapter 6 Name: __________________ Date: _______________ 1. What connects the CPU with other.
Department of Electrical Engineering Electronics Computers Communications Technion Israel Institute of Technology High Speed Digital Systems Lab. High.
UNIX and Shell Programming (06CS36)
Andrea Marongiu Luca Benini ETH Zurich Daniele Cesarini University of Bologna.
Block I/O. 2 Definition Any I/O operation in which the unit of data is several words, not just one word or byte.
1-1 Embedded Network Interface (ENI) API Concepts Shared RAM vs. FIFO modes ENI API’s.
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
The IT700 PIM only supports up to network layer, all other above layers must be executed by other processor. Therefore in the PLC control network two types.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Department of Electrical Engineering Electronics Computers Communications Technion Israel Institute of Technology High Speed Digital Systems Lab. High.
Guangdeng Liao, Xia Zhu, Steen Larsen, Laxmi Bhuyan, Ram Huggahalli University of California, Riverside Intel Labs.
Lu Hao Profiling-Based Hardware/Software Co- Exploration for the Design of Video Coding Architectures Heiko Hübert and Benno Stabernack.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
UNIX and Shell Programming
PROJECT - ZYNQ Yakir Peretz Idan Homri Semester - winter 2014 Duration - one semester.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Final Presentation Encryption on Embedded System Supervisor: Ina Rivkin students: Chen Ponchek Liel Shoshan Spring 2014 Part B.
Review of Computer System Organization. Computer Startup For a computer to start running when it is first powered up, it needs to execute an initial program.
Performed by: Itamar Niddam and Lior Motorin Instructor: Inna Rivkin המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון.
ECE 699: Lecture 6 AXI Interfacing Using DMA & AXI4-Stream.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Performed by: Yotam Platner & Merav Natanson Instructor: Guy Revach המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון.
CH (5) Computer Organization
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
By Ganesan Alagu Ganesh Feb 26, 2008
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
M. Bellato INFN Padova and U. Marconi INFN Bologna
Please do not distribute
USB The topics covered, in order, are USB background
Microcontrollers & GPIO
By Ganesan Alagu Ganesh Feb 21, 2008
Andrea Acquaviva, Luca Benini, Bruno Riccò
nan FIP The project April Eva. Gousiou BE/CO-HT
EE 107 Fall 2017 Lecture 7 Serial Buses – I2C Direct Memory Access
Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch
Improving java performance using Dynamic Method Migration on FPGAs
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
Operating Systems Chapter 5: Input/Output Management
Lecture 3: Main Memory.
Chapter 2: Operating-System Structures
Introduction to Operating Systems
Implementation of a GNSS Space Receiver on a Zynq
Chapter 5 Computer Organization
Wireless Embedded Systems
Chapter 2: Operating-System Structures
Chapter 1: Introduction CSS503 Systems Programming
Presentation transcript:

Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini Department of Electrical, Electronic and Information Engineering (DEI) University of Bologna, Italy Microelectronic Systems Design Research Group, University of Kaiserslautern, Germany {mohammadsadegh.sadr2,luca.benini}@unibo.it, {weis,wehn}@eit.uni-kl.de ver0 1 1

Outline Introduction ZYNQ Architecture (Brief) Motivations & Contributions Infrastructure Setup (Hardware & Software) Memory Sharing Methods Experimental Results Lessons Learned & Conclusion 2

1951 2012 Introduction Performance Per Watt!! Half a century later! UNIVAC I : 0.015 operations per 1 watt-second Half a century later! 2012 ST P2012 : 40 billion operations per 1 watt-second (c) Luca Bedogni 2012

(specialized hardware) (specialized hardware) Introduction Solution : Specialized functional units (Accelerators) - Problem can be more complicated! e.g. Multiple CPU cores! - Every processing element: Should have a consistent view of the shared memory! - Accelerator Coherency Port (ACP): Allows accelerator hardware To Perform coherent accesses To CPU(s) memory space! Better Performance Per Watt! var1 DRAM var3 var2 Accelerator (specialized hardware) cached CPU var1 What about Variables? TASK 1 Accelerator (specialized hardware) TASK 2 L1$ var2 TASK 3 ????? TASK 4 CPU should Flush the cache! Faster! More Power Efficient! Case 2 Case 1

PL PS Xilinx ZYNQ Architecture Snoop L2 DMA Controller (ARM PL330) L1 SGP0 Peripherals (UART, USB, Network, SD, GPIO,…) SGP1 DMA Controller (ARM PL330) AXI Masters HP0 DRAM Controller (Synopsys IntelliDDR MPMC) HP1 Inter Connect (ARM NIC-301) HP2 L2 PL310 Snoop L1 ARM A9 NEON MMU HP3 AXI Slaves MGP0 L1 ARM A9 NEON MMU MGP1 OCM AXI Master ACP 5

PL PS Motivations & Contributions Which method is better For each method, What is the data transfer speed? How much is the energy consumption? Effect of background workload on performance? Which method is better to share data between CPU and Accelerator? Various acceleration methods are addressed in the literature (GPU, hardware boards, …) We develop an infrastructure (HW+SW) For the Xilinx ZYNQ We run practical tests & measurements To quantify the efficiency of different CPU-accelerator memory sharing methods. DRAM Controller HP0 Snoop L1 ARM A9 NEON MMU AXI Master (Accelerator) L2 PL310 L1 ARM A9 NEON MMU OCM ACP 6

Hardware 7

Software Linux Kernel Level AXI Driver user side interface application Drivers AXI Driver user side interface application Background application: A Simple memory read/write loop AXI Dummy Driver AXI Driver More complicated: Handles AXI masters ACP & HP0 Memory allocation ISR registration statistics PL310 time measurement Simple driver: Initializes the dummy AXI masters (HP1) Triggers an endless read/write loop Oprofile statistical profiler. Measure all CPU performance metrics. Over ACP: kmalloc Over HP: dma_alloc_coherent 8

Measure execution interval. Processing Task Definition We define : Different methods to accomplish the task. Measure : Execution time & Energy. Allocated by: kmalloc dma_alloc_coherent Depends on the memory Sharing method Source Image (image_size bytes) @Source Address Selection of Pakcets: (Addressing) - Normal - Bit-reversed Result Image (image_size bytes) @Dest Address 128K Loop: N times Measure execution interval. Image Sizes: 4KBytes 16K 65K 128K 256K 1MBytes 2MBytes FIFO: 128K FIR read write process 9

2 1 Memory Sharing Methods ACP Only (HP only is similar, there is no SCU and L2) Accelerator SCU L2 DRAM ACP CPU only (with&without cache) CPU CPU ACP (CPU HP similar) 2 1 Accelerator SCU L2 DRAM ACP ACP --- CPU --- ACP --- 10

Speed Comparison ACP Loses! 4K 16K 64K 128K 256K 1MBytes CPU OCM between CPU ACP & CPU HP 298MBytes/s 239MBytes/s 4K 16K 64K 128K 256K 1MBytes 11

Dummy Traffic Effect 256K ACP: 1664Mbytes/s HP: 1382Mbytes/s CPU dummy traffic Occupies cache entries So less free entries remain for the accelerator 256K 12

Power Comparison 13

Energy Comparison CPU OCM always between CPU ACP and CPU HP CPU only methods : worst case! CPU OCM always between CPU ACP and CPU HP CPU ACP ; always better energy than CPU HP0 When the image size grows CPU ACP converges CPU HP0 14

Lessons Learned & Conclusion If a specific task should be done by the cooperation of CPU and accelerator: CPU ACP and CPU OCM are always better than CPU HP in terms of energy If we are running other applications which heavily depend on caches, CPU OCM and then CPU HP are preferred! If a specific task should be done by accelerator only: For small arrays ACP Only & OCM Only can be used For large arrays (>size of L2$) HP Only always acts better. 15