Xilinx Public System Interfaces & Caches RAMP Retreat Austin, TX June 2009.

Slides:



Advertisements
Similar presentations
Bus Specification Embedded Systems Design and Implementation Witawas Srisa-an.
Advertisements

Computer Science & Engineering
1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 1:Interrupts and shared memory dr.ir. A.C. Verschueren.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
CS-334: Computer Architecture
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Enabling Coherent FPGA Acceleration Allan Cantle, President & Founder Nallatech Join the conversation at #OpenPOWERSummit1 #OpenPOWERSummit.
ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.
1 Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
CHAPTER 9: Input / Output
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
General Purpose FIFO on Virtex-6 FPGA ML605 board midterm presentation
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Hardware Overview Net+ARM – Well Suited for Embedded Ethernet
Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.
Chapter 8 Input/Output. Busses l Group of electrical conductors suitable for carrying computer signals from one location to another l Each conductor in.
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
CHAPTER 9: Input / Output
Cpr E 308 Input/Output Recall: OS must abstract out all the details of specific I/O devices Today –Block and Character Devices –Hardware Issues – Programmed.
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
1 Lecture 20: I/O n I/O hardware n I/O structure n communication with controllers n device interrupts n device drivers n streams.
Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze Unit OS6: Device Management 6.1. Principles of I/O.
Computer Architecture Lecture10: Input/output devices Piotr Bilski.
Design and Performance of a PCI Interface with four 2 Gbit/s Serial Optical Links Stefan Haas, Markus Joos CERN Wieslaw Iwanski Henryk Niewodnicznski Institute.
2009 Sep 10SYSC Dept. Systems and Computer Engineering, Carleton University F09. SYSC2001-Ch7.ppt 1 Chapter 7 Input/Output 7.1 External Devices 7.2.
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CE-321: Computer.
CSE 661 PAPER PRESENTATION
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
CH10 Input/Output DDDData Transfer EEEExternal Devices IIII/O Modules PPPProgrammed I/O IIIInterrupt-Driven I/O DDDDirect Memory.
Fall 2000M.B. Ibáñez Lecture 25 I/O Systems. Fall 2000M.B. Ibáñez Categories of I/O Devices Human readable –used to communicate with the user –video display.
L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Full and Para Virtualization
Chapter 6 Storage and Other I/O Topics. Chapter 6 — Storage and Other I/O Topics — 2 Introduction I/O devices can be characterized by Behaviour: input,
Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower than CPU.
Input Output Techniques Programmed Interrupt driven Direct Memory Access (DMA)
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Part IVI/O Systems Chapter 13: I/O Systems. I/O Hardware a typical PCI bus structure 2.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
10/15: Lecture Topics Input/Output –Types of I/O Devices –How devices communicate with the rest of the system communicating with the processor communicating.
2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM Software mechanisms for enabling access to remote persistent.
Lecture 2. A Computer System for Labs
DIRECT MEMORY ACCESS and Computer Buses
M. Bellato INFN Padova and U. Marconi INFN Bologna
NFV Compute Acceleration APIs and Evaluation
CHAPTER 9: Input / Output
CHAPTER 9: Input / Output
CS703 - Advanced Operating Systems
Improving java performance using Dynamic Method Migration on FPGAs
BIC 10503: COMPUTER ARCHITECTURE
CMSC 611: Advanced Computer Architecture
FPro Bus Protocol and MMIO Slot Specification
I/O Systems I/O Hardware Application I/O Interface
CS703 - Advanced Operating Systems
CSC3050 – Computer Architecture
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 13: I/O Systems.
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Xilinx Public System Interfaces & Caches RAMP Retreat Austin, TX June 2009

Xilinx Public © Copyright 2009 Xilinx Page 2 Disclaimer  All Information Contained In This Presentation Based on Publicly Available Material  References: –Goldhammer & Ayer: “Understanding The Performance of PCI Express Systems”, Xilinx WP350, 2008 –John Goodacre: “The Effect and Technique of System Coherence in ARM Multicore Technology”, ARM Developer Conference 2008 –T. Shanley: “The Unabridged Pentium 4 IA32 Processor Genealogy”, Addison-Wesley, 2004

Xilinx Public © Copyright 2009 Xilinx Page 3 This Talk You Will Learn:  How an x86 CPU and an FPGA Can Exchange Data –IO Device Mapping vs Shared Memory  How the low level coherency interface works –Data and Control Exchange  How standard programming models are mapped –FIFOs, Message Passing, Shared Memory  How Direct Cache Access can reduce latency –And how we can overcome latency challenges

Xilinx Public © Copyright 2009 Xilinx Page 4 Context: System Interconnect FSB QPI PCIe Future Equal BW: FSB vs PCIeFSB/QPI: 2x PCIe BW PCIe IO DeviceIn-Socket Accelerator Device Driver CallShared Memory Function Cache FlushDirect Cache Access Diagram Source: Intel, 2008

Xilinx Public © Copyright 2009 Xilinx Page 5 Single core CPU InterconnectAcceleratorMemory Mailbox CPU flushes input data from cache 2.Writes to mailbox 3.Interrupt serviced by accelerator 4.Reads input data from memory 5.Writes result data to memory 6.Writes to mailbox 7.Interrupt serviced by CPU 8.Reads result data CPU-FPGA Communication Generic, Without DCA (Direct Cache Access)

Xilinx Public © Copyright 2009 Xilinx Page 6 Single core CPU PCIe + FSB AcceleratorMemory Device registers CPU flushes input data from cache 2.Writes to device registers 3.Write seen by accelerator 4.Reads input data from memory with DMA 5.Writes result data to memory with DMA 6.Writes to device registers 7.Write seen by processor 8.Reads result data CPU-FPGA Communication PCI Express Based FPGA

Xilinx Public © Copyright 2009 Xilinx Page 7 XeonFSB Fabric-hosted accelerator Memory Mailbox 2 1.CPU leaves data in cache 2.Writes to mailbox in cached memory 3.Snoop intercepted by accelerator 4.Reads input data from cached memory Snoop intercepted by CPU giving data 5.Writes result data to cached memory Snoop intercepted by CPU getting data 6.Writes to mailbox in cached memory 7.Snoop intercepted by CPU 8.Result data already in cache Snoop interface CPU-FPGA Communication FSB Based FPGA

Xilinx Public © Copyright 2009 Xilinx Page 8 Raw FPL Interface (FPL = FSB Protocol Layer) Fabric-hosted accelerator Xeon snoop System Memory (Not Coherent)  Coherent Memory Mailboxes  Unguarded Shared Memory Accesses –Convey added guarded access to this Synchronization Region (Coherent)

Xilinx Public © Copyright 2009 Xilinx Page 9  Special 2MB synchronization region used for communication between SW and HW –Writes by SW immediately result notification to FPGA (snoop control) –SW can poll locations waiting for write from FPGA  SW can also allocate other 2MB regions –Simple pinned memory regions –Use synchronization region to pass physical addresses to hardware  Use 2MB regions to move data between domains  Use synchronization region as start/finished indicators –Hardware uses a snoop for start –Software uses a poll for finished Raw FPL Interface (FPL = FSB Protocol Layer)

Xilinx Public © Copyright 2009 Xilinx Page 10 Xilinx Confidential – Internal Unpublished Work © Copyright 2009 Xilinx FIFO Programming Model Over FSB  Synchronization region used to convey full/empty status of buffers  Pinned memories acted as elastic buffers for SW  On chip memories acted as elastic buffers for HW  AFUs thought they were just reading and writing from streams  Exactly the kind of setup suitable for running Map/Reduce jobs in hardware

Xilinx Public © Copyright 2009 Xilinx Page 11 Intel: AAL (Accelerator Abstraction Layer) FPGA Co-Processor API  Co-Processor Management API  Streaming API Inside FPGA (FIFOs)  AAL Pins Mailboxes / Manages System Memory  Virtual Memory Support via Workspace  Accelerator Discovery Services  Accelerator Configuration Management Services Liu et al: “A high performance, energy efficient FPGA accelerator platform”, FPGA 2009

Xilinx Public © Copyright 2009 Xilinx Page 12 MPI FSB Bridge BB MPI SW Process FSB HW MPE MPI HW “Process” PPC MPI SW Process MPI GT/GTX Serial I/O Bridge X86 MPI SW Process X86 MPI SW Process Memory MPI FSB Bridge BB MPI SW Process FSB HW MPE MPI HW “Process” PPC MPI SW Process MPI GT/GTX Serial I/O Bridge X86 MPI SW Process X86 MPI SW Process Memory  Standard MPI Programming Model & API  Light Weight Message Passing Protocol Implementation  Focused on Embedded Systems  Explicit Rank to Node Binding Support Source: ArchesComputing, 2009 Arches: MPI (Message Passing Interface) Symmetrical Peer Processing

Xilinx Public © Copyright 2009 Xilinx Page 13 Convey: Shared Memory Convey HC-1 (2008)  Socket Filler Module  Bridge FPGA  Implements FSB Protocol  Full Snoop Support  FPGA Based Compute Accelerator  Pre-Defined Vector Instruction Set  Shared Memory Programming Model  ANSI C Support  Accelerator Cache Memory  80 GB/s BW  Snoop Coherent with System Memory  Direct Cache Access CPU FPGA Source: Convey Computer, 2008 MC LX155 MC LX155 MC LX155 MC LX155 MC LX155 MC LX155 MC LX155 MC LX155

Xilinx Public © Copyright 2009 Xilinx Page 14 Latency: PCI Express & FSB The Effects of DCA  PCIe –~400ns latency –Gen1x8 interface, 64 byte payloads –Includes: PCIe device to chipset –Does not include: Chipset to CPU latency (add FSB latency)  FSB –110ns latency –64 byte DCA transfers –200+ ns latency on cache miss operations (fetch from memory) DCA: 6x reduced latency  Results for minimally loaded systems (i.e. single master active)  Chipset can defer and/or retry transactions in loaded systems (both FSB and PCIe)  Typically less congestion on FSB than on PCIe interface DCA = Direct Cache Access

Xilinx Public © Copyright 2009 Xilinx Page 15 DCA with ARM ACP (ARM ACP = Accelerator Coherency Port)  Xilinx ACP Platform Applications: Customer Confidential  Hence using ARM as an example Source: ARM, 2008 ~ 8x reduced latency

Xilinx Public © Copyright 2009 Xilinx Page 16 System Memory Bandwidth (PCI Express on Virtex-5)  Virtex-5 LXT ML555 fitted in Dell PowerEdge machine –Goldhammer and Ayer Jr: “Understanding performance of PCI Express systems”  Intel E5000P Chipset  Virtex-6 & Gen2 data is available (but not public yet) –Rough data points: 2x the BW, similar latency  PCIe (Gen 1) x16 –Partner IP (not studied) Link width and transfer size READ BW (GBytes/s) WRITE BW (GBytes/s) x1 8 KB x1 16 KB x1 32 KB x4 8 KB x4 16 KB x4 32 KB x8 8 KB x8 16 KB x8 32 KB

Xilinx Public © Copyright 2009 Xilinx Page 17 System Memory Bandwidth (FSB on Virtex-5)  Intel Xeon 7300 chipset  FPL Performance (FSB Protocol Layer = Raw Interface) –FPL: Primitives for data and control exchange  Higher level protocols may reduce BW or require longer burst sizes to achieve the same BW –AAL, MPI, Other: Higher level protocols built on top of FPL BLOCK SIZEREAD BW GBytes/sWRITE BW GBytes/s 512 B KB2.47Not recorded 2 KB3.36Not recorded 4 KB3.97Not recorded 8 KB4.54Not recorded 16 KB KB KB

Xilinx Public © Copyright 2009 Xilinx Page 18 Bandwidth: PCI Express & FSB  PCIe Gen2 x8 (estimated performance data) –Double bandwidth of Gen1 –2.66 GBytes/s read –3.54 GBytes/s write  FSB –1.7x the read BW of PCIe Gen2 x8 –Half Duplex only –4.66 GBytes/s read –3.4 GBytes/s write (not fully optimized yet)  Data for 16 kB transfers for both PCIe and FSB  PCIe Gen2 BW: Estimates

Xilinx Public © Copyright 2009 Xilinx Page 19 Summary  FPGA Mapped Into Shared System Memory  Raw FPL Interface Exposes Coherency Engine in FPGA  Multiple Programming Models Supported –FIFO, Message Passing, ShMem  DCA Helps To Reduce Latency  Application Code Must Maximize Issue Rate