Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xilinx Public System Interfaces & Caches RAMP Retreat Austin, TX June 2009.

Similar presentations


Presentation on theme: "Xilinx Public System Interfaces & Caches RAMP Retreat Austin, TX June 2009."— Presentation transcript:

1 Xilinx Public System Interfaces & Caches RAMP Retreat Austin, TX June 2009

2 Xilinx Public © Copyright 2009 Xilinx Page 2 Disclaimer  All Information Contained In This Presentation Based on Publicly Available Material  References: –Goldhammer & Ayer: “Understanding The Performance of PCI Express Systems”, Xilinx WP350, 2008 –John Goodacre: “The Effect and Technique of System Coherence in ARM Multicore Technology”, ARM Developer Conference 2008 –T. Shanley: “The Unabridged Pentium 4 IA32 Processor Genealogy”, Addison-Wesley, 2004

3 Xilinx Public © Copyright 2009 Xilinx Page 3 This Talk You Will Learn:  How an x86 CPU and an FPGA Can Exchange Data –IO Device Mapping vs Shared Memory  How the low level coherency interface works –Data and Control Exchange  How standard programming models are mapped –FIFOs, Message Passing, Shared Memory  How Direct Cache Access can reduce latency –And how we can overcome latency challenges

4 Xilinx Public © Copyright 2009 Xilinx Page 4 Context: System Interconnect FSB QPI PCIe 20052008Future Equal BW: FSB vs PCIeFSB/QPI: 2x PCIe BW PCIe IO DeviceIn-Socket Accelerator Device Driver CallShared Memory Function Cache FlushDirect Cache Access Diagram Source: Intel, 2008

5 Xilinx Public © Copyright 2009 Xilinx Page 5 Single core CPU InterconnectAcceleratorMemory Mailbox 1 2 1.CPU flushes input data from cache 2.Writes to mailbox 3.Interrupt serviced by accelerator 4.Reads input data from memory 5.Writes result data to memory 6.Writes to mailbox 7.Interrupt serviced by CPU 8.Reads result data 3 4 5 6 7 8 CPU-FPGA Communication Generic, Without DCA (Direct Cache Access)

6 Xilinx Public © Copyright 2009 Xilinx Page 6 Single core CPU PCIe + FSB AcceleratorMemory Device registers 1 2 1.CPU flushes input data from cache 2.Writes to device registers 3.Write seen by accelerator 4.Reads input data from memory with DMA 5.Writes result data to memory with DMA 6.Writes to device registers 7.Write seen by processor 8.Reads result data 3 4 5 6 7 8 CPU-FPGA Communication PCI Express Based FPGA

7 Xilinx Public © Copyright 2009 Xilinx Page 7 XeonFSB Fabric-hosted accelerator Memory Mailbox 2 1.CPU leaves data in cache 2.Writes to mailbox in cached memory 3.Snoop intercepted by accelerator 4.Reads input data from cached memory Snoop intercepted by CPU giving data 5.Writes result data to cached memory Snoop intercepted by CPU getting data 6.Writes to mailbox in cached memory 7.Snoop intercepted by CPU 8.Result data already in cache 34567 Snoop interface CPU-FPGA Communication FSB Based FPGA

8 Xilinx Public © Copyright 2009 Xilinx Page 8 Raw FPL Interface (FPL = FSB Protocol Layer) Fabric-hosted accelerator Xeon snoop System Memory (Not Coherent)  Coherent Memory Mailboxes  Unguarded Shared Memory Accesses –Convey added guarded access to this Synchronization Region (Coherent)

9 Xilinx Public © Copyright 2009 Xilinx Page 9  Special 2MB synchronization region used for communication between SW and HW –Writes by SW immediately result notification to FPGA (snoop control) –SW can poll locations waiting for write from FPGA  SW can also allocate other 2MB regions –Simple pinned memory regions –Use synchronization region to pass physical addresses to hardware  Use 2MB regions to move data between domains  Use synchronization region as start/finished indicators –Hardware uses a snoop for start –Software uses a poll for finished Raw FPL Interface (FPL = FSB Protocol Layer)

10 Xilinx Public © Copyright 2009 Xilinx Page 10 Xilinx Confidential – Internal Unpublished Work © Copyright 2009 Xilinx FIFO Programming Model Over FSB  Synchronization region used to convey full/empty status of buffers  Pinned memories acted as elastic buffers for SW  On chip memories acted as elastic buffers for HW  AFUs thought they were just reading and writing from streams  Exactly the kind of setup suitable for running Map/Reduce jobs in hardware

11 Xilinx Public © Copyright 2009 Xilinx Page 11 Intel: AAL (Accelerator Abstraction Layer) FPGA Co-Processor API  Co-Processor Management API  Streaming API Inside FPGA (FIFOs)  AAL Pins Mailboxes / Manages System Memory  Virtual Memory Support via Workspace  Accelerator Discovery Services  Accelerator Configuration Management Services Liu et al: “A high performance, energy efficient FPGA accelerator platform”, FPGA 2009

12 Xilinx Public © Copyright 2009 Xilinx Page 12 MPI FSB Bridge BB MPI SW Process FSB HW MPE MPI HW “Process” PPC MPI SW Process MPI GT/GTX Serial I/O Bridge X86 MPI SW Process X86 MPI SW Process Memory MPI FSB Bridge BB MPI SW Process FSB HW MPE MPI HW “Process” PPC MPI SW Process MPI GT/GTX Serial I/O Bridge X86 MPI SW Process X86 MPI SW Process Memory  Standard MPI Programming Model & API  Light Weight Message Passing Protocol Implementation  Focused on Embedded Systems  Explicit Rank to Node Binding Support Source: ArchesComputing, 2009 Arches: MPI (Message Passing Interface) Symmetrical Peer Processing

13 Xilinx Public © Copyright 2009 Xilinx Page 13 Convey: Shared Memory Convey HC-1 (2008)  Socket Filler Module  Bridge FPGA  Implements FSB Protocol  Full Snoop Support  FPGA Based Compute Accelerator  Pre-Defined Vector Instruction Set  Shared Memory Programming Model  ANSI C Support  Accelerator Cache Memory  80 GB/s BW  Snoop Coherent with System Memory  Direct Cache Access CPU FPGA Source: Convey Computer, 2008 MC LX155 MC LX155 MC LX155 MC LX155 MC LX155 MC LX155 MC LX155 MC LX155

14 Xilinx Public © Copyright 2009 Xilinx Page 14 Latency: PCI Express & FSB The Effects of DCA  PCIe –~400ns latency –Gen1x8 interface, 64 byte payloads –Includes: PCIe device to chipset –Does not include: Chipset to CPU latency (add FSB latency)  FSB –110ns latency –64 byte DCA transfers –200+ ns latency on cache miss operations (fetch from memory) DCA: 6x reduced latency  Results for minimally loaded systems (i.e. single master active)  Chipset can defer and/or retry transactions in loaded systems (both FSB and PCIe)  Typically less congestion on FSB than on PCIe interface DCA = Direct Cache Access

15 Xilinx Public © Copyright 2009 Xilinx Page 15 DCA with ARM ACP (ARM ACP = Accelerator Coherency Port)  Xilinx ACP Platform Applications: Customer Confidential  Hence using ARM as an example Source: ARM, 2008 ~ 8x reduced latency

16 Xilinx Public © Copyright 2009 Xilinx Page 16 System Memory Bandwidth (PCI Express on Virtex-5)  Virtex-5 LXT ML555 fitted in Dell PowerEdge machine –Goldhammer and Ayer Jr: “Understanding performance of PCI Express systems”  Intel E5000P Chipset  Virtex-6 & Gen2 data is available (but not public yet) –Rough data points: 2x the BW, similar latency  PCIe (Gen 1) x16 –Partner IP (not studied) Link width and transfer size READ BW (GBytes/s) WRITE BW (GBytes/s) x1 8 KB0.1630.249 x1 16 KB0.1640.230 x1 32 KB0.1640.223 x4 8 KB0.6440.882 x4 16 KB0.6680.883 x4 32 KB0.6800.864 x8 8 KB1.2571.77 x8 16 KB1.3281.77 x8 32 KB1.3711.77

17 Xilinx Public © Copyright 2009 Xilinx Page 17 System Memory Bandwidth (FSB on Virtex-5)  Intel Xeon 7300 chipset  FPL Performance (FSB Protocol Layer = Raw Interface) –FPL: Primitives for data and control exchange  Higher level protocols may reduce BW or require longer burst sizes to achieve the same BW –AAL, MPI, Other: Higher level protocols built on top of FPL BLOCK SIZEREAD BW GBytes/sWRITE BW GBytes/s 512 B1.621.3 1 KB2.47Not recorded 2 KB3.36Not recorded 4 KB3.97Not recorded 8 KB4.54Not recorded 16 KB4.663.4 64 KB4.993.4 128 KB5.033.4

18 Xilinx Public © Copyright 2009 Xilinx Page 18 Bandwidth: PCI Express & FSB  PCIe Gen2 x8 (estimated performance data) –Double bandwidth of Gen1 –2.66 GBytes/s read –3.54 GBytes/s write  FSB –1.7x the read BW of PCIe Gen2 x8 –Half Duplex only –4.66 GBytes/s read –3.4 GBytes/s write (not fully optimized yet)  Data for 16 kB transfers for both PCIe and FSB  PCIe Gen2 BW: Estimates

19 Xilinx Public © Copyright 2009 Xilinx Page 19 Summary  FPGA Mapped Into Shared System Memory  Raw FPL Interface Exposes Coherency Engine in FPGA  Multiple Programming Models Supported –FIFO, Message Passing, ShMem  DCA Helps To Reduce Latency  Application Code Must Maximize Issue Rate


Download ppt "Xilinx Public System Interfaces & Caches RAMP Retreat Austin, TX June 2009."

Similar presentations


Ads by Google