Presentation is loading. Please wait.

Presentation is loading. Please wait.

A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics Presented by: Chris Comis September 23,

Similar presentations


Presentation on theme: "A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics Presented by: Chris Comis September 23,"— Presentation transcript:

1 A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics Presented by: Chris Comis September 23, 2005 Supervisor:Professor Paul Chow

2 2 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Integration into a Programming Model 6. Conclusions/Questions

3 3 What is Molecular Dynamics? A method of calculating the time-evolution of molecular configurations A method of calculating the time-evolution of molecular configurations Useful in the analysis of protein folding Useful in the analysis of protein folding Many applications in rational drug design Many applications in rational drug design

4 4 1. Forces (i.e. F=ma) are calculated between an atom and all other atoms in the system An O(n 2 ) problem across 10,000+ atoms An O(n 2 ) problem across 10,000+ atoms 2. Force calculations are performed at femtosecond timesteps Interesting results may take several μs of simulation (10 9 + timesteps required) Interesting results may take several μs of simulation (10 9 + timesteps required) MD is Computationally Challenging MD simulations are typically run on supercomputers

5 5 An FPGA-based MD Accelerator An ongoing collaborative project involves the development of an FPGA-based MD Accelerator An ongoing collaborative project involves the development of an FPGA-based MD Accelerator Advantages to an FPGA-based approach: Advantages to an FPGA-based approach: 1. Massive parallel computation 2. Forces can be parallelized 3. Force computations can be accelerated ~88x 4. High-speed Serial I/O (SERDES) may be leveraged

6 6 Area of Focus Develop communication protocol using high-speed SERDES links Develop communication protocol using high-speed SERDES links Requirements: Requirements: Reliability Reliability Light-weight Light-weight Minimal trip-time for small packets Minimal trip-time for small packets Must be abstracted at the hardware and software levels Must be abstracted at the hardware and software levels

7 7 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Integration into a Programming Model 6. Conclusions/Questions

8 8 Blocks → computation Arrows → communication A Partial MD Simulator Computation blocks can be hardware or software executed on MicroBlaze soft processors Computation blocks can be hardware or software executed on MicroBlaze soft processors Software must be written using a programming model Software must be written using a programming model

9 9 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model

10 10 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model The model is then adapted for SERDES development The model is then adapted for SERDES development

11 11 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model The model is then adapted for SERDES development The model is then adapted for SERDES development 1. Producers and consumer hardware blocks are implemented

12 12 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model The model is then adapted for SERDES development The model is then adapted for SERDES development 1. Producers and consumer hardware blocks are implemented 2. An FSL (FIFO) is used as an abstracted method of data transport with SERDES logic

13 13 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model The model is then adapted for SERDES development The model is then adapted for SERDES development 1. Producers and consumer hardware blocks are implemented 2. An FSL is used as an abstracted method of data transport with SERDES logic 3. An OPB bus interface is added for register access of components

14 14 System-Level Overview The MD simulator is simplified to a Producer/Consumer model The MD simulator is simplified to a Producer/Consumer model The model is then adapted for SERDES development The model is then adapted for SERDES development 1. Producers and consumer hardware blocks are implemented 2. An FSL is used as an abstracted method of data transport with SERDES logic 3. An OPB bus interface is added for register access of components 4. Deep FIFOs are added for logging high-speed data

15 15 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Integration into a Programming Model 6. Conclusions/Questions

16 16 Protocol Overview A synchronous acknowledgement-based protocol was chosen A synchronous acknowledgement-based protocol was chosen Simple and predictable Simple and predictable An inherent delay in waiting for acknowledgements An inherent delay in waiting for acknowledgements To mask this delay: To mask this delay: Multiple producers are connected to the SERDES interface Multiple producers are connected to the SERDES interface The link is time-multiplexed across multiple producers The link is time-multiplexed across multiple producers

17 17 Protocol Overview All data has a word width of 4 bytes All data has a word width of 4 bytes Data packets: Data packets: Variable size (between 32 and 2016 bytes) Variable size (between 32 and 2016 bytes) A 32-bit CRC is appended A 32-bit CRC is appended Acknowledgements: Acknowledgements: 8 bytes in size 8 bytes in size Can interrupt transmission of data packets Can interrupt transmission of data packets

18 18 Transmit Logic Transmitter consists mainly of two components Transmitter consists mainly of two components 1. Dual-port buffers: The start address of the packet is kept in case a resend is necessary The start address of the packet is kept in case a resend is necessary 2. Scheduler: Schedules ready packets in a round-robin fashion Schedules ready packets in a round-robin fashion From Producer via FSLTo Scheduler of SERDES Link

19 19 Receive Logic Receiver consists mainly of two components: Receiver consists mainly of two components: 1. Dual-port buffers: The start address of the packet is kept in case errors occur The start address of the packet is kept in case errors occur 2. Three-stage Dataflow Pipeline: Stage 1: Determine if incoming data is properly formatted Stage 2: Evaluate incoming data against all possible errors Stage 3: Pass results to acknowledgement handler From SERDES LinkTo Consumer via FSL

20 20 Design Effort Majority of design effort was in error handling: Majority of design effort was in error handling: Transmitter: Transmitter: Determine which packet combinations corrupt the system Determine which packet combinations corrupt the system Establish a priority among conflicting packet types Establish a priority among conflicting packet types Receiver: Receiver: Handle all possible combinations of transmission errors Handle all possible combinations of transmission errors

21 21 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Integration into a Programming Model 6. Conclusions/Questions

22 22 Test Environment All SERDES tests performed across a Xilinx Virtex-II Pro XC2VP7 and XC2VP30 series FPGAs All SERDES tests performed across a Xilinx Virtex-II Pro XC2VP7 and XC2VP30 series FPGAs Ribbon cables were used to transfer serial data between non-impedance controlled connectors Ribbon cables were used to transfer serial data between non-impedance controlled connectors

23 23 Reliability and Sustainability Verification test environment: Verification test environment: Send data concurrently from three producers to three respective consumers Send data concurrently from three producers to three respective consumers Pseudo-random packet length Pseudo-random packet length Consumers read from FSL at variable rates Consumers read from FSL at variable rates Reliability: Reliability: Run this test under extremely poor line conditions Run this test under extremely poor line conditions Sustainability: Sustainability: Run this test under normal line conditions for a long period of time Run this test under normal line conditions for a long period of time

24 24 Reliability Reliability: 128-second Test Results Reliability: 128-second Test Results Type of Error Average # of Errors Soft Error (x10 6 ) 1.312 Hard Error 722977 Frame Error 22 CRC Error 18414 Receive Buffer Full (x10 6 ) 1.804 Lost Acknowledgment 81769

25 25 Sustainability Sustainability: 8-hour Test Results Sustainability: 8-hour Test Results MeasurementResult Resent Packets due to Receive Buffer Full (x10 6 ) 502.353 Successful Packets (x10 6 ) 5666.821 Total Packets (x10 6 ) 6169.174 Approximate Bit-Rate (x10 9 ) 1.755

26 26 Comparison Against Other Communication Mechanisms Two configurations are used Two configurations are used Configuration A: Saturate the channel with packets Configuration A: Saturate the channel with packets Configuration B: Loop-back test Configuration B: Loop-back test Compare against: Compare against: Simple FPGA-based 100BaseT Ethernet Simple FPGA-based 100BaseT Ethernet TCP/IP FPGA-based 100BaseT Ethernet TCP/IP FPGA-based 100BaseT Ethernet TCP/IP Cluster-based Gigabit Ethernet TCP/IP Cluster-based Gigabit Ethernet

27 27 Throughput Results

28 28 One-way Trip Time Results

29 29 Area Consumption Each SERDES Interface takes approximately 8% of a Xilinx XC2VP30 Each SERDES Interface takes approximately 8% of a Xilinx XC2VP30 Debug logic substantially increases area consumption: Debug logic substantially increases area consumption: FF usage increases 68% FF usage increases 68% LUT usage increases 43% LUT usage increases 43% Area Measurement FFsLUTs Area with Debug Logic 34863218 Area without Debug Logic 20742244

30 30 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Integration into a Programming Model 6. Conclusions/Questions

31 31 Integration into a Programming Model while (1) { MPI_Send(data_outgoing, 64, MPI_INT, 0, 0, MPI_COMM_WORLD); MPI_Recv(data_incoming, 64, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); } Hardware abstraction: FSL Hardware abstraction: FSL Software abstraction: An MPI-based Programming Model Software abstraction: An MPI-based Programming Model Modified MPI_Send and MPI_Recv function calls Modified MPI_Send and MPI_Recv function calls

32 32 Integration into a Programming Model Replaced producers and consumers with a MicroBlaze processor Replaced producers and consumers with a MicroBlaze processor Several communication scenarios were tested Several communication scenarios were tested Scenario Bit-Rate (Mbps) MicroBlaze to MicroBlaze (no traffic) 4.30 MicroBlaze to MicroBlaze (traffic) 4.10 MicroBlaze to Hardware Consumer (no traffic) 7.78 Hardware Producer to MicroBlaze (no traffic) 8.90

33 33 Outline 1. Motivation 2. System-Level Overview 3. Protocol Development 4. Results 5. Incorporation into a Programming Model 6. Conclusions/Questions

34 34 Conclusions Final Results: Final Results: Reliable and sustainable Reliable and sustainable Abstracted at the software and hardware level Abstracted at the software and hardware level 2074 FFs and 2244 LUTs required for SERDES logic only 2074 FFs and 2244 LUTs required for SERDES logic only Given a channel rate of 2.5Gbps, maximum bidirectional throughput of 1.928Gbps Given a channel rate of 2.5Gbps, maximum bidirectional throughput of 1.928Gbps Minimum packet trip-time of 1.23μs Minimum packet trip-time of 1.23μs

35 35 Acknowledgements Y. Gu, T. VanCourt, M. C. Herbordt, FPGA Acceleration of Molecular Dynamics Computations, To appear: Proceedings of Field Programmable Logic and Applications, August 2005. Professor Régis Pomès, Chris Madill Professor Régis Pomès, Chris Madill Professor Paul Chow, Professor C.Y. Chen, Lesley Shannon, Arun Patel, Manuel Saldaña, David Chui, Sam Lee, Andrew House,, Nathalie Chan, Lorne Applebaum, Patrick Akl Professor Paul Chow, Professor C.Y. Chen, Lesley Shannon, Arun Patel, Manuel Saldaña, David Chui, Sam Lee, Andrew House,, Nathalie Chan, Lorne Applebaum, Patrick Akl References

36 36 Transmitter Packet Collision Handling Packets are enclosed by 8B/10B control characters (K-characters) Packets are enclosed by 8B/10B control characters (K-characters) The type of packet is distinguished by the K-characters used The type of packet is distinguished by the K-characters used Certain combinations of control characters cannot be nested Certain combinations of control characters cannot be nested Clock correction has priority over acknowledgement Clock correction has priority over acknowledgement Acknowledgement cannot interrupt the end of a data packet Acknowledgement cannot interrupt the end of a data packet Clock correction must avoid the beginning and end of a data packet Clock correction must avoid the beginning and end of a data packet

37 37 Receiver Error Handling All combinations of errors at the receiver are handled correctly All combinations of errors at the receiver are handled correctly Data errors (CRC errors) Data errors (CRC errors) Disparity errors or invalid characters (soft errors) Disparity errors or invalid characters (soft errors) Errors in framing (frame errors) Errors in framing (frame errors) Channel failures (hard errors) Channel failures (hard errors) Lost acknowledgements/repeat packets Lost acknowledgements/repeat packets Receiver buffers full Receiver buffers full

38 38 Test Configuration A Send data concurrently from three producers to three respective consumers Send data concurrently from three producers to three respective consumers Producers write to FSL as fast as possible Producers write to FSL as fast as possible Consumers read from FSL as fast as possible Consumers read from FSL as fast as possible Analyze best-case throughput results Analyze best-case throughput results

39 39 Test Configuration B Send data from a producer to a consumer Send data from a producer to a consumer Delay a packet write from a producer until a packet has been completely received by the consumer on the same FPGA Delay a packet write from a producer until a packet has been completely received by the consumer on the same FPGA A communication loop results that determines round-trip trip time (and therefore one-way trip time) A communication loop results that determines round-trip trip time (and therefore one-way trip time)


Download ppt "A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics Presented by: Chris Comis September 23,"

Similar presentations


Ads by Google