Presentation is loading. Please wait.

Presentation is loading. Please wait.

06/19/06 1 Simulation Case Studies Update David Bueno June 19, 2006 HCS Research Laboratory, ECE Department University of Florida.

Similar presentations


Presentation on theme: "06/19/06 1 Simulation Case Studies Update David Bueno June 19, 2006 HCS Research Laboratory, ECE Department University of Florida."— Presentation transcript:

1 06/19/06 1 Simulation Case Studies Update David Bueno June 19, 2006 HCS Research Laboratory, ECE Department University of Florida

2 06/19/06 2 Overview Set of case studies to study performance of a set of realistic GMTI kernels with detailed modeling of processing and memory subsystems  Models developed during Summer ’05 internship serve as basis for improvements Models modified to change/cover sensitive information and calibrated to more closely resemble Chris’ testbed setup in some cases (e.g. 250 MHz RIO) Modeling environment allows us to change parameters, scale, and network architecture with a much greater degree of freedom than testbed Three main points of emphasis of case studies:  Gain insight into tradeoffs in shared interface between processor/network/memory  Greater understanding of SBR processing/network requirements and optimal system configuration  Quantify RapidIO latency and examine methods of improving latency for critical data Also intend to analytically examine the previously-studied FT RIO architectures and discuss how they may apply to case study scenarios

3 06/19/06 3 System Architecture Two network architectures similar to previous GMTI/SAR experiments Small-scale system assumes FPGA-level TMR (“current” technology)  3 FPGAs (1 logical PE) per card  Maximum of 7 processor boards (7 logical PE’s), 1 global-memory board  4-switch network backplane card (8-port switches) Clos-like backplane with first- and second- stage switches on same card  1 system controller and 1 spacecraft interface card, each connected to RIO network via single dedicated link High-level models act as source of latency- sensitive data Large-scale system (shown) assumes fully radiation-hardened FPGAs or additional management software (not modeled) capable of handling SEUs in FPGAs (e.g. DM-like system, “near future” technology)  7 processor cards, 4 FPGAs (4 logical PE’s) and 1 RapidIO switch per card, 28 physical and logical FPGAs total  4-switch network backplane card (9-port switches) Clos-like backplane with second stage only  1 system controller and 1 spacecraft interface card

4 06/19/06 4 Major Improvements Several changes from previous SBR experiments:  Computation time based on working preliminary RIO testbed implementation of GMTI kernels  Detailed memory access model for SDRAM Models contention between processing elements and RIO interface SRAM access deterministic and considered part of measured computation time  Support for measurement of latency-sensitive traffic with latency values based on Honeywell RapidIO implementation New scaling method allows simulation runtimes reduced from hours (overnight) to ~30 minutes  Shrink data cube (and CPI) along pulses dimension for simulation, then scale reported results to full CPI size  Verified accurate to <<1% error for all cases where the system is able to meet real-time deadlines  Super-linear speedups in simulation runtime observed due to reduced memory/disk access by MLD simulator

5 06/19/06 5 GMTI Case Study CPI data arrives at global memory board every 256 ms  Store-and-forward pre-processing (e.g. ECCM Beamforming) performed outside scope of case study 32 bits per element (16 bits complex, 16 bits real) to match testbed Four kernels compose processing stages to match testbed implementation:  Pulse Compression  Doppler Processing  Beamforming  CFAR Each input data cube is 256 pulses, 6 beams, number of ranges varied Cube size shrinks along range dimension by.6 after Pulse Compression Cube size shrinks after Beamforming by.5 along beams dimension (3 beams formed from 6 input) 1KB of detection results reported to system controller at conclusion of CFAR Latency-sensitive control data arrives from spacecraft for delivery to a randomly- selected processing node at Poisson-distributed intervals Latency-sensitive health/timer/status data sent from system controller to each processing node at regular intervals  Processing nodes respond to data with RapidIO responses that are also sensitive to latency and jitter

6 06/19/06 6 Single-Engine Results One processing engine in each FPGA dedicated to each GMTI kernel (Pulse Compression, Doppler Processing, Beamforming, CFAR—4 total engines per FPGA) Most significant jump in performance lies between 8 Gbps (125 MHz interface) and 16 Gbps (250 MHz interface) in all cases  Between these values, memory interface goes from under-provisioned to adequately provisioned for given network and processing traffic patterns 64k-range cube experiences non-linear performance penalty compared to 32k ranges and 48k ranges cases  Penalty due to double buffering Sharing of SDRAM interface between incoming RIO data and PE access/corner turn RIO data  Smaller cubes require limited double buffering except for 8 Gbps, 48k case Data can be delivered and processed in under 256 ms so potential for overlap

7 06/19/06 7 Multiple-Engine Results Number of processing engines for each GMTI task in each FPGA varied from 1-4 (up to 16 total engines per FPGA) As processing engines added, performance becomes increasingly memory bound and PE requires more memory bandwidth to benefit from increased processing capabilities  Each successive increment in memory interface clock allows the effective addition of 1 more engine per task  However, diminishing returns overall for each speed increase beyond 16 Gbps due to additional dependence on network throughput for data delivery and corner turns RIO interface requires ~4 Gbps (theoretical max) memory bandwidth for each direction of traffic  8 Gbps required by full-duplex corner turn communication maxes out 125 MHz, 32-bit, DDR bus (8 Gbps) without any additional PE traffic  Double buffering of processing at PE level requires memory bandwidth to be 2x “processing bandwidth” for one engine For all experiments, FPGA can process one 32-bit element per cycle at 125 MHz (4 Gbps) 125 MHz bus leaves zero margin for any RapidIO traffic along with processing Most stressful memory access period is actually not the corner turn network traffic, since no PE access occurs during this time  Instead, double-buffered processing and network traffic may require PE reads, PE writes, AND reception of next data cube  Note: important to distinguish between double buffering of processing data (e.g. perform an FFT while loading next chunk of data to FFT into SRAM) from double buffering of network data (i.e. process one data cube while receiving the next from sensors)

8 06/19/06 8 Conclusions Testbed and simulation case studies providing valuable insight into implementation of real-world RapidIO systems for SBR processing Performance of DDR SDRAM interface is major determining factor in performance of GMTI kernels  Heavily influences performance of both PE and RapidIO network Double buffering of both network and processing data greatly taxes memory, even more than network corner turns  Even if receiving a data cube while performing a corner turn, RIO network flow control will ensure the memory interface is not taxed more than 8 Gbps for 250 MHz RapidIO  However, if receiving a data cube while performing double- buffered processor access, RIO network will require 4 Gbps while each engine will also require 8 Gbps throughput With current testbed configuration, no point in implementing support for multiple engines per task on a single FPGA  Simulation results save Chris some pain and suffering

9 06/19/06 9 Future Work Addition of charts showing memory bandwidth utilization over time Addition of chart showing execution time components of CPI Inclusion of small-scale system results  Mainly as proof-of-concept for current technology  Preliminary results show same trends at smaller scale Quantification of FPGA resources required for each configuration used in experiments  Mostly done, just need to calculate estimates with Chris for stages not yet implemented in testbed (i.e. addition of Magnituding to CFAR, Beamforming)  Baseline configuration (1 engine per task per FPGA) estimated to nearly fit in our current testbed FPGAs Study of latency values and latency improvement tactics  Cut-through routing  Preemption  Dedicated paths  Reduction of packet size  Direct access from RapidIO network to SRAM SAR global-memory-based FFT case studies  Suite of experiments similar to GMTI case studies  Also examine performance with processing performed directly out of global memory over RIO

10 06/19/06 10 Brief Testbed Update Chris Conger June 19, 2006 HCS Research Laboratory, ECE Department University of Florida

11 06/19/06 11 Review of Testbed Node Architecture As requested, will briefly review details of testbed node architecture Still awaiting arrival of full-featured DDR SDRAM controller core, needed for maximum performance of main memory  Current measured actual sustained throughput at 2.5 Gbps, with restricted controller  Burst-size fixed at 2 (min. for DDR), no bank management High-current draw from new DDR modules causing noise on overall power supply  Causing reliability issues with network link, as well as data integrity through SDRAM  Chris’ top priority to resolve, will discuss after node architecture review

12 06/19/06 12 HCS-CNF Top-level Architecture Node Architecture Conceptual Diagram

13 06/19/06 13 Notes regarding previous slide Each HCS-CNF includes:  Processing elements w/ 32 KB internal SRAM (each) @ 100 MHz  External 128 MB DDR SDRAM storage device @ 100/125 MHz  RapidIO endpoint @ 125/250 MHz  64-bit internal data path, except at processing memory interfaces (32-bit width)  DMA-style memory transfers, with transparent remote memory access  Arbitrates access to SDRAM storage, no “starvation” allowed Equally prioritized, 256 bytes per burst  Command queues and localized control for each major section Each colored circle indicates independent clock domains  Internal clock frequency of Network Interface Controller dependent on RapidIO link speed  Processors (PowerPC and HW co-processors) fixed at 100 MHz  All other parts of design operate at SDRAM clock frequency Not shown in diagram is control path between PowerPC and co-processor engines  Simple R/W control register on each co-processor provides necessary control Baseline node design is SDRAM-centric (all transfers involve main memory) Modular design allows architectural flexibility for enhancement  SRAM-to-RapidIO direct path achieved by adding one more FIFO between OCM module and NIC module, and adjusting control logic at each end  Direct data transfer between processing engines only requires redesign of OCM module

14 06/19/06 14 Additional Slides David Bueno June 20, 2006 HCS Research Laboratory, ECE Department University of Florida

15 06/19/06 15 Addendum: CPI Execution Time Figure shows breakdown of CPI execution time 32k cube size chosen because it does not require network double buffering  Double buffering makes it difficult to determine exact breakdown of execution time due to overlap of communication and computation Data cube receive occupies a significant portion of execution time, explaining diminishing returns as SDRAM interface speed is increased  RIO network speed would then need to be increased as well to maintain high levels of speedup Time per stage shrinks as cube size shrinks

16 06/19/06 16 Addendum: Memory Util. over Time Figures depict utilization of 16 Gbps DDR SDRAM interface over the execution of 8 CPIs for 32k and 64k ranges cases  Utilization poll interval is 64 us (so “instantaneous” utilization is calculated over each successive interval of 64 us)  32k (left)- no double buffering required, periods of inactivity since cube can be received and processed in 256 ms  64k (right)- requires double buffering, no periods of inactivity “Medium” peaks (40-60%) due to processing or corner turn activity (in this case bounded by processor or RIO) Maximum peaks (100%) due to local data distribution (considered to be completely memory bound) Note that most of first and last CPIs of 64k ranges case are “setup” and “wrap up” CPIs so double buffering limited in those CPIs  Focus closely on the middle 6 CPIs of 64k ranges chart to see full effects of double buffering  Essentially loading and emptying a “pipeline” Network double buffering slightly raises utilization during periods of data cube distribution (by ~4%)  Incoming data slows memory performance during local data redistribution and slows network performance during corner turns


Download ppt "06/19/06 1 Simulation Case Studies Update David Bueno June 19, 2006 HCS Research Laboratory, ECE Department University of Florida."

Similar presentations


Ads by Google