Presentation is loading. Please wait.

Presentation is loading. Please wait.

Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning.

Similar presentations


Presentation on theme: "Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning."— Presentation transcript:

1 Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS Integer Sort Benchmark within the SRC-6e Reconfigurable Computer

2 Page 2 MAPLD 2005/253MichalskiOutline Reconfigurable Computing – Introduction  SRC-6e architecture, programming model Sorting Algorithms  Design guidelines Testing Procedures, Results Conclusions, Future Work  Lessons learned

3 Page 3 MAPLD 2005/253Michalski What is a Reconfigurable Computer? Combination of:  Microprocessor workstation for frontend processing  FPGA backend for specialized coprocessing  Typical PC bus for communications

4 Page 4 MAPLD 2005/253Michalski What is a Reconfigurable Computer? PC Characteristics  High clock speed  Superscalar, pipelined  Out of order issue  Speculative execution  High-Level Language FPGA Characteristics  Low clock speed  Large number of configurable elements LUTs, Block RAMs, CPAs Multipliers  HDL Language

5 Page 5 MAPLD 2005/253Michalski What is the SRC-6e? SRC = Seymour R. Cray RC with high-throughput memory interface  1,415 MB/s for SNAP writes, 1,280 MB/s for SNAP reads  PCI-X (1.0) = 1.064 GB/s

6 Page 6 MAPLD 2005/253Michalski SRC-6e Development Programming does not require knowledge of HW design  C code can compile to hardware

7 Page 7 MAPLD 2005/253Michalski FPGA Considerations  Superscalar design Parallel, pipelined execution SRC Considerations  High overall data throughput Streaming versus non-streaming data transfer?  Reduction of FPGA data processing stalls due to data dependencies, data read/write delays FPGA Block RAM versus SRC OnBoard Memory? Evaluate software/hardware partitioning  Algorithm partitioning  Data size partitioning SRC Design Objectives

8 Page 8 MAPLD 2005/253Michalski Sorting Algorithms Traditional Algorithms  Comparison Sorts: Θ(n lg n) best case Insertion sort Merge sort Heapsort Quicksort  Counting Sorts Radix sort: Θ(d(n+k)) HPCS FORTRAN code baseline  Radix sort in combination with heapsort  This research focuses on 128-bit operands SRC simplified data transfer, management

9 Page 9 MAPLD 2005/253Michalski Memory Constraints  SRC onboard memory 6 banks x 4 MB Pipelined read or write access 5 clock latency  FPGA BRAM memory 144 blocks, 18 Kbit each 1 clock read and write latency Initial Choices  Parallel Insertion Sort (BubbleSort) Produces sorted blocks Use of onboard memory pipelined processing – Minimize data access stalls  Parallel Heapsort Random access merge of sorted lists Use of BRAM for low latency access – Good for random data access Sorting – SRC FPGA Implementation

10 Page 10 MAPLD 2005/253Michalski Parallel Insertion Sort (BubbleSort) Systolic array of cells  Pipelined SRC processing from OnBoard Memory  Keeps highest value, passes other values  Latency 2x number of cells

11 Page 11 MAPLD 2005/253Michalski Parallel Insertion Sort (BubbleSort) Systolic array of cells  Results passed out in reverse order of comparison N = # comparator cells  Sorts a list completely in Θ(L 2 )  Limit sort size to some number a < L (list size) Create multiple sorted lists Each list sorted in Θ(a)

12 Page 12 MAPLD 2005/253Michalski Parallel Insertion Sort (BubbleSort) #include void parsort_test(int arraysize, int sortsize, int transfer, uint64_t datahigh_in[], uint64_t datalow_in[], uint64_t datahigh_out[], uint64_t datalow_out[], int64_t *start_transferin, int64_t *start_loop, int64_t *start_transferout, int64_t *end_transfer, int mapno) { OBM_BANK_A (a, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (b, uint64_t, MAX_OBM_SIZE) OBM_BANK_C (c, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (d, uint64_t, MAX_OBM_SIZE) DMA_CPU(CM2OBM, a, MAP_OBM_stripe(1, "A"), datahigh_in, 1, arraysize*8, 0); wait_DMA(0); …. while (arrayindex < arraysize) { endarrayindex = arrayindex + sortsize - 1; if (endarrayindex > arraysize - 1) endarrayindex = arraysize - 1; while (arrayindex < endarrayindex) { for (i=arrayindex; i<=endarrayindex; i++) { data_high_in = a[i];data_low_in = b[i]; parsort(i==endarrayindex, data_high_in, data_low_in, &data_high_out, &data_low_out); c[i] = data_high_out;d[i] = data_low_out;

13 Page 13 MAPLD 2005/253Michalski Parallel Heapsort Tree structure of cells  Asynchronous operation Acknowledged data transfer  Merges sorted lists in Θ(n lg n)  Designed for Independent BRAM block accesses

14 Page 14 MAPLD 2005/253Michalski Parallel Heapsort BRAM Limitations  144 Block RAMs @ 512 32 bit values = not a whole lot of 128-bit values OnBoard Memory  SRC constraint – Up to 64 reads and 8 writes in one MAP C file  Cascading clock delays as number of reads increase  Explore the use of MUXd access: search and update only 6 of 48 leaf nodes at a time in round-robin fashion

15 Page 15 MAPLD 2005/253Michalski FPGA Initial Results Baseline: One V26000  PAR options: -ol high –t 1 Bubblesort Results – 100 Cells  29,354 Slices(86%)  37,131 LUTs(54%)  13.608 ns = 73 MHz (verified operational at 100MHz) Heapsort Results – 95 Cells (48 Leafs)  21,011 Slices(62%)  24,467 LUTs(36%)  11.770 ns = 85 MHz (verified operational at 100MHz)

16 Page 16 MAPLD 2005/253Michalski Testing Procedures All tests utilize one chip for baseline results Evaluate fastest software radix of operation Hardware/Software Partitioning  Five cases - Case 5 utilizes FPGA reconfiguration  Data size partitioning – 100, 500, 1000, 5000, 10000  10 runs for each test case/data partitioning combination  List size 500000 values

17 Page 17 MAPLD 2005/253MichalskiResults Fastest Software Operations (Baseline)  Comparison of Radixsort and Heapsort Combinations Radix 4, 8 and 16 evaluated Minimum Time: Radix-8 Radixsort + Heapsort (Size = 5000 or 10000)  Radix-16 has too many buckets for sort size partitions evaluated  Heapsort comparisons faster than radixsort index updates

18 Page 18 MAPLD 2005/253MichalskiResults Fastest SW-only Time = 3.41 sec. Fastest time including HW = 3.89 sec.  Bubblesort (HW), Heapsort (SW)  Partition Listsize of 1000 Heapsort times…   Dominated by data access   Significantly slower than software

19 Page 19 MAPLD 2005/253Michalski Results – Bubblesort vs. Radixsort Some cases where HW faster than SW  List sizes < 5000  SRC data pipelined access  Fastest SW case was for list size = 10000 MAP data transfer time less significant than data processing time   For size = 1000: Input (11.3%), Analyze (76.9%), Output (11.5%)

20 Page 20 MAPLD 2005/253Michalski Results - Limitations Heapsort is limited by overhead of input servicing  Random accesses of OBM not ideal  Overhead of loop search, sequentially dependent processing Bubblesort limited by number of cells  Can increase by approximately 13 cells  Two-chip streaming Reconfiguration time assumed to be one-time setup factor  Reconfiguration case exception – Solve by having a core per V26000

21 Page 21 MAPLD 2005/253MichalskiConclusions Pipelined, systolic designs are needed to overcome speed advantage of microprocessor  Bubblesort works well on small data sets  Heapsort’s random data access cannot exploit SRC benefits SRC high-throughput data transfer and high- level data abstraction provides good framework to implement systolic designs

22 Page 22 MAPLD 2005/253Michalski Future Work Heapsort’s random data access cannot exploit SRC benefits  Look for possible speedups using BRAM?  Unroll leaf memory access  Exploit SRC “periodic macro” paradigm Currently evaluating radix sort in hardware  This works better than bubblesort for larger sort sizes Compare MAP-C to VHDL when baseline VHDL is faster than SW


Download ppt "Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning."

Similar presentations


Ads by Google