Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK.

Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK

A Heterogeneous Future

Example Speedup: DNA Sequence Matching

Why are regular computers not fast enough?

FPGAs are the Lego of Hardware

multiple independent multi-ported memories fine-grain parallelism and pipelining hard and soft embedded processors

The heart of an FPGA

LUT4 (OR)

LUT4 (AND)

LUTs are higher order functions io lut1 o i1 i0 lut2lut3lut4 i0 i1 i2 i0 i1 i2 i3 o o inv = lut1 not and2 = lut2 (&&) mux = lut3 ( s d0 d1. if s then d1 else d0)

FPGAs as Co-Processors XD2000i FPGA in-socket accelerator for Intel FSB XD2000F FPGA in-socket accelerator for AMD socket F XD1000 FPGA co-processor module for socket 940

What kind of problems fit well on FPGA?

opportunity scientific computing data mining search image processing financial analytics challenge

Fibonacci Example 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765,...

entity fib is port (signal clk, rst : in bit ; signal fibnr : out natural) ; end entity fib ; architecture behavioural of fib is signal lastFib, currentFib : natural ; begin compute_fibs : process begin wait until clk'event and clk='1' ; if rst = '1' then lastFib <= 0 ; currentFib <= 1 ; else currentFib <= lastFib + currentFib ; lastFib <= currentFib ; end if ; end process compute_fibs ; fibnr <= currentFib ; end architecture behavioural ;

demonstration...

data parallel descriptions FPGA hardware (VHDL) GPU code (Accelerator) SMP C++

The Accidental Semi-colon ;

Kiwi structuralimperative (C)parallel imperative gate-level VHDL/Verilog Kiwi C-to- gates ; ; ; jpeg.c thread 2 thread 3 thread 1

Kiwi Library Kiwi.cs circuit model JPEG.cs Visual Studio multi-thread simulation debugging verification Kiwi Synthesis circuit implementation JPEG.v

parallel program C# Thread 1 Thread 2 Thread 3 C to gates C to gates C to gates C to gates circuit Verilog for system

Our Implementation Use regular Visual Studio technology to generate a.NET IL assembly language file. Our system then processes this file to produce a circuit: –The.NET stack is analyzed and removed –The control structure of the code is analyzed and broken into basic blocks which are then composed. –The concurrency constructs used in the program are used to control the concurrency / clocking of the generated circuit.

System Composition We need a way to separately develop components and then compose them together. Don’t invent new language constructs: reuse existing concurrency machinery. Adopt single-place channels for the composition of components. Model channels with regular concurrency constructs (monitors).

Writing to a Channel public class Channel { T datum; bool empty = true; public void Write(T v) { lock (this) { while (!empty) Monitor.Wait(this); datum = v; empty = false; Monitor.PulseAll(this); }

Reading from a Channel public T Read() { T r; lock (this) { while (empty) Monitor.Wait(this); empty = true; r = datum; Monitor.PulseAll(this); } return r; }

systems level concurrency constructs threads, events, monitors, condition variables rendezvousjoin patterns transactional memory data parallelism user applications domain specific languages

class FIFO2 { [Kiwi.OutputWordPort(“result“, 31, 0)] public static int result; static Kiwi.Channel chan1 = new Kiwi.Channel (); static Kiwi.Channel chan2 = new Kiwi.Channel ();

public static void Consumer() { while (true) { int i = chan1.Read(); chan2.Write(2 * i); Kiwi.Pause(); } public static void Producer() { for (int i = 0; i < 10; i++) { chan1.Write(i); Kiwi.Pause(); }

public static void Behaviour() { Thread ProducerThread = new Thread(new ThreadStart(Producer)); ProducerThread.Start(); Thread ConsumerThread = new Thread(new ThreadStart(Consumer)); ConsumerThread.Start();

Filter Example thread one-place channel

public static int[] SequentialFIRFunction(int[] weights, int[] input) { int[] window = new int[size]; int[] result = new int[input.Length]; // Clear to window of x values to all zero. for (int w = 0; w < size; w++) window[w] = 0; // For each sample... for (int i = 0; i < input.Length; i++) { // Shift in the new x value for (int j = size - 1; j > 0; j--) window[j] = window[j - 1]; window[0] = input[i]; // Compute the result value int sum = 0; for (int z = 0; z < size; z++) sum += weights[z] * window[z]; result[i] = sum; } return result; }

Transposed Filter

static void Tap(int i, byte w, Kiwi.Channel xIn, Kiwi.Channel yIn, Kiwi.Channel yout) { byte x; int y; while(true) { y = yIn.Read(); x = xIn.Read(); yout.Write(x * w + y); }

Inter-thread Communication and Synchronization // Create the channels to link together the taps for (int c = 0; c < size; c++) { Xchannels[c] = new Kiwi.Channel (); Ychannels[c] = new Kiwi.Channel (); Ychannels[c].Write(0); // Pre-populate y-channel registers with zeros }

// Connect up the taps for a transposed filter for (int i = 0; i < size; i++) { int j = i; // Quiz: why do we need the local j? Thread tapThread = new Thread(delegate() { Tap(j, weights[j], Xchannels[j], Ychannels[j], Ychannels[j+1]); }); tapThread.Start(); }

using System; using System.Collections.Generic; using System.Text; using Microsoft.Research.DataParallelArrays; using PA = Microsoft.Research.DataParallelArrays.ParallelArrays; using IPA = Microsoft.Research.DataParallelArrays.IntParallelArray; namespace ForOxford { class Program { static void Main(string[] args) { PA.InitGPU(); IPA is1 = new IPA(4, new int[] { 1, 2, 3, 4 }); IPA is2 = new IPA(4, new int[] { 5, 6, 7, 8 }); IPA is3 = new IPA(4, is1.Shape); is3 = PA.Add(is1, is2); IPA result = PA.Evaluate(is3); int[] ra1; PA.ToArray(result, out ra1); foreach (int i in ra1) Console.Write(i + " "); Console.WriteLine(""); }

Example: Bitmap Blur (Using Accelerator v1.1.1) using PA = Microsoft.Research.DataParallelArrays.ParallelArrays; using FPA = Microsoft.Research.DataParallelArrays.FloatParallelArray; float[,] Blur (float[] kernel) { FPA pa = new FPA(bitmap); // Convolve in X direction FPA resultX = new FPA(0, pa.Shape); for (int i = 0; i < kernel.Length; i++) { resultX += PA.Shift(pa, 0, i) * kernel[i]; } // Convolve in Y direction. FPA resultY = new FPA(0, pa.Shape); for (int i = 0; i < kernel.Length; i++) { resultY += PA.Shift(resultX, i, 0) * kernel[i]; } float [,] result; PA.ToArray (resultY, out result); return result; }

Expression Graphs FPA pa = new FPA(bitmap); // Convolve in X direction FPA rX = new FPA(0, pa.Shape); for (int i = 0; i < kernel.Length; i++) { rX += PA.Shift(pa, 0, i) * kernel[i]; } *pa Shift (0,0) k[0]+rX+* Shift (0,1) k[1]+… rX

class Program { static void Main(string[] args) { IPA.InitGPU(); IPA ipa1 = new IPA(5, new int[] {1, 2, 3, 4, 5}) ; IPA ipa2 = new IPA(5, new int[] {10, 20, 30, 40, 50}) ; IPA ipa3 = new IPA(5, new int[] {21, 5, 7, 4, 8}); IPA ipa4 = new IPA(5, new int[] {4, 1, 7, 2, 5}) ; IPA ipa5 = new IPA(5, ipa1.Shape); ipa5 = PA.Add(is1, is2); IPA result = PA.Multiply (ipa4, (PA.Subtract (ipa3, PA.Add(ipa1, ipa2)))); int[] ra1; PA.ToArray(result, out ra1); foreach (int i in ra1) Console.Write(i + " "); Console.WriteLine(""); }

class Program { static void Main(string[] args) { IPA.InitFPGA(); IPA ipa1 = new IPA(5, new int[] {1, 2, 3, 4, 5}) ; IPA ipa2 = new IPA(5, new int[] {10, 20, 30, 40, 50}) ; IPA ipa3 = new IPA(5, new int[] {21, 5, 7, 4, 8}); IPA ipa4 = new IPA(5, new int[] {4, 1, 7, 2, 5}) ; IPA ipa5 = new IPA(5, ipa1.Shape); ipa5 = PA.Add(is1, is2); IPA result = PA.Multiply (ipa4, (PA.Subtract (ipa3, PA.Add(ipa1, ipa2)))); int[] ra1; PA.ToArray(result, out ra1); foreach (int i in ra1) Console.Write(i + " "); Console.WriteLine(""); }

with addr select net_7 <= 10 when 0, 20 when 1, 30 when 2, 40 when 3, 50 when 4; process begin wait until clk'event and clk='1' ; net_5 <= net_6 + net_7 ; end process ; process type net_4_delay_type is array (0 to 1) of integer ; variable net_4_delayed : net_4_delay_type ; begin wait until clk'event and clk='1' ; net_4_delayed(0) := net_4_delayed(1) ; net_4_delayed(1) := net_4 ; net_3 <= net_4_delayed(0) - net_5 ; end process ;

8.249ns max delay 3 x DSP48Es 63 slice registers 24 slice LUTs

let rec bfly r n = match n with 1 -> r | n -> ilv (bfly r (n-1)) >-> evens r

Cryptol as = [Ox3F OxE2 Ox65 OxCA] # new; new = [| a ^ b ^ c || a <- as || b <- drop(1,as) || c <- drop(3,as)|]; 3F as E2 ^ 65CA ^ new

Bluespec rule enqueueSOFData (rx_src_rdy_n_input == 0 && rx_sof_n_input == 0 && recv_state == Ready_for_frame) ; fifo_in.enq (rx_data_input) ; recv_state <= Reading_frame ; endrule

Esterel Esterel design void uart_device_driver () {..... } uart.c VHDL, Verilog -> hardware implementation C -> software implementation

Some Challenges for Spatial Computing Language support: –Specifying resources. –Specifying memory organization. –Specifying timing. –Specifying control. –Models of computation. Co-design and verification. System integration (OS APIs). AWFUL AWFUL AWUFL vendor tools.

Some Challenges for Heterogeneous Systems A single model for programming very different kinds of computational elements? Giving up abstractions –memory Constant failure. –dynamically re-mapping computations

Questions?

Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK.

Similar presentations

Presentation on theme: "Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK.

Similar presentations

Presentation on theme: "Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK."— Presentation transcript:

Similar presentations

About project

Feedback