Presentation is loading. Please wait.

Presentation is loading. Please wait.

Switch EECS 252 – Spring 2006 RAMP Blue Project Jue Sun and Gary Voronel Electrical Engineering and Computer Sciences University of California, Berkeley.

Similar presentations


Presentation on theme: "Switch EECS 252 – Spring 2006 RAMP Blue Project Jue Sun and Gary Voronel Electrical Engineering and Computer Sciences University of California, Berkeley."— Presentation transcript:

1 Switch EECS 252 – Spring 2006 RAMP Blue Project Jue Sun and Gary Voronel Electrical Engineering and Computer Sciences University of California, Berkeley May 1, 2006

2 5/1/2006CS252-s06, Project Presentation 2 Outline Goal of switch Implementation Performance Future implementation Current state of project Project experience

3 5/1/2006CS252-s06, Project Presentation 3 One Piece of the Puzzle Main goal of RAMP Blue is to build a large scale system To do useful work, processors must be able to communicate Therefore, we need an interconnection network

4 5/1/2006CS252-s06, Project Presentation 4 Implementation Goals 1.Support communication between all processors in system 2.Flexible hardware allowing parameterization of global system constants, especially number of Microblaze cores per FPGA 3.Minimal resource utilization 4.High throughput 5.Low latency 6.Simple, homogenous hardware 7.Simple software interface

5 5/1/2006CS252-s06, Project Presentation 5 Hardware Design Constraints RAMP Blue will be implemented on the BEE2 4 user FPGAs per BEE2 board 2 LVCMOS links FPGA-to-FPGA communication –Relatively low latency (2 or 3 cycles) –Throughput: more than 64bit 16 MGT links per board (4 per FPGA) for board- to-board communication –Relatively high latency (20 or more cycles) –Throughput: 32bit or 64 bit To achieve lowest latency possible, we limit the packet routes to at most 1 MGT link 16 Microblaze cores per FPGA (64 per board) –Depending on resource utilization, number of cores per FPGA may need to be reduced

6 5/1/2006CS252-s06, Project Presentation 6 Physical Topology Topology is fixed and homogenous throughout the system –Each FPGA directly connected to 2 other FPGAs on the same board and 4 other boards –Number of cores per FPGA is the same on every FPGA Each board has a direct connection to every other board in the system (maximum of 17 boards) –BOARD n hooks up to board BOARD 16 through MGT n –With 16 cores per FPGA, 17 boards supports 1088 processors!

7 5/1/2006CS252-s06, Project Presentation 7 Board Level Connectivity

8 5/1/2006CS252-s06, Project Presentation 8 FPGA Level Connectivity For clarity, configuration shown is with 4 Microblaze cores per FPGA

9 5/1/2006CS252-s06, Project Presentation 9 Switch Fabric Specifications Crossbar switch with maximal connectivity –Every Microblaze can access every other Microblaze on the same FPGA directly –Every Microblaze can access both LVCMOS links –Every Microblaze can access all FPGA-local MGT links Buffering on inputs and outputs –Store-and-forward buffers for Microblazes to decrease complexity and simplify software interface –Cut through buffers for LVCMOS links –MGT links wrapped XAUI cores that already have internal buffers

10 5/1/2006CS252-s06, Project Presentation 10 Microblaze Level Connectivity For clarity, configuration shown is with 4 Microblaze cores per FPGA

11 5/1/2006CS252-s06, Project Presentation 11 Switch Overall

12 5/1/2006CS252-s06, Project Presentation 12 Scheduler If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first Other control logic not shown here is used to implement protocol between switch and buffers

13 5/1/2006CS252-s06, Project Presentation 13 Scheduler If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first Other control logic not shown here is used to implement protocol between switch and buffers

14 5/1/2006CS252-s06, Project Presentation 14 Scheduler If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first Other control logic not shown here is used to implement protocol between switch and buffers

15 5/1/2006CS252-s06, Project Presentation 15 Scheduler If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first Other control logic not shown here is used to implement protocol between switch and buffers

16 5/1/2006CS252-s06, Project Presentation 16 Scheduler If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first Other control logic not shown here is used to implement protocol between switch and buffers

17 5/1/2006CS252-s06, Project Presentation 17 Scheduler If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first Other control logic not shown here is used to implement protocol between switch and buffers

18 5/1/2006CS252-s06, Project Presentation 18 Scheduler If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first Other control logic not shown here is used to implement protocol between switch and buffers

19 5/1/2006CS252-s06, Project Presentation 19 Scheduler If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first Other control logic not shown here is used to implement protocol between switch and buffers

20 5/1/2006CS252-s06, Project Presentation 20 Scheduler If two ports want to send to the same port at the same time, the port attached to the port with the lowest number will be allowed to send first Other control logic not shown here is used to implement protocol between switch and buffers

21 5/1/2006CS252-s06, Project Presentation 21 Source Routing Fixed topology allows for straightforward source routing implementation Destination routing would be more robust, but would require significantly more resources and greater complexity Packet header is extremely simple: just a concatenated sequence of hops Minimal hardware required to determine next hop and adjust the header at every hop (zero LUTs used – can’t get better than that!) –The next hop is encoded in the lowest bits of the header –To adjust the header, the hardware must simply shift out the lowest bits

22 5/1/2006CS252-s06, Project Presentation 22 Source Routing – Hop Encoding Need 5 bits to represent each hop –Must be able to encode 16 cores per FPGA + 4 MGT links + 2 LVCMOS links = 22 total encodings (+ 1 for a FIN code) –If 8 or less cores per FPGA are used, then each hop can be represented using only 4 bits (hardware supports parameterization of the hop encoding width) Maximum of 6 hops based on physical topology –Constrained MGT links to 1 hop per route –Therefore, worst case route is: LVCMOS  LVCMOS  MGT  LVCMOS  LVCMOS  MB Hop encoding allows header to fit into 1 word –6 hops x 5 bits/hop = 30 bits

23 5/1/2006CS252-s06, Project Presentation 23 Source Routing – Hop Encoding Need 5 bits to represent each hop –Must be able to encode 16 cores per FPGA + 4 MGT links + 2 LVCMOS links = 22 total encodings (+ 1 for a FIN code) –If 8 or less cores per FPGA are used, then each hop can be represented using only 4 bits (hardware supports parameterization of the hop encoding width) Maximum of 6 hops based on physical topology –Constrained MGT links to 1 hop per route –Therefore, worst case route is: LVCMOS  LVCMOS  MGT  LVCMOS  LVCMOS  MB Hop encoding allows header to fit into 1 word –6 hops x 5 bits/hop = 30 bits

24 5/1/2006CS252-s06, Project Presentation 24 Source Routing – Global Naming Processors are globally named –Necessary to reach the goal of a simple software interface –If there are 16 cores per FPGA with 4 FPGAs per board and 17 total boards, then the processors are numbered 0 - 1087 Naming scheme scales down with less cores –Necessary to support parameterization of global system constants (especially number of cores per FPGA) –If there are 4 cores per FPGA with 4 FPGAs per board and 17 total boards, then the processors are numbered 0 – 271 Invalid processor number triggers error at the software level –Again, supports simple software interface –Ensures that only packets with valid headers enter the network

25 5/1/2006CS252-s06, Project Presentation 25 Source Routing Example For simplicity, let’s assume there are 4 cores per FPGA Let’s send from processor #10 to processor #24 (representative of worst case path)

26 5/1/2006CS252-s06, Project Presentation 26 Source Routing Example For simplicity, let’s assume there are 4 cores per FPGA Let’s send from processor #10 to processor #24 (representative of worst case path)

27 5/1/2006CS252-s06, Project Presentation 27 Source Routing Example Destination core is on a different board, so packet must first be routed from the source FPGA (FPGA 2) to the FPGA that is connected to the destination board (which is FPGA 0) This requires 2 hops over the LEFT LVCMOS link

28 5/1/2006CS252-s06, Project Presentation 28 Source Routing Example Destination core is on a different board, so packet must first be routed from the source FPGA (FPGA 2) to the FPGA that is connected to the destination board (which is FPGA 0) This requires 2 hops over the LEFT LVCMOS link

29 5/1/2006CS252-s06, Project Presentation 29 Source Routing Example Destination core is on a different board, so packet must first be routed from the source FPGA (FPGA 2) to the FPGA that is connected to the destination board (which is FPGA 0) This requires 2 hops over the LEFT LVCMOS link

30 5/1/2006CS252-s06, Project Presentation 30 Source Routing Example Once at the proper FPGA, packet can be sent across the MGT link to an FPGA on the destination board

31 5/1/2006CS252-s06, Project Presentation 31 Source Routing Example Once at the proper FPGA, packet can be sent across the MGT link to an FPGA on the destination board

32 5/1/2006CS252-s06, Project Presentation 32 Source Routing Example Then, the packet must be routed to the destination FPGA, which requires 2 more LVCMOS hops

33 5/1/2006CS252-s06, Project Presentation 33 Source Routing Example Then, the packet must be routed to the destination FPGA, which requires 2 more LVCMOS hops

34 5/1/2006CS252-s06, Project Presentation 34 Source Routing Example Then, the packet must be routed to the destination FPGA, which requires 2 more LVCMOS hops

35 5/1/2006CS252-s06, Project Presentation 35 Source Routing Example Finally, the packet must be forwarded to the destination Microblaze core

36 5/1/2006CS252-s06, Project Presentation 36 Source Routing Example Finally, the packet must be forwarded to the destination Microblaze core

37 5/1/2006CS252-s06, Project Presentation 37 Source Routing Example Each arrow head represents a hop – takes 5 hops to reach the destination FPGA Requires one more hop to send the packet to the destination Microblaze core totalling 6 hops in the worst case

38 5/1/2006CS252-s06, Project Presentation 38 Source Routing – 17 th Board To support the 17 th board, boards communicate to the 17 th board through the MGT link of their own board number

39 5/1/2006CS252-s06, Project Presentation 39 Source Routing – 17 th Board To support the 17 th board, boards communicate to the 17 th board through the MGT link of their own board number

40 5/1/2006CS252-s06, Project Presentation 40 Source Routing – 17 th Board For example, for BOARD 0 to send to BOARD 16, it sends over MGT 0

41 5/1/2006CS252-s06, Project Presentation 41 Microblaze Interface Store and forward Connecting to FSL bus for now Essentially double buffered MB FSL reading speed = extremely slow compare to switch delay time – at the fastest compilation with most efficient code, takes 48 cycle to write one value to FSL bus! Example: send from MB to LVCMOS, loop back to LVCMOS link and then back to MB

42 5/1/2006CS252-s06, Project Presentation 42 LVCMOS interface 2 cycles of latency Two buses connecting 2 FPGAs, can be used to do anything Wire control bus and data bus on LVCMOS, except data_full or free signal is high 2 cycle before it is really full

43 5/1/2006CS252-s06, Project Presentation 43 XAUI Interface Much simplified because of XAUI has internal buffer Essentially just some control signals Interface has recently changed, so this is still in progress

44 5/1/2006CS252-s06, Project Presentation 44 Software Interface Simple interface to send and receive data int send(int src, int dest, byte *buf, int len) –Copies len bytes of buf into local outgoing Buffer Unit –Constructs source route from src MB core to dest MB core –Blocks until all data copied –Returns number of bytes sent or -1 on error Receive is called by interrupt int recv(byte *buf, int len) –Copies len bytes into buf from local incoming Buffer Unit –Blocks until all data received –Returns number of bytes received or -1 on error

45 5/1/2006CS252-s06, Project Presentation 45 Simplifications Fixed packet length simplifies control hardware Packet length fits completely into all buffers in the system, so the entire packet can be transferred from hop to hop Once data transmission starts from MB buffer, it is not interrupted till MB input buffer Store-and-forward implementation of MB buffers

46 5/1/2006CS252-s06, Project Presentation 46 Performance (still need to clean this up) Latency1 =~ 48*packet length to write into FSL bus Latency2 =~ 2* packet length to wait for MB buffer to be full Latency3 =~ 2 in switch transmission Latency4 =~ 48*packet length to read into FSL bus Bandwidth = 32bit/cycle or 64 bit/cycle (current fsl do not support 64 bit)

47 5/1/2006CS252-s06, Project Presentation 47 Utilization on BEE2: With Switch (16x32 FIFO)Without Switch Number of BSCANs1 out of 1100%1 out of 1100% Number of BUFGMUXs7 out of 1643%6 out of 1637% Number of DCMs3 out of 837%3 out of 837% Number of External DIFFMs1 out of 4961%1 out of 4961% Number of LOCed DIFFMs1 out of 1100%1 out of 1100% Number of External DIFFSs1 out of 4961%1 out of 4961% Number of LOCed DIFFSs1 out of 1100%1 out of 1100% Number of External IOBs371 out of 99637%303 out of 99630% Number of LOCed IOBs371 out of 371100%303 out of 303100% Number of MULT18X18s14 out of 3284%14 out of 3284% Number of RAMB16s35 out of 32810%27 out of 3288% Number of SLICEs 8136 out of 3308824%6901 out of 3308820% Note: Measured with switch that connects 8 ports: 2 MB, 2 LVCMOS link, but no XAUI. All buffers are 32 bit wide and 16 word deep.

48 5/1/2006CS252-s06, Project Presentation 48 Future implementation Switch topology change Allow variable packet length – using control in fsl DMA 4 MB share a DMA

49 5/1/2006CS252-s06, Project Presentation 49 “Associated Switch”

50 5/1/2006CS252-s06, Project Presentation 50 Clustered Organization Microblaze cores organized into clusters –Since there are 4 DIMMs on the BEE2, split into 4 clusters NIC will coordinate transfer of data for all MBs in cluster –Faster transfer for MBs in the same cluster because its DMA –Faster overall transfer because data copying done in hardware Only 4 bits per hop now, but extra hop needed

51 5/1/2006CS252-s06, Project Presentation 51 Whats Working NOW!! Switch @ 100MHz Source route generation Store and forward buffer for MB TCL script and (partial) global parameterization Homogenous hardware Interface LVCMOS Single MB with switch booted on XUP Double MB with switch booted on BEE2

52 5/1/2006CS252-s06, Project Presentation 52 Almost Done / To Do Cut Through MB Buffer –Bottleneck of copying data from software limits performance gains from cut through version Need to test XAUI / MGT link Interrupt controller Complete parameterization

53 5/1/2006CS252-s06, Project Presentation 53 Trouble Spots Tools Interfaces Putting multiple MB on FPGA Lack of infrastructure during early stages


Download ppt "Switch EECS 252 – Spring 2006 RAMP Blue Project Jue Sun and Gary Voronel Electrical Engineering and Computer Sciences University of California, Berkeley."

Similar presentations


Ads by Google