Presentation is loading. Please wait.

Presentation is loading. Please wait.

Packet Switching on Raw

Similar presentations


Presentation on theme: "Packet Switching on Raw"— Presentation transcript:

1 Packet Switching on Raw
Research Qualifying Exam Gleb A Chuvpilo January 28, 2005

2 Project Publications High-Bandwidth Packet Switching on the Raw General-Purpose Architecture, Gleb A. Chuvpilo and Saman Amarasinghe In Proceedings of the International Conference on Parallel Processing (ICPP-03), Kaohsiung, Taiwan, Republic of China, October 6-9, 2003. High-Bandwidth Packet Switching on the Raw General-Purpose Architecture, Gleb A. Chuvpilo, S.M. Thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts, August, 2002. RawNet: Network Processing on the Raw Processor, David Wentzlaff, Gleb A. Chuvpilo, Arvind Saraf, Saman Amarasinghe, and Anant Agarwal, In Research Abstracts of the MIT Laboratory for Computer Science, Cambridge, Massachusetts, March 2002. Gigabit IP Routing on Raw, Gleb A. Chuvpilo, David Wentzlaff, and Saman Amarasinghe, In Proceedings of the 1st HPCA Workshop on Network Processors, Cambridge, Massachusetts, February 3, 2002. Also, unpublished work on Network Calculus at the Computer Engineering and Networks Laboratory of the ETH Swiss Federal Institute of Technology

3 Outline Introduction Packet Switching on Raw Results Conclusion
Raw Processor Overview Internet Router Overview Packet Switching on Raw Raw Router Architecture Rotating Crossbar Design for Switch Fabric Distributed Scheduling Algorithm Minimization and Scheduling Results Conclusion

4 Introduction

5 Goal Build an IP router on a general-purpose processor Why?
Flexibility  new protocols and services Price  economies of scale

6 Raw

7 Raw Processor A scalable computation fabric
4 x 4 mesh of tiles, each tile is a RISC microprocessor Ultra fast interconnect network Exposes the wires to the compiler Compiler orchestrates the communication

8 Raw Facts Performance 16 OPS/FLOPS per cycle
230 Gb/s of on-chip “bisection bandwidth” 201 Gb/s off-chip I/O bandwidth 57 GB/s of on-chip memory bandwidth

9 Raw Facts Layout Longest wire is the length of tile  fast clocking
Each tile: MIPS R router + interconnect 32 KB IMEM 32 KB data cache 64 KB SMEM  2 MB total per chip

10 Raw Facts Instruction Set Architecture
Eight stage pipeline: FETCH, DECODE, RF/STALL, EXE, MUL, MEM, FPU MIPS instruction set 28 general-purpose registers 4 register-mapped network ports 2-way set-associative cache, 3 cycle latency, 32 byte lines

11 Raw Facts Implementation ASIC @ 250 MHz Worst Case
122 million transistors (P4: 43 million) 18.2mm x 18.2mm die (P4 : 15mm x 15mm) 1080 signal I/O pins 25 Watts IBM SA-27E 6 layer metal copper 0.15μ process (P4: 0.13μ)

12 Raw Layout

13 Communication Mechanisms
2 static networks 2 dynamic networks

14 Static Networks Destinations known at compile time
Message size known at compile time Cycle-by-cycle switch schedule Three-cycle nearest neighbor send-to-use latency No processing overhead

15 Static Network: Send A tile wants to communicate a value to its southern neighbor

16 Static Network: Receive

17 Dynamic Networks Unpredictable events
External asynchronous interrupts Cache misses 15- to 30-cycle nearest neighbor send-to-use latency (message header processing overhead) Wormhole routed, two-stage pipelined, dimension-ordered

18 Routing

19 What is Routing? RM OSI… Let’s take a look at the Open Systems Interconnection Reference Model to figure out where routers stand.

20 IP Router Network Processor Switch Fabric Forwarding Engine Interface

21 Switch Fabric Cisco Gigabit Switch Router backplane interconnecting multiple line cards. A centralized scheduler connects to each line card and determines the configuration of the crossbar switch for each time slot

22 Click Modular Router Modular software router
MIT Parallel and Distributed OS Group 435, byte packets a second on a 700 MHz Pentium III (commodity hardware) Flexible, configurable, and easy to understand Interconnected collection of modules called elements

23 Click Modular Router Software router running on Intel x86 architecture

24 Packet Switching on Raw

25 Problem: Four Networks…
2 1 4 3

26 … and Sixteen Tiles:

27 What is the Mapping? ? Static Interconnect Dynamic Communication

28 Solution: Rotating Crossbar
Out 0 Out 1 Lookup Processor Egress Processor PORT 0 PORT 1 Ingress Processor Crossbar Processor ROTATING CROSSBAR PORT 3 PORT 2 In 0 In 1 In 3 In 2 Notice the symmetry of design END: now, let’s jump inside the center of the picture… Out 3 Out 2

29 Switch Fabric Design The idea of a Token Ring network  absolute fairness Algorithm uses two static networks, dynamic networks are idle All deadlock-free configurations are scheduled at compile time Four headers and token location define a global configuration Global configuration is computed in a distributed manner at run time

30 Rotating Crossbar Illustrated
Lookup Processor Egress Processor PORT 0 PORT 1 Ingress Processor Crossbar Processor ROTATING CROSSBAR PORT 3 PORT 2

31 Rotating Crossbar Illustrated
Lookup Processor Egress Processor PORT 0 PORT 1 Ingress Processor Crossbar Processor ROTATING CROSSBAR PORT 3 PORT 2

32 Phases of the Algorithm
TILE PROCESSOR SWITCH PROCESSOR headers_request headers send_prev_config choose_new_config route_body Pipelining = overlap routing with computation of configuration update_token confirm

33 Distributed Scheduling Algorithm
Let’s enumerate the number of configurations: SPACE = |Hdr0| x … x |Hdr3| x |Token|, where |Hdr0| = … = |Hdr3| = 5, and |Token| = 4  therefore SPACE = 54 x 4 = 2,500 distinct configurations the most straightforward enumeration of the configuration space is the product of four headers and a token; each of the headers is of size 5, and the token can be in four different locations in the crossbar – this enumeration is global

34 So What?... Each tile has 8,192 words of instruction memory, same for switch   8,192/2,500 = 3.3 instructions per configuration  not enough!  need to use off-chip memory  slow!   need to minimize SPACE you may ask: “so what, memory is cheap!” But here’s the thing: each tile of the Raw processor only has 8 k words of instruction memory. The same is true for the switch. So what are we left with? 8,192 divided by 2,500 leaves us 3.3 instructions per configuration.

35 Minimization Egress Processor PORT 0 Ingress Processor Crossbar Processor out cwnext in ccwprev Let’s think locally!! The symmetry of our design lets us do the enumeration of configurations in a local manner. Let’s shift the focus in order to minimize the configuration space and make things simpler: instead of enumerating global configurations of the Rotating Crossbar, let’s concentrate on a specific Crossbar Processor. As you can see in the figure, what we need to do is name all possible clients, or potential incoming occupants, of a Crossbar Processor’s servers – static networks connecting a Crossbar Processor to its outgoing neighboring tiles. cwprev ccwnext

36 Clients and Servers of a Crossbar Processor
out cwnext ccwnext clients in cwprev ccwprev Here are the possible values that “servers” and “clients” can take: three for servers (out, cwnext, ccwnext), and four for clients (empty, in, cwprev, and ccwprev)

37 Minimization and Scheduling
We cut down the number of configurations by 78 times! Now there are only 32 entries!  the program can fit in the local instruction memory! Code generated by an automatic compile-time scheduler In addition, software pipelining + loop unrolling of the assembly code of the switch processors of the crossbar to avoid deadlock

38 Scheduler Output /* AUTOGENERATED SCHEDULE FOR PORT 0 */
/* Tile Processor */ /* …*/ conf_1_0303: mtsri SW_PC, %lo(sw_conf_1000) j conf_done conf_1_0304: conf_1_0310: mtsri SW_PC, %lo(sw_conf_2001) conf_1_0311: mtsri SW_PC, %lo(sw_conf_1210) /* HAND-CODED SCHEDULE FOR PORT 0 */ /* Switch Processor */ /* …*/ /* in->out, prev->next, dist=1 */ sw_conf_1210: nop route $IN->$OUT nop route $IN->$OUT, $PREV->$NEXT

39 Results

40 Implementation Raw Router was tested in a cycle-accurate simulator of the Raw processor Raw prototype clock speed is assumed to be 250 MHz The focus of research is on switch fabric, NOT on route lookup, etc. Over 75,000 lines of assembly code, many of them hand-coded As a disclaimer, I would like to notice that the research presented so far has been focused on the design and implementation of the switch fabric of the Raw Router, and the rest of the router implementation is in the future work.

41 Raw Router Results Features 4-port edge router 3.3 Mpps 26.9 Gbps
Uses Raw static networks to stream data

42 Conclusion

43 Conclusion Implemented a gigabit switch on Raw
Mapped dynamic communication to static interconnect Can intermix switch fabric with computation High-bandwidth I/O allows performance of custom ASIC processors

44 Future Work + Critique Take advantage of dynamic networks
Implement IP route lookup Add computation on data (encryption) Add support of multicast traffic Implement Quality of Service Add virtual output queueing Explore larger router configurations We are planning to implement longest prefix match for IP route lookup multicast traffic is the one when a single source is simultaneously sending the same information to a set of subscribers; Quality of Service is a number of mechanisms to allow prioritization of traffic according to its pricing; virtual output queueing is a method to avoid head-of-line blocking of packets

45 End of the “official” part!

46 Current Research Probabilistic Robotics with Prof. John Leonard
Robust Feature-Relative Navigation for Autonomous Underwater Vehicles

47 Robotic Kayaks

48 Questions?


Download ppt "Packet Switching on Raw"

Similar presentations


Ads by Google