Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Similar presentations


Presentation on theme: "Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture."— Presentation transcript:

1 Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture Group http://www.cs.rice.edu/CS/Architecture/

2 2 Why programmable network interface?  More complex functionality on network interface –TCP offload, iSCSI, etc.  Easy maintenance –Bug fix, upgrade, customization, etc.  Performance? –51% less web server throughput than ASIC NIC –Big problem

3 3 Improving Performance  Increase clock speed and/or complexity –Typical solutions for general-purpose processors –Do not work for embedded processors  Design constraints: limited power and area –Power proportional to C V² f –Higher f requires higher V –Thus, Power roughly proportional to f³ –Complexity increases C only for marginal gains  Implication: simple and low frequency processor

4 4 Use Parallel Programming  Use multiple programmable cores –Increase computational capacity –Achieve performance within power limit Consume far less power than higher-frequency core  Improvements with two cores over single core –65-157% for bidirectional traffic –27-51% for web server workloads –Web server throughput comparable to ASIC NIC

5 5 Outline  Background –Tigon Programmable Gigabit Ethernet Controller –Network Interface Processing: Send/Receive  Parallelization of Firmware  Experimental Results  Conclusion

6 6 Tigon Gigabit Ethernet Controller  Two programmable cores –Based on MIPS running at 88MHz –Small on-chip memory (scratch pad) per core  Shared off-chip SRAM  Supports event-driven firmware  No interrupts –Event handlers run to completion –Handlers on same core require no synchronization  Released firmware fully utilizes only one core  No previous Ethernet firmware to utilize 2 cores

7 7 Send Processing Mailbox Send Buffer Descriptor Ready Tigon event: Main Memory CPU Network Interface Card Packet Descriptor 1. Create buffer descriptor PCI Bus Network Bridge 2. Alert: produced buffer descriptor Memory-mapped I/O Descriptor Direct Memory Access 3. Fetch buffer descriptor Packet 4. Transfer Packet 5. Transmit Packet Interrupt 6. Alert: consumed buffer descriptor DMA Read Complete Send Data Ready Update Send Consumer DMA Write Complete Index

8 8 Receive Processing: Pre-allocation Mailbox Receive Buffer Descriptor Ready DMA Read Complete Tigon event: Main Memory CPU Network Interface Card Descriptor 2. Create buffer descriptor PCI Bus Network Bridge 3. Alert: produced buffer descriptor Memory-mapped I/O Descriptor Direct Memory Access 4. Fetch buffer descriptor Receive Buffer 1. Allocate receive buffer

9 9 Receive Processing: Actual Receive Receive Complete Tigon event: DMA Write Complete Update Receive Return Producer Main Memory CPU Network Interface Card Descriptor 2. Create buffer descriptor PCI Bus Network Bridge 5. Alert: produced buffer descriptor Interrupt Descriptor Direct Memory Access 4. Transfer buffer descriptor Receive Buffer Packet 1. Store packet 3. Transfer packet Packet Index

10 10 Tigon Uniprocessor Performance Intel 100% over Tigon Decreasing maximum UDP throughput due to network headers and per-frame overhead Tigon with uniprocessor firmware Intel PRO/1000 MTNetgear 622T

11 11 Outline  Background  Parallelization of Firmware –Principles –Resource Sharing Patterns –Partitioning Process  Experimental Results  Conclusion

12 12 Principles  Identify unit of concurrency –Event handler  Analyze resource sharing patterns  Profile uniprocessor firmware  Partition event handlers so as to –Balance load –Minimize synchronization –Maximize on-chip memory utilization

13 13 Resource Sharing Patterns Mailbox DMA Read Complete Send Buffer Descriptor Ready Send Data Ready Update Send Consumer Receive Buffer Descriptor Ready Receive Complete Update Receive Return Producer DMA Write Complete Shared data objects Shared DMA read channel Shared DMA write channel

14 14 Partitioning Process Mailbox DMA Read Complete Send Buffer Descriptor Ready Send Data Ready Update Send Consumer Receive Buffer Descriptor Ready Receive Complete Update Receive Return Producer DMA Write Complete Shared data objects Shared DMA read channel Shared DMA write channel 6% 4% 3% 5% 14% 30% 1% 31% CPU A: CPU B: 30% 31% 52% 41% 47% 53%

15 15 Final Partition Mailbox DMA Read Complete Send Buffer Descriptor Ready Send Data Ready Update Send Consumer Receive Buffer Descriptor Ready Receive Complete Update Receive Return Producer DMA Write Complete Shared data objects Shared DMA read channel Shared DMA write channel CPU A: 47% CPU B: 53%

16 16 Outline  Background  Parallelization of Firmware  Experimental Results –Improved Maximum Throughput –Improved Web Server Throughput  Conclusion

17 17 Experimental Setup  Network interface card –3Com 710024 Gigabit Ethernet interface card based on Tigon  Firmware versions –Uniprocessor firmware: 12.4.13 from original manufacturer –Parallel firmware: modified version of 12.4.13  Benchmarks –UDP bidirectional, unidirectional, and ping traffic –Web server (thttpd) and software router (Click)  Testbed –PC machines with AMD Athlon 2600+ CPU and 2GB RAM –FreeBSD 4.7

18 18 Overall Improvements 157% improvement 65% improvement

19 19 Sources of Improvements 37% improvement due to two processors 70% improvement due to scratch pads

20 20 Comparison to ASIC NICs 3Com 710024 Tigon (1997) Intel PRO/1000 MT Intel (2002) Netgear GA622T Nat. Semi. (2001) 3Com 710024 Tigon (1997) UNIPROCESSOR Intel only 21% over Tigon

21 21 Impact on Web Server Throughput Overall 27-51% improvement Comparable to ASIC NICs

22 22 Parallelization Makes Programmability Viable  Programmability useful for complex functions  Limited clock speed for embedded processor –Limited uniprocessor performance  Use multiple cores to improve performance  Two core vs. single core –65% increase for maximum throughput –51% increase for web server throughput –Web server throughput comparable to ASIC NICs

23 23

24 24 UDP Send: Overall Improvements

25 25 UDP Send: Sources of Improvements

26 26 UDP Receive: Overall Improvements

27 27 UDP Receive: Sources of Improvements

28 28 UDP Ping: Overall Improvements

29 29 UDP Ping: Sources of Improvements

30 30 Impact on Routing Throughput


Download ppt "Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture."

Similar presentations


Ads by Google