Download presentation
Presentation is loading. Please wait.
Published byFrederica Fowler Modified over 8 years ago
1
Towards a Billion Routing Lookups per Second in Software Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc Publisher: SIGCOMM Computer Communication Review, 2012 Presenter: Yuen-Shuo Li Date: 2012/12/12 1
2
Outline Idea Introduction Search Data Structure Lookup Algorithm Updating Performance 2
3
Idea Can a software routing implementation compete in a field generally reserved for specialized lookup hardware? 3
4
introduction(1/3) Software routing lookups have fallen out of focus of the research community in the past decade, as the performance of general-purpose CPUs had not been on par with quickly increasing packet rates. Software vs Hardware? 4
5
introduction(2/3) But landscape has changed, however. 1. The performance potential of general-purpose CPUs has increased significantly over the past decade. instruction-level parallelism, shorter execution pipelines, etc. 2. The increasing interest in virtualized systems and software defined networks calls for a communication infrastructure that is more flexible and less tied to the hardware. 5
6
introduction(3/3) DXR, an efficient IPv4 lookup scheme for software running on general CPUs. The primary goal was not to implement a universal routing lookup schemes: data structures and algorithms which have been optimized exclusively for the IPv4 protocol. distills a real-world BGP snapshot with 417,000 prefix into a structure consuming only 782 Kbytes, less than 2 bytes per prefix, and achieves 490 MLps 6 MLps: million lookups per second
7
Search Data Structure(1/11) Building of the search data structure begins by expanding all prefixes from the database into address ranges. 7
8
Search Data Structure(2/11) Neighboring address ranges that resolve to the same next hop are then merged. 8
9
Search Data Structure(3/11) each entry only needs the start address and the next hop. the search data we need! 9
10
Search Data Structure(4/11) 10
11
Search Data Structure(5/11) 11
12
Search Data Structure(6/11) Each entry in the lookup table must contain the position and size of the corresponding chunk in the range table a special value for size indicates that the 19 bits are an offset into the next-hop table. 12 FormatSizeposition 1 bit12 bits19 bits 32 bits Lookup table entry
13
Search Data Structure(7/11) 13
14
Search Data Structure(8/11) 14 Range table With k bits already consumed to index the lookup table, the range table entries only need to store the remaining 32-k address bits, and the next hop index.
15
Search Data Structure(9/11) 15 Range table A further optimization which is especially effective for large BGP views, where the vast majority of prefixes are /24 or less specific. long format short format
16
Search Data Structure(10/11) 16 Range table one bit in the lookup table entry is used to indicate whether a chunk is stored in long or short format. long format short format Lookup indexNext hop 16 bit 16 bits 32 bits Lookup indexNext hop 8 bit 8 bits 16 bits
17
Search Data Structure(11/11)
18
Lookup Algorithm(1/4) The k leftmost bits of the key are used as an index in the lookup table. 1.2.4.0 00000001.00000010.00000100.00000000 index
19
Lookup Algorithm(2/4) If the entry points to the next hop, the lookup is complete. A special value for size indicates that the 19 bits are an offset into the next-hop table.
20
Lookup Algorithm(3/4) Otherwise, the (position, size) information in the entry select the portion of the range table on which to perform a binary search of the (remaining bits of the ) key. size position
21
Lookup Algorithm(4/4)
22
Updating(1/3) Lookup structures store only the information necessary for resolving next hop lookups, so a separate database which stores detailed information on all the prefixes is required for rebuilding the lookup structures. 22 lookup information detailed information
23
Updating(2/3) Updates are handled as follows: updates covering multiple chunks(prefix length < k) are expanded. then each chunk is processed independently and rebuilt from scratch. Finds the best matching route for the first IPv4 address belonging to the chunk, translating it to an range table entry. If the range table heap contains only a single element when the process ends, the next hop can be directly encoded in the lookup table. 23
24
Updating(3/3) The rebuild time can be reduced by coalescing multiple updates into a single rebuild We have implemented this by delaying the reconstruction for several milliseconds after the first update is received processing chunks in parallel, if multiple cores are available 24
25
Performance(1/11) Our primary test vectors were three IPv4 snapshots from routeviews.org BGP routers.routeviews.org 25
26
Performance(2/11) Here we present the results obtained using the LINX snapshot, selected because it is the most challenging for our scheme: it contains the highest number of next hops and an unusually high number of prefixes more specific than /24 26
27
Performance(3/11) Our primary test machine had a 3.6 GHz AMD FX-815 8-core CPU and 4 GB of RAM. Each pair of cores share a private 2 MB L2 cache block, for a total of 8 MB of L2 cache. All 8 cores share 8 MB of L3 cache. We also ran some benchmarks on a 2.8 GHz Intel Xeon W3530 4-core CPU with smaller but faster L2 caches(256 KB per core), and 8 MB shared L3 cache 27 All tests ran on the same platform/compiler(FreeBSD 8.3, amd 64, gcc 4.2.2) Each test ran for 10 seconds
28
Performance(4/11) We used three different request patterns: REP (repeated): each key is looked up 16 times before moving to the next one. RND (random): each key is looked up only once. The test loop has no data dependencies between iterations. SEQ (sequential): same as RND, but the timing loop has an artificial data dependency between iterations so that requests become effectively serialized. 28
29
Performance(5/11) 29
30
Performance(6/11) average lookup throughput with D18R is consistently faster compared to D16R on our test systems. 30
31
Performance(7/11) the distribution of binary search steps for uniformly random queries observed using the LINX database. 31 n = 0 means a direct match in the lookup table
32
Performance(8/11) the interference between multiple cores, causing a modest slowdown of the individual threads. 32
33
Performance(9/11) DIR-24-8-BASIC 33
34
Performance(10/11) 34
35
Performance(11/11) 35
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.