Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CoolCAMs: Power-Efficient TCAMs for Forwarding Engines Paper by Francis Zane, Girija Narlikar, Anindya Basu Bell Laboratories, Lucent Technologies Presented.

Similar presentations


Presentation on theme: "1 CoolCAMs: Power-Efficient TCAMs for Forwarding Engines Paper by Francis Zane, Girija Narlikar, Anindya Basu Bell Laboratories, Lucent Technologies Presented."— Presentation transcript:

1 1 CoolCAMs: Power-Efficient TCAMs for Forwarding Engines Paper by Francis Zane, Girija Narlikar, Anindya Basu Bell Laboratories, Lucent Technologies Presented by Edward Spitznagel

2 2 Outline zIntroduction zTCAMs for Address Lookup zBit Selection Architecture zTrie-based Table Partitioning zRoute Table Updates zSummary and Discussion

3 3 Introduction zTernary Content-Addressable Memories (TCAMs) are becoming very popular for designing high- throughput forwarding engines; they are yfast ycost-effective ysimple to manage zMajor drawback: high power consumption zThis paper presents architectures and algorithms for making TCAM-based routing tables more power-efficient

4 4 TCAMs for Address Lookup zFully-associative memory, searchable in a single cycle zHardware compares query word (destination address) to all stored words (routing prefixes) in parallel yeach bit of a stored word can be 0, 1, or X (don’t care) yin the event that multiple matches occur, typically the entry with lowest address is returned

5 5 TCAMs for Address Lookup zTCAM vendors now provide for a mechanism that can reduce power consumption by selectively addressing smaller portions of the TCAM zThe TCAM is divided into a set of blocks; each block is a contiguous, fixed size chunk of TCAM entries ye.g. a 512k entry TCAM could be divided into 64 blocks of 8k entries each zWhen a search command is issued, it is possible to specify which block(s) to use in the search zThis can help us save power, since the main component of TCAM power consumption when searching is proportional to the number of searched entries

6 6 Bit Selection Architecture zBased on observation that most prefixes in core routing tables are between 16 and 24 bits long yover 98%, in the authors’ datasets zPut the very short ( 24bit) prefixes in a set of TCAM blocks to search on every lookup zThe remaining prefixes are partitioned into “buckets,” one of which is selected by hashing for each lookup yeach bucket is laid out over one or more TCAM blocks zIn this paper, the hashing function is restricted to merely using a selected set of input bits as an index

7 7 Bit Selection Architecture

8 8 zA route lookup, then, involves the following: yhashing function (bit selection logic, really) selects k hashing bits from the destination address, which identifies a bucket to be searched yalso search the blocks with the very long and very short prefixes zThe main issues now are: yhow to select the k hashing bits xRestrict ourselves to choosing hashing bits from the first 16 bits of the address, to avoid replicating prefixes yhow to allocate the different buckets among the various TCAM blocks (since bucket size may not be an integral multiple of the TCAM block size)

9 9 Bit Selection: Worst-case power consumption zGiven any routing table containing N prefixes, each of length  L, what is the size of the largest bucket generated by the best possible hash function that uses k bits out of the first L? zTheorem III.1: There exists some hash function splitting the set of prefixes such that the size of the largest bucket is at most ymore details and proof in Appendix I zideal hash function would generate 2k equal-sized buckets ye.g. if k=3, then each has size 0.125; if k=6, then each has size

10 10 Bit Selection Heuristics zWe don’t expect to see the worst-case input, but it gives designers a power budget zGiven such a power budget and a routing table, it is sufficient to find a set of hashing bits that produce a split that does not exceed the power budget (a satisfying split ) z3 Heuristics ythe first is simple: use the rightmost k bits of the first 16 bits. In almost all routing traces studied, this works well.

11 11 Bit Selection Heuristics zSecond Heuristic: brute force search to check all possible subsets of k bits from the first 16. zGuaranteed to find a satisfying split zSince it compares possible sets of k bits, running time is maximum for k =8

12 12 Bit Selection Heuristics zThird heuristic: a greedy algorithm yFalls between the simple heuristic and the brute- force one, in terms of complexity and accuracy zTo select k hashing bits, the algorithm performs k iterations, selecting one bit per iteration ynumber of buckets doubles each iteration zGoal in each iteration is to select a bit that minimizes the size of the biggest bucket produced in that iteration

13 13 Bit Selection Heuristics zThird heuristic: greedy algorithm: pseudocode

14 14 Bit Selection Heuristics zCombining the heuristics, to reduce running time (in typical cases) yFirst, try the simple heuristic (use k rightmost bits), and stop if that succeeds. yOtherwise, apply the third heuristic (greedy algorithm), and stop if that succeeds. yOtherwise, apply the brute-force heuristic zApply algorithm again whenever route updates cause any bucket to become too large.

15 15 Bit Selection Architecture: Experimental Results zEvaluate the heuristics with respect to two metrics: running time and quality of splits produced. zApplied to real core routing tables; results are presented for two, but others were similar zApplied to synthetic table with ~1M entries, constructed by randomly picking how many prefixes share each combination of first 16 bits

16 16 Bit Selection Results: Running Time zRunning time on 800MHz PC zRequired less than 1MB memory

17 17 Bit Selection Results: Quality of Splits ylet N denote the number of bit prefixes ylet c max denote the maximum bucket size zThe ratio N / c max measures the quality (evenness) of the split produced by the hashing bits yit is the factor of reduction in the portion of the TCAM that needs to be searched

18 18 Bit Selection Architecture: Laying out of TCAM buckets zBlocks for very long prefixes and very short prefixes are placed in the TCAM at the beginning and end, respectively. yEnsures that we select the longest prefix, if more than one should match. zLaying out buckets sequentially in any order: yany bucket of size c occupies no more than TCAM blocks, where s is the TCAM block size yAt most TCAM blocks need to be searched for any lookup (plus the blocks for very long and very short prefixes) yThus, actual power savings ratio is not quite as good as the N / c max mentioned before, but it is still good.

19 19 Bit Selection Architecture: Remarks zGood average-case power reduction, but the worst-case bounds are not as good yhardware designers are thus forced to design for much higher power consumption than will be seen in practice zAssumes most prefixes are bits long ymay not always be the case (e.g. number of long (>24bit) prefixes may increase in the future)

20 20 Trie-based Table Partitioning zPartitioning scheme using a Routing Trie data structure zEliminates the two drawbacks of the Bit Selection architecture yworst-case bounds on power consumption do not match well with power consumption in practice yassumption that most prefixes are bits long zTwo trie-based schemes (subtree-split and postorder- splitting), both involving two steps: yconstruct a binary routing trie from the routing table ypartitioning step: carve out subtrees from the trie and place into buckets zThe two schemes differ in their partitioning step

21 21 Trie-based Architecture zTrie-based forwarding engine architecture yuse an index TCAM (instead of hashing) to determine which bucket to search yrequires searching the entire index TCAM, but typically the index TCAM is very small

22 22 Overview of Routing Tries zA 1-bit trie can be used for performing longest prefix matches ythe trie consists of nodes, where a routing prefix of length n is stored at level n of the trie zRouting lookup process ystarts at the root yscans input and descends the left (right) if the next bit of input is 0 (1), until a leaf node is reached ylast prefix encountered is the longest matching prefix zcount(v ) = number of routing prefixes in the subtree rooted at v zthe covering prefix of a node u is the prefix of the lowest common ancestor of u that is in the routing table (including u itself)

23 23 Routing Trie Example Routing Table:Corresponding 1-bit trie:

24 24 Splitting into subtrees zSubtree-split algorithm: yinput: b = maximum size of a TCAM bucket youtput: a set of K  TCAM buckets, each with size in the range, and an index TCAM of size K zPartitioning step: post order traversal of the trie, looking for carving nodes. yCarving node: a node with count  and with a parent whose count is > b zWhen we find a carving node v, ycarve out the subtree rooted at v, and place it in a separate bucket yplace the prefix of v in the index TCAM, along with the covering prefix of v ycounts of all ancestors of v are decreased by count(v )

25 25 Subtree-split: Algorithm

26 26 Subtree-split: Example b = 4

27 27 Subtree-split: Example b = 4

28 28 Subtree-split: Example b = 4

29 29 Subtree-split: Example b = 4

30 30 Subtree-split: Remarks zSubtree-split creates buckets whose size range from  b/2  to b (except the last, which ranges from 1 to b ) yAt most one covering prefix is added to each bucket zThe total number of buckets created ranges from  N/b  to  2N/b  ; each bucket results in one entry in the index TCAM zUsing subtree-split in a TCAM with K buckets, during any lookup at most K +  2N /K  prefixes are searched from the index and data TCAMs zTotal complexity of the subtree-split algorithm is O(N +NW /b)

31 31 Post-order splitting zPartitions the table into buckets of exactly b prefixes yimprovement over subtree-split, where the smallest and largest bucket sizes can vary by a factor of 2 ythis comes with the cost of more entries in the index TCAM zPartitioning step: post-order traversal of the trie, looking for subtrees to carve out, but, zBuckets are made from collections of subtrees, rather than just a single subtree yThis is because it is possible the entire trie does not contain  N /b  subtrees of exactly b prefixes each

32 32 Post-order splitting zpostorder-split : does a post-order traversal of the trie, calling carve-exact to carve out subtree collections of size b zcarve-exact : does the actual carving yif it’s at a node with count = b, then it can simply carve out that subtree yif it’s at a node with count < b, whose parent has count  b, do nothing (since we will later have a chance to carve the parent) yif it’s at a node with count x, where x b, then… xcarve out the subtree of size x at this node, and xrecursively call carve-exact again, this time looking for a carving of size b - x (instead of b)

33 33 Post-order split: Algorithm

34 34 Post-order split: Example b = 4

35 35 Post-order split: Example b = 4

36 36 Post-order split: Example b = 4

37 37 Postorder-split: Remarks zPostorder-split creates buckets of size b (except the last, which ranges from 1 to b ) yAt most W covering prefixes are added to each bucket, where W is the length of the longest prefix in the table zThe total number of buckets created is exactly  N/b . Each bucket results in at most W +1 entries in the index TCAM zUsing postorder-split in a TCAM with K buckets, during any lookup at most (W +1)K +  N /K  +W prefixes are searched from the index and data TCAMs zTotal complexity of the postorder-split algorithm is O(N +NW /b)

38 38 Post-order split: Experimental results zAlgorithm running time:

39 39 Post-order split: Experimental results zReduction in routing table entries searched

40 40 Route Table Updates zBriefly explore performance in the face of routing table updates zAdding routes may cause a TCAM bucket to overflow, requiring repartitioning of the prefixes and rewriting the entire table into the TCAM zApply real-life update traces (about 3.5M updates each) to the bit-selection and trie-based schemes, to see how often recomputation is needed

41 41 Route Table Updates zBit-selection Architecture: yapply brute-force heuristic on the initial table; note size c max of largest bucket yrecompute hashing bits when any bucket grows beyond c thresh = (1 + t ) x c max for some threshold t ywhen recomputing, first try the static heuristic; if needed, then try the greedy algorithm; fall back on brute-force if necessary zTrie-based Architecture: ysimilar threshold-based strategy ysubtree-split: use bucket size of  2N /K  ypost-order splitting: use bucket size of  N /K 

42 42 Route Table Updates zResults for bit-selection architecture:

43 43 Route Table Updates zResults for trie-based architecture:

44 44 Route Table Updates: “post-opt” algorithm zPost-opt: post-order split algorithm, with clever handling of updates: ywith post-order split, we can transfer prefixes between neighboring buckets easily (few writes to the index and data TCAMs are needed) yso, if a bucket becomes overfull, we can usually just transfer one of its prefixes to a neighboring bucket yrepartitioning, then, is only needed when both neighboring buckets are also full

45 45 Summary zTCAMs would be great for routing lookup, if they didn’t use so much power zCoolCAMs: two architectures that use partitioned TCAMs to reduce power consumption in routing lookup yBit-selection Architecture yTrie-based Table Partitioning (subtree-split and postorder- splitting) yeach scheme has its own subtle advantages/disadvantages, but overall they seem to work well

46 46 Discussion


Download ppt "1 CoolCAMs: Power-Efficient TCAMs for Forwarding Engines Paper by Francis Zane, Girija Narlikar, Anindya Basu Bell Laboratories, Lucent Technologies Presented."

Similar presentations


Ads by Google