Presentation is loading. Please wait.

Presentation is loading. Please wait.

Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

Similar presentations


Presentation on theme: "Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)"— Presentation transcript:

1 Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)

2 Table of Contents System Overview CRS/NG Restartability Overview − Problem Definition and High Level Solution Concrete Example − Statistics Resource Manager Library Conclusion 2

3 System Overview Core Router Extremely complex System SW: 16 MLOC HW: several chasses, LCs (1 CPU, 5 NPUs, chips galore), RPs (1 CPU, chips galore), fabric cards, blade cards, … Forms distributed System 99.9...9% Uptime 3

4 System Overview System Manager: restarts crashed Process HW bug SW bug Process must maintain State (after Crash) CRS/NG Approach Key data structures in shared memory Well written algorithm guarantee consistency CRS 1  CRS 3  CRS/NG (final name?) 4

5 CRS/NG Restartability Overview CRS/NG runs Cisco IOS/XR Cisco IOS/XR Abstraction Layer on Linux Sophisticated IPC Sophisticated shared memory API  Special malloc for shared memory  Static configuration file –Mapping identifiers to fixed virtual addresses –STATS_RESTART 0x50000000  (Re)attaching to shared memory via identifier  Previously allocated objects always available  … 5

6 CRS/NG Restartability Overview Process requiring Restartability Key data-structures in shared memory Careful algorithm design to avoid Temporary inconsistencies account1 := account1+X; account2 := account2-X; Pointer operations (disconnection of linked lists) Crashes during IPCs Crashes before a return; (caller records success) Optional recovery phase Compromises are possible 6

7 Concrete Example: Statistics Resource Manager Library HW: Extremely simplified View on CRS/NG 7

8 Concrete Example: Statistics Resource Manager Library SW: Somewhat simplified View on CRS/NG Statistics Manager 8

9 Concrete Example: Statistics Resource Manager Library Client Application / Library crashes  Restart Client Application: State is gone Stats pointers are lost Other state is lost Stats Lib State is gone Stats pointers are lost Solution for Stats Lib Keep freelists in shared memory Smart algorithm for keeping state consistent 9

10 Concrete Example: Statistics Resource Manager Library Step 1: Keeping State in Shared Memory 01 stats_cl_ctx_st *mstats_cl_bind (char *name) { 02 void *shmem; 03 stats_cl_ctx_st *con; 04 05 /* open shmem at a predetermined address */ 06 shmem = shmwin_attach(SSE_STATS_RESTART_ADDRESS); // posix mmap: MAP_FIXED flag 07 con=shmem+name_to_offset(name); 08 09 if (strcmp(con->name, name)) { 10 /* first bind */ 11 12 /* init "empty" context */ 13 con->freelist[0..max]=NULL; 14 con->mutex=0; 15 strcpy(con->name, name); 16 } else { 17 /* restart */ 18 /* do nothing, just return con */ 18 } 20 return con; 21 } 10

11 Concrete Example: Statistics Resource Manager Library Step 2a: Smart Algorithm − A pragmatic Approach (chosen for CRS/NG) Few Concepts:  (Re-)moving nodes from freelist Worst case: a page is lost (bad?)  Requesting fresh page from server Worst case: page is lost (bad?)  Updating bitmap: mark some pointers as allocated − client does not pick up Worst case: some pointers are lost (bad?) 11

12 Concrete Example: Statistics Resource Manager Library Discussion of worst Case Scenarios A page (or a few Pointers within) is lost = 256 out of 8 million stats pointers in NPU memory − no big deal = 80 byte out of several GB of CPU memory for node structure − no big deal Client frees a Pointer from a lost Page  Error Code is returned  Client is irritated but has to ignore it We never give out same Pointer twice 12

13 Concrete Example: Statistics Resource Manager Library Step 2b: Smart Algorithm − A perfect Approach Complicated Algorithm / Very difficult Implementation Further pointers in shared memory Need to figure out where crashed and continue from there Requirement: interacting Libraries and Processes must be "perfect" as well 13

14 Conclusion Pragmatic Approach of CRS/NG + Easy to implement +/− Crashes: worst Case: small Mem. Leak + No Run-time Performance Hit Perfect Approach + Very difficult to implement  Error prone + Crashes: no Memory Leak − Perhaps Run-time Performance Hit 14

15 Thank You 15 Platinum Sponsors: Gold Sponsors: Silver Sponsors: Organization Sponsors


Download ppt "Restartability Manage- ment in the Cisco Core Router CRS/NG Stefan Schaeckeler (Cisco Systems, Inc.) Ashwin Narasimha Murthy (Google, Inc.)"

Similar presentations


Ads by Google