Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon.

Similar presentations


Presentation on theme: "A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon."— Presentation transcript:

1 A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon 2 1 The Chinese University of Hong Kong 2 Bell Labs, Alcatel-Lucent April 2010

2 2 Outline Motivation MCRingBuffer, a multi-core ring buffer Parallel network monitoring prototype Conclusions

3 3 Network Traffic Monitoring Monitoring data streams in today’s networks is essential for network management: Accounting resource provisioning failure diagnosis intrusion detection/prevention Goal: achieve line-rate monitoring Monitoring speed must keep up with link bandwidth (i.e., prepare for the worst) Challenges: Data volume keeps increasing (e.g., to Gigabit scales) Single CPU systems may no longer support line-rate monitoring

4 4 Can Multi-Core Help? Can multi-core architectures help line-rate monitoring? Parallelize packet processing The answer should be “yes”…… yet, exploiting full potential of multi-core is still challenging Inter-core communication has overhead: Upper layer: protocol messages Lower layer: thread synchronization in shared data structures core raw packets CPU Quad-core CPU core Single-core caseMulti-core case raw packets

5 5 Can Multi-Core Help? Multi-core helps only if we minimize inter-core communication overhead Let’s focus on minimizing thread synchronization Benefit a broad class of multi-threaded network monitoring applications

6 6 Our Contribution Why lock-free? Allows concurrent thread accesses Why cache-efficient? Saves expensive memory accesses We embed the mechanism to MCRingBuffer, a lock-free, cache-efficient shared ring buffer tailored for multi- core architectures Design a lock-free, cache-efficient multi-core synchronization mechanism for high-speed network traffic monitoring

7 7 Producer/Consumer Problem Classical OS problem Ring buffer: bounded buffer with fixed number of slots Thread synchronization: Producer inserts elements when buffer is not full Consumer extracts elements when buffer is not empty First-in-first-out (FIFO): inserted elements and extracted elements in the same order Ring buffer element ProducerConsumer

8 8 Producer/Consumer Problem Ring buffer in multi-core context: L1 cache ProducerConsumer core L1 cache CPU L2 cache Control variables Ring buffer Memory System bus Thread synchronization operates on control variables. Make the operations as cache-friendly as possible.

9 9 Lamport’s Lock-Free Ring Buffer Operate on control variables: read and write, which resp. point to next read and write slots read write 0N-1 Insert(T element) 1: wait until NEXT(write) != read 2: buffer[write] = element 3: write = NEXT(write) Extract(T* element) 1: wait until read != write 2: *element = buffer[read] 3: read = NEXT(read) [Lamport, Comm. of ACM, 1977] NEXT(x) = (x + 1) % N

10 10 Previous Work FastForward [Giacomoni et al., PPoPP, 2008] : couple data/control operations need a special NULL data element defined by applications Hardware-primitive ring buffers support multiple-producers/multiple-consumers use hardware synchronization primitives (e.g., compare and swap) Hardware primitives are expensive in general

11 11 MCRingBuffer Overview Goal: use Lamport’s ring buffer as a building block to further minimize cost of thread synchronization Properties: Lock-free: allow concurrent accesses of producer and consumer Cache-efficient: improve cache locality of synchronization Generic: no assumption on data types and insert/extract patterns Deployable: works on general-purpose multi-core CPUs Components: Cache-line protection Batch updates of control variables

12 12 MCRingBuffer Assumptions Assumptions inherited from Lamport’s ring buffer: single-producer/single-consumer reading/writing read / write are atomic memory accesses follow sequential consistency

13 13 Cache-line Protection Cache is in unit of cache lines False sharing occurs when two threads access different variables on the same cache line Cache line invalidated when a thread modifies a variable Cache line reloaded from memory when a thread reads a different variable, even unchanged cache readwrite N N (ring buffer size) is reloaded from memory even if it’s constant read / write modified frequently for thread synchronization

14 14 Cache-line Protection Add padding bytes to avoid false sharing cache readwrite N cachePad1 cachePad2 int read int write char cachePad1[CL–2*sizeof(int)] int N char cachePad2[CL–sizeof(int)] CL = cache line size

15 15 Cache-line Protection Use cache-line protection to minimize memory accesses cache readwrite localWrite cachePad1 cachePad2nextRead localRead cachePad3nextWrite N cachePad4 Shared variables Consumer’s local variables Producer’s local variables Constants Shared variables are main controls of synchronization Use local variables to “guess” shared variables Goal: minimize freq. of reading shared control variables

16 16 Batch Updates of Control Variables Intuition: nextRead / nextWrite are the positions where to read/write Update read / write after batchSize reads/writes buffer[nextWrite] = element nextWrite = NEXT(nextWrite) wBatch++ if (wBatch >= batchSize) { write = nextWrite wBatch = 0 } *element = buffer[nextRead] nextRead = NEXT(nextRead) rBatch++ if (rBatch >= batchSize) { read = nextRead rBatch = 0 } ProducerConsumer Goal: minimize freq. of writing shared control variables

17 17 Batch Updates of Control Variables Limitation: read/write advanced on per-batch basis  elements may not be extracted even buffer is not empty However, if elements are raw packets in high-speed networks, read/write will be updated regularly

18 18 Correctness of MCRingBuffer Correctness based on Lamport’s ring buffer: Lamport’s: Insert only if write – read < N Extract only if read < write We prove for MCRingBuffer: Insert only if nextWrite – nextRead < N Extract only if nextRead < nextWrite Details in the paper.

19 19 Evaluation Hardware: Intel Xeon 5355 Quad-core sibling cores: pair of cores sharing L2 cache non-sibling cores: pair of cores not sharing L2 cache Ring buffers: LockRingBuffer: lock-based ring buffer BasicRingBuffer: Lamport’s ring buffer MCRingBuffer: batchSize = 1: cache-line protection batchSize > 1: cache-line protection + batch control updates Metrics: Throughput: number of insert/extract pairs per second Number of L2 cache misses: number of cache-line reload operations

20 20 Experiment 1 Throughput vs. element size Sibling coresNon-Sibling cores MCRingBuffer with batchSize > 1 has a higher throughput gain (up to 5x) for smaller element size buffer capacity = 2K elements

21 21 Experiment 2 Throughput vs. buffer capacity Sibling coresNon-Sibling cores MCRingBuffer’s throughput invariant with large enough buffer capacity element size = 128 bytes

22 22 Experiment 3 Code profiling from Intel VTune Performance Analyzer BasicRingBuffer MCRingBuffer ( batchSize = 50) # core cycles 1130M / 1097M137M / 113M # retired instructions 358M / 287M231M / 219M # L2 cache misses 746K / 808K102K / 80K Metric numbers for 10M inserts/extracts element size = 8 bytes, capacity = 2K elements MCRingBuffer improves cache locality

23 23 Recap of Evaluation MCRingBuffer improves throughput in various scenarios: Different data sizes Different buffer capacities Sibling/non-sibling cores MCRingBuffer has higher throughput gain via: careful organization of control variables careful accesses to control variables MCRingBuffer’s gain does not require any special insert/extract patterns

24 24 Parallel Traffic Monitoring Applying MCRingBuffer to parallel traffic monitoring Dispatcher SubAnanlyzer MainAnanlyzer SubAnanlyzer raw packets SubAnanlyzer … ring buffer decoded packetsstate reports

25 25 Parallel Traffic Monitoring Dispatch stage: Decode raw packets Distribute decoded packets by (srcIP, dstIP) SubAnalysis stage: Local analysis on address pairs e.g., 5-tuple flow stats, vertical portscans MainAnalysis stage: Global analysis: aggregate results of all SubAnalyzers e.g., source’s volume, horizontal portscans … Dispatch SubAnalysis MainAnalysis Evaluation results: MCRingBuffer helps scale up packet processing throughput (details in paper)

26 26 Take-away Messages Proposed a building block for parallel traffic monitoring: a lock-free, cache-efficient synchronization mechanism Next question: How do we apply MCRingBuffer to different network monitoring problems?


Download ppt "A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon."

Similar presentations


Ads by Google