Presentation is loading. Please wait.

Presentation is loading. Please wait.

Capriccio:Scalable Threads for Internet Services

Similar presentations


Presentation on theme: "Capriccio:Scalable Threads for Internet Services"— Presentation transcript:

1 Capriccio:Scalable Threads for Internet Services
Rob von Behren, Jeremy Condit, Feng Zhou, George Necula and Eric Brewer Presented by Guoyang Chen

2 Overview Motivation Threads VS Events User-Level Threads
Capriccio Implementation Linked Stack Management Resource-Aware Scheduling Evaluation

3 Motivation High demand for web content
Web services is getting more complex, requiring maximum server performance Well-conditioned Service?

4 Well-conditioned Service
A well-conditioned service behave like a simple pipeline As offered load increases, throughput increases proportionally When saturated, throughput does not degrade substantially Ideal Peak: some resource at max Performance Overload: some resource thrashing Load (concurrent tasks)

5 Threads VS Events Thread-Based Concurrency
To build the programming model ,there are two kinds of methods. Threads based programming and event-based proramming. Let’s start with thread based programming. Each incoming request is dispatched to a separate thread, which processes the request and returns a result to the client. Besides, other I/O operations, such as disk access, are not shown here, but would be incorporated into each threads‘ request processing. However, Too many threads will lead to High resource usage, context switch overhead, contended. The traditional solution is to bound total number of threads. This question is intractable High resource usage, context switch overhead, contended locks Traditional solution: Bound total number of threads But, how do you determine the ideal number of threads?

6 Threads VS Events Event-Based Concurrency
Small number of event-processing threads with many FSMs Yields efficient and scalable concurrency Many examples: Click router, Flash web server, TP Monitors, etc.

7 SEDA Staged Event-Driven Architecture (SEDA)
Decompose service into stages separated by queues Each stage performs a subset of request processing Stages internally event-driven Each stage contains a thread pool to drive stage execution However, threads are not exposed to applications Dynamic control grows/shrinks thread pools with demand

8 Drawbacks of Events Events systems hide the control flow
Difficult to understand and debug Eventually evolved into call-and-return event pairs Programmers need to match related events Need to save/restore states Events require manual state management Capriccio: instead of event-based model, fix the thread-based model

9 Thread VS Event Why Thread? More natural programming model
Control flow is more apparent Exception handling is easier State management is automatic Better fit with current tools & hardware Better existing infrastructure

10 Capriccio Goals Mechanisms Simplify the programming model
Thread per concurrent activity Scalability (100K+ threads) Support existing APIs and tools Automate application-specific customization Mechanisms User-level threads Plumbing: avoid O(n) operations Compile-time analysis Run-time analysis

11 Thread Design Principles
Decouple programming model and OS Kernel threads Abstract hardware Expose device concurrency User-level threads Provide clean programming model Expose logical concurrency App User Threads OS

12 Thread Design Principles
Decouple programming model and OS Kernel threads Abstract hardware Expose device concurrency User-level threads Provide clean programming model Expose logical concurrency App Threads User OS

13 Thread Design and Scalability
User-Level Threads Flexibility Capriccio can use the new asynchronous I/O mechanisms without changing App code. User-level thread scheduler can be built along with the application. Extremely lightweight Performance Reduce the overhead of thread synchronization Do not require kernel crossings for mutex acquisition or release More efficient memory management at user level

14 Drawbacks of User-Level Threads
An increased number of kernel crossings A blocking I/O call, will be replaced by non-blocking mechanisms(epoll) A wrapper layer for translating blocking mechanisms to non-blocking. Difficult to use multiple processors. Synchronization is no longer “for free”

15 Capriccio Implementation
A user-level threading library. All thread operations are O(1) Linked stacks Address the problem of stack allocation for large numbers of threads Combination of compile-time and run-time analysis Resource-aware scheduler

16 Context Switches Built on top of Edgar Toernig’s coroutine library
Fast context switches when threads yield

17 I/O Capriccio intercepts blocking I/O calls
Uses epoll for non-blocking I/O

18 Scheduling Very much like an event-driven application
Events are hidden from programmers

19 Synchronization Supports cooperative threading on single-CPU machines
Requires only Boolean checks

20 Threading Microbenchmarks
SMP, two 2.4 GHz Xeon processors 1 GB memory two 10 K RPM SCSI Ultra II hard drives Linux Compared Capriccio, LinuxThreads, and Native POSIX Threads for Linux

21 Latencies of Thread Primitives
NPTL: Native POSIX Thread Library Capriccio LinuxThreads NPTL Thread creation 21.5 17.7 Thread context switch 0.24 0.71 0.65 Uncontended mutex lock 0.04 0.14 0.15

22 Thread Scalability Producers put empty messages into a shared buffer, and consumers “process” each message by looping for a random amount of time.

23 I/O Performance Network performance
By passing a number of TOKEN among pipes Simulates the effect of slow client links 10% overhead compared to epoll Twice as fast as both LinuxThreads and NPTL when more than 1000 threads Disk I/O is comparable to kernel threads

24 I/O Performance epoll_wait()

25 I/O Performance Benefit from kernel’s disk head scheduling algorithm since Capriccio uses asynchronous I/O primitives

26 Disk I/O with Buffer Cache
At low miss rate, capriccio’s throughput is 50% of NPTL. The source of the overhead is asynchronous I/O interface

27 Linked Stack Management
LinuxThreads allocates 2MB per stack 1 GB of VM holds only 500 threads Fixed Stacks

28 Safety: Linked Stacks The problem: fixed stacks
Overflow vs. wasted space The solution: linked stacks Allocate space as needed Compiler analysis Add runtime checkpoints Guarantee enough space until next check overflow waste Linked Stack

29 Linked Stacks: Algorithm
Build weighted call graph Insert checkpoints What is checkpoint? Determine whether there is enough stack space left to reach the next checkpoint. 3 3 2 5 2 4 3 6

30 Placing Checkpoints One checkpoint in every cycle’s back edge in the call graph Bound the size between checkpoints with the deepest call path

31 Linked Stacks: Algorithm
Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack 3 3 2 5 c1 2 4 3 6 MaxPath = 8

32 Linked Stacks: Algorithm
Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack A 3 B 3 D C 2 5 E c1 c2 2 4 F 3 6 H G MaxPath = 8

33 Linked Stacks: Algorithm
Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack A 3 c3 B 3 D C 2 5 E c2 c1 2 4 F 3 6 G MaxPath = 8

34 Linked Stacks: Algorithm
Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack A 3 c3 B 3 D C 2 5 E c4 c1 c2 2 4 F 3 6 G MaxPath = 8

35 Linked Stacks: Algorithm
Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack c5 A 3 c3 B 3 D C 2 5 E c4 c1 c2 2 4 F 3 6 G MaxPath = 8

36 Dealing with Special Cases
Function pointers Don’t know what procedure to call at compile time Can find a potential set of procedures

37 Dealing with Special Cases
External functions Allow programmers to annotate external library functions with trusted stack bounds Allow larger stack chunks to be linked for external functions

38 Tuning the Algorithm Stack space will be wasted Tradeoffs
Internal and external wasted space Tradeoffs Number of stack linkings (maxpath) External fragmentation (minchunk)) Internal external

39 Memory Benefits No preallocation of large stacks
Reduce requirement to run a large numbers of threads Better paging behavior Stacks—LIFO

40 Case Study: Apache 2.0.44 MaxPath: 2KB MinChunk: 4KB
Apache under SPECweb99 Overall slowdown is about 3% Dynamic allocation 0.1% Link to large chunks for external functions 0.5% Stack removal 10%

41 Scheduling: The Blocking Graph
Web Server Lessons from event systems Capriccio does this for threads Each node is a location in the program that blocked Deduce stage with stack traces at blocking points Record information about thread behavior Accept Read Open Read Close Write Close

42 Scheduling: The Blocking Graph
Annotate average running time for each edge(cycle counter) Annotate value for how long the next edge will take on average for each node Annotate the changes in resource usage Web Server Accept Read Open Read Close Write Close

43 Resource-Aware Scheduling
Keep track of resource usage levels and decide dynamically if each resource is at its limit. Annotate each node with the resources used on its outgoing edges so we can predict the impact on each resource should we schedule threads from that node. Dynamically prioritize nodes (and thus threads) for scheduling based on information from the first two parts. Increase use when underutilized Decrease use near saturation Advantages Operate near thrashing Automatic admission control Web Server Accept Read Open Read Close Write Close

44 Track Resources Memory usage: File descriptors:
By using malloc() family Resource limit for memory: By watching page fault activity File descriptors: By tracking open() and close() calls Resource limit: By estimating the number of open connections at which response time jumps up

45 Pitfalls Tricky to determine the maximum capacity of a resource
Thrashing depends on the workload Disk can handle more requests that are sequential instead of random Resources interact VM vs. disk Applications may manage memory themselves

46 Yield Profiling User threads are problematic if a thread fails to yield They are easy to detect, since their running times are orders of magnitude larger Yield profiling identifies places where programs fail to yield sufficiently often

47 Web Server Performance
4x500 MHz Pentium server 2GB memory Intel e1000 Gigabit Ethernet card Linux Workload: requests for 3.2 GB of static file data

48 Web Server Performance
Request frequencies match those of the SPECweb99 A client connects to a server repeated and issue a series of five requests, separated by 20ms pauses Apache’s performance improved by 15% with Capriccio

49 Web Server Performance

50 Runtime Overhead Tested Apache 2.0.44 Stack linking
78% slowdown for null call 3-4% overall Resource statistics 2% (on all the time) 0.1% (with sampling) Stack traces 8% overhead

51 Resource-Aware Admission Control
Touching pages too quickly will cause thrashing. Producer threads loop, adding memory to a global pool and randomly touching pages to force them to stay in memory Consumer threads loop, removing memory from the global pool and freeing it. Capriccio can quickly detect the overload conditions and limit the number of producers

52 Related Work Programming Models for High Concurrency
User-Level Threads(Capriccio is unique) Blocking graph Resource-aware scheduling Target at a large number of blocking threads POSIX compliant Application-Specific Optimization Stack Management Resource-Aware Scheduling

53 Future Work Multi-CPU machines
Improve resource-aware scheduler and stack analysis Produce profiling tools to help tune Capriccio’s stack parameters to the application’s needs.

54 Thanks! Questions?


Download ppt "Capriccio:Scalable Threads for Internet Services"

Similar presentations


Ads by Google