Capriccio:Scalable Threads for Internet Services

Capriccio:Scalable Threads for Internet Services
Rob von Behren, Jeremy Condit, Feng Zhou, George Necula and Eric Brewer Presented by Guoyang Chen

Overview Motivation Threads VS Events User-Level Threads
Capriccio Implementation Linked Stack Management Resource-Aware Scheduling Evaluation

Motivation High demand for web content
Web services is getting more complex, requiring maximum server performance Well-conditioned Service?

Well-conditioned Service
A well-conditioned service behave like a simple pipeline As offered load increases, throughput increases proportionally When saturated, throughput does not degrade substantially Ideal Peak: some resource at max Performance Overload: some resource thrashing Load (concurrent tasks)

Threads VS Events Thread-Based Concurrency
To build the programming model ,there are two kinds of methods. Threads based programming and event-based proramming. Let’s start with thread based programming. Each incoming request is dispatched to a separate thread, which processes the request and returns a result to the client. Besides, other I/O operations, such as disk access, are not shown here, but would be incorporated into each threads‘ request processing. However, Too many threads will lead to High resource usage, context switch overhead, contended. The traditional solution is to bound total number of threads. This question is intractable High resource usage, context switch overhead, contended locks Traditional solution: Bound total number of threads But, how do you determine the ideal number of threads?

Threads VS Events Event-Based Concurrency
Small number of event-processing threads with many FSMs Yields efficient and scalable concurrency Many examples: Click router, Flash web server, TP Monitors, etc.

SEDA Staged Event-Driven Architecture (SEDA)
Decompose service into stages separated by queues Each stage performs a subset of request processing Stages internally event-driven Each stage contains a thread pool to drive stage execution However, threads are not exposed to applications Dynamic control grows/shrinks thread pools with demand

Drawbacks of Events Events systems hide the control flow
Difficult to understand and debug Eventually evolved into call-and-return event pairs Programmers need to match related events Need to save/restore states Events require manual state management Capriccio: instead of event-based model, fix the thread-based model

Thread VS Event Why Thread？ More natural programming model
Control flow is more apparent Exception handling is easier State management is automatic Better fit with current tools & hardware Better existing infrastructure

Capriccio Goals Mechanisms Simplify the programming model
Thread per concurrent activity Scalability (100K+ threads) Support existing APIs and tools Automate application-specific customization Mechanisms User-level threads Plumbing: avoid O(n) operations Compile-time analysis Run-time analysis

Thread Design Principles
Decouple programming model and OS Kernel threads Abstract hardware Expose device concurrency User-level threads Provide clean programming model Expose logical concurrency App User Threads OS

Thread Design Principles
Decouple programming model and OS Kernel threads Abstract hardware Expose device concurrency User-level threads Provide clean programming model Expose logical concurrency App Threads User OS

Thread Design and Scalability
User-Level Threads Flexibility Capriccio can use the new asynchronous I/O mechanisms without changing App code. User-level thread scheduler can be built along with the application. Extremely lightweight Performance Reduce the overhead of thread synchronization Do not require kernel crossings for mutex acquisition or release More efficient memory management at user level

Drawbacks of User-Level Threads
An increased number of kernel crossings A blocking I/O call, will be replaced by non-blocking mechanisms(epoll) A wrapper layer for translating blocking mechanisms to non-blocking. Difficult to use multiple processors. Synchronization is no longer “for free”

Capriccio Implementation
A user-level threading library. All thread operations are O(1) Linked stacks Address the problem of stack allocation for large numbers of threads Combination of compile-time and run-time analysis Resource-aware scheduler

Context Switches Built on top of Edgar Toernig’s coroutine library
Fast context switches when threads yield

I/O Capriccio intercepts blocking I/O calls
Uses epoll for non-blocking I/O

Scheduling Very much like an event-driven application
Events are hidden from programmers

Synchronization Supports cooperative threading on single-CPU machines
Requires only Boolean checks

Threading Microbenchmarks
SMP, two 2.4 GHz Xeon processors 1 GB memory two 10 K RPM SCSI Ultra II hard drives Linux Compared Capriccio, LinuxThreads, and Native POSIX Threads for Linux

Latencies of Thread Primitives
NPTL: Native POSIX Thread Library Capriccio LinuxThreads NPTL Thread creation 21.5 17.7 Thread context switch 0.24 0.71 0.65 Uncontended mutex lock 0.04 0.14 0.15

Thread Scalability Producers put empty messages into a shared buﬀer, and consumers “process” each message by looping for a random amount of time.

I/O Performance Network performance
By passing a number of TOKEN among pipes Simulates the effect of slow client links 10% overhead compared to epoll Twice as fast as both LinuxThreads and NPTL when more than 1000 threads Disk I/O is comparable to kernel threads

I/O Performance epoll_wait()

I/O Performance Benefit from kernel’s disk head scheduling algorithm since Capriccio uses asynchronous I/O primitives

Disk I/O with Buffer Cache
At low miss rate, capriccio’s throughput is 50% of NPTL. The source of the overhead is asynchronous I/O interface

Linked Stack Management
LinuxThreads allocates 2MB per stack 1 GB of VM holds only 500 threads Fixed Stacks

Safety: Linked Stacks The problem: fixed stacks
Overflow vs. wasted space The solution: linked stacks Allocate space as needed Compiler analysis Add runtime checkpoints Guarantee enough space until next check overflow waste Linked Stack

Linked Stacks: Algorithm
Build weighted call graph Insert checkpoints What is checkpoint? Determine whether there is enough stack space left to reach the next checkpoint. 3 3 2 5 2 4 3 6

Placing Checkpoints One checkpoint in every cycle’s back edge in the call graph Bound the size between checkpoints with the deepest call path

Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack 3 3 2 5 c1 2 4 3 6 MaxPath = 8

Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack A 3 B 3 D C 2 5 E c1 c2 2 4 F 3 6 H G MaxPath = 8

Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack A 3 c3 B 3 D C 2 5 E c2 c1 2 4 F 3 6 G MaxPath = 8

Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack A 3 c3 B 3 D C 2 5 E c4 c1 c2 2 4 F 3 6 G MaxPath = 8

Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack c5 A 3 c3 B 3 D C 2 5 E c4 c1 c2 2 4 F 3 6 G MaxPath = 8

Dealing with Special Cases
Function pointers Don’t know what procedure to call at compile time Can find a potential set of procedures

Dealing with Special Cases
External functions Allow programmers to annotate external library functions with trusted stack bounds Allow larger stack chunks to be linked for external functions

Tuning the Algorithm Stack space will be wasted Tradeoffs
Internal and external wasted space Tradeoffs Number of stack linkings (maxpath) External fragmentation (minchunk)) Internal external

Memory Benefits No preallocation of large stacks
Reduce requirement to run a large numbers of threads Better paging behavior Stacks—LIFO

Case Study: Apache 2.0.44 MaxPath: 2KB MinChunk: 4KB
Apache under SPECweb99 Overall slowdown is about 3% Dynamic allocation 0.1% Link to large chunks for external functions 0.5% Stack removal 10%

Scheduling: The Blocking Graph
Web Server Lessons from event systems Capriccio does this for threads Each node is a location in the program that blocked Deduce stage with stack traces at blocking points Record information about thread behavior Accept Read Open Read Close Write Close

Scheduling: The Blocking Graph
Annotate average running time for each edge(cycle counter) Annotate value for how long the next edge will take on average for each node Annotate the changes in resource usage Web Server Accept Read Open Read Close Write Close

Resource-Aware Scheduling
Keep track of resource usage levels and decide dynamically if each resource is at its limit. Annotate each node with the resources used on its outgoing edges so we can predict the impact on each resource should we schedule threads from that node. Dynamically prioritize nodes (and thus threads) for scheduling based on information from the ﬁrst two parts. Increase use when underutilized Decrease use near saturation Advantages Operate near thrashing Automatic admission control Web Server Accept Read Open Read Close Write Close

Track Resources Memory usage: File descriptors:
By using malloc() family Resource limit for memory: By watching page fault activity File descriptors: By tracking open() and close() calls Resource limit: By estimating the number of open connections at which response time jumps up

Pitfalls Tricky to determine the maximum capacity of a resource
Thrashing depends on the workload Disk can handle more requests that are sequential instead of random Resources interact VM vs. disk Applications may manage memory themselves

Yield Profiling User threads are problematic if a thread fails to yield They are easy to detect, since their running times are orders of magnitude larger Yield profiling identifies places where programs fail to yield sufficiently often

Web Server Performance
4x500 MHz Pentium server 2GB memory Intel e1000 Gigabit Ethernet card Linux Workload: requests for 3.2 GB of static file data

Request frequencies match those of the SPECweb99 A client connects to a server repeated and issue a series of five requests, separated by 20ms pauses Apache’s performance improved by 15% with Capriccio

Runtime Overhead Tested Apache 2.0.44 Stack linking
78% slowdown for null call 3-4% overall Resource statistics 2% (on all the time) 0.1% (with sampling) Stack traces 8% overhead

Resource-Aware Admission Control
Touching pages too quickly will cause thrashing. Producer threads loop, adding memory to a global pool and randomly touching pages to force them to stay in memory Consumer threads loop, removing memory from the global pool and freeing it. Capriccio can quickly detect the overload conditions and limit the number of producers

Related Work Programming Models for High Concurrency
User-Level Threads(Capriccio is unique) Blocking graph Resource-aware scheduling Target at a large number of blocking threads POSIX compliant Application-Specific Optimization Stack Management Resource-Aware Scheduling

Future Work Multi-CPU machines
Improve resource-aware scheduler and stack analysis Produce proﬁling tools to help tune Capriccio’s stack parameters to the application’s needs.

Thanks! Questions?

Capriccio:Scalable Threads for Internet Services

Similar presentations

Presentation on theme: "Capriccio:Scalable Threads for Internet Services"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Capriccio:Scalable Threads for Internet Services

Similar presentations

Presentation on theme: "Capriccio:Scalable Threads for Internet Services"— Presentation transcript:

Similar presentations

About project

Feedback