Capriccio: Scalable Threads for Internet Services (von Behren) Kenneth Chiu.

Capriccio: Scalable Threads for Internet Services (von Behren) Kenneth Chiu

Background Non-blocking I/O, async I/O –NB Usually doesn’t work well for disks. –Async I/O Issue a request, get completion. epoll()/poll() convoy: tendency for threads to “bunch up” priority inversion call graph average, weighted moving average capriccio: improvisatory style, free form

The Problem Web “transactions” involve a number of steps which must be performed in sequence. For high-throughput, we want to service many of these requests concurrently. –When does concurrency help? When does it not? If we use a single thread per request, we will have too many threads. If we multiplex requests on a small set of threads, it’s more difficult.

Read two numbers and add while (true) { fd = get_read_ready(); state = lookup(fd); if (state.step == READING_FIRST) { c = read(fd, …, bytes_left); if (have enough) { state.step == READING_SECOND; } } else if (state.step == READING_SECOND) { … } while (true) { int n1, n2; readexact(fd, &n1, 4); readexact(fd, &n2, 4); printf(“%d\n”, n1 + n2); }

Thread Design and Scalability

The Case for User-Level Threads Flexibility –Level of indirection between applications and the kernel, which helps decouple the two. –Kernel-level thread scheduling must handle all applications. User-level can be tailored. –Lightweight which means can use zillions of them. Performance –Cooperative scheduling is nearly free. –Do not require kernel crossing for uncontended locks. (Why do contended locks require kernel crossings?) Disadvantages –Non-blocking I/O requires an additional system call. (Why?) –SMPs

Implementation Context switches –Built on coroutine library. I/O –Intercept blocking system calls, use epoll() and AIO for disk. –Can be less efficient Scheduling –Main scheduling loop looks very much like an event-driven application. (What is an EDA?) –Makes it relatively easy to switch schedulers. Synchronization –Cooperative threading on UP. Efficiency –All O(1), except sleep queue.

Benchmarks 2 X 2.4 GHz Xeon, 1 GB memory, 2 X 10K RPM SCSI, GigE. –2 X 1.2 GHz US III Linux 2.5.70, epoll(), AIO. –Solaris 8 Capriccio, LinuxThreads, NPTL

Thread Primitives CapriccioCapriccio (notrace) Linux- Threads NPTLSolaris Thread creation 21.5 37.917.732 Thread context switch 0.560.240.710.65 Uncontended mutex lock 0.04 0.140.150.08

Thread Scalability Producer-consumer

Thread Scalability Drop between 100 and 1000 to cache footprint.

I/O Performance pipetest –Pass a number of tokens among a set of pipes. Disk scheduling –A number of threads perform random 4 KB reads from a 1 GB file. Disk I/O through buffer cache –200 threads reading with a fixed miss rate.

When concurrency is low, performance is poorer.

Benefits of disk head scheduling.

I/O out of buffer. Performance is lower due to AIO.

Linked Stack Management

Thread Stacks If a lot of threads, the cumulative stack space can be quite large. Solution: Use a dynamic allocation policy and allocate on demand. Link stack chunks together. Problem: How do you link stack chunks together? How do you know when to link a new one?

Weighed Call Graph Use static analysis to create a weighted call graph. Each node is weighed by the maximum stack space that that function might consume. (Why is it maximum, and not exact?) Now what?

Bounds Most real-world programs use recursion. Even without, static bound wastes too much. Instead insert checkpoints at key places to link in new stack chunks. Chunks switched right before arguments are pushed.

Placing Checkpoints Make sure one checkpoint in every cycle by inserting in back edges. (How?) (Is this efficient?) Then make sure each path (sum) is not too long.

Function B is executing. Function D, both ways. Recursion.

Special Cases Function pointers –Difficult, but they try to analyze. External functions –Allow annotations. –Alternatively, link in a large chunk. Variable length arrays –C99

Question What kind of a problem is this? Is it being solved at the right level?

Resource-Aware Scheduling

Admission Control We’ve seen many graphs where performance degrades as some variable increases. Scheduling in Capriccio is to keep performance in the “good” part of the curve.

Blocking Graph Each node is a location where the program blocked. –Location is call chain. Generated at run time. Annotate with resource usage: –Average running time (with exponentially-weighted “moving” average), memory, stack, sockets, etc. Maintain a run queue for each node. Admit threads till resources reach maximum capacity.

Pitfalls Too many non-linear effects to predict. One solution is to use some kind of instrumentation, plus feedback control. –But even detecting that is hard.

Web Server Test

Summary Control flow maintains state. Control flow can be swapped for explicit maintenance. Threads perform two functions: –Maintain state (logical threads of programming model) –Allow concurrency (kernel) Should separate the two, since the overhead of concurrency is not necessary when just want to maintain state. Cooperative multitasking has been denigrated before, but can be good.

Capriccio: Scalable Threads for Internet Services (von Behren) Kenneth Chiu.

Similar presentations

Presentation on theme: "Capriccio: Scalable Threads for Internet Services (von Behren) Kenneth Chiu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Capriccio: Scalable Threads for Internet Services (von Behren) Kenneth Chiu.

Similar presentations

Presentation on theme: "Capriccio: Scalable Threads for Internet Services (von Behren) Kenneth Chiu."— Presentation transcript:

Similar presentations

About project

Feedback