Presentation is loading. Please wait.

Presentation is loading. Please wait.

Capriccio: Scalable Threads for Internet Service

Similar presentations


Presentation on theme: "Capriccio: Scalable Threads for Internet Service"— Presentation transcript:

1 Capriccio: Scalable Threads for Internet Service

2 Introduction Internet services have ever-increasing scalability demands Current hardware is meeting these demands Software has lagged behind Recent approaches are event-based Pipeline stages of events

3 Drawbacks of Events Events systems hide the control flow
Difficult to understand and debug Eventually evolved into call-and-return event pairs Programmers need to match related events Need to save/restore states Capriccio: instead of event-based model, fix the thread-based model

4 Goals of Capriccio Support for existing thread API
Little changes to existing applications Scalability to thousands of threads One thread per execution Flexibility to address application-specific needs Threads Ideal Ease of Programming Events Threads Performance

5 Thread Design Principles
Kernel-level threads are for true concurrency User-level threads provide a clean programming model with useful invariants and semantics Decouple user from kernel level threads More portable

6 Capriccio Thread package All thread operations are O(1) Linked stacks
Address the problem of stack allocation for large numbers of threads Combination of compile-time and run-time analysis Resource-aware scheduler

7 Thread Design and Scalability
POSIX API Backward compatible

8 User-Level Threads + Performance + Flexibility - Complex preemption
- Bad interaction with kernel scheduler

9 Flexibility Decoupling user and kernel threads allows faster innovation Can use new kernel thread features without changing application code Scheduler tailored for applications Lightweight

10 Performance Reduce the overhead of thread synchronization
No kernel crossing for preemptive threading More efficient memory management at user level

11 Disadvantages Need to replace blocking calls with nonblocking ones to hold the CPU Translation overhead Problems with multiple processors Synchronization becomes more expensive

12 Context Switches Built on top of Edgar Toernig’s coroutine library
Fast context switches when threads voluntarily yield

13 I/O Capriccio intercepts blocking I/O calls
Uses epoll for asynchronous I/O

14 Scheduling Very much like an event-driven application
Events are hidden from programmers

15 Synchronization Supports cooperative threading on single-CPU machines
Requires only Boolean checks

16 Threading Microbenchmarks
SMP, two 2.4 GHz Xeon processors 1 GB memory two 10 K RPM SCSI Ultra II hard drives Linux Compared Capriccio, LinuxThreads, and Native POSIX Threads for Linux

17 Latencies of Thread Primitives
Capriccio LinuxThreads NPTL Thread creation 21.5 17.7 Thread context switch 0.24 0.71 0.65 Uncontended mutex lock 0.04 0.14 0.15

18 Thread Scalability Producer-consumer microbenchmark
LinuxThreads begin to degrade after 20 threads NPTL degrades after 100 Capriccio scales to 32K producers and consumers (64K threads total)

19 Thread Scalability

20 I/O Performance Network performance
Token passing among pipes Simulates the effect of slow client links 10% overhead compared to epoll Twice as fast as both LinuxThreads and NPTL when more than 1000 threads Disk I/O comparable to kernel threads

21 Linked Stack Management
LinuxThreads allocates 2MB per stack 1 GB of VM holds only 500 threads Fixed Stacks

22 Linked Stack Management
But most threads consumes only a few KB of stack space at a given time Dynamic stack allocation can significantly reduce the size of VM Linked Stack

23 Compiler Analysis and Linked Stacks
Whole-program analysis Based on the call graph Problematic for recursions Static estimation may be too conservative

24 Compiler Analysis and Linked Stacks
Grow and shrink the stack size on demand Insert checkpoints to determine whether we need to allocate more before the next checkpoint Result in noncontiguous stacks

25 Placing Checkpoints One checkpoint in every cycle in the call graph
Bound the size between checkpoints with the deepest call path

26 Dealing with Special Cases
Function pointers Don’t know what procedure to call at compile time Can find a potential set of procedures

27 Dealing with Special Cases
External functions Allow programmers to annotate external library functions with trusted stack bounds Allow larger stack chunks to be linked for external functions

28 Tuning the Algorithm Stack space can be wasted Tradeoffs
Internal and external fragmentation Tradeoffs Number of stack linkings External fragmentation

29 Memory Benefits Tuning can be application-specific
No preallocation of large stacks Reduced requirement to run a large numbers of threads Better paging behavior Stacks—LIFO

30 Case Study: Apache 2.0.44 Maximum stack allocation chunk: 2KB
Apache under SPECweb99 Overall slowdown is about 3% Dynamic allocation 0.1% Link to large chunks for external functions 0.5% Stack removal 10%

31 Resource-Aware Scheduling
Advantages of event-based scheduling Tailored for applications With event handlers Events provide two important pieces of information for scheduling Whether a process is close to completion Whether a system is overloaded

32 Resource-Aware Scheduling
Thread-based View applications as sequence of stages, separated by blocking calls Analogous to event-based scheduler

33 Blocking Graph Node: A location in the program that blocked
Edge: between two nodes if they were consecutive blocking points Generated at runtime

34 Resource-Aware Scheduling
1. Keep track of resource utilization 2. Annotate each node with resource used and its outgoing edges 3. Dynamically prioritize nodes Prefer nodes that release resources

35 Resources CPU Memory (malloc) File descriptors (open, close)

36 Pitfalls Tricky to determine the maximum capacity of a resource
Thrashing depends on the workload Disk can handle more requests that are sequential instead of random Resources interact VM vs. disk Applications may manage memory themselves

37 Yield Profiling User threads are problematic if a thread fails to yield They are easy to detect, since their running times are orders of magnitude larger Yield profiling identifies places where programs fail to yield sufficiently often

38 Web Server Performance
4x500 MHz Pentium server 2GB memory Intel e1000 Gigabit Ethernet card Linux Workload: requests for 3.2 GB of static file data

39 Web Server Performance
Request frequencies match those of the SPECweb99 A client connects to a server repeated and issue a series of five requests, separated by 20ms pauses Apache’s performance improved by 15% with Capriccio

40 Resource-Aware Admission Control
Consumer-producer applications Producer loops, adding memory, and randomly touching pages Consumer loops, removing memory from the pool and freeing it Fast producer may run out of virtual address space

41 Resource-Aware Admission Control
Touching pages too quickly will cause thrashing Capriccio can quickly detect the overload conditions and limit the number of producers

42 Programming Models for High Concurrency
Event Application-specific optimization Thread Efficient thread runtimes

43 User-Level Threads Capriccio is unique Blocking graph
Resource-aware scheduling Target at a large number of blocking threads POSIX compliant

44 Application-Specific Optimization
Most approaches require programmers to tailor their application to manage resources Nonstandard APIs, less portable

45 Stack Management No garbage collection

46 Future Work Multi-CPU machines Profiling tools for system tuning


Download ppt "Capriccio: Scalable Threads for Internet Service"

Similar presentations


Ads by Google