Capriccio : Scalable Threads for Internet Services

Capriccio : Scalable Threads for Internet Services
Ron von Behren &et al University of California , Berkeley. Presented By: L R Sriram Chivukula

Background Internet services have increasing scalability demands
Current commodity hardware is capable. But software is lagging behind. Need a simplified and scalable programming model for internet services. As you all know there has been increasing use of the internet based services. Big servers can handle these without any performance degradation. Not only the big servers but the commodity hardware can meet these demands. But the problem is with the software which is lagging behind So to meet these scalability demands there must be a simplified programming model that can achieve Good scalability.

Background Available design approaches Thread model vs. Event model
There ar 2 basic approaches for the application software. Thread based you have one thread to handle each of the incoming request Event based its is event driven a request has set of events when an even occurs a single scheduler/dispatcher assigns it to its respective even handler. Thread based model Event based model

Motivation In recent years …
Why threads are bad (for most purposes)?-John Ousterhout et al. Threads are hard to implement- only for experts. Synchronization. Hard to debug races. OS limit the performance-scheduling and context switches Many have adopted Event based model. Inexpensive synchronization due to cooperative multitasking. Very little stack space required. Better scheduling and locality, based on application-level information More flexible control flow. SEDA-Staged Event Drive Architecture SEDA is a hybrid approach between events and threads. Using events between stages and threads within them. Synchronization: Must coordinate access to shared data with locks. Forget a lock? Corrupted data.

Motivation Lot of criticism that threads don’t perform well for highly concurrent applications. Lauer & Needham duality argument. Why events are a bad idea ?(For high concurrency servers)-Rob von Behren et al. Thread paradigm is good and should give good performance , its only specific implementation that don’t perform well. With all the benefits of the event based models- threads based model has received lot of criticism. Then comes the duality argument by lauer and needham. They say both are duals and should perform equally. Taking this duality issue into account – Rob von Bheren and others published a paper on “” . In this they adress the criticism directed to the thread based model. And say that it is easier than the event based models. One main thing to note is that event based models are hard to debug. Though they have flexible control flow its always complicated to understand. With this they say that if a proper implementation is used then threads will do better than the event based systems And they propose the capriccio

Capriccio: Design Objectives
Support the existing thread APIs. Improve scalability One thread – one connection for Internet servers Do efficient memory management Flexibility to address the application specific needs Better scheduling The main objective of the capriccio are: Use the thread based model which provide support to the existing thhread APIs. Improve the scalability For this they propose a very good memory management system for their thread package. flexibility to address the application specific needs For this they propose a scheduling that can tune dynamically based on the app and work load.

Capriccio Thread Package
Achieves Goals by: User-level threads implementation. Introduces linked stacks mechanism . Resource-aware scheduler. Make use of Async I/O. Capriccio Kernel Apache web server App 1 App 2 Scheduler Memory mgt Async I/O ULT Main advantage of decoupling is that both can evolve independently. At user level they can have their own scheduling and mem units That can adapt to appication specific needs. Linked stack mechanism Where they dynamically allocate and deallocate stack space. Which can help in ccreating huge number of threads. Resource aware scheduling Which can adapt to the application needs and workloads and can help in improving the scalability. They also take the advantage of the async I/O provided the new linux kernels at the that time. by taking advantage of asynchronous I/O interface and by engineering their runtime system so that all thread operations are O(1). With linked stack mechanism they try to solve the problem of stack allocation for large number of threads. Compile time analysis and run-time checks for this Resource aware scheduling –extracts information about the flow of control within a program in order to make scheduling decisions. Cappricio uses User level thread management approach. Kernel level threads are generally used for enabling true concurrency via multiple devices, disk reqs. Or CPUs Where as user level threads are logical threads that provide clean programming model with useful invariants and semantics. They argue that decoupling the threads from the underlying kernel has 2 advantages. 1.There is substantial variation in interfaces and semantics among modern kernels 2.Kernel threads and asynchronous i/o is current area of research. Logical threads hide both OS and kernel evolution.

Thread Design and Scalability
User-Level Threads Flexibility Due to decoupling from underlying kernel. Flexibility of thread Scheduler. Performance Reduce overhead of thread synchronization. Disadvantages Introduces wrapper for translating blocking I/O to non blocking I/O. More kernel crossing with non-blocking I/O . User-Level Threads have difficulty to take advantage of multiple processors. ULT: Decoupling-Helps in both of them to evolve independently. Capriccio takes the advantage of the new asynchronous I/O allows it to provide performance improvement with out changing the app code. They also increase the flexibility of thread scheduler. Kernel level threads can’t be tailored to fit a specific application. Where as user level threads don’t suffer from this. They are light weight and provide programmers to use a tremendous no. of threads without worrying about the overhead. Performance: Sync-cooperative scheduling on a single cpu, syn is nearly free- neither threads nor scheduler has to be interrupted in critical sec. Memory mgmnt is efficient-kernel threads req. DS that eat up memory but its not the case with ULT. Disadvantages: Asyn I/O invloves polling of sockets and then performing I/O. These polling also cause little overhead. Have to introduce a wrapper for blocking I/O to non blocking I/O which inturn adds additional overhead. ULT can’t take advantage of the multiple processors.The performance advantage of lightweight sync is diminished –sync is no longer free.

Capriccio: Implementation
Context switches Uses Toernig’s coroutine library. Threads voluntarily yield . I/O Uses latest Linux asynchronous I/O mechanisms. epoll and AIO . Increases over head. Scheduling Resource aware scheduling . Synchronization Takes advantage of co-operative scheduling . Uses simple check like boolean locked/unlocked flag. Efficiency O(1) expect for sleep queue. Context Switches. Edgar Toernigs couroutine library provides fast context switches. common case threads voluntarily yield either explicitly or through making blocking I/O calls. Doesn’t provide signal based code that allows for preemption of the long running user threads. Designing it I/O. intercepts blocking I/O calls at the library level by overriding the system call stub functions in GNU libc. works for statically linked and dynamically linked apps (that use GNU libc 2.2 or earlier) for dynamically linked apps it doesn’t work as the GNU 2.3 libc bypasses system call stubs for many of its routines causing problems. Scheduling: main loop looks like the main loop of the even driven systems running app threads and checking for new I/O completion. allows user to select between different schedulers at run time Sync: on single CPU Takes advantage of co-operative scheduling Uses simple check like boolean locked/unlocked flag Multiple kernel threads uses spin locks or optimistic concurrency control primitives depending on which is best for the give situation. Efficiency great care to choose different DS and Algos. all thread management functions have O(1) complexity .

Comparison Of Different Thread Packages
Capriccio Capriccio_notrace Linux Thread NPTL Thread creation 21.5 37.9 17.7 Thread context switch 0.56 0.24 0.71 0.65 Uncontended mutex lock 0.04 0.14 0.15 Latencies (in micro seconds) of thread primitives for different thread packages 2X 2.4 GHz Xeon processors, 1 GB of memory. 2X 10k RPM SCSI Ultra II hard drives 3 Gigabit Ethernet interfaces. Operating System: Linux ( epoll supported)

Capriccio: Memory Management
Does a complier analysis Generates weighted call graph Linked stack management Use dynamic allocation policy. Allocate memory chunks on demand .

Weighted Call Graph 0.5 K M 0.2 K A 0.8 K C D B E 0.2 K 1.0 K 0.2 K

Weighted Call Graph Each function is represented as a node
Weighted by the max stack size it need for execution Each edge represents a direct function call Checkpoints Inserted at call sites at compile time. Checks whether there is enough stack size left for reaching next checkpoint. If there is no enough stack space ; it allocates a stack chunk. Problem ? Where we should insert checkpoints ?

Weighted Call Graph Insert one check point in every cycle back edge M

Weighted Call Graph Use Bottom up approach & MaxPath = 1.0 K
Check longest path from node to checkpoint, if MaxPath limit is exceeded, add checkpoint

Weighted Call Graph 0.5 K M 0.2 K A 0.8 K C D B E 0.2 K 1.0 K 0.2 K

Memory Allocation - Runtime
Internal wasted space-light gray MaxPath External wasted space-dark gray MinChunk Special Cases Function pointers Exrternal Functions Dynamic allocation and Deallocation of stack chunks. Function Pointers Which Function is called using a given function pointer. External functions Difficult to find the bound for a stack space for these. annotations- with trusted bounds for each of the external function Large stack chunks for the external functions. No issues as long as the thread doesn’t block with in these functions can re use small number of large stack chunks. Algorithm tuning MaxPath-Exec time Vs internal wasted space –large path lengths req. few check points but more stack linking. MinChunk – external wasted space- large chunks reduce this but fewer stack linking

Benefits of Linked Stack
Preallocation of large stack space unnnecessary . Reduces the virtual memory. Improves paging behavior. Microbenchmark – bigstack() Just 1 Mb of stack space is shared between all the threads. 800 threads- each calls bigstack 10 times 3.33sec 1.07 sec 1 Mb/thread trashes at 1000. 1 Mb across all scales to 100,000. Paging behavior: stack chunks are reused in LIFO order – allows sharing of stack chunks between threads, reducing the size of app working set. can allocate smaller stack chunks than a single page, reducing the overall memory wastage.

Resource Aware Scheduling
Application is viewed as a sequence of stages separated by blocking points Uses blocking graph Each node is location in program that is blocked. From Event-based scheduler: current handler gives the task’s location in processing chain. length of handlers’ task queues can be used to determine which stages are causing the bottlenecks. Capriccio implements similar strategy in scheduling. Uses the information from learning to improve the scheduling and the admission control. Node weighted average is updated everytime an outgoing edge is traversed. Is essentially the weighted average of the edge values since the number updates is proportional to number of times each outgoing edge is taken. Node values tell us how long the next edge will take on average. Resource usage are annotated. Resource usage is monitored & scheduling is done based on the resource usage patterns.

Resource Aware Scheduling
It is generated at run-time. Learns the behavior of application dynamically. Annotation Average running time for each eadge. Weighted average for node. Changes in resource usage.-CPU,memory,file desc. Strategy Keeps track or resource utilization level Annotate each node with the resources used on its outgoing edge-predict the impact on each resource Dynamically prioritize the nodes for scheduling based on information from two parts. Track mem usage by providing their own version of malloc family. Detect the resource limit for mem by watching the page fault activity. For file descriptors track the open() and close() calls. For each resource we can increase the utilization until it reaches maximum capacity throttle back by scheduling node that release the resource. When low we give high priority for those that consume the resource. When high we give high priority to those that release the resource. Also give high priority to tasks near completion so that they release the resources they hold. Completely adaptive strategy – scheduler responds to changes in the resource consumption. Implementation Separate run queues for each node in blocking graph. Prioritize the nodes Select by Using stride scheduling Dequeue the threads from nodes’ run queues.

Pitfalls Resource’s maximum capacity is difficult to determine.
It is difficult to detect thrashing Involves system overhead. Application-specific resources also present challenges. Application level mem. Management hides allocation and deallocation. Logical resources like locks also cause challenges. Solution – An API to inform the runtime system about the logical resources. Yield Profiling Threads may not yield. Utilization level at which trashing occurs often depends on workload. Eg:disk subsystem performs better if the req are sequential than random. Solution to this is to watch for early signs of thrashing(such as page fault rates) and use these to find the max capacity. Assumption that all threads are from the same application and are therefore mutually trusting. Can find the edges that failed to yield: running times are typically orders of magnitude larger than the average edge. Find the full blocking graph by sending a USR2 signal to the running server process. Example : close call may take 5ms even though it should return immediately when nonblocking I/O is selected. fix – insert yield in our system call lib before and after the actual close call. Doesn’t solve the problem but cuts the long edge into small ones. Better solution is to use multiple kernel threads for running a user-level threads. this would hide latencies from occasional uncontrollable blocking operations such as close

Experiments & Results Thread Scalability I/O Performance test
Producer & Consumer I/O Performance test Web Server tests 4*500 MHz Pentium server with 2GB memory Linux No use of epoll or Linux AIO

Thread Scalability Drop between 100 and due to cache footprint

I/O Performance Concurrently passing 12 byte token to fixed number of pipes Disk head scheduling A number of threads perform random 4 KB reads from a 1 GB file Disk I/O through buffer cache 200 threads reading with a fixed miss rate

When concurrency is low, performance also decreases

Web Server Performance Test Results
Apache web server performance improved by 15% Knot’s performance matched the performance of event-based Haboob webserver

Web Server Performance Test Results

Conclusion Capriccio illustrates that using user-level threads we can get High scalability Efficient memory/stack management Resource based scheduling Drawbacks Lack of multi-cpu support

Future Work Extending Capriccio to multi processor environment.
Producing profiling tools to tune stack parameters according to the application needs

Critique Capriccio shows that thread library can improves the scalability , memory management & thread scheduling The techniques used by Capriccio are novel Presently there is no support for Capriccio thread library Still there no multiprocessor/multicore support !

Capriccio : Scalable Threads for Internet Services

Similar presentations

Presentation on theme: "Capriccio : Scalable Threads for Internet Services"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Capriccio : Scalable Threads for Internet Services

Similar presentations

Presentation on theme: "Capriccio : Scalable Threads for Internet Services"— Presentation transcript:

Similar presentations

About project

Feedback