Presentation is loading. Please wait.

Presentation is loading. Please wait.

Control Theory in Log Processing Systems

Similar presentations


Presentation on theme: "Control Theory in Log Processing Systems"— Presentation transcript:

1 Control Theory in Log Processing Systems
Wei Xu UC Berkeley Joseph L. Hellerstein IBM T.J. Watson Research Center

2 Outline Data streams and log processing Applying control theory
Controlling queue length Load balancing Lessons learned

3 Introduction Goal of our project A tool A testbed
Problem: data rate up to 1 TB a day Distributed Infrastructure How to make itself reliable? main goal of our project is to analyze log data of online services such as Amazon or eBay. these systems are very complex and they often fail. ... however, due to very high data rate and complexity of the logs, we had problems processing the data

4 Example of system log data
request data Apache log, etc performance data CPU, mem etc. failure data Detected problems /error messages reports from operators 450 attributes, 11,000 requests a second

5 ?   The big picture Production System raw log data Data Collection
Automatic analysis preprocessing ? Repository Sanitized Data Failure Detection add “AOL” box in front of the orange arrows. add a feedback loop back into AOL. how would be this used in real life. it’s in critical path of failure recovery. speed of “data analysis” is critical for recovery. also, speed of preprocessing is critical ... also, how do we evaluate this framework? how much delay do we introduce? what happens if a node in the preprocessing step fails? can we handle that? put data sanitizing functions into TCQ!!

6 Preprocessing Sanitize the data Put logs into common format
Merge information from various sources Sampling Needs to be fast all the required preprocessing should be done outside the algorithm ... As new data streams -> as new input

7 Stream processing Telegraph Continuous Query (TCQ)
Log data are data streams Preprocessing tasks are continuous queries Telegraph Continuous Query (TCQ) SQL queries adaptive: execution optimized on-the-fly performance doesn’t depend on #queries SLT query Q We think that stream processing is a good data model for system log data.

8 Data preprocessing architecture
load splitter combiner 4 1 TCQ query Q 4 1 5 2 6 3 SLT 1 6 5 4 3 2 1 6 5 4 3 2 1 5 2 6 5 4 3 2 1 6+5+4 3+2+1 TCQ query R SLT 2 “one machine running TCQ can’t handle 1 TB of data a day, so we need to distribute the processing. at the same time, we also want to extract temporal information from the data and thus we need to process the data in sequence. these contradicting goals ...” can be easily distributed over a cluster of machines linear scaling performance of a TCQ node depends on the data rate, not on the number of queries running => can generate many streams can be extended/reorganized in any way why (at least) two tiers? sampling should be the first thing to do in the pipeline to reduce the data rate (that’s why we need parallelism) how to support off-line algorithms? 6 3 Intra-Event Processing Inter-Event Processing

9 Problem: performance disturbance
CPU contention Maintenance Tasks Packets drop Other failures SELECTIVITY changes

10 The result of disturbance
End to End Response time (ms) Time (second)

11 Solution – Control Theory
Treat this as a failure? Not necessary and too expensive Feedback control theory as first tier defense mechanism Try to make it stable at least for sometime If doesn’t work out, try recovery

12 Outline Data streams and log processing Applying control theory
Controlling queue length Load balancing Lessons learned

13 The problem Source Buffer TCQ Result Q

14 Why does this happen? TCQ Complex internal structure Controlled
Data Source Input Buffer TCQ drops tuples silently if result queue is full Back pressure not possible

15 Control Problems Goal? What to control? The Knob? No dropping tuples
The result queue length The Knob? Input data rate to the TCQ node

16 Control block diagram Target system (System identification)
u(k)=u(k-1)+(Kp+KI)e(k)-Kpe(k-1) Error Data rate in next interval Last Error Data rate in last interval

17 Result – Under CPU Contention
Source Buffer TCQ Result Q

18 Why useful? Original system New system
Input data rate =>tuple drop v.s. not drop New system Input data rate => Response time Make it ready for load balancing

19 Outline System log as data streams Applying control theory
Controlling queue length Load balancing Lessons learned

20 The problem Barrier in system Different response times
End to end response time matches the slower node

21 The control problem Goal? What to control? The knob? What to monitor?
Make the response time equal What to control? Response time on each node The knob? Tuples assigned to each node What to monitor? Queue length v.s. response time

22 System with control Response time

23 Control block diagram

24 Result End to End Response time (ms) Time (second)

25 Outline System log as data streams Applying control theory
Controlling queue length Load balancing Lessons learned

26 Advantages of control theory
Performance can be analyzed Stability Accuracy Settling time Overshoot

27 Other advantages Simple implementation Encourage good system design
Modeling the system Treat system as black box First defense mechanism against disturbances in system

28 Limitations Not all software systems are designed to be controlled
Finite input produces unbounded output E.g. Join in TCQ Useful state not measurable Queuing theory helps, but lacks other good theory Many binary variables Failed v.s working correctly

29 Other Limitations Can not find the cause of problem
The model for target system is complex Lack of a reliable knob E.g. change result queue length of TCQ – sometime it crash What is the range you can turn? How often you can turn? How long will the system respond? Can not find the cause of problem

30 Solution? More advanced modeling and controller?
Adaptive control Design controller-friendly systems? A simple model User configurable parameter -> knobs?

31 Future Work As a tool, real users? Scheduling multiple streams
Dynamically scale up/down Other control theory applications

32 Backup Slides

33 Future Work Load balancer Load control across multiple tiers
Scheduling of multiple streams

34 System with control Controlled Output Rate Data Source Controller
Queue Length Monitor

35 Result Source Buffer TCQ Result Q

36 Conclusion Advantages of feedback control
Make system more robust under disturbance Allows more time for failure detection Treat complex systems as black boxes Cope with the system characteristics instead of having to change it Theoretical analysis Implementation is easy System statistics can also be used for SLT

37 Output Thread (Code Reuse)
What is going on? Controlled Output Thread (Code Reuse) Desired Queue length Queue Length Controller Data Rate to TCQ Actual Queue Length

38 Theory meets reality Output Y from simulation Queue length Time

39 Tricky part of parameter estimation
Model evaluation – Making the system operate in desired range Data rate vs free space Free Space Non-Linear range Easy for data source, but queue length ..

40 Why do we need control? Data source does not provide accurate data rate

41 Control Problems Not accurate for various reasons
Scheduling Time spent on I/O Etc. Providing an accurate data source using feedback control By controlling the input of “desired rate”

42 The Control Architecture
1500 1900 1600 P Controller (with precompensation) u(k)=Kp*e(k) PI Controller U(k)=u(k-1)+(Kp+KI)e(k)-Kpe(k-1)

43 Result – An accurate data source
P Controller with Pre-compensation PI Controller

44 Zoom In A lot of small disturbance in a Java program
Incremental garbage collection P Controller PI Controller


Download ppt "Control Theory in Log Processing Systems"

Similar presentations


Ads by Google