Quality management in database systems A thesis proposal Yicheng Tu January 24, 2006 Advisor: Prof. Sunil Prabhakar.

Quality management in database systems A thesis proposal Yicheng Tu January 24, 2006 Advisor: Prof. Sunil Prabhakar

Outline ► Introduction Quality-aware query processing in multimedia databases Controlling delays in data stream management systems (DSMSs) Managing uncertainty of query results in DSMSs Summary

What is quality? a degree or grade of excellence or worth … - Webster’s dictionary The nature, kind, or character (of something). Hence, the degree or grade of excellence, etc. possessed by a thing. Restricted to cases in which there is comparison (expressed or implied) with other things of the same kind. - Oxford English dictionary

Quality in data management series of parameters that describe the characteristics of data processing and lead to different degrees of user satisfaction overlaps with the concept of Quality-of-Service (QoS)

What are the problems? Two types of problems To maintain quality of applications under highly dynamic environments Determine the quality of concurrent applications for maximal user satisfaction Focal problems are system-specific Various techniques/solutions are involved. Some are used in Multimedia systems Real-time systems Networking

Outline Introduction ► Quality-aware query processing in multimedia databases Controlling delays in data stream management systems (DSMSs) Managing uncertainty of query results in DSMSs Summary

Quality in multimedia DBMS Quality = QoS Querying the DB with quality parameters SELECT vid:[s] FROM VidLib1 WHERE (vid, s) IN FindVideoWithObject( Someone ) QUALITY Resolution = High, Color_depth = Low

QuaSAQ Quality-of-Service-Aware Query processing Users do not need to know low-level details Cost evaluation towards various global optimization goals Throughput Utilizing current system/network QoS support to deliver the query results Theory first presented in Bertino et al., 2003 Prototyping is essential

QuaSAQ architecture Our approach: Augment the query evaluation and optimization modules to directly take QoS into account Major components Offline multimedia processor Transcode media objects into copies with different QoS/formats Interesting research topic Estimate resource use Online components QoS Browser Quality Manager QoS APIs

Outline Introduction Quality-aware query processing in multimedia databases ► Controlling delays in data stream management systems (DSMSs) Managing uncertainty of query results in DSMSs Summary

Data stream management systems Applications Financial analysis Mobile services Sensor networks Network monitoring More … Continuous data, discarded after being processed Continuous query Data-active query-passive model

Qualities in DSMS data processing Data processing in DSMS is quality-critical tuple delay data loss sampling rate, window size, … Overloading during spikes  degraded quality (delay) Solution: adjust data loss (i.e., load shedding) On DSMS side Eliminating excessive load by dropping data items The real problem is: tuple delay is the major concern: results generated from old data are useless! How to maintain processing delays while minimizing data loss ?

Related work Accuracy of aggregate queries under load shedding (Babcock et al., ICDE04) Data triage (Reiss & Hellerstein, ICDE05) Put data into an asylum upon overloading QoS-driven load shedding (Tatbul et al., VLDB03) Key questions - When? - How much? - Where? Use a load shedding roadmap to decide where Simple, intuitive algorithm to decide when and how much

What ’ s wrong? Highly dynamic environment is reality Bursty data input Variable unit processing cost Fail to capture current system status (queue length) and output (delay) Delay positively related to queue length Examples 1. Unbounded increase of delay Example 2. Unnecessary data loss

Our approach The feedback control loop: Plant Monitor Controller Actuator How it works Error ( e ) = desirable output ( y r ) - measured output ( y ) Focal point: controller, which maps e to control signal u Disturbances View load shedding as a control problem Control: manipulation of system behavior by adjusting system input Cruise control of automobiles, room temperature control, etc. Open-loop vs. closed-loop (feedback) control

Challenges Can we model the system? Analytical model may not be easy to derive System identification: experimental methods How to design the controller? Use control theoretical tools for guaranteed performance DSMS-specific problems Lack of real-time measurement of output signal ( y ) How to set control period ( T ) Real system evaluation we use Borealis in our study

Modeling a DSMS Borealis data stream manager Round robin operator scheduler FIFO waiting queues For now, fix the per-tuple processing cost c Proposed model: y = qc where q is the number of outstanding data tuples Discrete form: y(k) = q(k-1)c Denote the input load as f i and system processing power as f o:

Controller design Design based on pole placement Guaranteed performance targeting Convergence rate - responsiveness Damping - smoothness The controller:

Experiments Controller and load shedder implemented in Borealis Synthetic (“pareto”) and real (“Web”) data streams Small query network with variable average processing cost

Experimental results Experiments for comparison Aurora – open loop solution Baseline – a simple feedback method Target delay : 2000ms Control period : 1 second Total time: 400 seconds For both input types, data loss are almost the same for three load shedding strategies

Future work Time-varying DSMS model For example, time-varying cost c Possible solution: adaptive control Adaptation other than load shedding New disturbances? Model changes?

Outline Introduction Quality-aware query processing in multimedia databases Controlling delays in data stream management systems (DSMSs) ► Managing uncertainty of query results in DSMSs Summary

The problem DSMS has limited resources: CPU, memory, bandwidth, … Users tolerate certain level of uncertainty in the query results they receive Probabilistic queries Forecasting of stream value  no need to get updates from all streams at all times

Related work Probabilistic queries (Cheng et al., SIGMOD03) Adaptive filters of stream data (Olston et al., SIGMOD03) Model-based data acquisition (Desphande et al., VLDB04) Kalman filtering of stream data (Jain et al., SIGMOD04) Brownian model-based data filtering (Zhu et al., VLDB04) Using statistical models to predict the future value of streams s.t. we can treat streams differently!

What’s missing? Current work focuses on maintaining the correctness of stream value DSMS vs. data caching system: User-specified queries (e.g., MIN/MAX, AVG/SUM) For example, how does the future value of a stream affect the outcome of a MIN query? The one with the most uncertain prediction may not be the one we care about. Need a systematic way to address the quality of probabilistic queries under resource constraints

Problem statement Total of n data streams Let y j,i be the value of stream j at time i Query result q i at time i is a function of y j,i (j = 1, 2, 3, …,n), e.g., q i = f ( y 1,i, y 2,i, …, y n,i ) At time m, we are only allowed to update c ( m ) streams. How to select the c ( m ) streams (from a total number of n ) to update s.t. the query result q i is of the highest quality?

Definition of quality Let y j,i be a predictor of y j,i Naturally, q i = f (y 1,i, y 2,i, …, y n,i ) is a predictor of q i Quality can be defined as the risk R of using q i as an estimator of q i Consider precision and accuracy A candidate of such risk is mean square error

Issues Quantity q i is unknown, need a practical target function Models for one-step ahead estimation of y i State-space methods, let’s start with Kalman filters How to decide the benefit of individual stream values to q i ? Algorithm has to be fast - decisions have to be made within short time intervals.

Other works Performance analysis of peer-to-peer media streaming systems [TOMCCAP05, MMCN04] Change point estimation of bi-level functions [JMASM06, CSC05] Entity-based queries with non-value tolerance [VLDB05] Tertiary storage support in VDBMS [MMJ04, ICDE03]

Publications 1. Leming Qu and Yi-Cheng Tu. Change Point Estimation of Bi-level Functions. To appear in Journal of Modern Applied Statistical Methods. 2. Yi-Cheng Tu, Jianzhong Sun, Mohamed Hefeeda, and Sunil Prabhakar. An Analytical Study of Peer-to-Peer Media Streaming Systems. To appear in ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP). 3.Reynold Cheng, Ben Kao, Sunil Prabhakar, Alan Kwan, and Yi-Cheng Tu. Adaptive Stream Filters for Entity-Based Queries with Non-Value Tolerance. In Proceedings of Intl. Conf. on Very Large Databases (VLDB), pp.37-48, Trondheim, Norway, August 2005. 4. Yi-Cheng Tu, Jingfeng Yan, and Sunil Prabhakar. Quality-Aware Replication of Multimedia Data. In Proceedings of International Conference of Database and Expert Systems Applications (DEXA), pp. 240-249, Copenhagen, Denmark, August 2005. 5. Yi-Cheng Tu, Mohamed Hefeeda, Yuni Xia, Sunil Prabhakar, and Song Liu. Control-based Quality Adaptation in Data Stream Management Systems. In Proceedings of International Conference of Database and Expert Systems Applications (DEXA), pp.746-755, Copenhagen, Denmark, August 2005. 6. Leming Qu and Yi-Cheng Tu. Change Point Estimation of Bar Code Signals. In Proceedings of International Conference on Scientific Computing (CSC). pp.109-114, Las Vegas, USA, June 2005.

Publications (cont’d) 7. Yi-Cheng Tu, Sunil Prabhakar, Ahmed Elmagarmid and Radu Sion. QuaSAQ: An Approach to Enabling End-to-End QoS for Multimedia Databases. In Proceedings of International Conference on Extending Database Technology (EDBT), pp.694- 711, Herakolin, Greece., March 2004. 8. Yi-Cheng Tu, Jianzhong Sun and Sunil Prabhakar. Performance Analysis of A Hybrid Media Streaming System. In Proceedings of SPIE/ACM Conference on Multimedia Computing and Networking (MMCN), pp.69-82, San Jose, CA., January 2004. 9. W. Aref, A. Catlin, A. Elmagarmid, J. Fan, M. Hammad, I. Ilyas, M. Marzouk, S. Prabhakar, Y.-C. Tu and X. Zhu. VDBMS: A Testbed Facility for Research in Video Database Benchmarking. Springer/ACM Multimedia Systems. 9(6):575-585., June 2004. 10. W. Aref, A. Catlin, A. Elmagarmid, J. Fan, M. Hammad, I. Ilyas, M. Marzouk, S. Prabhakar, Y.-C. Tu and X. Zhu. VDBMS: A Testbed Facility for Research in Video Database Benchmarking. In Proceedings of Intl. Conf. on Distributed Multimedia Systems (DMS), pp.160-166, 2003. 11. W. Aref, A. Elmagarmid, J. Fan, J. Guo, M. Hammad, I. Ilyas, M. Marzouk, S. Prabhakar, A. Rezgui, A. Teoh, E. Terzi, Y.-C. Tu, A. Vakali, X. Zhu. A Distributed Database Server for Continuous Media (Demo). In Proceedings of International Conference on Data Engineering (ICDE), pp.490-491. San Jose, CA., March 2002.

Submitted drafts 12. Yi-Cheng Tu, Sunil Pabhakar, Jingfeng Yan, and Gang Shen. Selection of Quality-Specific Caches of Multimedia Data. Submitted to journal. 13. Yi-Cheng Tu, Song Liu, Sunil Pabhakar, and Bin Yao. Load Shedding in Stream Databases: A Control-Based Approach. Submitted to conference. 14. Yi-Cheng Tu. Contorl-based load shedding in Data Stream Management Systems. Submitted to workshop.

Future plans 1. Journal submission: control-based load shedding in DSMSs. VLDB Journal (early 2006). 2. Improving quality of (single) probabilistic queries by forecasting models. ICDE (July 2006); 3. Optimizing quality of probabilistic queries in a multi- query environment. SIGMOD (December 2006); 4. Control techniques in self-adaptive DBMS. CIDR (August 2006). 5. Thesis Defense – early 2007

References E. Bertino, Ahmed Elmagarmid, and Mohamed-Said Hacid. A Database Approach to Quality of Service Specification in Video Databases. SIGMOD Record, 32(1):35-40, 2003. B. Babcock, M. Datar, and R. Motwani. Load Shedding for Aggregation Queries over Data Streams. Procs. of ICDE 2004. C. Olston, J. Jiang, and J. Widom. Adaptive Filters for Continuous Queries over Distributed Data Streams. Procs. of SIGMOD 2003, p.563–574. Frederick Reiss and Joseph M. Hellerstein. Data Triage: An Adaptive Architecture for Load Shedding in TelegraphCQ. Procs. of ICDE 2005, p. 155–156. N. Tatbul, U. C¸ etintemel, S. Zdonik, M. Cherniack, and M. Stonebraker. Load Shedding in a Data Stream Manager. Procs. of VLDB 2003, p.309–320. S. Zhu and C.V. Ravishankar. Stochastic Consistency, and Scalable Pull-Based Caching for Erratic Data Sources. Procs. of VLDB 2004. p.192-203 A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, W. Hong. Model-Driven Data Acquisition in Sensor Networks. Procs. of VLDB 2004, p. 588-599 R. Cheng, D. Kalashinikov, and S. Prabhakar. Evaluating Probabilistic Queries over Imprecise Data. Procs. of SIGMOD 2003, p.551-562. A. Jain, E. Y. Chang, and Y.-F. Wang. Adaptive Stream Resource Management Using Kalman Filters. Procs. of SIGMOD 2004, p.11-22

Thank you! Questions?

DSMS architecture Network of query operators (O1 – O3) Each operator has its own queue (q1 – q4) Scheduler decides which operator to execute Query results (Q1, Q2) pushed to clients Example systems: Aurora/Borealis STREAM

Why feedback control ? Open loop Closed-loop 1/a

Backup

Experimental results - 2 Lack of robustness of open-loop solution More optimistic policy adapted in Aurora Unstable performance Our solution is robust Under input streams with different burstiness

Backup - 2

Quality adaptation in multimedia Adaptation techniques to satisfy quality needs Dynamic adaptation: online transcoding, layered coding Static adaptation: retrieve pre-coded replica from storage Dynamic adaptation transcoding is very expensive, poor cost efficiency Topic of active research Layered coding Not included in all standards Number of qualities limited Not as popular as expected Things may change in future

Little CPU cost Choice of most commercial service providers What about storage cost? On the order of total number of quality points Ignored in previous research assuming Very few quality profiles Storage is dirt cheap Excessively high for service providers Selection of quality becomes a problem Static adaptation

Quality-aware replication We view this as a data replication problem We formulate the problem under two user behavior models Hard-quality: rigid quality requirements Soft-quality: users willing to negotiate the quality received, user satisfaction decreases when desired quality is not received

The problem An optimization: get the highest utility given the popularity ( f k ), storage cost ( s k ) of all quality points under total storage S u(j, k): the utility when a request on quality j is served by replica of quality k Utility is given as a function of distance in quality space Requests served by the closest replica Problem is NP-hard, a variation of the k -median problem with extra difficulties coming from multiple media objects

A greedy solution Aggressively selects replicas based on the ratio of marginal utility gain (∆ u ) to cost ( s k ) Time complexity: where I is the # of replicas selected and m the total # of possible replicas A conjecture: Greedy is 1.33-competitive! selected replica set P := Φ available storage s’ := S while s’ > 0 add the quality point that yields the largest ∆u/sk value to P decrease s’ by sk return P

Extensions An iterative variation can further improve the solution Iterative Greedy: run Greedy iteratively Same time complexity Handling multiple ( V > 1) media objects Greedy can be easily extended to do this Time complexity:, can be reduced to with some tweaks Dynamic selection of qualities (dynamic replication) Popularity f of replicas could change over time Naïve solution - run Greedy every time a change of f occurs An elegant solution derived from Greedy Time complexity: O ( I log V ) Provable optimality of quality selection – same as the Naïve solution

Effectiveness of algorithms For comparison: The optimal solution (by CPLEX) Random selections Local popularity-based

Efficiency of algorithms CPLEX < Iterative Greedy < Greedy < Random < Local Results on a P4 2.4 GHz CPU:

Storage for replication Empirical formula to calculate storage after transcoding to a lower quality in one dimension: Sum of all replicas when there are n qualities Three dimensions:, total storage is thus O ( n ^3) For d dimensions, O ( n^d )

Dynamic FSRS algorithm Based on the RR idea Proved performance: results given are as optimal as those chosen by Greedy Preprocess phase: Build the RRs Online phase: Performing exchanges till total utility converges Time complexity: O ( I log V ) where I : # of storage exchanges occurs and V is the # of media objects

More experimental results Selection of replicas by Greedy, 21 X 21 2-D quality space with larger number representing lower quality (i.e., point (20,20) is of the lowest quality), V = 30 Same inputs, results given by Iterative Greedy

An illustration: Greedy

An illustration: Iterative Greedy

Dynamic replication Popularity f of replicas could change over time Naïve solution - run Greedy every time a change of f occurs – is too slow When f of all replicas of a single media changes together, there is an elegant solution based on Greedy Consider the order replicas are selected by Greedy – follow a predefined path (replication roadmap, or RR) for each media object RRs are all convex The one that becomes more popular takes storage from the least popular one The one that becomes less popular gives up storage to the most popular one It is efficient to make exchanges at the frontiers of the RRs, no need to look inside

Replication Roadmap example Media A should take storage from media B as the slope of its current segment in RR is greater than that of B’s

Dynamic replication Randomly generated changes of f Compare with Greedy Results with (almost) the same optimality as Greedy Reason: small number of storage exchanges

Model verification Feed Borealis with synthetic streams Input rate: step function or sinusoidal function of time Average processing cost is fixed

Control period Provides complete answer to the question “when to shed load”? Arbitrarily set in previous studies Case-by-case decision with some systematic rules In our problem, a tradeoff between: Sampling theory (Nyquist-Shannon Theorem): in order to capture the moving trends of the disturbances, higher (shorter) sampling frequency (period) is preferred Stochastic feature of output ( y ) and parameter ( c ): more samples are needed  longer period is preferred The first factor should be given more weight

Thanks to My wife and family My advisor: Prof. Sunil Prabhakar My advising committee Prof. Walid Aref Prof. Ahmed Elmagarmid Prof. Dongyan Xu Collaborators Prof. Reynold Cheng Prof. Ahmed Elmagarmid Prof. Mohamed Hefeeda Dr. Song Liu Prof. Leming Qu Mr. Gang Shen Prof. Radu Sion Dr. Jianzhong Sun Prof. Yuni Xia Mr. Jingfeng Yan

Quality management in database systems A thesis proposal Yicheng Tu January 24, 2006 Advisor: Prof. Sunil Prabhakar.

Similar presentations

Presentation on theme: "Quality management in database systems A thesis proposal Yicheng Tu January 24, 2006 Advisor: Prof. Sunil Prabhakar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Quality management in database systems A thesis proposal Yicheng Tu January 24, 2006 Advisor: Prof. Sunil Prabhakar.

Similar presentations

Presentation on theme: "Quality management in database systems A thesis proposal Yicheng Tu January 24, 2006 Advisor: Prof. Sunil Prabhakar."— Presentation transcript:

Similar presentations

About project

Feedback