Download presentation
Presentation is loading. Please wait.
2
Supporting Streaming Updates in an Active Data Warehouse Neoklis Polyzotis, Spiros Skiadopoulos, Panos Vassiliadis, Alkis Simitsis, Nils-Erik Frantzell
3
ICDE 2007, Constantinople 18/4/2007 2 Forecast Problem in active data warehousing: –the join between a fast stream of source updates and a disk- based relation under the constraint of limited memory Solution: – the mesh join, a novel join operator that operates under minimum assumptions for the stream and the relation Features: –a cost model and tuning methodology that accurately associates memory consumption with the incoming stream rate
4
ICDE 2007, Constantinople 18/4/2007 3 Roadmap Motivation & Problem statement The Mesh-Join Algorithm Cost model & Tuning Experiments Conclusions
5
ICDE 2007, Constantinople 18/4/2007 4 Roadmap Motivation & Problem statement The Mesh-Join Algorithm Cost model & Tuning Experiments Conclusions
6
ICDE 2007, Constantinople 18/4/2007 5 Add_SPK 1 SUPPKEY=1 SK 1 DS.PS 1.PKEY, LOOKUP_PS.SKEY, SUPPKEY $ 2€ COSTDATE DS.PS 2 Add_SPK 2 SUPPKEY=2 SK 2 DS.PS 2.PKEY, LOOKUP_PS.SKEY, SUPPKEY COSTDATE=SYSDATE AddDate CheckQTY QTY>0 U DS.PS 1 Log rejected Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF 1 DS.PS_NEW 1.PKEY, DS.PS_OLD 1.PKEY DS.PS_NEW 1 DS.PS_OLD 1 DW.PARTSU PP Aggregate 1 PKEY, DAY MIN(COST) Aggregate 2 PKEY, MONTH AVG(COST) V2 V1 TIME DW.PARTSUPP.DATE, DAY FTP 1 S 1 _PARTSU PP S 2 _PARTSU PP FTP 2 DS.PS_NEW 2 DIFF 2 DS.PS_OLD 2 DS.PS_NEW 2.PKEY, DS.PS_OLD 2.PKEY SourcesDW DSA ETL workflows
7
ICDE 2007, Constantinople 18/4/2007 6 Active Data Warehousing Traditionally, data warehouse refreshment has been performed off-line, through Extraction- Transformation-Loading (ETL) software Active Data Warehousing refers to a new trend where data warehouses are updated as frequently as possible, to accommodate the high demands of users for fresh data
8
ICDE 2007, Constantinople 18/4/2007 7 Issues around Active Warehousing Smooth upgrade of the software at the (legacy) source –minimal modification of the software configuration at the source side Minimal overhead of the source system No data losses are allowed in the long run Maximum freshness of data –the response time for the transport, cleaning, transformation and loading of a new source record to the DW should be small and predictable Scalability at the warehouse side –the architecture should scale up with respect to the number of sources and data consumers at the DW –if possible, cover issues like checkpointing, index maintenance
9
ICDE 2007, Constantinople 18/4/2007 8 Grand view of an Active DW
10
ICDE 2007, Constantinople 18/4/2007 9 Problem statement Joining a fast stream of updates with a persistent relation within limited memory bounds is of particular importance in the Active Warehousing setting Example practical cases: –Surrogate Key assignment –Duplicate detection –…
11
ICDE 2007, Constantinople 18/4/2007 10 Example: Surrogate Key
12
ICDE 2007, Constantinople 18/4/2007 11 Roadmap Motivation & Problem statement The Mesh-Join Algorithm Cost model & Tuning Experiments Conclusions
13
ICDE 2007, Constantinople 18/4/2007 12 Operation of Mesh-Join
14
ICDE 2007, Constantinople 18/4/2007 13 (Not really any) Assumptions No assumption of any order in either the stream or the relation No indexes are necessarily present Limited memory is available The join condition is arbitrary (equality, similarity, range, etc.) The join relationship is general (i.e., many-to- many, one-to-many, or many-to-one) The result is exact. … But.. The relation remains fixed throughout the join
15
ICDE 2007, Constantinople 18/4/2007 14 Architecture of Mesh-Join
16
ICDE 2007, Constantinople 18/4/2007 15
17
ICDE 2007, Constantinople 18/4/2007 16 Roadmap Motivation & Problem statement The Mesh-Join Algorithm Cost model & Tuning Experiments Conclusions
18
ICDE 2007, Constantinople 18/4/2007 17 Critical issues The important measures are: –the stream rate λ –the available memory M –the service rate μ of the join The main challenge is to interrelate these metrics in a cost formula, so as to be able to tune the system –minimize M, given a desirable rate μ –maximize μ, give a constraint of available memory M
19
ICDE 2007, Constantinople 18/4/2007 18 Cost model: Memory wrt b, s Size of b buffer s Size of w buffer s Size of queue Q Size of hash H NRbNRb = # iterations a stream tuple must “see”
20
ICDE 2007, Constantinople 18/4/2007 19 Cost model: cost of an iteration wrt b, s
21
ICDE 2007, Constantinople 18/4/2007 20 Cost model C loop = function (w, b) M = function (w, b) Interrelated M, μ, λ via w, s
22
ICDE 2007, Constantinople 18/4/2007 21 Tuning: M,μ as a function of b
23
ICDE 2007, Constantinople 18/4/2007 22 Minimize M, given a desirable rate μ Minimize w => minimize M Minimum w min = λc loop In this case λ = μ Thus, M is a function only of b, computed by simple calculus
24
ICDE 2007, Constantinople 18/4/2007 23 Roadmap Motivation & Problem statement The Mesh-Join Algorithm Cost model & Tuning Experiments Conclusions
25
ICDE 2007, Constantinople 18/4/2007 24 Experimental methodology Synthetic data set: Zipf distribution, skew in [0,1], 10% of R as available memory, 3.5M rows, domain of 1.35M values Real data set: cloud cover data, 10M rows, domain of 36,000 values INL as an opponent, based on a clustered B+, in Berkeley DB Platform: Pentium IV 3GHz, 1GB main memory, 7200 RPM disk
26
ICDE 2007, Constantinople 18/4/2007 25 Predicted and measured performance (synthetic data)
27
ICDE 2007, Constantinople 18/4/2007 26 Performance for varying memory (synthetic data)
28
ICDE 2007, Constantinople 18/4/2007 27 Performance for varying data skew (synthetic data)
29
ICDE 2007, Constantinople 18/4/2007 28 Performance for varying memory (real-life data)
30
ICDE 2007, Constantinople 18/4/2007 29 Roadmap Motivation & Problem statement The Mesh-Join Algorithm Cost model & Tuning Experiments Conclusions
31
ICDE 2007, Constantinople 18/4/2007 30 Conclusions We have proposed the mesh join, a join operator particularly fit for active data warehousing that operates under minimum assumptions for the stream and the relation We have presented a cost model and tuning methodology that accurately associates memory consumption with the incoming stream rate
32
ICDE 2007, Constantinople 18/4/2007 31 Other capabilities & Possible extensions Approximate processing Ordered join output Tuning for join conditions other than equality Dynamic tuning for changes in the stream rate Possible Extensions –multi-way joins –other active ETL operators
33
ICDE 2007, Constantinople 18/4/2007 32 Thank you for your attention! … many thanks to our hosts! This research was co-funded by the European Union in the framework of the program “Pythagoras IΙ” of the “Operational Program for Education and Initial Vocational Training” of the 3rd Community Support Framework of the Hellenic Ministry of Education, funded by 25% from national sources and by 75% from the European Social Fund (ESF). Figures of the Antikythera mechanism by Rupert Russell URL: http://www.giant.net.au/users/rupert/kythera/kythera.htm
34
ICDE 2007, Constantinople 18/4/2007 33 Questions?
35
ICDE 2007, Constantinople 18/4/2007 34 Backup Slides
36
ICDE 2007, Constantinople 18/4/2007 35 Related work Applications of Symmetric Hash-Joins over windows of streaming inputs that fit in M/M –Chandrasekaran, Franklin @ VLDBJ, 2003 –Golab, Ozsu @ VLDB 2003 –Hammad, Franklin, Aref, Elmagarmid @ VLDB 2003 –Viglas, Naughton, Burger @ VLDB 2003 Joins of streamed bounded relations: Xjoin variants that flush overflow tuples to disk –Dittrich, Seeger, Taylor, Widmayer @ VLDB 2002 –Tao, Yiu, Papadias, Hadjieleftheriou, Mamoulis @ SIGMOD 2005
37
ICDE 2007, Constantinople 18/4/2007 36 Involved Measures
38
ICDE 2007, Constantinople 18/4/2007 37 Cost model I/O per secondI/O per stream tuple
39
ICDE 2007, Constantinople 18/4/2007 38 Loops of Mesh Join
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.