Presentation is loading. Please wait.

Presentation is loading. Please wait.

Supporting Streaming Updates in an Active Data Warehouse Neoklis Polyzotis, Spiros Skiadopoulos, Panos Vassiliadis, Alkis Simitsis, Nils-Erik Frantzell.

Similar presentations


Presentation on theme: "Supporting Streaming Updates in an Active Data Warehouse Neoklis Polyzotis, Spiros Skiadopoulos, Panos Vassiliadis, Alkis Simitsis, Nils-Erik Frantzell."— Presentation transcript:

1

2 Supporting Streaming Updates in an Active Data Warehouse Neoklis Polyzotis, Spiros Skiadopoulos, Panos Vassiliadis, Alkis Simitsis, Nils-Erik Frantzell

3 ICDE 2007, Constantinople 18/4/2007 2 Forecast Problem in active data warehousing: –the join between a fast stream of source updates and a disk- based relation under the constraint of limited memory Solution: – the mesh join, a novel join operator that operates under minimum assumptions for the stream and the relation Features: –a cost model and tuning methodology that accurately associates memory consumption with the incoming stream rate

4 ICDE 2007, Constantinople 18/4/2007 3 Roadmap Motivation & Problem statement The Mesh-Join Algorithm Cost model & Tuning Experiments Conclusions

5 ICDE 2007, Constantinople 18/4/2007 4 Roadmap Motivation & Problem statement The Mesh-Join Algorithm Cost model & Tuning Experiments Conclusions

6 ICDE 2007, Constantinople 18/4/2007 5 Add_SPK 1 SUPPKEY=1 SK 1 DS.PS 1.PKEY, LOOKUP_PS.SKEY, SUPPKEY $ 2€ COSTDATE DS.PS 2 Add_SPK 2 SUPPKEY=2 SK 2 DS.PS 2.PKEY, LOOKUP_PS.SKEY, SUPPKEY COSTDATE=SYSDATE AddDate CheckQTY QTY>0 U DS.PS 1 Log rejected Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF 1 DS.PS_NEW 1.PKEY, DS.PS_OLD 1.PKEY DS.PS_NEW 1 DS.PS_OLD 1 DW.PARTSU PP Aggregate 1 PKEY, DAY MIN(COST) Aggregate 2 PKEY, MONTH AVG(COST) V2 V1 TIME  DW.PARTSUPP.DATE, DAY FTP 1 S 1 _PARTSU PP S 2 _PARTSU PP FTP 2 DS.PS_NEW 2 DIFF 2 DS.PS_OLD 2 DS.PS_NEW 2.PKEY, DS.PS_OLD 2.PKEY SourcesDW DSA ETL workflows

7 ICDE 2007, Constantinople 18/4/2007 6 Active Data Warehousing Traditionally, data warehouse refreshment has been performed off-line, through Extraction- Transformation-Loading (ETL) software Active Data Warehousing refers to a new trend where data warehouses are updated as frequently as possible, to accommodate the high demands of users for fresh data

8 ICDE 2007, Constantinople 18/4/2007 7 Issues around Active Warehousing Smooth upgrade of the software at the (legacy) source –minimal modification of the software configuration at the source side Minimal overhead of the source system No data losses are allowed in the long run Maximum freshness of data –the response time for the transport, cleaning, transformation and loading of a new source record to the DW should be small and predictable Scalability at the warehouse side –the architecture should scale up with respect to the number of sources and data consumers at the DW –if possible, cover issues like checkpointing, index maintenance

9 ICDE 2007, Constantinople 18/4/2007 8 Grand view of an Active DW

10 ICDE 2007, Constantinople 18/4/2007 9 Problem statement Joining a fast stream of updates with a persistent relation within limited memory bounds is of particular importance in the Active Warehousing setting Example practical cases: –Surrogate Key assignment –Duplicate detection –…

11 ICDE 2007, Constantinople 18/4/2007 10 Example: Surrogate Key

12 ICDE 2007, Constantinople 18/4/2007 11 Roadmap Motivation & Problem statement The Mesh-Join Algorithm Cost model & Tuning Experiments Conclusions

13 ICDE 2007, Constantinople 18/4/2007 12 Operation of Mesh-Join

14 ICDE 2007, Constantinople 18/4/2007 13 (Not really any) Assumptions No assumption of any order in either the stream or the relation No indexes are necessarily present Limited memory is available The join condition is arbitrary (equality, similarity, range, etc.) The join relationship is general (i.e., many-to- many, one-to-many, or many-to-one) The result is exact. … But.. The relation remains fixed throughout the join

15 ICDE 2007, Constantinople 18/4/2007 14 Architecture of Mesh-Join

16 ICDE 2007, Constantinople 18/4/2007 15

17 ICDE 2007, Constantinople 18/4/2007 16 Roadmap Motivation & Problem statement The Mesh-Join Algorithm Cost model & Tuning Experiments Conclusions

18 ICDE 2007, Constantinople 18/4/2007 17 Critical issues The important measures are: –the stream rate λ –the available memory M –the service rate μ of the join The main challenge is to interrelate these metrics in a cost formula, so as to be able to tune the system –minimize M, given a desirable rate μ –maximize μ, give a constraint of available memory M

19 ICDE 2007, Constantinople 18/4/2007 18 Cost model: Memory wrt b, s Size of b buffer s Size of w buffer s Size of queue Q Size of hash H NRbNRb = # iterations a stream tuple must “see”

20 ICDE 2007, Constantinople 18/4/2007 19 Cost model: cost of an iteration wrt b, s

21 ICDE 2007, Constantinople 18/4/2007 20 Cost model C loop = function (w, b) M = function (w, b) Interrelated M, μ, λ via w, s

22 ICDE 2007, Constantinople 18/4/2007 21 Tuning: M,μ as a function of b

23 ICDE 2007, Constantinople 18/4/2007 22 Minimize M, given a desirable rate μ Minimize w => minimize M Minimum w min = λc loop In this case λ = μ Thus, M is a function only of b, computed by simple calculus

24 ICDE 2007, Constantinople 18/4/2007 23 Roadmap Motivation & Problem statement The Mesh-Join Algorithm Cost model & Tuning Experiments Conclusions

25 ICDE 2007, Constantinople 18/4/2007 24 Experimental methodology Synthetic data set: Zipf distribution, skew in [0,1], 10% of R as available memory, 3.5M rows, domain of 1.35M values Real data set: cloud cover data, 10M rows, domain of 36,000 values INL as an opponent, based on a clustered B+, in Berkeley DB Platform: Pentium IV 3GHz, 1GB main memory, 7200 RPM disk

26 ICDE 2007, Constantinople 18/4/2007 25 Predicted and measured performance (synthetic data)

27 ICDE 2007, Constantinople 18/4/2007 26 Performance for varying memory (synthetic data)

28 ICDE 2007, Constantinople 18/4/2007 27 Performance for varying data skew (synthetic data)

29 ICDE 2007, Constantinople 18/4/2007 28 Performance for varying memory (real-life data)

30 ICDE 2007, Constantinople 18/4/2007 29 Roadmap Motivation & Problem statement The Mesh-Join Algorithm Cost model & Tuning Experiments Conclusions

31 ICDE 2007, Constantinople 18/4/2007 30 Conclusions We have proposed the mesh join, a join operator particularly fit for active data warehousing that operates under minimum assumptions for the stream and the relation We have presented a cost model and tuning methodology that accurately associates memory consumption with the incoming stream rate

32 ICDE 2007, Constantinople 18/4/2007 31 Other capabilities & Possible extensions Approximate processing Ordered join output Tuning for join conditions other than equality Dynamic tuning for changes in the stream rate Possible Extensions –multi-way joins –other active ETL operators

33 ICDE 2007, Constantinople 18/4/2007 32 Thank you for your attention! … many thanks to our hosts! This research was co-funded by the European Union in the framework of the program “Pythagoras IΙ” of the “Operational Program for Education and Initial Vocational Training” of the 3rd Community Support Framework of the Hellenic Ministry of Education, funded by 25% from national sources and by 75% from the European Social Fund (ESF). Figures of the Antikythera mechanism by Rupert Russell URL: http://www.giant.net.au/users/rupert/kythera/kythera.htm

34 ICDE 2007, Constantinople 18/4/2007 33 Questions?

35 ICDE 2007, Constantinople 18/4/2007 34 Backup Slides

36 ICDE 2007, Constantinople 18/4/2007 35 Related work Applications of Symmetric Hash-Joins over windows of streaming inputs that fit in M/M –Chandrasekaran, Franklin @ VLDBJ, 2003 –Golab, Ozsu @ VLDB 2003 –Hammad, Franklin, Aref, Elmagarmid @ VLDB 2003 –Viglas, Naughton, Burger @ VLDB 2003 Joins of streamed bounded relations: Xjoin variants that flush overflow tuples to disk –Dittrich, Seeger, Taylor, Widmayer @ VLDB 2002 –Tao, Yiu, Papadias, Hadjieleftheriou, Mamoulis @ SIGMOD 2005

37 ICDE 2007, Constantinople 18/4/2007 36 Involved Measures

38 ICDE 2007, Constantinople 18/4/2007 37 Cost model I/O per secondI/O per stream tuple

39 ICDE 2007, Constantinople 18/4/2007 38 Loops of Mesh Join


Download ppt "Supporting Streaming Updates in an Active Data Warehouse Neoklis Polyzotis, Spiros Skiadopoulos, Panos Vassiliadis, Alkis Simitsis, Nils-Erik Frantzell."

Similar presentations


Ads by Google