Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Fully Distributed, Fault-Tolerant Data Warehousing System Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris Computing Systems Laboratory National.

Similar presentations


Presentation on theme: "A Fully Distributed, Fault-Tolerant Data Warehousing System Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris Computing Systems Laboratory National."— Presentation transcript:

1 A Fully Distributed, Fault-Tolerant Data Warehousing System Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris Computing Systems Laboratory National Technical University of Athens

2 Motivation Large volumes of data  Everyday life (Web 2.0)  Science (LHC, NASA)  Business domain (automation, digitization, globalization)  New regulations – log/digitize/store everything  Sensors Immense production rates Distributed by nature D. Tsoumakos, HDMS 20104/29/2015

3 Motivation (contd.) Demand for on always-on analytics  Store huge datasets Both structured and semi-structured bulk data  Detection of real time changes in trends Fast retrieval – Point, range, aggregate queries –Intrusion or DoS detection, effects of product’s promotion  Online, near real-time updates From various locations, at big rates D. Tsoumakos, HDMS 20104/29/2015

4 (Up till) now Traditional Data Warehouses  Vast amounts of historical data – data cubes  Centralized, off-line approaches Querying vs. Updating  Distributed warehousing systems Functionality remains centralized Cloud Infrastructures  Resource as a service  Elasticity, commodity hardware  Pay-as-you-go pricing model D. Tsoumakos, HDMS 20104/29/2015

5 Our Goal Distributed DataWarehousing-like system  Store, query, update Multi-d, hierarchical  Scalable, always-on  Shared-nothing architecture Commodity nodes  No proprietary tool needed Java libraries, socket APIs D. Tsoumakos, HDMS 20104/29/2015

6 Brown Dwarf in a nutshell Complete system for datacubes  Distributed storage  Online updates  Efficient query resolution Point, aggregate Various levels of granularity Elastic resources according to  Workload skew  Node churn D. Tsoumakos, HDMS 20104/29/2015

7 Dwarf Dwarf computes, stores, indexes and updates materialized cubes Eliminates prefix and suffix redundancies  Centralized structure with d levels  Root contains all distinct values of first dimension  Each cell points to node of the next level D. Tsoumakos, HDMS 20104/29/2015

8 Why distribute it? Store larger amounts of data  Dwarf may reduce but may also blow-up data High dimensional, sparse >1,000 times Update and query the system online Accelerate creation, query and update speed  Parallelization What about…  Failures, load-balancing, comm. costs?  Performance D. Tsoumakos, HDMS 20104/29/2015

9 Brown Dwarf (BD) Overview Dwarf nodes mapped to overlay nodes UID for each node Hint tables of the form (currAttr, child) Resolve/update along network path Mirrors on per-node basis D. Tsoumakos, HDMS 20104/29/2015

10 BD Operations – Insert+Query One-pass over the fact table Gradual structure of hint tables Creation of cell → insertion of currAttr Creation of dearf node → registration of child Follow path (d hops)along the structure D. Tsoumakos, HDMS 20104/29/2015

11 BD Operations - Update Longest common prefix with existing structure Underlying nodes recursively updated Nodes expanded with new cells New nodes created ALL cells affected D. Tsoumakos, HDMS 20104/29/2015

12 Elasticity of Brown Dwarf Static and adaptive replication vs:  Load (min/max load)  Churn (require≥k replicas) Local only interactions  Ping/exchange hint Tables for consistency Query forwarding to balance load D. Tsoumakos, HDMS 20104/29/2015

13 Experimental Evaluation 16 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory) Synthetic and real datasets 5d-25d, various levels of skew (Zipf θ=0.95) APB-1 Benchmark generator Forest and Weather datasets Simulation results with 1000s nodes D. Tsoumakos, HDMS 20104/29/2015

14 Cube Construction Acceleration of cube creation up to 3.5 times compared to Dwarf Better use of resources through parallelization More noticeable effect for high dimensional, skewed datasets Storage overhead Mainly attributed to mapping between dwarf node and network IDs Shared among network nodes UniformZipf dSize (MB)Time (sec)Size (MB)Time (sec) DwarfBDDwarfBDDwarfBDDwarfBD D. Tsoumakos, HDMS 20104/29/2015

15 Updates 1% updates Up to 2.3 times faster for skewed dataset Dimensionality increases the cost UniformZipf dTime (sec)Msg /upd Time (sec)Msg /upd DwarfBDDwarfBD D. Tsoumakos, HDMS 20104/29/2015

16 Queries 1K querysets, 50% aggregate Impressive acceleration of up to 60 times Message cost bound by d+1 UniformZipf dTime (sec)Msg /quer Time (sec)Msg /quer DwarfBDDwarfBD D. Tsoumakos, HDMS 20104/29/2015

17 Elasticity Dimitrios Tsoumakos, UoI Talk 17 23/02/ d 100k datasets, 5k query-sets λ=10qu/sec → 100qu/sec BD adapts according to demand → elasticity k=3, N fail failing nodes every T fail sec 5k queries, 10-d uniform dataset No loss for N fail < k+1 Query time increases due to redirections

18 What have we achieved so far? BD optimizations – work in progress  Replication units (chunks, …),  Hierarchies – faster updates (MDAC 2010), … Brown Dwarf focuses on +Efficient answering of aggregate queries +Cloud - friendly - Preprocessing - Costly updates HiPPIS project +Explicit support for Hierarchical data +No preprocessing +Ease of insertion and updates - Processing for aggregate queries D. Tsoumakos, HDMS 20104/29/2015

19 Questions D. Tsoumakos, HDMS 20104/29/2015


Download ppt "A Fully Distributed, Fault-Tolerant Data Warehousing System Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris Computing Systems Laboratory National."

Similar presentations


Ads by Google