A Fully Distributed, Fault-Tolerant Data Warehousing System Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris Computing Systems Laboratory National Technical University of Athens
Motivation Large volumes of data Everyday life (Web 2.0) Science (LHC, NASA) Business domain (automation, digitization, globalization) New regulations – log/digitize/store everything Sensors Immense production rates Distributed by nature D. Tsoumakos, HDMS 20104/29/2015
Motivation (contd.) Demand for on always-on analytics Store huge datasets Both structured and semi-structured bulk data Detection of real time changes in trends Fast retrieval – Point, range, aggregate queries –Intrusion or DoS detection, effects of product’s promotion Online, near real-time updates From various locations, at big rates D. Tsoumakos, HDMS 20104/29/2015
(Up till) now Traditional Data Warehouses Vast amounts of historical data – data cubes Centralized, off-line approaches Querying vs. Updating Distributed warehousing systems Functionality remains centralized Cloud Infrastructures Resource as a service Elasticity, commodity hardware Pay-as-you-go pricing model D. Tsoumakos, HDMS 20104/29/2015
Our Goal Distributed DataWarehousing-like system Store, query, update Multi-d, hierarchical Scalable, always-on Shared-nothing architecture Commodity nodes No proprietary tool needed Java libraries, socket APIs D. Tsoumakos, HDMS 20104/29/2015
Brown Dwarf in a nutshell Complete system for datacubes Distributed storage Online updates Efficient query resolution Point, aggregate Various levels of granularity Elastic resources according to Workload skew Node churn D. Tsoumakos, HDMS 20104/29/2015
Dwarf Dwarf computes, stores, indexes and updates materialized cubes Eliminates prefix and suffix redundancies Centralized structure with d levels Root contains all distinct values of first dimension Each cell points to node of the next level D. Tsoumakos, HDMS 20104/29/2015
Why distribute it? Store larger amounts of data Dwarf may reduce but may also blow-up data High dimensional, sparse >1,000 times Update and query the system online Accelerate creation, query and update speed Parallelization What about… Failures, load-balancing, comm. costs? Performance D. Tsoumakos, HDMS 20104/29/2015
Brown Dwarf (BD) Overview Dwarf nodes mapped to overlay nodes UID for each node Hint tables of the form (currAttr, child) Resolve/update along network path Mirrors on per-node basis D. Tsoumakos, HDMS 20104/29/2015
BD Operations – Insert+Query One-pass over the fact table Gradual structure of hint tables Creation of cell → insertion of currAttr Creation of dearf node → registration of child Follow path (d hops)along the structure D. Tsoumakos, HDMS 20104/29/2015
BD Operations - Update Longest common prefix with existing structure Underlying nodes recursively updated Nodes expanded with new cells New nodes created ALL cells affected D. Tsoumakos, HDMS 20104/29/2015
Elasticity of Brown Dwarf Static and adaptive replication vs: Load (min/max load) Churn (require≥k replicas) Local only interactions Ping/exchange hint Tables for consistency Query forwarding to balance load D. Tsoumakos, HDMS 20104/29/2015
Experimental Evaluation 16 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory) Synthetic and real datasets 5d-25d, various levels of skew (Zipf θ=0.95) APB-1 Benchmark generator Forest and Weather datasets Simulation results with 1000s nodes D. Tsoumakos, HDMS 20104/29/2015
Cube Construction Acceleration of cube creation up to 3.5 times compared to Dwarf Better use of resources through parallelization More noticeable effect for high dimensional, skewed datasets Storage overhead Mainly attributed to mapping between dwarf node and network IDs Shared among network nodes UniformZipf dSize (MB)Time (sec)Size (MB)Time (sec) DwarfBDDwarfBDDwarfBDDwarfBD D. Tsoumakos, HDMS 20104/29/2015
Updates 1% updates Up to 2.3 times faster for skewed dataset Dimensionality increases the cost UniformZipf dTime (sec)Msg /upd Time (sec)Msg /upd DwarfBDDwarfBD D. Tsoumakos, HDMS 20104/29/2015
Queries 1K querysets, 50% aggregate Impressive acceleration of up to 60 times Message cost bound by d+1 UniformZipf dTime (sec)Msg /quer Time (sec)Msg /quer DwarfBDDwarfBD D. Tsoumakos, HDMS 20104/29/2015
Elasticity Dimitrios Tsoumakos, UoI Talk 17 23/02/ d 100k datasets, 5k query-sets λ=10qu/sec → 100qu/sec BD adapts according to demand → elasticity k=3, N fail failing nodes every T fail sec 5k queries, 10-d uniform dataset No loss for N fail < k+1 Query time increases due to redirections
What have we achieved so far? BD optimizations – work in progress Replication units (chunks, …), Hierarchies – faster updates (MDAC 2010), … Brown Dwarf focuses on +Efficient answering of aggregate queries +Cloud - friendly - Preprocessing - Costly updates HiPPIS project +Explicit support for Hierarchical data +No preprocessing +Ease of insertion and updates - Processing for aggregate queries D. Tsoumakos, HDMS 20104/29/2015
Questions D. Tsoumakos, HDMS 20104/29/2015