Presentation is loading. Please wait.

Presentation is loading. Please wait.

29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma,

Similar presentations

Presentation on theme: "29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma,"— Presentation transcript:

1 29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris} Computing Systems Laboratory National Technical University of Athens

2 29/1/2014 Motivation Large volumes of data Everyday life, science and business domain Time-series data Temporally ordered, organized in hierarchies (Day<Month<Year) E.g., date of a credit card purchase, time of a phone call Important for monitoring a process of interest On-line processing Fast retrieval – Point, range, aggregate queries Detection of real time changes in trends Intrusion or DoS detection, effects of products promotion Online, cost-efficient updates 2

3 29/1/2014 Up till now Data Warehouses Centralized, off-line approaches Distributed warehousing systems Functionality remains centralized Distributed Warehouse-like initiative: Brown Dwarf Distribution of centralized Dwarf Deployed on shared-nothing, commodity hardware Scalability, fault tolerance, performance No special consideration for time-series data Update procedure costly unfit for frequent updates 3

4 29/1/2014 Our Goals Cloud based DataWarehousing-like system Targeted to time-series data Arriving at high rate Store, update, query data at various granularity levels Multidimensional, hierarchical Shared nothing architecture Commodity nodes Without use of any proprietary tool Java libraries, socket APIs 4

5 29/1/2014 Our Contribution 5 Complete system for multidimensional time-series data Store with one pass Update online Query efficiently Point, aggregate Various levels of granularity Adaptive materialization According to data recency Accelerate cube creation/update Minimize storage consumption

6 29/1/2014 Dwarf Dwarf computes, stores, indexes and updates materialized cubes Eliminates prefix and suffix redundancies Any query (point or aggregate) is answered through traversal of structure 6

7 29/1/2014 Brown Dwarf Dwarf nodes mapped to overlay nodes UID for each node Hint tables of the form (currAttr, child) Insertion One-pass over the fact table Gradual structure of hint tables Queries Overlay path of d hops Incremental Updates Elasticity through adaptive mirroring 7

8 29/1/2014 Advantages and Drawbacks Store even larger amounts of data! Dwarf reduces but may also blow-up data High dimensional, sparse >1,000 times Handle many more requests Query the system online Accelerate creation (up to 5 times ) and querying (up to 60 times) Parallelization Update remains costly 8

9 29/1/2014 Time Series Dwarf (TSD) A concept hierarchy characterizes time and any other dimension Updates are applied in temporal order Temporal granularity of queries relative to the time of querying More detailed queries for recent events More coarse grained queries for past events 9

10 29/1/2014 TSD Operations - Insertion Time first in order Lack of ALL cell in Time Aggregate created after completion of a level 10

11 29/1/2014 TSD Operations - Querying Follow path along the structure Roll-up query for aggregate already created Within d hops (e.g., ) Roll-up query for recent records Initial query substituted by multiple lower level queries (e.g., ) 11

12 29/1/2014 TSD Operations - Updating Insertion of a new tuple Longest common prefix with existing structure Underlying nodes recursively updated Lack of ALL cell for Time + temporal ordering = fewer existing cells affected Example: 3 TSD nodes vs. 12 Dwarf nodes affected 12

13 29/1/2014 Adaptive Materialization A daemon process asynchronously creates roll-up views deletes corresponding drill-down ones The period of this process depends on application Tradeoff: cube size vs. response accuracy 13

14 29/1/2014 Experimental Evaluation 25 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory) Synthetic and real datasets APB-1 Benchmark generator 4-d, 3 levels for Time, various densities DARPA Intrusion Detection audit data 1M tuples, 7-d, 3 levels for Time TSD: static mode TSD ad : adaptive mode 14

15 29/1/2014 Cube Construction Noticeable reduction of cube size for TSD, impressive for TSD ad (up to 85% for the APB dataset) Lack of the ALL cell in the first dimension Acceleration of cube creation up to 89% compared to Dwarf Better use of resources through parallelization (BD) Further reduction due to lack of ALL and selective materialization 15 Size (MB)Time (sec) Dataset#TuplesDwarfBDTSDTSD ad DwarfBDTSDTSD ad APB-A 1.2M565953948510110057 APB-B 2.5M1021159324957220198123 APB-C 3.7M163182146321530321289167 DARPA 1.1M178191156127614222208189

16 29/1/2014 Updates 10k updates TSD up to 3 times faster than Dwarf and 30% faster than BD Ordered updates – do not affect already created views No recursive updates for ALL cell of first dimension smaller communication overhead (3-fold reduction) TSD ad does not include roll-up view creation (asynchronous) further acceleration ~20% 16 Time(sec)Msgs/update DatasetDwarfBDTSDTSD ad BDTSDTSD ad APB-A 11236034043152298 APB-B 115861141832323109 APB-C 120362442432825119 DARPA 153564945838029139

17 29/1/2014 Queries DARPA 10k datasets – 3 kinds of querysets, 50% aggregates Q1: Ideal Q2: Recent records are queried upon in more detail (Zipfian) Q3: Random As queryset approximates uniform distribution Message cost increases Accuracy decreases 17 Time(sec)Msgs/query%Inaccurate queries %Resp. Deviation QuerysetBDTSDTSD ad BDTSDTSD ad Q1Q1 56677700 Q2Q2 5987991519 Q3Q3 52421732 3332

18 29/1/2014 Questions 18

Download ppt "29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma,"

Similar presentations

Ads by Google