A Fully Distributed, Fault-Tolerant Data Warehousing System Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris Computing Systems Laboratory National.

Slides:

Advertisements

Similar presentations

SkipNet: A Scalable Overlay Network with Practical Locality Properties Nick Harvey, Mike Jones, Stefan Saroiu, Marvin Theimer, Alec Wolman Microsoft Research.

Advertisements

29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma,

The HV-tree: a Memory Hierarchy Aware Version Index Rui Zhang University of Melbourne Martin Stradling University of Melbourne.

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.

Scalable Content-Addressable Network Lintao Liu

Fast Algorithms For Hierarchical Range Histogram Constructions

Dwarf: A High Performance OLAP Engine Nick Roussopoulos ACT Inc. & UMD.

High Performance Computing Course Notes Grid Computing.

10 REASONS Why it makes a good option for your DB IN-MEMORY DATABASES Presenter #10: Robert Vitolo.

Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Schenker Presented by Greg Nims.

Query Processing of Massive Trajectory Data based on MapReduce Qiang Ma, Bin Yang (Fudan University) Weining Qian, Aoying Zhou (ECNU) Presented By: Xin.

Peer-to-Peer Support for Massively Multiplayer Games Bjorn Knutsson, Honghui Lu, Wei Xu, Bryan Hopkins Presented by Mohammed Alam (Shahed)

SCAN: A Dynamic, Scalable, and Efficient Content Distribution Network Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy,

Lee Center Workshop, May 19, 2006 Distributed Objects System with Support for Sequential Consistency.

A Scalable Content-Addressable Network Authors: S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker University of California, Berkeley Presenter:

SkipNet: A Scalable Overlay Network with Practical Locality Properties Nick Harvey, Mike Jones, Stefan Saroiu, Marvin Theimer, Alec Wolman Microsoft Research.

Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.

Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.

Chapter 13 The Data Warehouse

Data Warehousing: Defined and Its Applications Pete Johnson April 2002.

Hierarchical Dwarfs for the Rollup-Cube Yannis Sismanis Antonios Deligiannakis Yannis Kotidis Nick Roussopoulos.

Ch 4. The Evolution of Analytic Scalability

Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor

Network Support for Cloud Services Lixin Gao, UMass Amherst.

1 The Google File System Reporter: You-Wei Zhang.

8/9/2015 Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, dtsouma, Computing Systems Laboratory.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,

Presenter: Dipesh Gautam.  Introduction  Why Data Grid?  High Level View  Design Considerations  Data Grid Services  Topology  Grids and Cloud.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Introduction to Hadoop and HDFS

OnLine Analytical Processing (OLAP)

DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.

Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia

Benjamin AraiUniversity of California, Riverside Reliable Hierarchical Data Storage in Sensor Networks Song Lin – Benjamin.

Resource Addressable Network (RAN) An Adaptive Peer-to-Peer Substrate for Internet-Scale Service Platforms RAN Concept & Design  Adaptive, self-organizing,

A Peer-to-Peer Approach to Resource Discovery in Grid Environments (in HPDC’02, by U of Chicago) Gisik Kwon Nov. 18, 2002.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.

1 More on Plaxton routing There are n nodes, and log B n digits in the id, where B = 2 b The neighbor table of each node consists of - primary neighbors.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Reporter ： Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.

Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.

1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.

Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

1 Subscription Partitioning and Routing in Content-based Publish/Subscribe Networks Yi-Min Wang, Lili Qiu, Dimitris Achlioptas, Gautam Das, Paul Larson,

Data Structures and Algorithms in Parallel Computing Lecture 7.

NCLAB 1 Supporting complex queries in a distributed manner without using DHT NodeWiz: Peer-to-Peer Resource Discovery for Grids Sujoy Basu, Sujata Banerjee,

A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.

INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.

Distributed Network Traffic Feature Extraction for a Real-time IDS

What’s coming? Sneak peek.

Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University

به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

Ch 4. The Evolution of Analytic Scalability

Managing batch processing Transient Azure SQL Warehouse Resource

Big Data Analytics: Exploring Graphs with Optimized SQL Queries

Presentation transcript:

A Fully Distributed, Fault-Tolerant Data Warehousing System Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris Computing Systems Laboratory National Technical University of Athens

Motivation Large volumes of data  Everyday life (Web 2.0)  Science (LHC, NASA)  Business domain (automation, digitization, globalization)  New regulations – log/digitize/store everything  Sensors Immense production rates Distributed by nature D. Tsoumakos, HDMS 20104/29/2015

Motivation (contd.) Demand for on always-on analytics  Store huge datasets Both structured and semi-structured bulk data  Detection of real time changes in trends Fast retrieval – Point, range, aggregate queries –Intrusion or DoS detection, effects of product’s promotion  Online, near real-time updates From various locations, at big rates D. Tsoumakos, HDMS 20104/29/2015

(Up till) now Traditional Data Warehouses  Vast amounts of historical data – data cubes  Centralized, off-line approaches Querying vs. Updating  Distributed warehousing systems Functionality remains centralized Cloud Infrastructures  Resource as a service  Elasticity, commodity hardware  Pay-as-you-go pricing model D. Tsoumakos, HDMS 20104/29/2015

Our Goal Distributed DataWarehousing-like system  Store, query, update Multi-d, hierarchical  Scalable, always-on  Shared-nothing architecture Commodity nodes  No proprietary tool needed Java libraries, socket APIs D. Tsoumakos, HDMS 20104/29/2015

Brown Dwarf in a nutshell Complete system for datacubes  Distributed storage  Online updates  Efficient query resolution Point, aggregate Various levels of granularity Elastic resources according to  Workload skew  Node churn D. Tsoumakos, HDMS 20104/29/2015

Dwarf Dwarf computes, stores, indexes and updates materialized cubes Eliminates prefix and suffix redundancies  Centralized structure with d levels  Root contains all distinct values of first dimension  Each cell points to node of the next level D. Tsoumakos, HDMS 20104/29/2015

Why distribute it? Store larger amounts of data  Dwarf may reduce but may also blow-up data High dimensional, sparse >1,000 times Update and query the system online Accelerate creation, query and update speed  Parallelization What about…  Failures, load-balancing, comm. costs?  Performance D. Tsoumakos, HDMS 20104/29/2015

Brown Dwarf (BD) Overview Dwarf nodes mapped to overlay nodes UID for each node Hint tables of the form (currAttr, child) Resolve/update along network path Mirrors on per-node basis D. Tsoumakos, HDMS 20104/29/2015

BD Operations – Insert+Query One-pass over the fact table Gradual structure of hint tables Creation of cell → insertion of currAttr Creation of dearf node → registration of child Follow path (d hops)along the structure D. Tsoumakos, HDMS 20104/29/2015

BD Operations - Update Longest common prefix with existing structure Underlying nodes recursively updated Nodes expanded with new cells New nodes created ALL cells affected D. Tsoumakos, HDMS 20104/29/2015

Elasticity of Brown Dwarf Static and adaptive replication vs:  Load (min/max load)  Churn (require≥k replicas) Local only interactions  Ping/exchange hint Tables for consistency Query forwarding to balance load D. Tsoumakos, HDMS 20104/29/2015

Experimental Evaluation 16 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory) Synthetic and real datasets 5d-25d, various levels of skew (Zipf θ=0.95) APB-1 Benchmark generator Forest and Weather datasets Simulation results with 1000s nodes D. Tsoumakos, HDMS 20104/29/2015

Cube Construction Acceleration of cube creation up to 3.5 times compared to Dwarf Better use of resources through parallelization More noticeable effect for high dimensional, skewed datasets Storage overhead Mainly attributed to mapping between dwarf node and network IDs Shared among network nodes UniformZipf dSize (MB)Time (sec)Size (MB)Time (sec) DwarfBDDwarfBDDwarfBDDwarfBD D. Tsoumakos, HDMS 20104/29/2015

Updates 1% updates Up to 2.3 times faster for skewed dataset Dimensionality increases the cost UniformZipf dTime (sec)Msg /upd Time (sec)Msg /upd DwarfBDDwarfBD D. Tsoumakos, HDMS 20104/29/2015

Queries 1K querysets, 50% aggregate Impressive acceleration of up to 60 times Message cost bound by d+1 UniformZipf dTime (sec)Msg /quer Time (sec)Msg /quer DwarfBDDwarfBD D. Tsoumakos, HDMS 20104/29/2015

Elasticity Dimitrios Tsoumakos, UoI Talk 17 23/02/ d 100k datasets, 5k query-sets λ=10qu/sec → 100qu/sec BD adapts according to demand → elasticity k=3, N fail failing nodes every T fail sec 5k queries, 10-d uniform dataset No loss for N fail < k+1 Query time increases due to redirections

What have we achieved so far? BD optimizations – work in progress  Replication units (chunks, …),  Hierarchies – faster updates (MDAC 2010), … Brown Dwarf focuses on +Efficient answering of aggregate queries +Cloud - friendly - Preprocessing - Costly updates HiPPIS project +Explicit support for Hierarchical data +No preprocessing +Ease of insertion and updates - Processing for aggregate queries D. Tsoumakos, HDMS 20104/29/2015

Questions D. Tsoumakos, HDMS 20104/29/2015