BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review on Communication Steps for Parallel Query Processing Reviewed by: Abinet Kindie Submitted to : Bhabani Shankar D.M

INTRODUCTION Parallel query processing aims at reducing response time by utilizing the processing power of multiple CPUs to process a query. Parallelism enables the distribution of computation for data-intensive tasks into a number of machines and hence meaningfully reduces the completion time for several data processing tasks. As explained in this paper there are two communication steps, i.e one communication step and multiple communication step. For a single communication step they provide lower bounds in arbitrary bits or bit models. For multiple rounds of communication, they give lower bounds in a model where routing decisions for a tuple are tuple-based.

Cont.… Query processing for big data is executed on a shared-nothing parallel architecture. In a shared nothing architecture, the processing units share no memory or other resources or processor has exclusive access in its memory and disk, and communicate with one another by sending messages via communication network. The main goal of this paper is to gain short response time to accomplish the given task.

Shared-nothing architectures Proc. 1 Memory Interconnection Network Proc. 2Proc. n......

METHODOLOGY To develop this article the researchers uses different methodologies such as:  Hypercube algorithm: used to compute any conjunctive query by achieving the optimal load in one round and to optimize joins in Map Reduce model.  The hyper graph algorithm of the query Q: The hyper graph of a query q is defined by introducing one node for each variable in the form and one hyper edge for each set of variables that occur in a single atom and used to define the inequality using query language.  Tuple-based MPC algorithm :that computes the query on matching databases with load rounds of computation and allows bit randomization for load balancing communication.

BERIEF SUMMARY Statement of Problems In most real-world applications, data with skew causes an uneven distribution of the load, and hence reduces the effectiveness of parallelism. Proposed Solution To deal with the problems caused by skew design data-sensitive techniques that identify the outliers in the data and alleviate the effect of skew by further splitting the computation to more servers.

Motivation The motivation behind this paper are: Understand the complexity of parallel query processing on big data management Focus on shared-nothing architectures Dominating complexity parameters of computation: Communication cost Number of communication rounds and the amount of data being exchanged.

RESULTS ♠ONE ROUND:  Lower bounds on the space exponent for any randomized algorithm that computes a Conjunctive query.  The lower bound for a class of inputs or matching databases show tight upper bounds. ♠MULTIPLE ROUNDS: They gain a virtually tight space exponent/round trade-offs for tree-like Conjunctive Queries under a weaker communication model.

CONTRIBUTIONS This paper mainly contributes:  Massively Parallel Computation model/ MPC as a theoretical tool to analyse the performance of parallel algorithms for query processing on relational data.  Rounds are contributed to solve problems of computing a relational query on a large input database, using a large number of servers.  Algorithms for communication rounds.

FOUNDATION The article basically prepared from the following list of papers. Query Processing for Massively Parallel Systems(2015) Upper and lower bounds on the cost of a map-reduce computation(2012) Optimizing joins in a map-reduce environment(2010)

CRITIQUE Researchers perform this article in a good way because they built the article based on theoretical framework and experimentation/simulation however, the minor problem that I observe in this paper is, within the MapReduce framework, only lower bounds apply to a single round communication, and say nothing about the limitations of multi- round MapReduce algorithms and mainly focused on lower bound models for rounds.

CONCLUSION Generally, Parallelism enables the distribution of computation for data-intensive tasks into a number of machines.hence, significantly reduces the completion time for several data processing tasks. MPC model captures parallel query processing algorithms: the number of synchronization steps, and the communication complexity. Identifying the optimal tradeoff between the number of rounds and maximum load for several computational tasks is the main challenging work here.

THANK YOU !!!

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

Similar presentations

Presentation on theme: "BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

Similar presentations

Presentation on theme: "BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review."— Presentation transcript:

Similar presentations

About project

Feedback