Efficient Processing Regular Queries In Shared-Nothing Parallel Database Systems Using Tree- And Structural Indexes (ADBIS 2007, Bulgaria) Vu Le Anh, Attilla.

Slides:



Advertisements
Similar presentations
1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
Advertisements

C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
1 Networking through Linux Partha Sarathi Dasgupta MIS Group Indian Institute of Management Calcutta.
Query Evaluation Techniques for Cluster Database Systems Andrey V. Lepikhov, Leonid B. Sokolinsky South Ural State University Russia 22 September 2010.
Graph & BFS.
Evaluating Reachability Queries over Path Collections* P. Bouros 1, S. Skiadopoulos 2, T. Dalamagas 3, D. Sacharidis 3, T. Sellis 1,3 1 National Technical.
Fall 2008Parallel Query Scheduling1. Fall 2008Parallel Query Scheduling2 Query Processing Queries submitted to the system are queued up and processed.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
From Semistructured Data to XML: Migrating The Lore Data Model and Query Language Roy Goldman, Jason McHugh, Jennifer Widom Stanford University
Graph COMP171 Fall Graph / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D E A C F B Vertex Edge.
Parallel Algorithms for Relational Operations. Models of Parallelism There is a collection of processors. –Often the number of processors p is large,
A Schedulability-Preserving Transformation of BDF to Petri Nets Cong Liu EECS 290n Class Project December 10, 2004.
Graph & BFS Lecture 22 COMP171 Fall Graph & BFS / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D.
Communication operations Efficient Parallel Algorithms COMP308.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
Parallel Algorithms for Relational Operations Class ID: 21 Name: Shujia Zhang.
Solving problems by searching
Distributed process management: Distributed deadlock
DOMAIN NAME SYSTEM. Introduction  There are several applications that follow client server paradigm.  The client/server programs can be divided into.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.
Shilpa Seth.  Centralized System Centralized System  Client Server System Client Server System  Parallel System Parallel System.
Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology.
Distributed Algorithms 2014 Igor Zarivach A Distributed Algorithm for Minimum Weight Spanning Trees By Gallager, Humblet,Spira (GHS)
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007 (TPDS 2007)
Chapter 17 Domain Name System
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture.
Algorithms and Running Time Algorithm: Well defined and finite sequence of steps to solve a well defined problem. Eg.,, Sequence of steps to multiply two.
Structuring P2P networks for efficient searching Rishi Kant and Abderrahim Laabid Abderrahim Laabid.
System Support for Managing Graphs in the Cloud Sameh Elnikety & Yuxiong He Microsoft Research.
1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Chapter 21 Topologies Chapter 2. 2 Chapter Objectives Explain the different topologies Explain the structure of various topologies Compare different topologies.
1 Kyung Hee University Chapter 18 Domain Name System.
Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.
1 Detecting and Reducing Partition Nodes in Limited-routing-hop Overlay Networks Zhenhua Li and Guihai Chen State Key Laboratory for Novel Software Technology.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
Eric Chang and Rutwik Parikh. Goal: Determine the largest subset of edges in a graph such that no vertex of the graph is touched by more than one edge.
Disjoint Sets Data Structure. Disjoint Sets Some applications require maintaining a collection of disjoint sets. A Disjoint set S is a collection of sets.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Distributed Graph Simulation: Impossibility and Possibility 1 Yinghui Wu Washington State University Wenfei Fan University of Edinburgh Southwest Jiaotong.
16.7 Completing the Physical- Query-Plan By Aniket Mulye CS257 Prof: Dr. T. Y. Lin.
Union-Find  Application in Kruskal’s Algorithm  Optimizing Union and Find Methods.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Data Structures and Algorithms in Parallel Computing
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
Master Informatique 1 Dr. Vu Le AnhStructural indexes of XML Databases Dr. Vu Le Anh
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
Spring Routing: Part I Section 4.2 Outline Algorithms Scalability.
REED : Robust, Efficient Filtering and Event Detection in Sensor Network Daniel J. Abadi, Samuel Madden, Wolfgang Lindner Proceedings of the 31st VLDB.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.
Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
1 Along & across algorithm for routing events and queries in wireless sensor networks Tat Wing Chim Department of Electrical and Electronic Engineering.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Disjoint Sets Data Structure
CLUSTER COMPUTING Presented By, Navaneeth.C.Mouly 1AY05IS037
Graphs All tree structures are hierarchical. This means that each node can only have one parent node. Trees can be used to store data which has a definite.
Searching for Solutions
Abstraction.
Amir Kamil and Katherine Yelick
Database System Architectures
Basic Search Methods How to solve the control problem in production-rule systems? Basic techniques to find paths through state- nets. For the moment: -
Presentation transcript:

Efficient Processing Regular Queries In Shared-Nothing Parallel Database Systems Using Tree- And Structural Indexes (ADBIS 2007, Bulgaria) Vu Le Anh, Attilla Kiss Department of Information Systems, Eötvös Loránd University, Hungary Under the support of the Hungarian National Office for Research and Technology under grant no. RET14/2005.

Outline Problem State-of-art of the problem: Streaming approach vs. Partial Parallel approach Our efficient algorithm: Tree index & Structural index Experiments Summary

Problem Shared-Nothing Parallel Database System Fragmented XML Tree Regular queries

Shared-Nothing Parallel Database System 2 - thousands sites connecting by an interconnection network Each site: non-shared memory, non-shared disk and own processor The cost per processor may be extremely low because each node is an inexpensive processor Interconnection network Disk memory P1 memory P2 memory Pn …

Shared-Nothing Parallel Database System Parallel processing Provide incremental and unlimited growth Failure is local: if one node fails, the others stay up The cost of system may be very cheap

Fragmented XML Tree A AB CBAD ADBE DCEAF F0F0 F2F2 F4F4 F5F5 F1F1 0 F3F Nodes : 0,1,…,15 Label values: A,B,…,F Fragments : F 0, F 1,…, F 5 F 0 = {0,1,2,6} F 1 = {3,4,5} … Sites: S 0, S 1 S 0 ={F 0,F 4,F 5 }, S 1 ={F 1,F 2,F 3 } Site = machine 1 Master site + Slaver sites Master server: - Communicating with the Clients - Controling the Slavers processing queries

Regular queries A variety of query languages have been proposed for XML data: UnQL, Lorel, XQL, XML- QL, etc. All of them are built around regular path expressions. Three basic operations: Union, Concatenation and Iteration. Every regular path expression can be determined by a finite deterministic automata.

Example regular query Query: //B/D. Query graph: A A B CB26 AD913 A D BE D CA F E 15 * BD q0q0 q1q1 q2q2 q0q0 q0q0 q0q0 q0q0 q0q0 q 0 q 2 q1q1 q0q0 Answers = {3,11,13}

Problem Problem: -Nodes are in different fragments 2 approaches: Streaming approach vs. Partial Parallel approach A A B q0q0 q0q0 q 0 q 1 F1F1 F0F0

Basic operation: Fragment Process Fragment-Process(F,q): -Traverse the fragment F and the query graph begin at the root of F and state q -While processing if a link edge is traversed, different processes will have different behavior

Streaming Approach If a link edge F  F’ is traversed: 1.Current fragment process operation over F is stopped. 2.The corresponding fragment process operation over F’ is started 3.If 2 finishes 2 sends the result to 1, 1 will be resumed

Streaming Approach A F0F0 0 F3F3 1 2 A B8 F2F2 F4F4 F1F AC D CB F E D D B A A E 7 F5F5 Sequence of events: 1. (F 0,q 0 ) is started 2. Link edge (2,3) is traversed 3. (F 0,q 0 ) is stopped 4. (F 1,q 0 ) is started 5. (F 0,q 0 ) is resumed 6. Link edge (2,3) is traversed again 7. (F 0,q 0 ) is stopped 8. (F 1,q 2 ) is started, {3} is sent to F 0 9. (F 0,q 0 ) is resumed … No parallelism, the waiting time is high

Partial Parallel Approach When fragment process operation is processed there is no communication with other sites If a link edge (F, q)  (F’, q’) is traversed: -Write down the fact: If (F, q) is processed (F’, q’) will be processed These facts will be sent to the Master to find out all the operations which are reachable Only the results of the reachable operations are sent to the Master

Partial Parallel Approach A F0F0 0 F3F3 1 2 A B8 F2F2 F4F4 F1F AC D CB F E D D B A A E 7 F5F5 Sequence of events: 1.All fragment process operations of S 0 and S 1 are executed in parallel 2. S 1 = {F 1, F 2, F 3 } Operations: (F 1,q 0 ), (F 1,q 1 ), (F 1,q 2 ) (F 2,q 0 ), (F 2,q 1 ), (F 2,q 2 ),(F 3,q 0 ), (F 3,q 1 ), (F 3,q 2 ) 3. The list of facts: (F 3,q 0 )  (F 4,q 0 ) (F 3,q 0 )  (F 5,q 0 ) (F 3,q 0 )  (F 5,q 1 ) 4. List of reachable operations: (F 1,q 0 ), (F 1,q 1 ), (F 2,q 0 ), (F 3,q 0 ) 5. Sending the results of reachable operations to the Master S 0 = {F 0, F 4, F 5 } S 1 = {F 1, F 2, F 3 }

Our algorithm Partial Parallel Approach -Advantages: Parallelism, the number of communication is constant and each fragment is scanned maximum once -Disadvantages: many unnecessary operations Our algorithm: -Based on the partial evaluation -Restrain the unnecessary operations

Unnecessary operations Unnecessary operations type I: Def: Unreachable operations Solution: -Determined by Tree Index - Tree Index is stored in Master storing all paths connecting between the roots of Fragments Unnecessary operations type II: Def: Return no result Solution: -Restrained by structural indexes -Structural indexes = Simulators of Fragments

Tree Index A F0F0 0 F3F3 1 2 A B8 F2F2 F4F4 F1F AC D CB F E D D B A A E 7 F5F5 Tree Index AF0F0 AF2F2 BF3F3 BF4F4 DF1F1 DF5F5 ε AB AC A ε q0q0 q 0 q 1 (F 2,q 1 ), (F 2,q 2 ): unreachable q0q0 q0q0 q0q0 q 0 q 1 Reachable operations: (F 0,q 0 ), (F 1,q 0 ), … The size of tree index = The number of Fragments The process cost can be ignored

Structural Indexes Simulating the fragment by a index graph Processing over the index graph is safe.  Using as necessary condition (if an operation returns no result over the index graph, it also returns no result over the fragment ) The size of the index should be constant so that the cost of pre-processing is minimized

DL-Indexes A5 Fragment A10 D8B7D9B6 C11C12C13A14 F16E15E17E18 A19A20A21A22 DL Index A B,D A,C E,F A Simulating * BD q0q0 q1q1 q2q2 (F,q 0 ), (F,q 1 ) and (F,q 2 ): unnecessary operations type II q0q0 q0q0 q 0 q 1 q0q0 q0q0

Our Algorithm 1.The Master determines the reachable operations by the tree-index 2.For each reachable operation, using the corresponding structural index to check out if it is a unnecessary operation type 2. 3.Sending the „good” operations to each sites 4.Each site processes the operations and send back to the Master

Experiments Comparing the performance of three algorithm: Our algorithm (EPP), Partial Processing algorithm (PP) and Streaming Processing algorithm (TP) System: 19 Linux machines connecting by local network Data set: 500 Mb  76 fragments: randomly stored in servers Queries: 10 Queries representing for different conditions of the environment

Experiments Waiting time: EPP : PP : TP = 1 : 1.94 : The waiting time of TP extremely high since there is no parallelism Processing and Communication Cost: EPP : PP : TP = 1 : 1.77 : 2.75 In some cases the total cost of PP is higher than TP because of redundant operation type 2 EPP is the best

Summary Introduce an efficient algorithm processing regular queries in shared-nothing based on partial evaluation Two types of unnecessary operations: - Type 1: Unreachable operations. Restrained by processing over the tree index - Type 2: Returning no matching nodes. Restrained by processing over structural indexes Experiments: Our algorithm overcomes the classical algorithms according the waiting time and processing and communication cost criteria

Thank you. Question?