DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.

Slides:



Advertisements
Similar presentations
WIMS 2014, June 2-4Thessaloniki, Greece1 Optimized Backward Chaining Reasoning System for a Semantic Web Hui Shi, Kurt Maly, and Steven Zeil Contact:
Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Ilias Tachmazidis 1,2, Grigoris Antoniou 1,2,3, Giorgos Flouris 2, Spyros Kotoulas 4 1 University of Crete 2 Foundation for Research and Technology, Hellas.
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
Building and Analyzing Social Networks Web Data and Semantics in Social Network Applications Dr. Bhavani Thuraisingham February 15, 2013.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
The Chinese University of Hong Kong. Research on Private cloud : Eucalyptus Research on Hadoop MapReduce & HDFS.
UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Secure Data Storage and Retrieval in the Cloud Bhavani Thuraisingham,
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Storing RDF Data in Hadoop And Retrieval Pankil Doshi Asif Mohammed Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham.
HADOOP ADMIN: Session -2
Triple Stores.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University.
An approach to Intelligent Information Fusion in Sensor Saturated Urban Environments Charalampos Doulaverakis Centre for Research and Technology Hellas.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools Mohammad Farhan Husain, Latifur Khan, Murat Kantarcioglu and Bhavani Thuraisingham.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
Building and Analyzing Social Networks Insider Threat Analysis with Large Graphs Dr. Bhavani Thuraisingham March 22, 2013.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
Lesley Charles November 23, 2009.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department.
Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012.
A Token-Based Access Control System for RDF Data in the Clouds Arindam Khaled Mohammad Farhan Husain Latifur Khan Kevin Hamlen Bhavani Thuraisingham Department.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Scalable Distributed Reasoning Using MapReduce Jacopo Urbani, Spyros Kotoulas, Eyal Oren, and Frank van Harmelen Department of Computer Science, Vrije.
Tool for Ontology Paraphrasing, Querying and Visualization on the Semantic Web Project By Senthil Kumar K III MCA (SS)‏
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Next Generation of Apache Hadoop MapReduce Owen
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
Managing Large RDF Graphs Vaibhav Khadilkar Dr. Bhavani Thuraisingham Department of Computer Science, The University of Texas at Dallas December 2008.
BIG DATA/ Hadoop Interview Questions.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Dr. Mohammad Farhan Husain (Amazan; Facebook)
Big Data is a Big Deal!.
Large-scale file systems and Map-Reduce
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Analyzing and Securing Social Networks
湖南大学-信息科学与工程学院-计算机与科学系
On Spatial Joins in MapReduce
Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham
Cse 344 May 4th – Map/Reduce.
Chaitali Gupta, Madhusudhan Govindaraju
Presentation transcript:

DATA INTENSIVE QUERY PROCESSING FOR LARGE RDF GRAPHS USING CLOUD COMPUTING TOOLS Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham Department of Computer Science University of Texas at Dallas

Outline  Semantic Web Technologies & Cloud Computing Frameworks  Goal & Motivation  Current Approaches  System Architecture & Storage Schema  SPARQL Query by MapReduce  Query Plan Generation  Experiment  Future Works

Semantic Web Technologies  Data in machine understandable format  Infer new knowledge  Standards  Data representation – RDF Triples Example:  Ontology – OWL, DAML  Query language - SPARQL SubjectPredicateObject Smith”

Cloud Computing Frameworks  Proprietary  Amazon S3  Amazon EC2  Force.com  Open source tool  Hadoop – Apache’s open source implementation of Google’s proprietary GFS file system MapReduce – functional programming paradigm using key- value pairs

Outline  Semantic Web Technologies & Cloud Computing Frameworks  Goal & Motivation  Current Approaches  System Architecture & Storage Schema  SPARQL Query by MapReduce  Query Plan Generation  Experiment  Future Works

Goal  To build efficient storage using Hadoop for large amount of data (e.g. billion triples)  To build an efficient query mechanism  Publish as open source project   Integrate with Jena as a Jena Model

Motivation  Current Semantic Web frameworks do not scale to large number of triples, e.g.  Jena In-Memory, Jena RDB, Jena SDB  AllegroGraph  Virtuoso Universal Server  BigOWLIM  There is a lack of distributed framework and persistent storage  Hadoop uses low end hardware providing a distributed framework with high fault tolerance and reliability

Outline  Semantic Web Technologies & Cloud Computing Frameworks  Goal & Motivation  Current Approaches  System Architecture & Storage Schema  SPARQL Query by MapReduce  Query Plan Generation  Experiment  Future Works

Current Approaches  State-of-the-art approach  Store RDF data in HDFS and query through MapReduce programming (Our approach)  Traditional approach  Store data in HDFS and process query outside of Hadoop Done in BIOMANTA 1 project (details of querying could not be found) 1.

Outline  Semantic Web Technologies & Cloud Computing Frameworks  Goal & Motivation  Current Approaches  System Architecture & Storage Schema  SPARQL Query by MapReduce  Query Plan Generation  Experiment  Future Works

System Architecture LUBM Data Generator Preprocessor N-Triples Converter Predicate Based Splitter Object Type Based Splitter Hadoop Distributed File System / Hadoop Cluster MapReduce Framework Query Rewriter Query Plan Generator Plan Executor RDF/XML Preprocessed Data 2. Jobs 3. Answer 1. Query

Storage Schema  Data in N-Triples  Using namespaces  Example: utd:resource1  Predicate based Splits (PS)  Split data according to Predicates  Predicate Object based Splits (POS)  Split further according to rdf:type of Objects

Example D0U0:GraduateStudent20rdf:typelehigh:GraduateStudent lehigh:University0rdf:typelehigh:University D0U0:GraduateStudent20lehigh:memberOflehigh:University0 P File: rdf_type D0U0:GraduateStudent20lehigh:GraduateStudent lehigh:University0lehigh:University File: lehigh_memberOf D0U0:GraduateStudent20lehigh:University0 PS File: rdf_type_GraduateStudent D0U0:GraduateStudent20 File: rdf_type_University D0U0:University0 File: lehigh_memberOf_University D0U0:GraduateStudent20lehigh:University0 POS

Space Gain  Example StepsNumber of FilesSize (GB)Space Gain N-Triples Predicate Split (PS) % Predicate Object Split (POS) % Data size at various steps for LUBM1000

Outline  Semantic Web Technologies & Cloud Computing Frameworks  Goal & Motivation  Current Approaches  System Architecture & Storage Schema  SPARQL Query by MapReduce  Query Plan Generation  Experiment  Future Works

SPARQL Query  SPARQL – SPARQL Protocol And RDF Query Language  Example SELECT ?x ?y WHERE { ?z foaf:name ?x ?z foaf:age ?y } Query SubjectPredicateObject Smith” Doe” Data ?x?y “John Smith”“24” Result

SPAQL Query by MapReduce  Example query SELECT ?p WHERE { ?xrdf:typelehigh:Department ?plehigh:worksFor?x ?xsubOrganizationOfhttp://University0.edu }  Rewritten query SELECT ?p WHERE { ?plehigh:worksFor_Department?x ?xsubOrganizationOfhttp://University0.edu }

Inside Hadoop MapReduce Job subOrganizationOf_University Department1 Department2 worksFor_Department Professor1Deaprtment1 Professor2Department2 Map Reduce Output WF#Professor1 Department1 SO# Department1 WF#Professor1 Department2 WF#Professor2 Filtering Object == INPUTINPUT MAPMAP SHUFFLE&SORTSHUFFLE&SORT REDUCEREDUCE OUTPUTOUTPUT Department1 SO# WF#Professor1 Department2 WF#Professor2

Outline  Semantic Web Technologies & Cloud Computing Frameworks  Goal & Motivation  Current Approaches  System Architecture & Storage Schema  SPARQL Query by MapReduce  Query Plan Generation  Experiment  Future Works

Query Plan Generation  Challenge  One Hadoop job may not be sufficient to answer a query In a single Hadoop job, a single triple pattern cannot take part in joins on more than one variable simultaneously  Solution  Algorithm for query plan generation Query plan is a sequence of Hadoop jobs which answers the query  Exploit the fact that in a single Hadoop job, a single triple pattern can take part in more than one join on a single variable simultaneously

Example  Example query: SELECT ?X, ?Y, ?Z WHERE { ?Xpred1obj1 subj2?Zobj2 subj3?X?Z ?Ypred4obj4 ?Ypred5?X }  Simplified view: 1. X 2. Z 3. XZ 4. Y 5. XY

Join Graph &Hadoop Jobs Z X X X Y Join Graph Z X X X Y Valid Job Z X X X Y Valid Job Z X X X Y Invalid Job X

Possible Query Plans  A. job1: (x, xz, xy)=yz, job2: (yz, y) = z, job3: (z, z) = done Z X X X Y Join Graph Z X X X Y Job 1 2 1,3,5 4 Z Y Job 2 2 Job 3 1,3, 4,5 Z 1,2,3, 4,5 Result

Possible Query Plans  B. job1: (y, xy)=x; (z,xz)=x, job2: (x, x, x) = done Z X X X Y Join Graph Z X X X Y Job 1 2,3 1 4,5 X X X Job 2 1,2,3, 4,5 Result

Query Plan Generation  Goal: generate a minimum cost job plan  Back tracking approach  Exhaustively generates all possible plans.  Uses two coloring scheme on a graph to find jobs with colors WHITE and BLACK. Two WHITE nodes cannot be adjacent  User defined cost model.  Chooses best plan according to cost model.

Some Definitions  Triple Pattern,TP A triple pattern is an ordered collection of subject, predicate and object which appears in a SPARQL query WHERE clause. The subject, predicate and object can be either a variable (unbounded) or a concrete value (bounded).  Triple Pattern Join,TPJ A triple pattern join is a join between two TPs on a variable  MapReduceJoin, MRJ A MapReduceJoin is a join between two or more triple patterns on a variable.

Some Definitions  Job, JB A job JB is a Hadoop job where one or more MRJs are done. JB has a set of input files and a set of output files.  Conflicting MapReduceJoins, CMRJ A job JB is a Hadoop job where one or more MRJs are done. JB has a set of input files and a set of output files.  NON-Conflicting MapReduceJoins, NCMRJ Non-conflicting MapReduceJoins is a pair of MRJs either not sharing any triple pattern or sharing a triple pattern and the MRJs are on same variable.

Example  LUBM Query  SELECT ?X WHERE {  1 ?X rdf : type ub : Chair.  2 ?Y rdf : type ub : Department.  3 ?X ub : worksFor ?Y.  4 ?Y ub : subOrganizat ionOf }

Example (contd.)  Triple Pattern Graph and Join Graph for the LUBM Query Triple Pattern Graph (TPG)#1 Join Graph (JG)#1 Join Graph (JG)#2 Triple Pattern Graph (TPG)#2

Example(contd.)  Figure shows TPG and JG for query.  On left, we have TPG where each node represents a triple pattern in query, and they are named in the order they appear.  In the middle, we have the JG. Each node in the JG represents an edge in the TPG  For the query, an FQP can have two jobs  First one dealing with NCMRJ between triple patterns 2, 3, 4  Second one NCMRJ between triple pattern 1 and the output of the first join.  IQP would be first job having CMRJs between 1, 3 and 4 and the second having MRJ between triple pattern 2 and the output of the first join.

Query Plan Generation: Backtracking

 Drawbacks of back tracking approach  Computationally intractable  Search space is exponential in size

Steps a Hadoop Job Goes Through  Executable file (containing MapReduce code) is transferred from client machine to JobTracker 1  JobTracker decides which TaskTrackers 2 will execute the job  Executable file is distributed to TaskTrackers over network  Map processes start by reading data from HDFS  Map outputs are written to discs  Map outputs are read from discs, shuffled (transferred over the network to TaskTrackers which would run Reduce processes), sorted and written to discs  Reduce processes start by reading the input from the discs  Reduce outputs are written to discs

MapReduce Data Flow

Observations & an Approximate Solution  Observations  Fixed overheads of a Hadoop job Multiple read-writes to disc Data transfer over network multiple times  Even a “Hello World” MapReduce job takes a couple of seconds because of the fixed overheads  Approximate solution  Minimize number of jobs  This is a good approximation since the overhead of each job (e.g. jar file distribution, multiple disc read-writes, multiple network data transfer) and job switching is huge

Greedy Algorithm: Terms  Joining variable:  A variable that is common in two or more triples  Ex: x, y, xy, xz, za -> x,y,z are joining, a not  Complete elimination:  A join operation that eliminates a joining variable  y can be completely eliminated if we join (xy,y)  Partial elimination:  A join that partially eliminates a joining variable  After complete elimination of y, x can be partially eliminated by joining (xz,x)

Greedy Algorithm: Terms  E-count:  Number of joining variables in the resultant triple after a complete elimination  In the example x, y, z, xy, xz  E-count of x is = 2 (resultant triple: yz)  E-count of y is = 1 (resultant triple: x)  E-count of z is = 1 (resultant triple: x)

Greedy Algorithm: Proposition  Maximum job required for any SPARQL query  K, if K 1  Where K is the number of triples in the query  N is the total number of joining variables

Greedy Algorithm: Proof  If we make just one join with each joining variable, then all joins can be done in N jobs (one join per job)  Special case scenario-  Suppose each joining variable is common in exactly two triples:  Example- ab, bc, cd, de, ef, …. (like a chain)  At each job, we can make K/2 joins, which reduce the number of triples to half (i.e., K/2)  So, each job halves the number of triples  Therefore, total jobs required is log 2 K < 1.71*log 2 K

Greedy Algorithm: Proof (Continued)  General case:  Suppose we sort (decreasing order) the variables according to the frequency in different triples  Let v i has frequency f i  Therefore, f i <= f i -1<=f i -2<=…<=f1  Note that if f 1 =2, then it reduces to the special case  Therefore, f 1 >2 in the general case, also, f N >=2  Now, we keep joining on v 1, v 2, …,v N as long as there is no conflict

Greedy Algorithm: Proof (Continued)  Suppose L triples could not be reduced because each of them are left alone with one/more joining variable that are conflicting (e.g. try reducing xy, yz, zx)  Therefore, M>=L joins have been performed, producing M triples (total M+L triples remaining)  Since each join involved at least 2 triples,  2M + L <= K  2(L+e) + L = 0)  3L + 2e <= K  2L + (4/3)e <= K*(2/3) (multiplying by 2/3 on both sides)

Greedy Algorithm: Proof (Continued)  2L+e <= (2/3) * K  So each job reduces #of triples to 2/3  Therefore,  K * (2/3) Q >= 1>= K * (2/3) Q+1  (3/2) Q <= K <= (3/2) Q+1, Q <= log 3/2 K = 1.71 * log 2 K <= Q+1  In most real world scenarios, we can assume that 100 triples in a query is extremely rare  So, the maximum number of jobs required in this case is 12

Greedy Algorithm  Greedy algorithm  Early elimination heuristic:  Make as many complete eliminations in each job as possible  This leaves the fewest number of variables for join in the next job  Must choose the join first that has the least e-count (least number of joining variables in the resultant triple)

Greedy Algorithm

 Step I: remove non-joining variables  Step II: sort the vars according to e-count  Step III: choose a var for elimination as long as complete or partial elimination is possible – these joins make a job  Step IV: continue to step II if more triples are available

Outline  Semantic Web Technologies & Cloud Computing Frameworks  Goal & Motivation  Current Approaches  System Architecture & Storage Schema  SPARQL Query by MapReduce  Query Plan Generation  Experiment  Future Works

Experiment  Dataset and queries  Cluster description  Comparison with Jena In-Memory, SDB and BigOWLIM frameworks  Experiments with number of Reducers  Algorithm runtimes: Greedy vs. Exhaustive  Some query results

Dataset And Queries  LUBM  Dataset generator  14 benchmark queries  Generates data of some imaginary universities  Used for query execution performance comparison by many researches

Our Clusters  10 node cluster in SAIAL lab  4 GB main memory  Intel Pentium IV 3.0 GHz processor  640 GB hard drive  OpenCirrus HP labs test bed

Comparison: LUBM Query 2

Comparison: LUBM Query 9

Comparison: LUBM Query 12

Experiment with Number of Reducers

Greedy vs. Exhaustive Plan Generation

Some Query Results Seconds Million Triples

Outline  Semantic Web Technologies & Cloud Computing Frameworks  Goal & Motivation  Current Approaches  System Architecture & Storage Schema  SPARQL Query by MapReduce  Query Plan Generation  Experiment  Future Works

Future Works  Enable plan generation algorithm to handle queries with complex structures  Ontology driven file partitioning for faster query answering  Balanced partitioning for data set with skewed distribution  Materialization with limited number of jobs for inference  Experiment with non-homogenous cluster

Publications  Mohammad Husain, Latifur Khan, Murat Kantarcioglu, Bhavani M. Thuraisingham: Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools, IEEE International Conference on Cloud Computing, 2010 (acceptance rate 20%)  Mohammad Husain, Pankil Doshi, Latifur Khan, Bhavani M. Thuraisingham: Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce, International Conference on Cloud Computing Technology and Science, Beijing, China, 2009  Mohammad Husain, Mohammad M. Masud, James McGlothlin, Latifur Khan, Bhavani Thuraisingham: Greedy Based Query Processing for Large RDF Graphs Using Cloud Computing, IEEE Transactions on Knowledge and Data Engineering Special Issue on Cloud Computing (submitted)  Mohammad Farhan Husain, Tahseen Al-Khateeb, Mohmmad Alam, Latifur Khan: Ontology based Policy Interoperability in Geo-Spatial Domain, CSI Journal (to appear)  Mohammad Farhan Husain, Mohmmad Alam, Tahseen Al-Khateeb, Latifur Khan: Ontology based policy interoperability in geo-spatial domain. ICDE Workshops 2008  Chuanjun Li, Latifur Khan, Bhavani M. Thuraisingham, M. Husain, Shaofei Chen, Fang Qiu : Geospatial Data Mining for National Security: Land Cover Classification and Semantic Grouping, Intelligence and Security Informatics, 2007

Questions/Discussion