Managing Large RDF Graphs Vaibhav Khadilkar Dr. Bhavani Thuraisingham Department of Computer Science, The University of Texas at Dallas December 2008.

Slides:



Advertisements
Similar presentations
Section 6.2. Record data by magnetizing the binary code on the surface of a disk. Data area is reusable Allows for both sequential and direct access file.
Advertisements

TU e technische universiteit eindhoven / department of mathematics and computer science Modeling User Input and Hypermedia Dynamics in Hera Databases and.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Use Case: Populating Business Objects.
Databases MMG508. DB Properties  Definition of a database: “A database is a collection of interrelated data items that are managed as a single unit”
WIMS 2014, June 2-4Thessaloniki, Greece1 Optimized Backward Chaining Reasoning System for a Semantic Web Hui Shi, Kurt Maly, and Steven Zeil Contact:
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
Jena a introduction Semantic Web Tools. Originally devised by HP Labs in Bristol, it was developed by Brian McBride of Hewlett-Packard and was derived.
Progress Update Semantic Web, Ontology Integration, and Web Query Seminar Department of Computing David George.
SPICE! An Ontology Based Web Application By Angela Maduko and Felicia Jones Final Presentation For CSCI8350: Enterprise Integration.
Building and Analyzing Social Networks Web Data and Semantics in Social Network Applications Dr. Bhavani Thuraisingham February 15, 2013.
Michael Povolotsky CMSC491s/691s. What is Virtuoso? Virtuoso, known as Virtuoso Universal Server, is a multi-protocol RDBMS Includes an object-relational.
Semantic Web Tools Vagan Terziyan Department of Mathematical Information Technology, University of Jyvaskyla ;
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
RDF(S) Tools Adrian Pop, Programming Environments Laboratory Linköping University.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
A Guide to SQL, Seventh Edition. Objectives Understand the concepts and terminology associated with relational databases Create and run SQL commands in.
Storing RDF Data in Hadoop And Retrieval Pankil Doshi Asif Mohammed Mohammad Farhan Husain Dr. Latifur Khan Dr. Bhavani Thuraisingham.
Triple Stores.
Project By: Anuj Shetye Vinay Boddula. Introduction Motivation HBase Our work Evaluation Related work. Future work and conclusion.
Building Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Semantic web technologies for secure interoperability and.
Cloud Computing. Cloud Computing Overview Course Content
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Managing & Integrating Enterprise Data with Semantic Technologies Susie Stephens Principal Product Manager, Oracle
Berlin SPARQL Benchmark (BSBM) Presented by: Nikhil Rajguru Christian Bizer and Andreas Schultz.
RDF Triple Stores Nipun Bhatia Department of Computer Science. Stanford University.
Rajashree Deka Tetherless World Constellation Rensselaer Polytechnic Institute.
Semantic Web. Course Content
M1G Introduction to Database Development 6. Building Applications.
Database Support for Semantic Web Masoud Taghinezhad Omran Sharif University of Technology Computer Engineering Department Fall.
Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools Mohammad Farhan Husain, Latifur Khan, Murat Kantarcioglu and Bhavani Thuraisingham.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
Samad Paydar Web Technology Lab. Ferdowsi University of Mashhad 10 th August 2011.
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
11 3 / 12 CHAPTER Databases MIS105 Lec15 Irfan Ahmed Ilyas.
RDF and triplestores CMSC 461 Michael Wilson. Reasoning  Relational databases allow us to reason about data that is organized in a specific way  Data.
Export experiments in Corese. October 10th Export experiments in Corese Olivier Corby October 10th, 2005 Interoperability Working Days October 10th-11th,
STASIS Technical Innovations - Simplifying e-Business Collaboration by providing a Semantic Mapping Platform - Dr. Sven Abels - TIE -
Attack Tool Repository and Player for ISEAGE May06-11 Abstract Today’s world is changing shape as it increases its dependency on computer technology. As.
Oracle Database 11g Semantics Overview Xavier Lopez, Ph.D., Dir. Of Product Mgt., Spatial & Semantic Technologies Souripriya Das, Ph.D., Consultant Member.
Research enabling other research  Infinite graph (UTD) is the prerequisite for EL++ (RPI, HP), Expanded Visualization (UCSB) and Topic Modeling (UC Irvine,
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #22 Secure Web Information.
Tool for Ontology Paraphrasing, Querying and Visualization on the Semantic Web Project By Senthil Kumar K III MCA (SS)‏
Operating Systems (CS 340 D) Princess Nora University Faculty of Computer & Information Systems Computer science Department.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Steven Seida How Does an RDF Knowledge Store Compare to an RDBMS?
R Store Angelique Moscicki Oshani Seneviratne Sergio Herrero-Lopez.
Triple Stores. What is a triple store? A specialized database for RDF triples Can ingest RDF in a variety of formats Supports a query language – SPARQL.
Triple Storage. Copyright  2006 by CEBT Triple(RDF) Storages  A triple store is designed to store and retrieve identities that are constructed from.
CMPE58H Project Progress Presentation QAPoint H.Tuğçe Özkaptan Gözde Kaymaz Serkan Kırbaş
Trustworthy Semantic Web Dr. Bhavani Thuraisingham The University of Texas at Dallas Inference Problem March 4, 2011.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
SOCSAMS e-learning Dept. of Computer Applications, MES College Marampally FILE SYSTEM.
Storage Systems CSE 598d, Spring 2007 OS Support for DB Management DB File System April 3, 2007 Mark Johnson.
Erik Jonsson School of Engineering and Computer Science The University of Texas at Dallas Cyber Security Research on Engineering Solutions Dr. Bhavani.
RDF storages and indexes Maciej Janik September 1, 2005 Enterprise Integration – Semantic Web.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Shared Nothing Architecture Allen Archer. What is Shared Nothing architecture? It is a distributed architecture in which each node is independent and.
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Managing Data Resources File Organization and databases for business information systems.
Triple Stores.
Triple Stores.
Prof. Bhavani Thuraisingham The University of Texas at Dallas
Data and Applications Security Developments and Directions
Virtual Memory: Working Sets
Data and Applications Security Developments and Directions
Triple Stores.
Trustworthy Semantic Web
Presentation transcript:

Managing Large RDF Graphs Vaibhav Khadilkar Dr. Bhavani Thuraisingham Department of Computer Science, The University of Texas at Dallas December 2008

Introduction The Provost for the University of Texas at Dallas, Dr. B. Hobson Wildenthal, in conjunction with the Vice President for Research and Development, Dr. Bruce Gnade made a commitment on becoming a leader in emerging technologies recognizing that the university did not want to compete in legacy technologies. After a detailed analysis and examination of unsolved problems the university committed to the Semantic Web and Cloud Computing as research areas. This was vetted through a large number of government and industrial clients. This resulted in the creation of the Semantic Web Lab.

Our Projects on Semantic Web  Confidentiality, Privacy and Trust for the Semantic Web  Texas Enterprise Funds, 2005; NSF 2007  Building Geospatial Semantic Web  Raytheon, 2006; NGA, 2007  Blackbook Experimentation  Texas Enterprise Funds, 2007  Ontology Mining – part of Text mining project  NASA 2007  Assured Information Sharing  AFOSR MURI, 2008  Managing Large RDF Graphs and Ontology Homogenization  IARPA, 2008

Managing Large RDF Graphs

 Current problems  Semantic web does not scale  Hinders ability to do reasoning and large graph processing  Current work focuses on load balancing and fault tolerance, but the big bottleneck is memory  Current systems can be broken with even 100,000 triples  We work on load balancing and polynomial reasoning but memory management breaks the systems even before any of the other problems can be addressed  Current problems  Semantic web does not scale  Hinders ability to do reasoning and large graph processing  Current work focuses on load balancing and fault tolerance, but the big bottleneck is memory  Current systems can be broken with even 100,000 triples  We work on load balancing and polynomial reasoning but memory management breaks the systems even before any of the other problems can be addressed

 Solution History  To solve this problem we only look at history  In the 1960’s Dijkstra invented the multiprocess operating system  This gave us general purpose resource management for files and memory  In the 1970’s efforts were directed to taking the general purpose OS and placing database applications on top of them  The drawback was that these systems did not scale  In the 1980’s Robert Epstein and Michael Stonebreaker from UC Berkley defined specific algorithms for database processing like LRU/MRU  These principles are accepted as a solved solution space resulting in ORACLE, MySQL and others  Solution History  To solve this problem we only look at history  In the 1960’s Dijkstra invented the multiprocess operating system  This gave us general purpose resource management for files and memory  In the 1970’s efforts were directed to taking the general purpose OS and placing database applications on top of them  The drawback was that these systems did not scale  In the 1980’s Robert Epstein and Michael Stonebreaker from UC Berkley defined specific algorithms for database processing like LRU/MRU  These principles are accepted as a solved solution space resulting in ORACLE, MySQL and others Managing Large RDF Graphs

 Solution History  In 2001 we started with the Semantic Web  Oracle, HP and others tried to apply database algorithms to graph processing  We worked to expand resource management to use specific graph algorithms  The solution is constructed so that memory is boundless (infinite graph) with deterministic reads that are an order of magnitude slower than pure memory solutions  Solution History  In 2001 we started with the Semantic Web  Oracle, HP and others tried to apply database algorithms to graph processing  We worked to expand resource management to use specific graph algorithms  The solution is constructed so that memory is boundless (infinite graph) with deterministic reads that are an order of magnitude slower than pure memory solutions Mem Mgt LRU/MRU A B C

Managing Large RDF Graphs  Relevance of problem  This was an unsolved problem  Critical in handling terabytes of data relevant in today’s times  Virtualize from memory space to disk space  Relevance of problem  This was an unsolved problem  Critical in handling terabytes of data relevant in today’s times  Virtualize from memory space to disk space

Managing Large RDF Graphs  Tools Used  Jena  An open source Semantic Web framework used to build and manipulate large RDF graphs  Also gives the capability to handle RDFS and OWL  Provides a query language SPARQL and a rule based inference engine  Developed by HP Labs  Can represent RDF graphs as a model  Tools Used  Jena  An open source Semantic Web framework used to build and manipulate large RDF graphs  Also gives the capability to handle RDFS and OWL  Provides a query language SPARQL and a rule based inference engine  Developed by HP Labs  Can represent RDF graphs as a model

Managing Large RDF Graphs  Tools Used  Lucene  Lucene is a Java based text search engine library  Is suitable for any application and is platform independent  Does indexing and retrieval in a few milliseconds across terabytes of data  MySQL  An open source RDBMS used with the various database representations in Jena (RDB, SDB, and, TDB)  An easy to use alternative compared to other RDBMS’s  Tools Used  Lucene  Lucene is a Java based text search engine library  Is suitable for any application and is platform independent  Does indexing and retrieval in a few milliseconds across terabytes of data  MySQL  An open source RDBMS used with the various database representations in Jena (RDB, SDB, and, TDB)  An easy to use alternative compared to other RDBMS’s

Managing Large RDF Graphs  In-memory Jena Model  This solution formed the basis of the solution that we will use for the RDB problem  As nodes are added to the in-memory graph, memory fills up  Therefore we can handle medium sized graphs  After a certain point when memory is full we get an out of memory exception stopping program execution  We want to solve this out of memory problem  In-memory Jena Model  This solution formed the basis of the solution that we will use for the RDB problem  As nodes are added to the in-memory graph, memory fills up  Therefore we can handle medium sized graphs  After a certain point when memory is full we get an out of memory exception stopping program execution  We want to solve this out of memory problem

Managing Large RDF Graphs  Memory Management Algorithm  Graph representation  Memory Management Algorithm  Graph representation author 35 Age Time Phone ACM Society Journal Society Semantic Web Journal Journal Name

Managing Large RDF Graphs  Memory Management Algorithm  Graph Representation  The graph is constructed in Jena by specifying nodes and their properties.  Triples are added in a monotonically increasing fashion.  Nodes may be accessed at any time (this is a key point in the algorithm)  Data structure used in the algorithm  Create an in-memory LRU based cache  For each node in the graph store an index number, a timestamp value for when it was last accessed, and, the number of connections for that node  Each time the node is accessed or a triple added, update the associated cache entry  This structure will be used to determine the candidate node that will be written to disk  Memory Management Algorithm  Graph Representation  The graph is constructed in Jena by specifying nodes and their properties.  Triples are added in a monotonically increasing fashion.  Nodes may be accessed at any time (this is a key point in the algorithm)  Data structure used in the algorithm  Create an in-memory LRU based cache  For each node in the graph store an index number, a timestamp value for when it was last accessed, and, the number of connections for that node  Each time the node is accessed or a triple added, update the associated cache entry  This structure will be used to determine the candidate node that will be written to disk

Managing Large RDF Graphs  Memory Management Algorithm  Algorithm  We use the LIMIT clause in MySQL to get back only a part of the results at a time  The triples retrieved are added to the revised in- memory Jena model  This leverages the memory management algorithm for the in-memory model  Since the revised in-memory model never runs out of memory this RDB solution does not run out of memory

Managing Large RDF Graphs  Conclusions from In-Memory Jena Model  As threshold increases the time required for the calculations reduces  As the memory size increases the time needed for the calculations increases since more triples can be stored in memory  A node in memory takes about 35 ms whereas one cached to lucene takes about 300ms  The goal is for usage patterns to pull from memory.  Conclusions from In-Memory Jena Model  As threshold increases the time required for the calculations reduces  As the memory size increases the time needed for the calculations increases since more triples can be stored in memory  A node in memory takes about 35 ms whereas one cached to lucene takes about 300ms  The goal is for usage patterns to pull from memory.

Managing Large RDF Graphs  Conclusions from the RDF Jena Model  Database creation times are almost the same as with the original Jena implementation  Database querying times vary depending upon the threshold value set in the algorithm  General Conclusions  Implemented an in-memory based LRU/Connectivity memory management algorithm  Solves the in-memory and RDB based models in Jena by creating an infinite memory impression for the user  Conclusions from the RDF Jena Model  Database creation times are almost the same as with the original Jena implementation  Database querying times vary depending upon the threshold value set in the algorithm  General Conclusions  Implemented an in-memory based LRU/Connectivity memory management algorithm  Solves the in-memory and RDB based models in Jena by creating an infinite memory impression for the user

Managing Large RDF Graphs  Future Work  Implement the memory management algorithms for cloud computing  Generalize the algorithm for all models  Try various other memory management algorithms which effect usage  Future Work  Implement the memory management algorithms for cloud computing  Generalize the algorithm for all models  Try various other memory management algorithms which effect usage