Archival Integration with Neo4j Mike Bryant Centre for e-Research King’s College London.

Slides:



Advertisements
Similar presentations
XML DOCUMENTS AND DATABASES
Advertisements

Knowledge Graph: Connecting Big Data Semantics
Introduction to Databases
Feature requests for Case Manager By Spar Nord Bank A/S IBM Insight 2014 Spar Nord Bank A/S1.
Slide 1 Web-Base Management Systems Aaron Brown and David Oppenheimer CS294-7 February 11, 1999.
0 General information Rate of acceptance 37% Papers from 15 Countries and 5 Geographical Areas –North America 5 –South America 2 –Europe 20 –Asia 2 –Australia.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
File Systems and Databases
CONTENTdm Important Features and Capabilities. CONTENTdm provides an “out of the box” solution to a complex web programming challenge. With minimal customization,
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
File Systems and Databases Hachim Haddouti
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Graph databases …the other end of the NoSQL spectrum. Material taken from NoSQL Distilled and Seven Databases in Seven Weeks.
Academic Year 2014 Spring.
IBM User Technology March 2004 | Dynamic Navigation in DITA © 2004 IBM Corporation Dynamic Navigation in DITA Erik Hennum and Robert Anderson.
New “Collaborate” Button Integrate UI directly into the browser. Preferred target: Firefox Easiest browser to extend in terms of UI.
Systems analysis and design, 6th edition Dennis, wixom, and roth
A Level Computer Science Topic 9: Data Structures T eaching L ondon C omputing William Marsh School of Electronic Engineering and Computer Science Queen.
Domain Driven Design. Set of blog posts spanning 10 months – building an app Fefactored along the way code to Patterns eg repository.
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
Completing the Model Common Problems in Database Design.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
-1- Philipp Heim, Thomas Ertl, Jürgen Ziegler Facet Graphs: Complex Semantic Querying Made Easy Philipp Heim 1, Thomas Ertl 1 and Jürgen Ziegler 2 1 Visualization.
Copyright © 2005 Ed Lance Fundamentals of Relational Database Design By Ed Lance.
Module 13 Implementing Business Continuity. Module Overview Protecting and Recovering Content Working with Backup and Restore for Disaster Recovery Implementing.
Why use a Database B8 B8 1.
M1G Introduction to Database Development 5. Doing more with queries.
Module 4 Designing and Implementing Views. Module Overview Introduction to Views Creating and Managing Views Performance Considerations for Views.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Resource Discovery for Extreme Scale Collaboration Benno Lee Patrick West 1 William Smith 2
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Representing Uncertainty, Unknowns and Dynamics of Social Networks NX-Workshop on Social Network Analysis and Visualization for Public Safety 18 – 19 October.
Database Management Systems (DBMS)
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
Lection №4 Development of the Relational Databases.
11 Introduction to Neo4j. 2 We all have our own graphs...
Session 1 Module 1: Introduction to Data Integrity
Interface for Glyco Vault Functionality and requirements. Initial proposal. Maciej Janik.
Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.
An Effective SPARQL Support over Relational Database Jing Lu, Feng Cao, Li Ma, Yong Yu, Yue Pan SWDB-ODBIS 2007 SNU IDB Lab. Hyewon Lim July 30 th, 2009.
EMBL-EBI Dimitris Dimitropoulos MSD-mine. EMBL-EBI MSD-mine overview  Web application for online data analysis and mining  For the advanced MSDSD researcher.
NoSQL: Graph Databases. Databases Why NoSQL Databases?
Introduction to the Semantic Web Tutorial Introduction to the Introduction to the Semantic Web Jim Hendler Rensselaer
Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
External Data Access 5/29/08. Current Problems No way to load, process & analyze live Atlas data via critical analysis & programming tools (SAS, R, Perl)
Preservation Through Evolution Management: The DIACHRON Approach DIACHRON Final Dissemination Workshop Giorgos Flouris (FORTH)
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
The Palantir Platform… …Changes in 2.3
Neo4j: GRAPH DATABASE 27 March, 2017
Linking Ontologies to Spatial Databases
Chapter 14: System Protection
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Every Good Graph Starts With
Web Engineering.
NOSQL databases and Big Data Storage Systems
Storage Virtualization
Logics for Data and Knowledge Representation
Introduction to Smart Search
File Systems and Databases
CS Introduction to Operating Systems
Analysis models and design models
Chapter 14: Protection.
Dr.s Khem Ghusinga and Alan Jones
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management
Legacy Databases.
Presentation transcript:

Archival Integration with Neo4j Mike Bryant Centre for e-Research King’s College London

Integrate existing electronic descriptions of Holocaust-related archival material Create new electronic descriptions of hitherto-inaccessible material Recreate associations and connections between material What are we doing?

CountryArchiveFondsSeriesFileItem FileItemSeriesFileItemFondsArchiveFondsSeriesFile Archival data is hierarchical… CountryArchiveFondsSeriesFileItemSeriesFileItemFonds

Working with hierarchies in SQL can be… …challenging  (Good book, BTW)

Why a graph database? Dynamic, heterogeneous, evolving data set Lots of generic functionality applied to many different entity types Hierarchies and freeform connections Social relation aspect

Permissions & access control Lots of people involved in different aspects of the data management Permission structure roughly follows data hierarchies Requirement that access to some data be restricted to accredited users

“Social” stuff Allow forms of group collaboration Put researchers in touch with each other Find out who is interested in similar material (recommendations, etc) Curation of “virtual” data collections

Neo4j Blueprints Domain/ACL Stuff Jersey web service (CRUD, etc) Frontend web application

Traversals A lot of traversals we do are of the form: MATCH child-[:CHILD_OF*]->parent parent-[:HELD_BY]->repo RETURN DISTINCT repo Frequently traverse paths of unknown length: Leaf node up to root Searching a newest-first linked-list of events

wrapp Blueprints Wrapping the graph Item User A Restricted To Neo4j User B ACL Wrapper Low-level access filtering makes life simpler and easier

a ; "Accession number: 1991.A " ; "us eng" ; "eng" ; "1985 October 26" ; "Oral history interview with Mike Jacobs" ; """User copies, 2 videocassettes (VHS) (NTSC) : sound, color ; 1/2 in. Protection copies, 2 videocassettes (U-Matic) (NTSC): sound, color ; 3/4 in. Linked data Via Blueprints SailGraph* we can: Export triples Query via SPARQL, but: Still need to construct/infer new triples for domain-level ontology mapping… preferably in real-time *

Performance…

Not a huge amount of data (< 1M nodes) Not expecting that many users But –Performance matters, e.g. indexing whole DB –Can’t just throw hardware at the problem –Lack of experience with Neo4j –Performance problems in production difficult to reproduce and diagnose on hosted VMs

Challenges Maintaining integrity –Flip side to schema-less model –(pre-Neo4j 2.0 introduction of constraints) –Had to implement our own poor facsimile of compound PKs for uniqueness within hierarchies What would be nice –Way of expressing “optional” constraints to the graph – runnable on-demand to validate integrity –Investigating solutions to this, including OWL

Challenges (continued…) Good versioning is difficult to “bolt on” after the fact: a) Can I recover this data in a pinch? vs. b) Can I go back in time?! Unfortunately we currently have a). Very interested in Datomic* for going back in time abilities. *

Final thoughts Using Neo4j/Blueprints removed a lot of friction and complexity in modeling our data –Added friction in other places, but still a win Systematic benchmarking is important –Confounding factors are hard to remove –Developers (at least, me) are bad at guessing what’s fast, what’s slow, and why

Thanks