XML Store Christian Theil Have, René Kofoed, References: Kasper Pedersen & Jesper Pedersen, Value-oriented.

Slides:



Advertisements
Similar presentations
More on File Management
Advertisements

P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
The google file system Cs 595 Lecture 9.
Storage management and caching in PAST Antony Rowstron and Peter Druschel Presented to cs294-4 by Owen Cooper.
Garbage Collecting the World Bernard Lang Christian Queinnec Jose Piquer Presented by Yu-Jin Chia See also: pp text.
CS162 Week 9 Kyle Dewey. Overview What needs to be done Quirks with GC on miniJS Implementing GC on miniJS.
Clayton Sullivan PEER-TO-PEER NETWORKS. INTRODUCTION What is a Peer-To-Peer Network A Peer Application Overlay Network Network Architecture and System.
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
The Zebra Striped Network Filesystem. Approach Increase throughput, reliability by striping file data across multiple servers Data from each client is.
Spark: Cluster Computing with Working Sets
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Name Services Jessie Crane CPSC 550. History ARPAnet – experimental computer network (late 1960s) hosts.txt – a file that contained all the information.
Based on last years lecture notes, used by Juha Takkinen.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
Tuple Spaces and JavaSpaces CS 614 Bill McCloskey.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Peer To Peer Distributed Systems Pete Keleher. Why Distributed Systems? l Aggregate resources! –memory –disk –CPU cycles l Proximity to physical stuff.
Wide-area cooperative storage with CFS
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Distributed Systems Fall 2009 Distributed transactions.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Proxy Design Pattern Source: Design Patterns – Elements of Reusable Object- Oriented Software; Gamma, et. al.
SEG Advanced Software Design and Reengineering TOPIC L Garbage Collection Algorithms.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed File Systems Steve Ko Computer Sciences and Engineering University at Buffalo.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
Introduction to Peer-to-Peer Networks. What is a P2P network A P2P network is a large distributed system. It uses the vast resource of PCs distributed.
Cooperative File System. So far we had… - Consistency BUT… - Availability - Partition tolerance ?
Distributed Systems. Interprocess Communication (IPC) Processes are either independent or cooperating – Threads provide a gray area – Cooperating processes.
Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
Replication March 16, Replication What is Replication?  A technique for increasing availability, fault tolerance and sometimes, performance 
Naming (1) Chapter 4. Chapter 4 topics What’s in a name? Approaches for naming schemes Directories and location services Distributed garbage collection.
OS2- Sem ; R. Jalili Introduction Chapter 1.
Group Communication Group oriented activities are steadily increasing. There are many types of groups:  Open and Closed groups  Peer-to-peer and hierarchical.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Distributed Computing Systems CSCI 4780/6780. Distributed System A distributed system is: A collection of independent computers that appears to its users.
Title Line Subtitle Line Top of Content Box Line Top of Footer Line Left Margin LineRight Margin Line Top of Footer Line Top of Content Box Line Subtitle.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Structural Patterns1 Nour El Kadri SEG 3202 Software Design and Architecture Notes based on U of T Design Patterns class.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Definition of a Distributed System (1) A distributed system is: A collection of independent computers that appears to its users as a single coherent system.
More Distributed Garbage Collection DC4 Reference Listing Distributed Mark and Sweep Tracing in Groups.
Scalable Clone Detection and Elimination for Erlang Programs Huiqing Li, Simon Thompson University of Kent Canterbury, UK.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Review CS File Systems - Partitions What is a hard disk partition?
Bigtable: A Distributed Storage System for Structured Data
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
An Efficient, Incremental, Automatic Garbage Collector P. Deutsch and D. Bobrow Ivan JibajaCS 395T.
Chapter Five Distributed file systems. 2 Contents Distributed file system design Distributed file system implementation Trends in distributed file systems.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Peer-to-Peer Information Systems Week 12: Naming
Remote Backup Systems.
Jonathan Walpole Computer Science Portland State University
Statistics Visualizer for Crawler
Java Distributed Object System
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
Open Source distributed document DB for an enterprise
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
Concepts of programming languages
CHAPTER 3 Architectures for Distributed Systems
Google Filesystem Some slides taken from Alan Sussman.
Naming (1) Chapter 4.
Distributed P2P File System
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Distributed File Systems
Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.
Path Oram An Extremely Simple Oblivious RAM Protocol
Remote Backup Systems.
Presentation transcript:

XML Store Christian Theil Have, René Kofoed, References: Kasper Pedersen & Jesper Pedersen, Value-oriented XML Store Mads Pultz, Garbage Collection in Distributed Value-Oriented Storage System

XML Store overview XML store Stores values Name service Like a DNS service

XML Store overview What’s in an XML store Values Value references XML store helps you to store values, and locating them again with references.

What’s a value ? A piece of data of some sort, maybe the number 5 :) Are 5 and 5 and 5 the same values? Nope – there is only one value: 5 The other 5 and 5 are occurences of the first 5.

What’s a value reference? A shortname for a value :) A shortname is shorter than the data being stored.

Value references to xml or not to xml BBCD01 BBCD02 Hamlet Values Value references

Shared values Document A BBCD01 BBCD02 Hamlet Document B BBCD01 BBCD03 to be or not to be Shared value

And updates? By changing speaker: ”Hamlet” in document A, will also change document B This will be problematic!

Therefore…. Perform non-destructional updates by: Creating a new value in a new tree Replacing existing references to the old value to point to the new value

Changing Hamlet to Ophelia Document A BBCD01 BBCD02 Document A’ BBCD04 BBCD03 Hamlet Document B BBCD01 BBCD03 Ophelia

Put in another way Node Value

XML - format for representing tree- structured data XML models: Document Object Model (DOM) - loads the entire document into memory in order to manipulate it. XML Enabled Databases (XED) – stores XML data in an existing relational database. Native XML Databases – internal model designed for persisting XML Document Value Model DVM – a way to program XML centric applications

What’s DVM? An API consisting of only two methods: Save Load That’s it!

Implementations 1. Based on Chord 2. Based on IP-multicast Uses locators and a reference server to map a value reference to a location independant value.

XMLStore architechture Overview Layered architechture Application layer DVM layer Disk layer A concrete XMLStore is build with plugable modules.

Layers in XMLStore ”SPEECH” Name server: 6FE02A :6949/XAA/42,8 5A7012,87D311 Value reference (hashed value) Lookup (p2p routing) load Locator Value Lookup Document XMLStore

Modules (1) XMLStore is organized in a modular way. Funtionality can be added using decorators: LoadSave 9A01 F Save Load Spawns thread

Modules (2) Example modules: Buffering Caching Asynchronous operation Replication (peer-to-peer) distribution Garbage collection

Name service Binds a human readable name to a value reference. Must provide provide destructive updates Lost update problem Pessimistic locking Optimistic concurrency control Should be Atomic

Distributed name service

Garbage collection in XMLStore Problem: Since values cannot be deleted from XMLStore, it could eventually be flooded. The solution: Garbage collection! Garbage collection in XMLStore is concerned with reclaiming values that cannot be reached using references.

Garbage collection Live values - can be reached using an existing value reference. The existing value references comprise the ”Root set” Mutator – A program that accesses memory Collector – The garbage collector Local vs. Global garbage collection Different approaches Reference counting Reference listing Tracing

Garbage collection 1234 F045 9A01 F3B2 ”foo” ”bar” FF01”foobar” Name server 9A01 F … … F3B2 FF01 F3B2 … … …0 StubSkeleton Reference count

Garbage collection XMLStore needs extended API to to keep track of value references.

Garbage collection Reference counting Basic idea: count the number of references to a given object, and reclaim it when the count reaches zero. A reference count is associated with each value occurence Scales well Doesn’t prevent cycles This is not a problem since XMLStore contains only acyclic data structures No resilience to lost messages Low fault tolerance

Garbage collection Reference listing Each value occurence has a separate skeleton for each client… Mapping of peers to value references F3B2 Client A Client B Peer hamlet Skeleton

Garbage collection Tracing Live objects are recursively traced from the roots and unreachable objects are determined to be garbage Stop-and-collect vs. Incremental collection Group based tracing Pros and cons Can detect cycles Scalability issues

Approach adapted for xmlstore Local garbage collection Reference listing Tracing (copying collector).. So cycles can be collected (only relevant with cells/mutable data) Global garbage collection Tracing

Summary