D u k e S y s t e m s Scaling Data and Services Jeff Chase Duke University.

Slides:



Advertisements
Similar presentations
CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.
Advertisements

Precept 6 Hashing & Partitioning 1 Peng Sun. Server Load Balancing Balance load across servers Normal techniques: Round-robin? 2.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.
Peer To Peer Distributed Systems Pete Keleher. Why Distributed Systems? l Aggregate resources! –memory –disk –CPU cycles l Proximity to physical stuff.
Wide-area cooperative storage with CFS
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 18: Replication Control All slides © IG.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
1 The Google File System Reporter: You-Wei Zhang.
Interposed Request Routing for Scalable Network Storage Darrell Anderson, Jeff Chase, and Amin Vahdat Department of Computer Science Duke University.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
Cooperative File System. So far we had… - Consistency BUT… - Availability - Partition tolerance ?
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
© Pearson Education Limited, Chapter 16 Physical Database Design – Step 7 (Monitor and Tune the Operational System) Transparencies.
Data Structures & Algorithms and The Internet: A different way of thinking.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
NoSQL Databases Oracle - Berkeley DB Rasanjalee DM Smriti J CSC 8711 Instructor: Dr. Raj Sunderraman.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
IMDGs An essential part of your architecture. About me
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Partitioning and Replication.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Distributed Hash Tables Steve Ko Computer Sciences and Engineering University at Buffalo.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Seminar On Rain Technology
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
CalvinFS: Consistent WAN Replication and Scalable Metdata Management for Distributed File Systems Thomas Kao.
Lecture 17 Raid. Device Protocol Variants Status checks: polling vs. interrupts Data: PIO vs. DMA Control: special instructions vs. memory-mapped I/O.
SysPlex -What’s the problem Problems are growing faster than uni-processor….1980’s Leads to SMP and loosely coupled Even faster than SMP and loosely coupled.
Business Continuity & Disaster Recovery
Large-scale file systems and Map-Reduce
CHAPTER 3 Architectures for Distributed Systems
Ivy Eva Wu.
A Survey on Distributed File Systems
Business Continuity & Disaster Recovery
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Massively Parallel Cloud Data Storage Systems
Providing Secure Storage on the Internet
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
Building Peer-to-Peer Systems with Chord, a Distributed Lookup Service
EECS 498 Introduction to Distributed Systems Fall 2017
Lecture 21: Replication Control
CSE 451: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB
Lecture 21: Replication Control
Presentation transcript:

D u k e S y s t e m s Scaling Data and Services Jeff Chase Duke University

Challenge: data management Data volumes are growing enormously. Mega-services are “grounded” in data. How to scale the data tier? – Scaling requires dynamic placement of data items across data servers, so we can grow the number of servers. – Sharding divides data across multiple servers or storage units. – Caching helps to reduce load on the data tier. – Replication helps to survive failures and balance read/write load. – Caching and replication require careful update protocols to ensure that servers see a consistent view of the data.

Concept: load spreading Spread (“deal”) the data across a set of storage units. – Make it “look like one big unit”, e.g., “one big disk”. – Redirect requests for a data item to the right unit. The concept appears in many different settings/contexts. – We can spread load across many servers too, to make a server cluster look like “one big server”. – We can spread out different data items: objects, records, blocks, chunks, tables, buckets, keys…. – Keep track using maps or a deterministic function (e.g., a hash). Also called sharding, declustering, striping, “bricks”.

“Sharding”

Key-value stores Many mega-services are built on key-value stores. – Store variable-length content objects: think “tiny files” (value) – Each object is named by a “key”, usually fixed-size. – Key is also called a token: not to be confused with a crypto key! Although it may be a content hash (SHAx or MD5). – Simple put/get interface with no offsets or transactions (yet). – Goes back to literature on Distributed Data Structures [Gribble 2000] and Distributed Hash Tables (DHTs). Over the next couple of years, Amazon transformed internally into a service-oriented architecture. They learned a tremendous amount… - pager escalation gets way harder….build a lot of scaffolding and metrics and reporting. - every single one of your peer teams suddenly becomes a potential DOS attacker. Nobody can make any real forward progress until very serious quotas and throttling are put in place in every single service. - monitoring and QA are the same thing. You'd never think so until you try doing a big SOA. But when your service says "oh yes, I'm fine", it may well be the case that the only thing still functioning in the server is the little component that knows how to say "I'm fine, roger roger, over and out" in a cheery droid voice. In order to tell whether the service is actually responding, you have to make individual calls. The problem continues recursively until your monitoring is doing comprehensive semantics checking of your entire range of services and data, at which point it's indistinguishable from automated QA. So they're a continuum. - if you have hundreds of services, and your code MUST communicate with other groups' code via these services, then you won't be able to find any of them without a service-discovery mechanism. And you can't have that without a service registration mechanism, which itself is another service. So Amazon has a universal service registry where you can find out reflectively (programmatically) about every service, what its APIs are, and also whether it is currently up, and where. - debugging problems with someone else's code gets a LOT harder, and is basically impossible unless there is a universal standard way to run every service in a debuggable sandbox. That's just a very small sample. There are dozens, maybe hundreds of individual learnings like these that Amazon had to discover organically. There were a lot of wacky ones around externalizing services, but not as many as you might think. Organizing into services taught teams not to trust each other in most of the same ways they're not supposed to trust external developers. This effort was still underway when I left to join Google in mid-2005, but it was pretty far advanced. From the time Bezos issued his edict through the time I left, Amazon had transformed culturally into a company that thinks about everything in a services-first fashion. It is now fundamental to how they approach all designs, including internal designs for stuff that might never see the light of day externally. [image from Sean Rhea, opendht.org, 2004]

Scalable key-value stores Can we build massively scalable key/value stores? – Balance the load: distribute the keys across the nodes. – Find the “right” server(s) for a given key. – Adapt to change (growth and “churn”) efficiently and reliably. – Bound the “spread” of each object (to reduce cost). Warning: it’s a consensus problem! What is the consistency model for massive stores? – Can we relax consistency for better scaling? Do we have to?

Data objects named in a “flat” key space (e.g., “serial numbers”) K-V is a simple and clean abstraction that admits a scalable, reliable implementation: a major focus of R&D. Is put/get sufficient to implement non-trivial apps? Distributed hash table Distributed application get (key) data node …. put(key, data) Lookup service lookup(key)node IP address [image from Morris, Stoica, Shenker, etc.] Key-value stores

Service-oriented architecture of Amazon’s platform Dynamo is a scalable, replicated key-value store.

Memcached is a scalable in-memory key-value cache.

Storage services: 31 flavors Can we build rich-functioned services on a scalable data tier that is “less” than an ACID database or even a consistent file system? People talk about the “NoSQL Movement” to scale the data tier beyond classic databases. There’s a long history. Today most of the active development in scalable storage is in key-value stores.

Load spreading and performance What effect does load spreading across N units have on performance, relative to 1 unit? What effect does it have on throughput? What effect does it have on response time? How does the workload affect the answers? What if the accesses follow a skewed distribution, so some items are more “popular” than others?

“Hot spot” bottlenecks What happens if the workload references items according to a skewed popularity distribution? Some items are “hot” (popular) and some are “cold” (rarely used). A read or write of a stored item must execute where the item resides. The servers/disks/units that store the “hot” items get more requests, resulting in an unbalanced load: they become “hot” units. The “hot” units saturate before the others (bottleneck or hot spot). Requests for items on “hot” units have longer response times. (Why?) Work A bottleneck limits throughput and/or may increase response time for some class of requests.

What about failures? Systems fail. Here’s a reasonable set of assumptions about failure properties for servers/bricks (or disks) – Fail-stop or fail-fast fault model – Nodes either function correctly or remain silent – A failed node may restart, or not – A restarted node loses its memory state, and recovers its secondary (disk) state If failures are random/independent, the probability of some failure is linear with the number of units. – Higher scale  less reliable! X

“Declustering” data Bricks Clients State: … … … A A write A write B Write C [drawing adapted from Barbara Liskov]

“Declustering” data Bricks Clients State: … … … A A Read A Read B Read C [drawing adapted from Barbara Liskov]

Replicating data Bricks Coordinators State: … … … A A write B [drawing adapted from Barbara Liskov]

Replicating data Bricks Coordinators State: … … … A A Read B [drawing adapted from Barbara Liskov]

Replicating data Bricks Coordinators State: … … … A A Read B [drawing adapted from Barbara Liskov] X Read B X

Replicating data Bricks Coordinators State: … … … A A X write B X [drawing adapted from Barbara Liskov]

Scalable storage: summary of drawings The items A, B, C could be blocks, or objects (files), or any other kind of read/write service request. The system can write different items to different nodes, to enable reads/writes on those items to proceed in parallel (declustering). – How does declustering affect throughput and response time? The system can write copies of the same item to multiple nodes (replication), to protect the data against failure of one of the nodes. – How does replication affect throughput and response time? Replication  multiple reads of the same item may proceed in parallel. When a client reads an item, it can only read it from a node that has an up-to-date copy. Where to put the data? How to keep track of where it is? How to keep the data up to date? How to adjust to failures (node “churn”)?

Recap: scalable data An abstract model Requests (e.g., reads and writes on blocks) arrive. Work Pending requests build up on one or more queues, as modeled by queueing theory (if assumptions of the theory are met). A dispatcher with a request routing policy draws requests from the queues and dispatches them to an array of N functional units (“bricks”: disks, or servers, or disk servers). Throughput depends on a balanced distribution, ideally with low spread (for locality and cache performance). Throughput (as a function of N) depends in part on the redundancy policy chosen to protect against failures of individual bricks. This model applies to a service cluster serving clients, or to an I/O system receiving block I/O requests from a host, or both.