CSci8211: Distributed System Techniques & Case Studies: I 1 Detour: Distributed Systems Techniques & Case Studies I  Distributing (Logically) Centralized.

CSci8211: Distributed System Techniques & Case Studies: I 1 Detour: Distributed Systems Techniques & Case Studies I  Distributing (Logically) Centralized SDN Controllers  NIB need to be maintained by multiple (distributed) SDN controllers  Multiple SDN controllers may need to concurrently read or write the same shared state  Distributed State Management Problem!  Look at three case studies from distributed systems  Google File System (GFS)  Amazon’s Dynamo  Yahoo!’s PNUTS

2 Distributed Data Stores & Consistency Models Availability & Performance vs. Consistency Trade-offs  Traditional (Transactional) Database Systems:  Query Model: more expressive query language e.g., SQL  ACID Properties: Atomicity, Consistency, Isolation and Durability  Efficiency: very expensive to implement at large scale!  Many real Internet applications/systems do not require strong consistency, but require high availability  Google File Systems: many reads, few writes (mostly appends)  Amazon’s Dynamo: simple query model, small data objects, but need to “always-writable” at massive scale  Yahoo’s PNUTS: databases with relaxed consistency for web apps requiring more than “eventual consistency.” (e.g., ordered updates) Implicit/Explicit Assumptions: Applications often can tolerate or know best how to handle inconsistencies (if happen rarely), but care more about availability & performance CSci8211: Distributed System Techniques & Case Studies: I

3 Data Center and Cloud Computing Data center: large server farms + data warehouses –not simply for web/web services –managed infrastructure: expensive! From web hosting to cloud computing –individual web/content providers: must provision for peak load Expensive, and typically resources are under-utilized –web hosting: third party provides and owns the (server farm) infrastructure, hosting web services for content providers –“server consolidation” via virtualization VMM Guest OS App Under client web service control

4 Cloud Computing Cloud computing and cloud-based services: –beyond web-based “information access” or “information delivery” –computing, storage, … Cloud Computing: NIST Definition "Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction." Models of Cloud Computing –“Infrastructure as a Service” (IaaS), e.g., Amazon EC2, Rackspace –“Platform as a Service” (PaaS), e.g., Micorsoft Azure –“Software as a Service” (SaaS), e.g., Google

5 Data Centers: Key Challenges With thousands of servers within a data center, How to write applications (services) for them? How to allocate resources, and manage them? –in particular, how to ensure performance, reliability, availability, … Scale and complexity bring other key challenges –with thousands of machines, failures are the default case! – load-balancing, handling “heterogeneity,” … data center (server cluster) as a “computer” “super-computer” vs. “cluster computer” –A single “super-high-performance” and highly reliable computer –vs. a “computer” built out of thousands of “cheap & unreliable” PCs –Pros and cons?

Google Scale and Philosophy Lots of data –copies of the web, satellite data, user data, email and USENET, Subversion backing store Workloads are large and easily parallelizable No commercial system big enough –couldn’t afford it if there was one –might not have made appropriate design choices –But truckloads of low-cost machines 450,000 machines (NYTimes estimate, June 14 th 2006) Failures are the norm –Even reliable systems fail at Google scale Software must tolerate failures –Which machine an application is running on should not matter –Firm believers in the “end-to-end” argument Care about perf/$, not absolute machine perf

Typical Cluster at Google Cluster Scheduling Master Lock ServiceGFS Master Machine 1 Scheduler Slave GFS Chunkserver Linux User Task 1 Machine 2 Scheduler Slave GFS Chunkserver Linux User Task Machine 3 Scheduler Slave GFS Chunkserver Linux User Task 2 BigTable Server BigTable Master

Google: System Building Blocks Google File System (GFS): –raw storage (Cluster) Scheduler: –schedules jobs onto machines Lock service: –distributed lock manager –also can reliably hold tiny files (100s of bytes) w/ high availability Bigtable: –a multi-dimensional database MapReduce: –simplified large-scale data processing....

Google File System Key Design Considerations Component failures are the norm – hardware component failures, software bugs, human errors, power supply issues, … –Solutions: built-in mechanisms for monitoring, error detection, fault tolerance, automatic recovery Files are huge by traditional standards –multi-GB files are common, billions of objects –most writes (modifications or “mutations”) are “append” –two types of reads: large # of “stream” (i.e., sequential) reads, with small # of “random” reads High concurrency (multiple “producers/consumers” on a file) –atomicity with minimal synchronization Sustained bandwidth more important than latency

GFS Architectural Design A GFS cluster: –a single master + multiple chunkservers per master –running on commodity Linux machines A file: a sequence of fixed-sized chunks (64 MBs) –labeled with 64-bit unique global IDs, –stored at chunkservers (as “native” Linux files, on local disk) –each chunk mirrored across (default 3) chunkservers master server: maintains all metadata –name space, access control, file-to-chunk mappings, garbage collection, chunk migration –why only a single master? (with read-only shadow masters) simple, and only answer chunk location queries to clients! chunk servers (“slaves” or “workers”): –interact directly with clients, perform reads/writes, …

GFS Architecture: Illustration GPS clients –consult master for metadata –typically ask for multiple chunk locations per request –access data from chunkservers Separation of control and data flows

Chunk Size and Metadata Chunk size: 64 MBs –fewer chunk location requests to the master –client can perform many operations on a chuck reduce overhead to access a chunk can establish persistent TCP connection to a chunkserver –fewer metadata entries metadata can be kept in memory (at master) in-memory data structures allows fast periodic scanning –some potential problems with fragmentation -Metadata –file and chunk namespaces (files and chunk identifiers) –file-to-chunk mappings –locations of a chunk’s replicas

Chunk Locations and Logs Chunk location: –does not keep a persistent record of chunk locations –polls chunkservers at startup, and use heartbeat messages to monitor chunkservers: simplicity! because of chunkserver failures, it is hard to keep persistent record of chunk locations –on-demand approach vs. coordination on-demand wins when changes (failures) are often Operation logs –maintains historical record of critical metadata changes –Namespace and mapping –for reliability and consistency, replicate operation log on multiple remote machines (“shadow masters”)

14 Clients and APIs GFS not transparent to clients –requires clients to perform certain “consistency” verification (using chunk id & version #), make snapshots (if needed), … APIs: –open, delete, read, write (as expected) –append: at least once, possibly with gaps and/or inconsistencies among clients –snapshot: quickly create copy of file Separation of data and control: –Issues control (metadata) requests to master server –Issues data requests directly to chunkservers Caches metadata, but does no caching of data –no consistency difficulties among clients –streaming reads (read once) and append writes (write once) don’t benefit much from caching at client

15 System Interaction: Read Client sends master: –read(file name, chunk index) Master’s reply: –chunk ID, chunk version#, locations of replicas Client sends “closest” chunkserver w/replica: –read(chunk ID, byte range) –“closest” determined by IP address on simple rack-based network topology Chunkserver replies with data

16 System Interactions: Write and Record Append Write and Record Append (atomic) –slightly different semantics: record append is “atomic” The master grants a chunk lease to a chunkserver (primary), and replies back to client Client first pushes data to all chunkservers –pushed linearly: each replica forwards as it receives –pipelined transfer: 13 MB/second with 100 Mbps network Then issues a write/append to primary chunkserver Primary chunkserver determines the order of updates to all replicas –in record append: primary chunkserver checks to see whether record append would exceed maximum chunk size –if yes, pad the chuck (and ask secondaries to do the same), and then ask client to append to the next chunk

Leases and Mutation Order Lease: –60 second timeouts; can be extended indefinitely –extension request are piggybacked on heartbeat messages –after a timeout expires, master can grant new leases Use leases to maintain consistent mutation order across replicas Master grant lease to one of the replicas -> Primary Primary picks serial order for all mutations Other replicas follow the primary order

Consistency Model Changes to namespace (i.e., metadata) are atomic –done by single master server! –Master uses log to define global total order of namespace-changing operations Relaxed consistency –concurrent changes are consistent but “undefined” defined: after data mutation, file region that is consistent, and all clients see that entire mutation –an append is atomically committed at least once occasional duplications All changes to a chunk are applied in the same order to all replicas Use version number to detect missed updates

Master Namespace Management & Logs Namespace: files and their chunks –metadata maintained as “flat names”, no hard/symbolic links –full path name to metadata mapping with prefix compression Each node in the namespace has associated read- write lock (-> a total global order, no deadlock) –concurrent operations can be properly serialized by this locking mechanism Metadata updates are logged –logs replicated on remote machines –take global snapshots (checkpoints) to truncate logs (but checkpoints can be created while updates arrive) Recovery –Latest checkpoint + subsequent log files

Replica Placement Goals: –Maximize data reliability and availability –Maximize network bandwidth Need to spread chunk replicas across machines and racks Higher priority to replica chunks with lower replication factors Limited resources spent on replication

Other Operations Locking operations –one lock per path, can modify a directory concurrently to access /d1/d2/leaf, need to lock /d1, /d1/d2, and /d1/d2/leaf each thread acquires: a read lock on a directory & a write lock on a file –totally ordered locking to prevent deadlocks Garbage Collection: –simpler than eager deletion due to unfinished replicated creation, lost deletion messages –deleted files are hidden for three days, then they are garbage collected combined with other background (e.g., take snapshots) ops –safety net against accidents

Fault Tolerance and Diagnosis Fast recovery –Master and chunkserver are designed to restore their states and start in seconds regardless of termination conditions Chunk replication Data integrity –A chunk is divided into 64-KB blocks –Each with its checksum –Verified at read and write times –Also background scans for rarely used data Master replication –Shadow masters provide read-only access when the primary master is down

23 GFS: Summary GFS is a distributed file system that support large-scale data processing workloads on commodity hardware –GFS has different points in the design space Component failures as the norm Optimize for huge files –Success: used actively by Google to support search service and other applications –But performance may not be good for all apps assumes read-once, write-once workload (no client caching!) GFS provides fault tolerance –Replicating data (via chunk replication), fast and automatic recovery GFS has the simple, centralized master that does not become a bottleneck Semantics not transparent to apps (“end-to-end” principle?) –Must verify file contents to avoid inconsistent regions, repeated appends (at-least-once semantics)

24 Highlights of Dynamo Dynamo: key-value data store at massive scale –Used to maintain users’ shopping cart info Key Design Goals: highly available and resilient at massive scale, while also meeting SLAs! –i.e., all customers have good experience, not simply most! Target Workload & Usage Scenarios: –simple read/write operations to a (relatively small) data item uniquely identified by a key; e.g., usually less 1 MB –services must be able to configure Dynamo to consistently achieve their latency and throughput requirements –used by internal services: Non-hostile environments System Interface: –get(key), put(key, context, object)

25 Amazon Service-Oriented Arch

26 Dynamo: Techniques Employed

27 Dynamo: Key Partitioning & Replications & Sloppy Quorum for Read/Write # of key replicas >= N (here N =3) Each key is associated with a preference list of N ranked (virtual) nodes read via get(): read from all N replicas; success if receiving R responses write via put(): write to al N replicas; success if receiving W-1 “write OK” acks Sloppy Quorum: R +W >N  each read is handled by a (read) coordinator -- any node in the ring is fine  Each write is also handled by a (write) coordinator -- highest ranked available node in the preference list

28 Dynamo: Vector Clock version evolution of an object over time

29 Highlights of PNUTS PNUTS: massively parallel and geographically distributed database system for Yahoo!’s web apps –data storage organized as hashed or ordered tables –hosted, centrally managed, geographically distributed service with automated load-balancing & fail-over Target Workload – managing session states, content meta-data, user-generated content such as tags & comments, etc. for web applications Key Design Goals: –scalability –response time and geographic scope –high availability and fault tolerant –relaxed consistency guarantees more than eventual consistency supported by GFS & Dynamo

30 PNUTS Overview Data model and Features – expose a simple relational model to users, & support single- table scans with predicates –include: scatter-gather ops, async. notification, bulk loading Fault Tolerance –Redundancy at multiple levels: data, meta-data, serving components, etc. –Leverage consistency model to support highly available reads & writes even after failure or partition Pub-Sub Message System: topic-based YMB (msg. broker) Record-level Mastering: write sync’ly to all copies expensively! –make all high latency ops asynchronous: allow local writes, and use record-level mastering to serve all requests locally Hosting: hosted service shared by many applications

31 PNUTS Data & Query Model A simplified relational data model – data organized into tables of records with attributes in addition to typical data types, allow “blob” data type –schemas are flexible: allow new attribute addition at any time without halting query or update activities; records not required to have values for all attributes –each record has a primary key: delete(key)/update(key) Query language: PNUTS supports –selection and project from a single table –both hashed (for point access) and ordered table (for scan) get(key), multi-get(list-of-keys), scan(range[, predicate]) –no support for “complex” queries, e.g., “join” or “group-by” –in the near future, provide interface to Hadoop, Pig Latin, …

32 PNUTS Consistency Model Applications typically manipulate one record at a time PNUTS provides per-record timeline consistency – all replicas of a given record apply all updates to the record in the same order (one replica designated as “master”) –A range of APIs with varying levels of consistency guarantees read-only read-critical (required-version) read-latest write test-and-set-write(required-version) Future: i) bundled updates ii) “more” relaxed consistency to cope w/ major (regional data center) failures v.generation.version

33 PNUTS System Architecture Interval Mappings  Tables are partitioned into tablets, each tablet stored on one server per region each tablet: ~ 100s MBs to a few GBs  Planned scale: 1000 servers per region, 1000 tablets each key: 100 bytes  interval mapping table: 100s MB RAM tablets ~500 MBs  a database of ~500 TBs

34 Interval Mappings Ordered Table Hashed Table

35 PNUTS: Other Features Yahoo! Message Broker (YMB) –topic-based pub/sub system –together w/ PNUTS: Yahoo! Sherpa data service platform YMB and Wide-Area Data Replication –Data updates considered “committed” once they are published by YMB – YMB asynchronously propagates the update to different regions and applies to all replicas – YMB provides “logging” and guarantees all published messages will be delivered to all subscribers –YMB logs purged only after PNUTS verifies Consistency via YMB and Mastership –YMB provides partial ordering of published messages –per-record mastering: updates directed to master 1 st, then propagates to other replicas via publishing to YMB Recovery: can survive storage unit failures; tablet boundaries sync’ed across tablet replicas; recover a lost tablet by copying a remote replica

CSci8211: Distributed System Techniques & Case Studies: I 1 Detour: Distributed Systems Techniques & Case Studies I  Distributing (Logically) Centralized.

Similar presentations

Presentation on theme: "CSci8211: Distributed System Techniques & Case Studies: I 1 Detour: Distributed Systems Techniques & Case Studies I  Distributing (Logically) Centralized."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSci8211: Distributed System Techniques & Case Studies: I 1 Detour: Distributed Systems Techniques & Case Studies I  Distributing (Logically) Centralized.

Similar presentations

Presentation on theme: "CSci8211: Distributed System Techniques & Case Studies: I 1 Detour: Distributed Systems Techniques & Case Studies I  Distributing (Logically) Centralized."— Presentation transcript:

Similar presentations

About project

Feedback