CMSC Cluster Computing Basics

CMSC 34702-1 Cluster Computing Basics
Junchen Jiang The University of Chicago October 8, 2018

MapReduce: Simplified Data Processing on Large Clusters The Google File System Bigtable: A Distributed Storage System for Structured Data Cassandra - A Decentralized Structured Storage System

Agenda Cluster Computing (focus on “private cloud” today) The “Google Stack” - MapReduce - The Google File System - Bigtable Lessons

Scaling up vs. Scaling out
Scale-up: High-end servers Sun Starfire, Enterprise, … ($1 million a piece) Used by eBay, … Scale-out: “Commercial Off-The-Shelf” (COTS) computers Many of them (Google had 15,000 of them c. 2004)

Price/Performance Comparison (c. 2004)
High-end server rack 8 x 2GHz Xeon CPUs, 64GB RAM, 8TB Disk $758K Rack of COTS nodes 176 2GHz Xeon CPUs, 176GB RAM, 7.04TB Disk {Dual CPUs, 2GB RAM, 80GB Disk} x 88 $278K Higher performance and cheaper! Too good to be true?

Disadvantages of a cluster of COTS nodes?
High-end server Rack of COTS computers VS CPU RAM Disk

New problems in distributed/cluster computing
Fault tolerance Network traffic Data consistency Programming complexity …

Cluster Computing Needs a Software Stack
Data mngt Processing Database Resource mngt … Typical software analytics stack The Google File System (GFS) MapReduce Bigtable Borg … Google Hadoop File System (HDFS) MapReduce HBase YARN … Hadoop Allexio Spark Shark Mesos … Berkeley

MapReduce: Simplified Data Processing on Large Clusters

Why is parallelization difficult?
If the initial state is x=6, y=0, what happens when these threads finish running? Thread 1 void foo(){ x ++; y = x; } Thread 2 void bar(){ y ++; x += 3; } Multithreading = Unpredictability (from

Functional Programming
x++ y=x x x y 6 f X A f y++ Y B y y x+=3 x x Functional Programming No mutable variable, No changing state No side effect States can change (not idempotent) Too many variable (interdependency)

Key Functional Programming ops: map & fold
X X’ X f f Y Y’ Y f f Z Z’ Z f f map fold

MapReduce: An instantiation of “map” & “fold”
(key_a, val_11) (key_b, val_12) (key_1, val_1) (key_a, R([val_11])) (key_b, R([val_12,val_21])) (key_c, R([val_22])) (key_b, val_21) (key_c, val_22) (key_2, val_2) Example: Count word occurrences

Rationale behind the MapReduce Interface: A Minimalist Approach
Google Search, Machine learning, Graph mining, Grep, Sort, Word Counting… Applications, Data Analytics Algorithms Interface Imperative, Object Oriented, Functional Map & Reduce Cluster Computing System MapReduce System Can you think of another example of the minimalist approach?

What’s the contribution of the MapReduce System?
Make it easier to write parallel programs

What’s the contribution of the MapReduce System?
Make it easier to write parallel programs Fault tolerance Data locality Load balancing Straggler mitigation Consistency Data integrity

System Architecture

Performance: Data locality
Co-locate workers with the data Co-locate reducers with mappers

Performance: Speeding up “Reducer” with “Combiner”
When can “Combiner” help?

Re-execute in-progress and completed map tasks
Fault Tolerance Re-execute in-progress and completed map tasks What if a map worker?

What if a reduce worker fails?
Fault Tolerance What if a reduce worker fails? Re-execute in-progress reduce tasks

What if the master fails?
Fault Tolerance What if the master fails? Expose to the user

Why reduce() must start after all map() tasks finish?
Why not start “reducing” whenever a new <k,v> pair is produced? Does the complexity justify the performance gain?

Mitigating stragglers via re-execution
Will re-executing the task mess-up the computation? Re-execute!

MapReduce Summary A minimalist approach
Many problems can be easily expressible by MapReduce primitives Greatly simplifies fault tolerance & performance optimization (Almost) complete transparent fault tolerance at a large scale Dramatically ease the burden of programmers Still need users to step-in in some cases…

take a break

Questions on Piazza In 4.7 Local Execution, can the sequential case detect all the logical errors in distributed environment? How to place mappers and reducers that have large traffic in the same physical server? MapReduce can skip malformed results, but what happen if a worker somehow introduce a regular but incorrect result? Is there a smart resource allocation scheme for mappers and reducers? How does the system load balance the tasks? How expensive can sorting of the intermediate keys get between the map and reduce steps? Does increasing workers linearly also linearly decrease total execution speed? Is there an upper bound to the number of worker such that increasing it will not decrease the speed or even adding it? Can reducing start even when mapping isn’t fully complete? Why it is necessary to have two operations, Map and Reduce, instead of one operation that take in (part of) input file and generate an output? Is this technique only designed to address embarrassingly parallel problems (e.g, trivially decomposed into independent problems)?

The Google File System

Why does cluster computing need a new filesystem?
more /foo/bar.txt more /foo/bar.txt UNIX FS Hard Drive /foo/bar.txt /foo/bar.txt

Provide a unified view of local and remote File Systems
Virtual File System more /foo/bar.txt more /foo/bar.txt Virtual File System UNIX FS Hard Drive Virtual File System UNIX FS Hard Drive UNIX FS Hard Drive Client Server /foo/bar.txt /foo/bar.txt Provide a unified view of local and remote File Systems

Design choices of a Virtual File System
NFS (Network File System) Sun Microsystems AFS (Andrew File System) Carnegie Mellon U. Client Server Client Server Old file New file Files can be arbitrary  Block (10K) caching Files can fit to disk/RAM  Whole file caching Many read & writes  Write back every 30 sec Read heavy, short lifetime  Write on close 10s – 100s of users  Ask server on open Writes are relatively rare  Server callback Design = F (assumptions)

Why does cluster computing need a different file system?
Google Why does cluster computing need a different file system? Google’s problems are different High component failure rates A few million HUGE files (100MB ~ multi-GB) File writes are mostly appends Large streaming reads High throughput is favored over low latency Support Google apps only

GFS Architecture

GFS Architecture Design choices Single Master (why?) 64MB chunks Cons?
Single point of failure? Small files become hotspots? Performance bottleneck?

Master is not the performance bottleneck

Replication (3+) for reliability
GFS Architecture Design choices Replication (3+) for reliability Why not NFS/AFS’s recovery?

GFS Architecture Design choice No local caching Lease-based mutation
Why delegating the control to the primary?

Atomic “At-Least-Once” Appends
GFS Architecture Design choice Consistency Model under concurrent writes Atomic “At-Least-Once” Appends What about duplicated records?

Design = F (workload, environment)
Google’s assumptions GFS Design choices High component failure rates Single Master Large file chunks (64MB) A few million HUGE files (100MB~multi-GB) Replication (3+) for reliability File writes are mostly appends No local caching Lease-based mutation Large streaming reads Atomic “at-least-once” appends High throughput favored over low latency Support Google apps only

GFS Summary Design = F(assumptions)
Optimize for a given workload Simple architecture: highly scalable, fault tolerant

take a break

Questions How does GFS handle network partitions?
Can we change chunk size later without reboot the system? Is there any possibility that the file size stored in a chunk is much smaller than 64MB, which may cause a waste of space? How does a 64M fixed chunk size affects the network traffic, and thus user latency? The master server is the center of this system, so is it still the bottleneck of it? Although it handle a much lighter mission than chunk server, but we can still see a read/write performance that much less than the network upper bound in Figure3 how does GFS handle hotspot chunks? How to avoid master node to be a performance bottleneck? If GFS was designed to accommodate appends, why are they still so slow? How does GFS exploits locality?

BigTable: A Distributed Storage System for Structured Data

Motivation Highly available distributed storage for structured data
Web content: URLs, web content, page rank index, … Geographical data: geo-location, satellite images, … User information: preference, history, queries, … Large scale Petabytes of data across thousands of commodity servers 1.2 million requests per second 10s of terabytes of satellite images

Bigtables “A Bigtable is a sparse, distributed, persistent multidimensional sorted map.” (from

Does the benefit of Tablets ring any bells to you?
Tablet = range of contiguous rows Unit of distribution and load balancing Rows in the same tablet are usually co-located Usually, MB per tablet Users can “sort of” control which rows are in the same tablet E.g., store maps.google.com/index.html under key com.google.maps/ index.html Does the benefit of Tablets ring any bells to you?

Timestamps Each cell in a Bigtable can contain multiple versions of same data Version indexed by a 64-bit timestamp: real time or assigned by client Column-family-based garbage collection Keep only latest n versions Or keep only versions written since time t

Three-level tablet hierarchy
Each METADATA tablet has 128MB RAM Each tablet needs 1KB METADATA How many tablets can be addressed?

Bigtable: A combination of many building blocks
… GFS SSTable Chubby Storing log & data files Sorting <key, value> data Paxos-based distributed lock

Bigtable over GFS Tablet recovery:
METADATA: list of SSTables that comprise a tablet and set of redo points Reconstructs the memtable by applying all of the updates that have committed since the redo points

Bigtable vs. Relational Databases
Clients have the control over data layout and storage locality Only single-row transactions, No multi-row transactions Follow-up project: Percolator (OSDI 2010) API: Not SQL (no complex queries) What if people don’t get used to it?

Horizontally scalable
Performance Horizontally scalable

Questions on Piazza How fast is load-balancing/How long does it take to recover from failed machines? If tablets could be unassigned, how could we access the data in it? How does Bigtable monitor a large set of servers? How does Bigtable scale to datasets that are rapidly growing? When inserting a new row, does all the data below that row need to be moved to generate space? Since Bigtable is run on top of GFS, are there any optimizations that were made between the two systems? Or were they designed completely separately? In 5.3 Tablet Serving, wouldn’t that be more efficient if we could save the final state of data and then retrieve it instead of reconstructing it from the commit log? What if many clients write to the same row? Will this cause a lock congestion since the read and write to it has to be atomic?

Summary … Huge Impact Design lessons: Systems research:
Deeply understand the workloads  hard tradeoffs (it’s ok not to be good at everything) Simple systems are much easier to scale and be made fault tolerant Systems research: New problem vs. Old problem with new assumptions What’s “fundamental” in systems research? … “Building Software Systems at Google and Lessons Learned” by Jeff Dean

Reminders Post on Piazza (the earlier the better)
Use “Note”, not “Questions”  Project proposal due in one week! Find teammate (if you want to) on Piazza, mailing list, … Each group should schedule a discussion with the instructor Next lecture: Streaming Analytics Cassandra (CAP theorem)

CMSC Cluster Computing Basics

Similar presentations

Presentation on theme: "CMSC Cluster Computing Basics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMSC Cluster Computing Basics

Similar presentations

Presentation on theme: "CMSC Cluster Computing Basics"— Presentation transcript:

Similar presentations

About project

Feedback