Distributed Storage Systems

Distributed Storage Systems
Zhi Yang School of EECS, Peking University

Another Description of 3-Tier Architecture
2.2 System architecture

E.g., Internet Search Engine
As a first example, consider an Internet search engine. Ignoring all the animated banners, images, and other fancy window dressing, the user interface of a search engine is very simple: … 2.2 System architecture

Distributed File Systems (continued)
One of most common uses of distributed computing Goal: provide common view of centralized file system, but distributed implementation. Ability to open & update any file on any machine on network All of synchronization issues and capabilities of shared local files Distributed File Systems

Architecture of a Distributed File System
Figure 1 Architecture of a Distributed File System

Use RPC to translate file system calls
Client Server Read (RPC) Return (Data) cache Write (RPC) Client ACK Remote Disk: Reads and writes forwarded to server Use RPC to translate file system calls No local caching/can be caching at server-side Advantage: Server provides completely consistent view of file system to multiple clients Problems? Performance! Going over network is slower than going to local memory Lots of network traffic/not well pipelined Server can be a bottleneck

In practice: use buffer cache at source and destination
Crash! Crash! Client read(f1) V1 Server cache Read (RPC) read(f1)V1 F1:V1 Return (Data) read(f1)V1 read(f1)V1 cache Write (RPC) Client F1:V2 F1:V1 ACK cache write(f1) OK F1:V2 read(f1)V2 Idea: Use caching to reduce network load In practice: use buffer cache at source and destination Advantage: if open/read/write/close can be done locally, don’t need to do any network traffic…fast! Problems: Failure: Client caches have data not committed at server Cache consistency! Client caches not consistent with server/each other

File Sharing Semantics (1/2)
Problem: When dealing with distributed file systems, we need to take into account the ordering of concurrent read/write operations, and expected semantics (=consistency). On a single processor, when a read follows a write, the value returned by the read is the value just written. In a distributed system with caching, obsolete values may be returned.

File Sharing Semantics (2/2)
UNIX semantics: a read operation returns the effect of the last write operation => can only be implemented for remote access models in which there is only a single copy of the file Transaction semantics: the file system supports transactions on a single file => issue is how to allow concurrent access to a physically distributed file Session semantics: the effects of read and write operations are seen only by the client that has opened (a local copy) of the file => what happens when a file is closed (only one client may actually win) 尽管NFS在理论上遵循远程访问模型，但是大多数实现使用本地高速缓存，有效实现了上载/下载模型 File Sharing Semantics One-copy semantics (unix semantics). Updates are written to the single copy and are available immediately. Serializability. Transaction semantics (locking files - share for read and exclusive for write). Session semantics. Copy the file on open, work on local copy, and copy back on close.

Stateful or stateless design?
Server maintains client-specific state Shorter requests Better performance in processing requests Cache coherence is possible Server can know who’s accessing what File locking is possible

Stateful or stateless design?
Server maintains no information on client accesses Each request must identify file and offsets Server can crash and recover Client can crash and recover No open/close needed They only establish state No server space used for state Don’t worry about supporting many clients Problems if file is deleted on server File locking not possible

Consistency and Replication
Observation: In modern distributed file systems, client side caching is the preferred technique for attaining performance; server-side replication is done for fault tolerance. Observation: Clients are allowed to keep (large parts of) a file, and will be notified when control is withdrawn => servers are now generally stateful Figure Using the NFSv4 callback mechanism to recall file delegation

Fault Tolerance Observation: FT is handled by simply replicating file servers, generally using a standard primary-backup protocol:

NFS Sun Network File System (NFS) has become de facto standard for distributed UNIX file access. NFS runs over LAN even WAN (slowly) Any system may be both a client and server Basic idea: Remote directory is mounted onto local directory Remote directory may contain mounted directories within CS-4513, D-Term 2007 Distributed File Systems

How do you access them? Access remote files as local files
Remote FS name space should be syntactically consistent with local name space redefine the way all files are named and provide a syntax for specifying remote files e.g. //server/dir/file Can cause legacy applications to fail use a file system mounting mechanism Overlay portions of another FS name space over local name space This makes the remote name space look like it’s part of the local name space

Mounting Remote Directories (NFS)
CS-4513, D-Term 2007 Distributed File Systems

Mounting Remote Directories (continued)
Note:– names of files are not unique As represented by path names E.g., Server A sees : /users/steen/mbox Client A sees: /remote/vu/mbox Client B sees: /work/me/mbox Consequence:– Cannot pass file “names” around haphazardly CS-4513, D-Term 2007 Distributed File Systems

Nested Mounting (NFS) CS-4513, D-Term 2007 Distributed File Systems

NFS Operations Lookup File Handle
Fundamental NFS operation Takes pathname, returns file handle File Handle Unique identifier of file within server Persistent; never reused Storable, but opaque to client 64 bytes in NFS v3; 128 bytes in NFS v4 Most other operations take file handle as argument CS-4513, D-Term 2007 Distributed File Systems

Other NFS Operations (version 3)
read, write link, symlink mknod, mkdir rename, rmdir readdir, readlink getattr, setattr create, remove Conspicuously absent open, close CS-4513, D-Term 2007 Distributed File Systems

Client Caching (1) Client-side caching in NFS.

Client Caching (2) Using the NFS version 4 callback mechanism to recall file delegation.

Security The NFS security architecture.

NFS Pros and Cons Simple, Highly portable Sometimes inconsistent!
NFS Cons: Sometimes inconsistent! Doesn’t scale to large # clients Must keep checking to see if caches out of date Server becomes bottleneck due to polling traffic

Motivating Application: Search
Crawl the whole web Store it all on “one big disk” Process users’ searches on “one big CPU” $$$$$ Doesn’t scale

Google Disk Farm Early days… …today CS 5204 – Operating Systems

Processing Granularity
Data size: small Single-core, single processor Single-core, multi-processor Single-core Multi-core, single processor Multi-core, multi-processor Multi-core Cluster of processors (single or multi-core) with shared memory Cluster of processors with distributed memory Cluster Grid of clusters Embarrassingly parallel processing MapReduce, distributed file system Cloud computing Pipelined Instruction level Concurrent Thread level Service Object level Indexed File level Mega Block level Virtual System Level Data size: large Bina Ramamurthy 2011 9/22/2018

A new way to store and analyze data

MapReduce Example in my Operating System Class
Dogs Cats Snakes Fish (Pet database size: TByte) map split combine reduce part0 part1 part2 Bina Ramamurthy 2010 6/23/2010

Large scale data splits Map <key, 1> <key, value>pair
Reducers (say, Count) Parse-hash Count P-0000 , count1 Parse-hash Count P-0001 , count2 Parse-hash Count P-0002 Parse-hash ,count3 Bina Ramamurthy 2010 6/23/2010

Data Flow Web Servers Scribe Servers Network Storage Oracle RAC
This is the architecture of our backend data warehouing system. This system provides important information on the usage of our website, including but not limited to the number page views of each page, the number of active users in each country, etc. We generate 3TB of compressed log data every day. All these data are stored and processed by the hadoop cluster which consists of over 600 machines. The summary of the log data is then copied to Oracle and MySQL databases, to make sure it is easy for people to access. Oracle RAC Hadoop Cluster MySQL By Harsha Jain

HDFS Architecture Namenode Metadata ops Client Block ops Read
Metadata(Name, replicas..) (/home/foo/data,6. .. Metadata ops Client Block ops Read Datanodes Datanodes B replication Blocks Rack2 Rack1 Write Client Bina Ramamurthy 2010 6/23/2010

Heartbeat和Blockreport
Datanode 1 Namenode Metadata: <1,(1,2)> <2,(2,3) > <3,(1,3)> 1 1,3 3 1,2 Datanode 2 1 2 2,3 1、保证metadata和data的一致 2、namenode用这样的机制来探测datanode是dead的还是alive，并在发现 datanode死机的时候重新生成副本 Datanode 3 2 3

Data Flow File Read 先读NameNode，获得数据块的信息，再去读相应的DataNode中的数据。
避免所有数据都从NameNode获取，使得NameNode成为性能瓶颈 52 52 52

Data Flow File Write 53 53 53

HDFS 副本備份機制 Original ~ Hadoop 0.17 ~ First : 同機架的不同節點
Second : 同機架的另一節點 Third : 不同機架另一節點 More : 隨機挑選 Hadoop 0.17 ~ First : 同Client的節點上 Second : 不同機架中的節點上 Third : 同第二個副本的機架中的另一個節點上 RackA RackB

Distributed Storage Systems

Similar presentations

Presentation on theme: "Distributed Storage Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Storage Systems

Similar presentations

Presentation on theme: "Distributed Storage Systems"— Presentation transcript:

Similar presentations

About project

Feedback