Presentation is loading. Please wait.

Presentation is loading. Please wait.

Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson.

Similar presentations

Presentation on theme: "Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson."— Presentation transcript:

1 Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson

2 Web Builder 2.02 Hello!

3 Web Builder 2.03 Big file systems? Too vague! What is a file system? What constitutes big? Some requirements would be nice

4 Web Builder 2.04 Scalable Looking at storage and serving infrastructures 1

5 Web Builder 2.05 Reliable Looking at redundancy, failure rates, on the fly changes 2

6 Web Builder 2.06 Cheap Looking at upfront costs, TCO and lifetimes 3

7 Web Builder 2.07 Four buckets Storage Serving BCP Cost

8 Web Builder 2.08 Storage

9 Web Builder 2.09 The storage stack File system Block protocol RAID Hardware ext, reiserFS, NTFS SCSI, SATA, FC Mirrors, Stripes Disks and stuff File protocol NFS, CIFS, SMB

10 Web Builder 2.010 Hardware overview The storage scale InternalDASSANNAS LowerHigher

11 Web Builder 2.011 Internal storage A disk in a computer –SCSI, IDE, SATA 4 disks in 1U is common 8 for half depth boxes

12 Web Builder 2.012 DAS Direct attached storage Disk shelf, connected by SCSI/SATA HP MSA30 – 14 disks in 3U

13 Web Builder 2.013 SAN Storage Area Network Dumb disk shelves Clients connect via a fabric Fibre Channel, iSCSI, Infiniband –Low level protocols

14 Web Builder 2.014 NAS Network Attached Storage Intelligent disk shelf Clients connect via a network NFS, SMB, CIFS –High level protocols

15 Web Builder 2.015 Of course, its more confusing than that

16 Web Builder 2.016 Meet the LUN Logical Unit Number A slice of storage space Originally for addressing a single drive: –c1t2d3 –Controller, Target, Disk (Slice) Now means a virtual partition/volume –LVM, Logical Volume Management

17 Web Builder 2.017 NAS vs SAN With SAN, a single host (initiator) owns a single LUN/volume With NAS, multiple hosts own a single LUN/volume NAS head – NAS access to a SAN

18 Web Builder 2.018 SAN Advantages Virtualization within a SAN offers some nice features: Real-time LUN replication Transparent backup SAN booting for host replacement

19 Web Builder 2.019 Some Practical Examples There are a lot of vendors Configurations vary Prices vary wildly Lets look at a couple –Ones I happen to have experience with –Not an endorsement ;)

20 Web Builder 2.020 NetApp Filers Heads and shelves, up to 500TB in 260U FC SAN with 1 or 2 NAS heads

21 Web Builder 2.021 Isilon IQ 2U Nodes, 3-96 nodes/cluster, 6-600 TB FC/InfiniBand SAN with NAS head on each node

22 Web Builder 2.022 Scaling Vertical vs Horizontal

23 Web Builder 2.023 Vertical scaling Get a bigger box Bigger disk(s) More disks Limited by current tech – size of each disk and total number in appliance

24 Web Builder 2.024 Horizontal scaling Buy more boxes Add more servers/appliances Scales forever* *sort of

25 Web Builder 2.025 Storage scaling approaches Four common models: Huge FS Physical nodes Virtual nodes Chunked space

26 Web Builder 2.026 Huge FS Create one giant volume with growing space –Suns ZFS –Isilon IQ Expandable on-the-fly? Upper limits –Always limited somewhere

27 Web Builder 2.027 Huge FS Pluses –Simple from the application side –Logically simple –Low administrative overhead Minuses –All your eggs in one basket –Hard to expand –Has an upper limit

28 Web Builder 2.028 Physical nodes Application handles distribution to multiple physical nodes –Disks, Boxes, Appliances, whatever One volume per node Each node acts by itself Expandable on-the-fly – add more nodes Scales forever

29 Web Builder 2.029 Physical Nodes Pluses –Limitless expansion –Easy to expand –Unlikely to all fail at once Minuses –Many mounts to manage –More administration

30 Web Builder 2.030 Virtual nodes Application handles distribution to multiple virtual volumes, contained on multiple physical nodes Multiple volumes per node Flexible Expandable on-the-fly – add more nodes Scales forever

31 Web Builder 2.031 Virtual Nodes Pluses –Limitless expansion –Easy to expand –Unlikely to all fail at once –Addressing is logical, not physical –Flexible volume sizing, consolidation Minuses –Many mounts to manage –More administration

32 Web Builder 2.032 Chunked space Storage layer writes parts of files to different physical nodes A higher-level RAID striping High performance for large files –read multiple parts simultaneously

33 Web Builder 2.033 Chunked space Pluses –High performance –Limitless size Minuses –Conceptually complex –Can be hard to expand on the fly –Cant manually poke it

34 Web Builder 2.034 Real Life Case Studies

35 Web Builder 2.035 GFS – Google File System Developed by … Google Proprietary Everything we know about it is based on talks theyve given Designed to store huge files for fast access

36 Web Builder 2.036 GFS – Google File System Single Master node holds metadata –SPF – Shadow master allows warm swap Grid of chunkservers –64bit filenames –64 MB file chunks

37 Web Builder 2.037 GFS – Google File System 1(a)2(a) 1(b) Master

38 Web Builder 2.038 GFS – Google File System Client reads metadata from master then file parts from multiple chunkservers Designed for big files (>100MB) Master server allocates access leases Replication is automatic and self repairing –Synchronously for atomicity

39 Web Builder 2.039 GFS – Google File System Reading is fast (parallelizable) –But requires a lease Master server is required for all reads and writes

40 Web Builder 2.040 MogileFS – OMG Files Developed by Danga / SixApart Open source Designed for scalable web app storage

41 Web Builder 2.041 MogileFS – OMG Files Single metadata store (MySQL) –MySQL Cluster avoids SPF Multiple tracker nodes locate files Multiple storage nodes store files

42 Web Builder 2.042 MogileFS – OMG Files Tracker MySQL

43 Web Builder 2.043 MogileFS – OMG Files Replication of file classes happens transparently Storage nodes are not mirrored – replication is piecemeal Reading and writing go through trackers, but are performed directly upon storage nodes

44 Web Builder 2.044 Flickr File System Developed by Flickr Proprietary Designed for very large scalable web app storage

45 Web Builder 2.045 Flickr File System No metadata store –Deal with it yourself Multiple StorageMaster nodes Multiple storage nodes with virtual volumes

46 Web Builder 2.046 Flickr File System SM

47 Web Builder 2.047 Flickr File System Metadata stored by app –Just a virtual volume number –App chooses a path Virtual nodes are mirrored –Locally and remotely Reading is done directly from nodes

48 Web Builder 2.048 Flickr File System StorageMaster nodes only used for write operations Reading and writing can scale separately

49 Web Builder 2.049 Serving

50 Web Builder 2.050 Serving files Serving files is easy! ApacheDisk

51 Web Builder 2.051 Serving files Scaling is harder ApacheDisk ApacheDisk ApacheDisk

52 Web Builder 2.052 Serving files This doesnt scale well Primary storage is expensive –And takes a lot of space In many systems, we only access a small number of files most of the time

53 Web Builder 2.053 Caching Insert caches between the storage and serving nodes Cache frequently accessed content to reduce reads on the storage nodes Software (Squid, mod_cache) Hardware (Netcache, Cacheflow)

54 Web Builder 2.054 Why it works Keep a smaller working set Use faster hardware –Lots of RAM –SCSI –Outer edge of disks (ZCAV) Use more duplicates –Cheaper, since theyre smaller

55 Web Builder 2.055 Two models Layer 4 –Simple balanced cache –Objects in multiple caches –Good for few objects requested many times Layer 7 –URL balances cache –Objects in a single cache –Good for many objects requested a few times

56 Web Builder 2.056 Replacement policies LRU – Least recently used GDSF – Greedy dual size frequency LFUDA – Least frequently used with dynamic aging All have advantages and disadvantages Performance varies greatly with each

57 Web Builder 2.057 Cache Churn How long do objects typically stay in cache? If it gets too short, were doing badly –But it depends on your traffic profile Make the cached object store larger

58 Web Builder 2.058 Problems Caching has some problems: –Invalidation is hard –Replacement is dumb (even LFUDA) Avoiding caching makes your life (somewhat) easier

59 Web Builder 2.059 CDN – Content Delivery Network Akamai, Savvis, Mirror Image Internet, etc Caches operated by other people –Already in-place –In lots of places GSLB/DNS balancing

60 Web Builder 2.060 Edge networks Origin

61 Web Builder 2.061 Edge networks Origin Cache

62 Web Builder 2.062 CDN Models Simple model –You push content to them, they serve it Reverse proxy model –You publish content on an origin, they proxy and cache it

63 Web Builder 2.063 CDN Invalidation You dont control the caches –Just like those awful ISP ones Once something is cached by a CDN, assume it can never change –Nothing can be deleted –Nothing can be modified

64 Web Builder 2.064 Versioning When you start to cache things, you need to care about versioning –Invalidation & Expiry –Naming & Sync

65 Web Builder 2.065 Cache Invalidation If you control the caches, invalidation is possible But remember ISP and client caches Remove deleted content explicitly –Avoid users finding old content –Save cache space

66 Web Builder 2.066 Cache versioning Simple rule of thumb: –If an item is modified, change its name (URL) This can be independent of the file system!

67 Web Builder 2.067 Virtual versioning Database indicates version 3 of file Web app writes version number into URL Request comes through cache and is cached with the versioned URL mod_rewrite converts versioned URL to path Version 3 Cached: foo_3.jpg foo_3.jpg -> foo.jpg

68 Web Builder 2.068 Authentication Authentication inline layer –Apache / perlbal Authentication sideline –ICP (CARP/HTCP) Authentication by URL –FlickrFS

69 Web Builder 2.069 Auth layer Authenticator sits between client and storage Typically built into the cache software Cache Authenticator Origin

70 Web Builder 2.070 Auth sideline Authenticator sits beside the cache Lightweight protocol used for authenticator Cache Authenticator Origin

71 Web Builder 2.071 Auth by URL Someone else performs authentication and gives URLs to client (typically the web app) URLs hold the keys for accessing files CacheOriginWeb Server

72 Web Builder 2.072 BCP

73 Web Builder 2.073 Business Continuity Planning How can I deal with the unexpected? –The core of BCP Redundancy Replication

74 Web Builder 2.074 Reality On a long enough timescale, anything that can fail, will fail Of course, everything can fail True reliability comes only through redundancy

75 Web Builder 2.075 Reality Define your own SLAs How long can you afford to be down? How manual is the recovery process? How far can you roll back? How many node x boxes can fail at once?

76 Web Builder 2.076 Failure scenarios Disk failure Storage array failure Storage head failure Fabric failure Metadata node failure Power outage Routing outage

77 Web Builder 2.077 Reliable by design RAID avoids disk failures, but not head or fabric failures Duplicated nodes avoid host and fabric failures, but not routing or power failures Dual-colo avoids routing and power failures, but my need duplication too

78 Web Builder 2.078 Tend to all points in the stack Going dual-colo: great Taking a whole colo offline because of a single failed disk: bad We need a combination of these

79 Web Builder 2.079 Recovery times BCP is not just about continuing when things fail How can we restore after they come back? Host and colo level syncing –replication queuing Host and colo level rebuilding

80 Web Builder 2.080 Reliable Reads & Writes Reliable reads are easy –2 or more copies of files Reliable writes are harder –Write 2 copies at once –But what do we do when we cant write to one?

81 Web Builder 2.081 Dual writes Queue up data to be written –Where? –Needs itself to be reliable Queue up journal of changes –And then read data from the disk whose write succeeded Duplicate whole volume after failure –Slow!

82 Web Builder 2.082 Cost

83 Web Builder 2.083 Judging cost Per GB? Per GB upfront and per year Not as simple as youd hope –How about an example

84 Web Builder 2.084 Hardware costs Cost of hardware Usable GB Single Cost

85 Web Builder 2.085 Power costs Cost of power per year Usable GB Recurring Cost

86 Web Builder 2.086 Power costs Power installation cost Usable GB Single Cost

87 Web Builder 2.087 Space costs Cost per U Usable GB [ ] Us needed (inc network) x Recurring Cost

88 Web Builder 2.088 Network costs Cost of network gear Usable GB Single Cost

89 Web Builder 2.089 Misc costs Support contracts + spare disks Usable GB + bus adaptors + cables [ ] Single & Recurring Costs

90 Web Builder 2.090 Human costs Admin cost per node Node count x Recurring Cost Usable GB [ ]

91 Web Builder 2.091 TCO Total cost of ownership in two parts –Upfront –Ongoing Architecture plays a huge part in costing –Dont get tied to hardware –Allow heterogeneity –Move with the market

92 (fin)

93 Web Builder 2.093 Photo credits

94 Web Builder 2.094 You can find these slides online:

Download ppt "Beyond the File System Designing Large Scale File Storage and Serving Cal Henderson."

Similar presentations

Ads by Google