Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Similar presentations

Presentation on theme: "Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi."— Presentation transcript:

1 Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi (shared under Creative Commons Attribution Share-alike License incorporated herein by reference)Creative Commons Attribution Share-alike License (

2 Why is storage important? Web 2.0 applications are an extension of your Desktop SaaS is here and growing Broadband is a reality Storage costs are dropping Everyone expects near-unlimited storage online – Youtube, Flickr, Facebook et al are storing your life online* (.. And yea … lets not forget your personal bit-torrent collection) * it would take 1400 TB to store your entire life in video. 5700 TB if you want to know what was happening around you. Another 73 TB for the audio files of everything you heard (MP3 quality). Thats about 6000 TB for a copy of your life

3 Agenda Hard disks SATA, SAS, FC, Solidstate RAID DAS SAN

4 Large scale storage requires careful planning

5 Choosing your Hard Disk (SATA, FC, SAS, SCSI, Solidstate)

6 Introduction to Hard Drives Basic physical storage unit (aka Physical block device) Variables to consider when selecting a drive Type (SAS, SATA, FC) RPM Capacity MTBF (Mean Time between Failures) Life Expectancy

7 Hard Disk types SATA (Serial ATA) SAS (Serial Attached SCSI) FC (Fibre Channel) Typical Use low-cost, high- volume, low-speed, large-storage environments CDP / Backups Replacement for SCSI High performance transaction oriented applications with high IOPs requirement Performance Average Typically 7200 RPM Good (Similar to FC) 10k / 15k RPM Good (Similar to SAS) 10k / 15k RPM Hard drive capacities Typically - 250 GB, 500 GB, 750 GB, 1TB Typically – 73 GB, 146 GB, 300 GB, 400 GB

8 Hard Disk types SATA (Serial ATA) SAS (Serial Attached SCSI) FC (Fibre Channel) Price per Gig (based on max drive capacity retail web price) $ 0.33$2$3 Misc - Backward compatible with SATA Allows mixing SATA drives on same backplane -

9 Hard Disk Conclusions For high IOPs, database applications, low-storage requirements – you have a choice between FC and SAS SAS currently seems like the better option Future SAS standards promise to be faster than FC (though it is likely they may remain neck to neck) For high-storage requirements (video server, file servers, photo storage, archivals, mail servers, backup servers) SATA is the way to go One may combine SAS and SATA to reduce average cost and achieve your goals – especially since the backplanes are cross-compatible Readup the spec sheet of the hard drives you plan on using for determining specifics

10 Solid State Drives Uses solid state memory to store persistent data Eliminates mechanical parts Useful for creating efficient in-between caches or storing small to mid-sized high performance databases

11 Solid State Drives References Intro - RAM vs Flash based - flash.html SSD based SAN!!! - AdvantagesDisadvantages Faster startup – no spinning Significantly faster on Random IO (From 250x to 1000x+) Extremely low latency (25x to 200x better) No noise Lower power consumption Lesser heat production Significantly more expensive ($10-30/GB for Flash based, $100-200/GB for DDR RAM based) Slightly slower on large sequential reads Slower random write speeds incase of Flash based storage

12 RAID Primer (0, 1, 2, 3, 4, 5, 6, TP, 0+1, 10, 50, 60)

13 Introduction to RAID allows multiple disks to appear as a single contiguous physical block device provides redundancy / high availability A raid group appears as a single physical block device HD1HD2 HD1HD2 RAID

14 Comparison of Single RAID Levels RAID 0RAID 1RAID 5RAID 6 Diagram DescriptionStripingMirroringStriping with Parity Striping with Dual Parity Minimum Disks 2234 Maximum Disks Controller Dependant 2 Array Capacity No. of Drives x Drive Capacity Drive Capacity(No. of Drives - 1) x Drive Capacity (No. of Drives - 2) x Drive Capacity

15 Comparison of Single RAID Levels RAID 0RAID 1RAID 5RAID 6 Storage Efficiency 100%50%(Num of drives – 1) / Num of drives (Num of drives – 2) / Num of drives Fault Tolerance None1 Drive failure 2 Drive failures High Availability NoneGood Very Good Degradation during rebuild NA Slight degradation Rebuilds very fast High degradation Slow Rebuild (due to write penalty of parity) Very High degradation Very Slow Rebuild (due to write penalty of dual parity)

16 Comparison of Single RAID Levels RAID 0RAID 1RAID 5RAID 6 Random Read Performance Very GoodGoodVery Good Random Write Performance Very GoodGood (slightly worse than single drive) Fair (Parity overhead) Poor (Dual Parity Overhead) Sequential Read Performance Very GoodFairGood Sequential Write Performance Very GoodGoodFair CostLowestHighModerateModerate+

17 Comparison of Single RAID Levels RAID 0RAID 1RAID 5RAID 6 Use Case Non critical data High speed requirements Data backed up elsewhere Typically used as RAID 10 in OLTP / OLAP applications Non-write intensive OLTP applications / file servers etc Misc--Parity can considerably slow down system Not supported on all RAID cards

18 Understanding the Parity Penalty RAID 5 and RAID 6 store parity information against data for rebuild Single Parity can be calculated using a simple XOR eg– abcdefghijkl on a 4 disk RAID 5 array If Disk 2 fails then the data B can be recalculated as (01000001 XOR 01000011 XOR 01000000) => 01000010 => B +12124286429 Disk 1Disk 2Disk 3Disk 4 A (01000001)B (01000010)C (01000011){P – 01000000} Parity {P}DEF G HI JK L

19 Understanding the Parity Penalty Steps to change B to X on Disk 2 Read A, C and {P} Recalculate {P} as A XOR X XOR C Write X and {P} A single update required 3 reads and 2 writes Random writes in RAID 5 and RAID 6 are very very expensive Disk 1Disk 2Disk 3Disk 4 A (01000001)B->X (01000010) -> (01011000) C (01000011){P – 01000000}

20 Understanding the Parity Penalty Rebuilding in RAID 5 and RAID 6 is expensive The cost increases with increase in number of disks As if this isnt enough there is an additional penalty All the writes after the computation (ie parity and the changed block) must be simultaneous (involving a two- phase commit operation) The impact can be marginally reduced through write-back caching

21 Comparison of Nested RAID Levels RAID 10RAID 50 Diagram DescriptionMirroring then StripingStriping with Parity then Striping without parity Minimum DisksEven number > 4> 6 Maximum DisksController Dependant Array Capacity(Size of Drive) * (Number of Drives ) / 2 (Size of Drive) * (No. of Drives In Each RAID 5 Set - 1) * (No of RAID 5 Sets)

22 Comparison of Nested RAID Levels RAID 10RAID 50 Storage Efficiency50%((No. of Drives In Each RAID 5 Set - 1) / No. of Drives In Each RAID 5 Set) Fault ToleranceMultiple drive failure as long as 2 drives from same RAID 1 set do not fail Multiple drive failure as long as 2 drives from same RAID 5 set do not fail High AvailabilityExcellent Degradation during rebuild Minor Moderate degradation Slow Rebuild (due to write penalty of parity)

23 Comparison of Nested RAID Levels RAID 10RAID 50 Read PerformanceVery Good Write PerformanceVery GoodGood Use CaseOLTP / OLAP applications Medium-write intensive OLTP / OLAP applications

24 Nested RAID Misc Notes RAID 10 is faster and better than RAID 0+1 for the same cost RAID 60 is similar to RAID 50 except that the striped sets with parity contain dual parity Ideally RAID 10 and RAID 50 will be the only nested RAID levels you will use

25 RAID Considerations Select your Stripe Size by empirical testing smaller stripe size increases transfer performance, decreases positioning performance, and vice versa ideal stripe sizes depend on your application, typical data read in a read, sequential vs random reads etc Try and select hard drives from separate production batches Maintain sufficient Spares in a large array (typically 1 per 10-15 disks is sufficient) Use Global spares across RAID groups if your controller supports it

26 RAID Considerations Use hardware RAID unless performance is not a consideration Especially nested RAID levels or parity based RAID – consume more CPU cycles and increase rebuild time if implemented in software General rule about Controller Cache – the higher the better Ensure the controller has battery backup to retain its cache in case of power failure For internal RAID Controller cards use faster PCI buses (PCI-x)

27 The Fun starts – Lets build our storage system

28 Passive Disk Enclosure based Direct Attached Storage (PDE based DAS)

29 Passive Disk Enclosure based DAS DAS – Direct Attached storage RAID controller inside host machine External chasis is simply a JBOD (Just a Bunch Of Disks) (or what Id like to call Passive Disk Enclosure or PDE) PDE enables stringing larger number of drives together as compared to internal RAID array Eg Dell Powervault MD1000

30 Passive Disk Enclosure based DAS Passive Disk Enclosure can consist of SAS, SATA or FC drives Passive Disk Enclosure to RAID Controller connectivity can be SAS, FC, SCSI (possibly different from the backplane) Multiple PDEs can be daisy chained if they support it RAID card is a single point of failure Only one host machine supported Array of disks can be divided into multiple RAID groups

31 Passive Disk Enclosure based DAS Array of disks can be divided into multiple heterogeneous RAID groups Size and type of a RAID group depends on RAID card PDE may have multiple paths to system with possibility of multiplexing for increased speed Global spares can be defined on the RAID card Maximum storage size = maximum number of PDEs that can be daisy chained x size of drives

32 Passive Disk Enclosure based DAS Performance Considerations Drives RAID configuration PDE Interconnect PDE to RAID Card connect RAID card config (cache etc) PCI bus

33 Active Disk Enclosure based Direct Attached Storage (ADE based DAS)

34 Active Disk Enclosure based DAS ADE Difference -> RAID Card is not in the host machine but in the enclosure Host machine has a SAS/FC Host Bus Adaptor (HBA) depending on ADE to Host connectivity support Some ADEs may support multiple connection protocols ADE may support SAS/FC/SATA drives ADE can support daisy-chaining PDEs Eg of ADE – Dell MD 3000, Infortrend eonstor devices, Nexsan Satabeast and Sataboy etc

35 Active Disk Enclosure based DAS ADE may support dual RAID Controllers RAID Controllers can be used as Active-Active (incase of multiple RAID Groups) – otherwise as Active Passive RAID Controller to HBA connectivity can be multiplexed - if supported - for higher throughput ADEs are wrongly but commonly referred as SAN (SAN device would still be alright)

36 Partitioning and Mounting

37 Logical Volumes A RAID Group is a physical unit of storage At the Operating System a Logical Group can be created out of multiple RAID Groups Each Logical Group can be further divided into Logical Volumes Each Logical Volume represents a mountable block device In Linux this is done using LVM In LVM Logical Volumes are resizable

38 SAN (Storage Area Network)

39 SAN Multiple host machines connected to an ADE through a SAN switch SAN refers to the interconnect + Switch + ADE + PDE Switch and HBA can be SAS / FC depending on interconnect type supported by ADE ADE would support creation of Volumes These can be mounted onto Client and further subdivided

40 SAN Care must be taken to mount each Logical Volume onto a single client (unless you are running a Clustered File System) This can be achieved by host masking supported by ADE and/or the Switch Without careful host masking and mounting data corruption can take place

41 SAN Complex SAN configs include multiple hosts and multiple ADEs connected to active-active switches with multiplexed connections Client hosts can be of heterogeneous operating systems (Funnily ADE to PDE paths sometimes are not be multiplexed)

42 SAN While this looks complex – just think of it as removing hard disks from the machine and hosting them outside in separate enclosures Each machine mounts an independent partition from the SAN

43 SAN Performance Considerations All variables we covered before Switch config Ensure that switch / HBA / interconnect does not become the bottleneck and full hdd throughput can be utilized

44 Throughput Calculations Hard disk performance – Type, RPM etc Data distribution and Type of Data access RAID performance, number of drives, RAID type RAID card performance – cache, active-active config etc ADE to switch connection speed Switch to HBA connection speed HBA to PCI bus speed

45 Thats all Folks Lets go build out our Yottabyte arrays and fill em up [Considerably exaggerated hyperbole given that the combined space of all computers in the world today (2007) doesnt add up to 1 Yottabyte (2 ^ 80 bytes). Infact the entire worlds storage is projected to hit 988 exabytes (2 ^ 60) by 2010] [6 th Sep 2007 - – Nanotech breakthrough could put entire YouTube contents on an iPod-size device]

46 Part II sneak preview Complex SAN configurations iSCSI NAS Clustered Storage GFS Backups Storage Monitoring Storage Benchmarking Some Commercial storage vendors

47 Intelligent People. Uncommon Ideas. Shameless HR Propaganda Slide Directi builds cool Web products Deployed on distributed architecture Using terrabytes of storage Used by millions of users Generating billions of pageviews and transactions Spanning every possible software engineering technology http://careers.directi.com | | http://cosmos.directi.comhttp://wiki.directi.com Personal Blog: Mail:

Download ppt "Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi."

Similar presentations

Ads by Google