Presentation on theme: "1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC."— Presentation transcript:
1 HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC
2 Where are we today ?? File systems Interconnects Disk technologies Solid state devices Solutions Final thoughts … Agenda ?? Pan Galactic Gargle Blaster "Like having your brains smashed out by a slice of lemon wrapped around a large gold brick.” Pan Galactic Gargle Blaster "Like having your brains smashed out by a slice of lemon wrapped around a large gold brick.”
5 GPFS –Running out of steam ?? –Let me qualify !! (and he then rambles on …..) Frauenhofer –Excellent metadata perf –Many modern features –No real HA Ceph –New, interesting and with a LOT of good features –Immature and with limited track record Panasas –Still RAID 5 and running out of steam …. Other parallel file systems
6 A traditional file system includes a hierarchy of files and directories Accessed via a file system driver in the OS Object storage is “flat”, objects are located by direct reference Accessed via custom APIs o Swift, S3, librados, etc. The difference boils down to 2 questions: o How do you find files? o Where do you store metadata? Object store + Metadata + driver is a filesystem Object based storage
7 It’s more flexible. Interfaces can be presented in other ways, without the FS overhead. A generalized storage architecture vs. a file system It’s more scalable. POSIX was never intended for clusters, concurrent access, multi-level caching, ILM, usage hints, striping control, etc. It’s simpler. With the file system-”isms” removed, an elegant (= scalable, flexible, reliable) foundation can be laid Object Storage Backend: Why?
8 Most clustered FS and OS are built on local FS’s – ….and inherit their problems Native FS o XFS, ext4, ZFS, btrfs OS on FS o Ceph on btrfs o Swift on XFS FS on OS on FS o CephFS on Ceph on btrfs o Lustre on OSS on ext4 o Lustre on OSS on ZFS Elephants all the way down...
9 ObjectStor based solutions offers a lot of flexibility: –Next-generation design, for exascale-level size, performance, and robustness –Implemented from scratch "If we could design the perfect exascale storage system..." –Not limited to POSIX –Non-blocking availability of data –Multi-core aware –Non-blocking execution model with thread-per-core –Support for non-uniform hardware –Flash, non-volatile memory, NUMA –Using abstractions, guided interfaces can be implemented e.g., for burst buffer management (pre-staging and de-staging). The way forward..
10 S-ATA 6 Gbit FC-AL 8 Gbit SAS 6 Gbit SAS 12 Gbit PCI-E direct attach Ethernet Infiniband Next gen interconnect… Interconnects (Disk and Fabrics)
11 Doubles bandwidth compared to SAS 6 Gbit Triples the IOPS !! Same connectors and cables 4.8 GB/s in each direction with 9.6 GB/s following –2 streams moving to 4 streams 24 Gb/s SAS is on the drawing board …. 12 Gbit SAS
12 PCI-E direct attach storage PCIe ArchRaw Bit RateData Encoding Interconnect bandwidth BW Lane Direction Total BW for x16 link PCIe 1.x2.5GT/s8b/10b2Gb/s~250MB/s~8GB/s PCIe 2.05.0GT/s8b/10b4Gb/s~500MB/s~16GB/s PCIe 3.08.0GT/s128b/130b8Gb/s~1 GB/s~32GB/s PCIe 4.0?? M.2 Solid State Storage Modules Lowest latency/highest bandwidth Limitations in # of PCI-E channels available Ivy Bridge has up to 40 lanes per chip
13 Ethernet has now been around for 40 years !!! Currently around 41% of Top500 systems … 28% 1 GbE 13% 10 GbE 40 GbE shipping in volume 100 GbE being demonstrated –Volume shipments expected in 2015 400 GbE and 1 TbE is on the drawing board –400 GbE planned for 2017 Ethernet – Still going strong ??
21 HAMR drives (Seagate) –Using a laser to heat the magnetic substrate (Iron/Platinum alloy) –Projected capacity – 30-60 TB/ 3.5 inch drive … –2016 timeframe …. BPM (bit patterned media recording) –Stores one bit per cell, as opposed to regular hard-drive technology, where each bit is stored across a few hundred magnetic grains –Projected capacity – 100+ TB / 3.5 inch drive … Hard drive futures …
22 RAID 5 – No longer viable RAID 6 – Still OK –But re-build times are becoming prohibitive RAID 10 –OK for SSDs and small arrays RAID Z/Z2 etc –A choice but limited functionality on Linux Parity Declustered RAID –Gaining foothold everywhere –But …. PD-RAID ≠ PD-RAID ≠ PD-RAID …. No RAID ??? –Using multiple distributed copies works but … What about RAID ?
23 Supposed to “Take over the World” [cf. Pinky and the Brain] But for high performance storage there are issues …. –Price and density not following predicted evolution –Reliability (even on SLC) not as expected Latency issues –SLC access ~25µs, MLC ~50µs … –Larger chips increase contention Once a flash die is accessed, other dies on the same bus must wait Up to 8 flash dies shares a bus Address translation, garbage collection and especially wear leveling add significant latency Flash (NAND)
24 MLC 3-4 bits per cell @ 10K duty cycles SLC –1 bit per cell @ 100K duty cycles eMLC –2 bits per cell @ 30K duty cycles Disk drive formats (S-ATA / SAS bandwidth limitations) PCI-E accelerators PCI-E direct attach Flash (NAND)
25 Flash is essentially NV-RAM but …. Phase Change Memory (PCM) –Significantly faster and more dense that NAND –Based on chalcogenide glass Thermal vs electronic process –More resistant to external factors Currently the expected solution for burst buffers etc … –but there’s always Hybrid Memory Cubes …… NV-RAM
26 Size does matter ….. 2014 – 2016 >20 proposals for 40+ PB file systems Running at 1 – 4 TB/s !!!! Heterogeneity is the new buzzword –Burst buffers, data capacitors, cache off loaders … Mixed workloads are now taken seriously …. Data integrity is paramount –T10-DIF/X is a decent start but … Storage system resiliency is equally important –PD-RAID need to evolve and become system wide Multi tier storage as standard configs … Geographical distributed solutions commonplace Solutions …
27 Object based storage –Not an IF but a WHEN …. –Flavor(s) still TBD – DAOS, Exascale10, XEOS, …. Data management core to any solution –Self aware data, real time analytics, resource management ranging from job scheduler to disk block …. –HPC storage = Big Data Live data – –Cache ↔ Near line ↔ Tier2 ↔ Tape ? ↔ Cloud ↔ Ice Compute with storage ➜ Storage with compute Final thoughts ??? Storage is no longer a 2 nd class citizen