Download presentation
1
Isilon Clustered Storage OneFS
Nick Kirsch
2
Introduction Who is Isilon?
What Problems Are We Solving? (Market Opportunity) Who Has These Problems? (Our Customers) What Is Our Solution? (Our Product) How Does It Work? (The Cool Stuff)
3
Who is Isilon Systems? Founded in 2000 Located in Seattle (Queen Anne)
IPO’d in 2006 (ISLN) ~400 employees Q Revenue: $30 million, 40% Y/Y Co-founded by Paul Mikesell, UW/CSE I’ve been at the company for 6+ years
4
What Problems Are We Solving?
Structured Data Small files Modest-size data stores I/O intensive Transactional Steady capacity growth Unstructured Data Larger files Very large data stores Throughput intensive Sequential Explosive capacity growth So, while size does matter, its not the only thing that matters. Traditional data and digital content are fundamentally different in many ways, small files vs. large, I/O intensive vs. throughput intensive, steady and predictable growth vs. rapid and unpredictable growth. While the traditional storage systems from the incumbent vendors are well suited for traditional data, they were never designed to meet the unique needs of digital content. That’s why organizations grappling with the storage, management and distribution of digital content chose Isilon IQ. It’s the right tool for the right job.
5
Traditional Architectures
Data Organized in Layers of Abstraction File System, Volume Manager, RAID Server/Storage Architecture - “Head” and “Disk” Scale Up (vs Scale Out) Islands of Storage Hard to Scale Performance Bottlenecks Not Highly Available Overly Complex Cost Prohibitive Storage Device #1 Storage Device #2 Storage Device #3
6
Who Has These Problems? Isilon has over 850 customers today.
Worldwide File And Block Disk Storage Systems, * Cloud Computing Rich Media Content HPC File Server Consolidation Disk-based Archiving (PB) File Based: 79.3% CAGR Block Based: 31% CAGR By 2011, 75% of all storage capacity sold will be for file-based data * Source: IDC, 2007 Isilon has over 850 customers today.
7
Enterprise- class hardware
What is Our Solution? Enterprise- class hardware Clustered Storage Isilon IQ OneFS™ intelligent software A 3-node Isilon IQ Cluster Scales to 96 nodes 2.3 PB (single file system) 20 GB/s (aggregate) 7
8
Clustered Storage Consists Of “Nodes”
Largely Commodity Hardware Quad-core 2.3Ghz CPU 4 GB memory read cache GbE and 10GbE for front-end network 12 disks per node InfiniBand for intra-cluster communication High-speed NVRAM journal Hot-swappable disks, power supplies, and fans NFS, CIFS, HTTP, FTP Integrates with Windows and UNIX OneFS operating system 8
9
Isilon Network Architecture
CIFS Ethernet NFS Either Drop-in replacement for any NAS device No client-side drivers required, like Andrew FS (Coda), or Lustre No application changes, like Google FS or Amazon S3 No changes required to adopt.
10
How Does It Work? Built on FreeBSD 6.x (originally 5.x)
New kernel module for OneFS Modifications to the kernel proper User space applications Leverage open-source where possible Almost all of the heavy-lifting is in the kernel Commodity Hardware A few exceptions: We have a high-speed NVRAM journal for data consistency We have an Infiniband low-latency cluster inter-connect We have a close-to-commodity SAS card (commodity chips) A custom monitoring board (fans, temps, voltages, etc.) SAS and SATA disks
11
OneFS architecture Fully Distributed Top Half Bottom Half
Initiator Bottom Half Participant The OneFS architecture is basically an Infiniband SAN All data access across the back-end network is block-level The participants act as very smart disk drives Much of the back-end data traffic can be RDMA Network Operations (TCP, NFS, CIFS) FEC Calculations, Block Reconstruction VFS layer, Locking, etc. File-Indexed Cache Journal and Disk Operations Block-Indexed Cache
12
OneFS architecture OneFS started from UFS (aka FFS)
Generalized for a distributed system. Little resemblance in code today, but concepts are there. Almost all data structures are trees OneFS Knows Everything – no volume manager, no RAID Lack of abstraction allows us to do interesting things, but forces the file system to know a lot – everything. Cache/Memory Architecture Split “Level 1” – file cache (cached as part of the vnode) “Level 2” – block cache (local or remote disk blocks) Memory used for high-speed write coalescer Much more resource intensive than a local FS Per-file: Snapshots Quotas Protection Could be allocation policies, prefetch policies
13
Atomicity/Consistency Guarantees
POSIX file system Namespace operations are atomic fsync/sync operations are guaranteed synchronous FS data is either mirrored or FEC-protected Meta-data is always mirrored; up to 8x User-data can be mirrored (up to 8x) or FEC up to +4 We use Reed-Solomon codings for FEC Protection level can be chosen on a per-file or per-directory basis. Some files can be at 1x (no protection) while others can be at +4 (survive four failures). Meta-data must be protected at least as high as anything it refers to. All writes go to the NVRAM first as part of a distributed transaction – guaranteed to commit or abort. Mirroring allows meta-data updates to be faster, easy DSR, no read-modify-write
14
Group Management + Transactional way to handle state changes
All nodes need to agree on their peers Group changes: split, merge, add, remove Group changes don’t “scale”, but are rare 1 4 + 2 3
15
Distributed Lock Manager
Textbook-ish DLM Anyone requesting a lock is an initiator. Coordinator knows the definitive owner for the lock. Controls access to locks. Coordinator is chosen by a hash of the resource. Split/Merge behavior Locks are lost at merge time, not split time. Since POSIX has no lock-revoke mechanism, advisory locks are silently dropped. Coordinator renegotiates on split/merge. Locking optimizations – “lazy locks” Locks are cached. Lock-lost callbacks. Lock-contention callbacks.
16
RPC Mechanism Uses SDP on Infiniband Batch System
Allows you to put dependencies on the remote side. i.e. Send 20 messages, checkpoint, send 20 messages. Messages run in parallel, then synchronize, etc. Coalesces errors. Async messages (callback) Sync messages Update message (no response) Used by DLM, RBM, etc. (everything)
17
Writing a file to OneFS Writes occur via NFS, CIFS, etc. to a single node That node coalesces data and initiates transactions Optimizing for write performance is hard Lots of variables Each node might have different load Unusual scenarios, e.g. degraded writes Asynchronous Write Engine Build a directed acyclical graph (DAG) Do work as soon as dependencies satisfied Prioritize and pipeline work for efficiency
18
Writing a file to OneFS Servers NFS, CIFS, FTP, HTTP
(optional 2nd switch) Servers NFS, CIFS, FTP, HTTP (optional 2nd switch) (optional 2nd switch) Let’s return this to the black lines used before.. Also, the line connecting the switch should line up with the line above it. (thicker lines, like before). Could you add a macintosh client under the Max logo (as MacOS does not run on dells) Phase 2: Is it possible to show intracluster communication (chatter) between the nodes over the infiniband links?
19
Writing a file to OneFS (optional 2nd switch)
Let’s return this to the black lines used before.. Also, the line connecting the switch should line up with the line above it. (thicker lines, like before). Could you add a macintosh client under the Max logo (as MacOS does not run on dells) Phase 2: Is it possible to show intracluster communication (chatter) between the nodes over the infiniband links?
20
Writing a file to OneFS Break the write into regions
Region are protection group aligned For each region: Create a layout Use layout to generate a plan Execute the plan asynchronously write FEC compute FEC compute layout write block allocate blocks write block
21
Writing a file to OneFS Plan executes and transaction commits
Data and parity blocks are now on disks Data and Parity blocks Data and Parity blocks Data and Parity blocks Inode mirror 0 Inode mirror 1
22
Reading a file from OneFS
(optional 2nd switch) Servers NFS, CIFS, FTP, HTTP (optional 2nd switch) (optional 2nd switch)
23
Reading a file from OneFS
Reading a OneFS File Servers NFS, CIFS, FTP, HTTP (optional 2nd switch) (optional 2nd switch) Let’s return this to the black lines used before.. Also, the line connecting the switch should line up with the line above it. (thicker lines, like before). Could you add a macintosh client under the Max logo (as MacOS does not run on dells) Phase 2: Is it possible to show intracluster communication (chatter) between the nodes over the infiniband links?
24
Handling Failures What could go wrong during a single transaction?
A block-level I/O request fails A drive goes down A node runs out of space A node disconnects or crashes In a distributed system, things are expected to fail. Most of our system calls automatically restart. Have to be able to gracefully handle all of the above, plus much more!
25
Handling Failures When a node goes “down”: When a node “fails”:
New files will use effective protection levels (if necessary) Affected files will be reconstructed automatically per request. That node’s IP addresses are migrated to another node. Some data is orphaned and later garbage collected. When a node “fails”: Affected files will be repaired automatically across the cluster. AutoBalance will automatically rebalance data. We can safely, proactively SmartFail nodes/drives: Reconstruct data without removing the device. In the event of a multiple-component failure occurs, use the original device – minimizes WOR.
26
SmartConnect SmartConnect Client must connect to a single IP address.
CIFS Ethernet NFS Either Client must connect to a single IP address. SmartConnect - DNS server which runs on the cluster Customer delegates zone to the cluster DNS server SmartConnect responds to DNS queries with only available nodes SmartConnect can also be configured to respond with nodes based on load, connection, throughput, etc.
27
We've got Lego Pieces Accelerator Nodes Storage Nodes
Top-Half Only Adds CPU and Memory – no disks or journal Only has Level 1 cache… high single-stream throughput Storage Nodes Both Top or Bottom Half In Some Workloads, Bottom Half Only Makes Sense Storage Expansion Nodes Just a dumb extension of a Storage Node – add disks Grow Capacity Without Performance
28
SmartConnect Zones Interpreters BizDev Eng Finance IT 28 hpc. tx.com
10 GigE dedicated Accelerator X nodes NFS Failover required Processing 10gige-1 BizDev bizz.tx.com Renamed sub-domain CIFS clients (static IP) 10.10 Eng eng.tx.com Shared subnet Separate sub-domain NFS Failover gg.tx.com Storage nodes NFS clients, no failover Interpreters 10.20 fin.tx.com VLAN (confidential traffic, isolated) Same physical LAN 10.30 Finance ext-2 ext-1 IT it.tx.com Full access, maintenance interface Corporate DNS, no SC Static (well-known) IPs required 28
29
Initiator Software Block Diagram
ISILON CONFIDENTIAL
30
Participant Software Block Diagram
ISILON CONFIDENTIAL
31
System Software Block Diagram
Accelerator Storage Node ISILON CONFIDENTIAL
32
Too much to talk about… Snapshots Failed Drive Reconstruction Quotas
Replication Bit Error Protection Rebalancing Data Handling Slow Drives Statistics Gathering I/O Scheduling Network Failover Native Windows Concepts (ACLs, SIDs, etc.) Failed Drive Reconstruction Distributed Deadlock Detection On-the-fly Filesystem Upgrade Dynamic Sector Repair Globally Coherent Cache
33
Thank You! Questions?
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.