- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - Microsoft
- >
- Cluster Shared Volumes (CSV) on SAN vs Storage Spa...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Friday - last edited Friday by support_s
Friday - last edited Friday by support_s
Cluster Shared Volumes (CSV) on SAN vs Storage Spaces Direct (S2D) : The details.
Lets talk Windows Failover Clusters (WFC), and how storage is seen by a Cluster. You have basically Three Options for Storage when deploying a WFC, two of which are valid only for SAN (iSCSI + FC) type storage, and one which is ONLY valid for JBOD/Faceplate SAS/SATA/NVMe storage.
Cluster Volumes: This is old fashioned Cluster Volumes that were introduced with Windows NT 4.0, yes, they have been updated since then, but they still operate in a very basic way. You create a Volume on a SAN, and expose it to multiple servers (and multiple paths). If more than 1 server at a time tries to read/write to a Cluster Volume at a time, the data can (and eventually will) be corrupted. To prevent this Cluster Volumes use SCSI Reservation, and only one server in the cluster can read/write to the drive at a time. This is a hardware method to enforce protection on the disk. When you want to have Cluster Node 2 work with a disk instead of the current Cluster Node 1, you institute a Failover Command, which in the background will tell Node 1 to flush its outstanding writes (if possible), and then let its SCSI reservation expire. While this is happening, Node 2 is requesting a SCSI reservation break, and if Node 1 is unable to defend its reservation, and if node 1 doesn't have quorum, it will not do so. This failover mechanism will only allow a failover to a node that has the Cluster Quorum, and and node without quorum will refuse to defend.. This type of failover is a hard cut-over, and can take between 500ms (1/2 second) and 30 second to appear on the other node (usually its under 5 seconds. If the volume is hosting a VM, that VM may see a slight pause in the VM, barely noticable.
Cluster Shared Volumes: This is far a different tool than the Cluster Volume, yes they use SCSI reservations, but ALL nodes of the cluster are able to read/write to the drive at the same time. To prevent corruption however the drive uses a CSV Owner Node to coordinate actions on the drive. Basically all nodes can read any file, however, they need to register with the owner node to open a file for write/modify operations; and a node can mark either a single file or a folder or a collection of files. Now lets take the same situation, where a VM is running on Node 1, and you want to fail that VM to Node 2....Well, all of the files that make up that VM are currently leased/owned by Node 1, so Node 2 cannot write to them, but when Node 2 wants to bring that VM up, it simply has to co-ordinate the expiration of the VMs Files Lease, and lease those files for itself. Since all of this happens in the Cluster Software on the WFC, we dont need to wait for hardware reservations, or have to wait for plug-and-play to detect the hardware change, the failover can happen far faster, usually below 500ms (1/2 second), and the VM being moved will not even know it, i.e. completely undetectable. Now the CSV Owner Node doesn't JUST manage the leases to individual files, it also has a different mode, called redirected mode. Lets say you have a VM on Node 1, but Node 1 looses access to its iSCSI or FC connection, but the Node is still up, then Node 1 can send its storage traffic over SMB to the CSV Owner Node to write its storage on its behalf. This allows a VM not to crash, and allows the VM to be moved to another node without a crash. This is also used if you want to create a cache-flushed (VSS) snapshot of a CSV.....remember that each Node in the cluster may have outstanding writes to the disks. So all write traffic is redirected to the owner node, who can then take a VSS flushed Snapshot which takes no longer than 10 seconds (usually sub 5 seconds); and once completed the CSV is returned to normal operation. Generally the weak spot of a CSV is that you really need to deploy a Node-to-Node network that is high speed to support this redirection-mode as well as unblocked access for each node to make lease requests and breaks.
Storage Spaces Direct (S2D): Originally introduced in Windows Server 2012 (as Shared Storage Spaces), and then reintroduced in Windows Server 2016 as Storage Spaces Direct (these are very different). This is a completely different beast. In this mode, you deploy a large number of RAW drives in the faceplate of each server, and that storage is joined to create a storage pool. As an example, in a 4-node cluster, you might have 12 drives in each node, which means your pool would be 4 x 12 (total 48 drives). But as you know when you deploy a SAN, the array takes care of the RAID levels and protection for you, in this case, the S2D layer will create something similar to a RAID set on top of that primodial pool. Since we are using Direct attach storage (DAS), if a node fails, it takes down all of those captive drives, so to survive the S2D software must use a protection method that mirrors data among the surviving nodes. The Recommended Cluster Size is 4 Nodes, and the recommended Protection method is 3-way mirroring. So a Volume is created in the Primordial Pool, and 3-way mirroring is selected. That Volume is split into thousands (or millions) of slabs, and each slab is written to 3 different random servers. The only rule is that the 3 copies of any slab must exist on different physical servers. If a VM is running on Node 1, and that VM needs to write a large file to its local file system, it will write to its VHD, which will cause a collection of slabs to be updated. To update these slabs, it will write some slabs locally, and 2 copies of each slab to the other nodes of the cluster. i.e. If the file is 10GB, then it will generate 30TB of data to be written, distributed against the 4-nodes, which means that it will write 7.50 GB locally, and 7.50 GB to each other node in the cluster. This represents a write amplification, and as such the CSV network MUST be sized to handle the workload. One additional note is that a Storage Pool can be all-flash, or all-spinning-disk, or can be a tiered pool, where you join a capacity layer with a caching layer. Note however that each POOL in a tier set can either be used for capacity or caching, but never both.
So to sum up, the following illustrates the level of control,
- Cluster Volume --> Exposed Volume --> Multipath Driver --> Disk Driver --> SCSI Reservation --> Array Cache --> Array (RAID Layer)
- Cluster Shared Volume --> Exposed Volume --> CSV Owner Node Lease --> Multipath Driver --> Disk Driver --> Array Cache --> Array (RAID Layer)
- S2D --> Exposed Volume --> CSV Lease --> OS RAID --> S2D slab RAID --> SMB Dri+ver (CSV Network) --> RAW Drives
NTFS vs ReFS
Now we need to talk about NTFS and ReFS....With SAN based Storage, the Array can support internally all of the retrys, raid-scrubs, data integrity, etc type operations and an upper layer of NTFS meshes in with that ecosystem perfectly. I should also note that NTFS is a known quantity, in that the internal workings of NTFS are well know, and as such many layers of enhancements and optimizations have been baked in over the years. This is both an advantage and a disadvantage. The clear advantage is backwards compatibility, interoperability, etc; but the disadvantages is that not much can be changed without breaking that compatibility, and to optimize S2D, Microsoft needs to implement some of these features that storage arrays have been doing for 20 years. An example is NTFS uses a simple LBA layout, and ReFS has an extra layer of redirect, this allows ReFS to relocate slabs, do checksumming, self-heal when a slab is missing (or node goes down), etc.
Additionally, ReFS is not tested/certified to work on SAN type hardware, and features such as ODX (Offload Data Transfer), VSS Snapshots, Trim/Unmap, Volume Pivot, Linux Compatibility may give unpredictable results.
For this reason ReFS should always be used for S2D, and NTFS should always be used for SAN.
Note also that I simplified the recommendation to a 4-Node WFC using S2D, this is due to the following reasons.
If you deploy a single Node Cluster, you will need to support some form of protection, but you have NO protection from a node failure
If you deploy a 2/3-Node Cluster, it is common to deploy a Parity RAID on each node, and then a 2 or 3-way mirror between the nodes, however to grow this node from 2-node to a larger cluster, you will have to destroy this layout as it cannot scale. The upgrade path is to destroy and recrete the cluster
If you deploy a 4-node cluster, you can grow (data-in-place) by simply adding more nodes and expand the pool to incorporate the new nodes; no downtime, no growth pains. Eventually the storage will rebalance across all the nodes.
If you are deploying a NEW cluster, right-sizing it to a 4-node cluster gives you all the advantages of failover/protection from faults, while keeping the complexity level low. Larger node-count cluster increase complexity without increasing resiliency. i.e. a four node cluster can survive 2 nodes being down, and a 6-node cluster can also ONLY survive 2 nodes being down, and a 8-node cluster can also only survive 2 nodes being down.
I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
