StoreVirtual Storage
1830809 Members
3460 Online
110016 Solutions
New Discussion

Re: Over Provisioned storage at 95% but writes failing

 
chrisgatguis
Occasional Visitor

Over Provisioned storage at 95% but writes failing

Hi, 

I've hit 95% of the capacity of our SAN (P4500G2 X 6 shelves X 12 disks, Raid 10 @ 19.9TB Total usable space)

 

I've got two Volumes which are both thin provisioned totalling 18.9TB provisioned. 

Its telling me I still have 995GB available - however our VMWare hosts have lost access to both disks & assoc vmfs datastores. 

Its caused a knock on on a couple of hosts but i'll deal with that via VMWare.

 

My only thoughts are that when the thin prov disk is trying to expand into the available space it is trying to do so in a larger chunk than 995GB? 

Is this a possibility and is there any way that i can get the volumes back working - i could then delete out some actual vmware disks that reside on the vmfs datastores (both of these volumes are only used by our backup server)  - but at the moment i'm just unable to do anything within the volumes to clear out some space.

I could probably run an unmap command from vmware but in the current status i can't even get access to the volumes!.

Thanks

chris

 

 

 

 

 

4 REPLIES 4
Mukesh2
Advisor

Re: Over Provisioned storage at 95% but writes failing

Hi Chris,

Thank you for reaching out to HPE!

Please note that 95% is the threshold set on our storevirtual to ensure data integrity on volumes.
Once cluster touches 95% mark, unavailability of space can lead to read/writes failing, thus raising a risk of data corruption.
There are few options that can be tried provided pre requisites are met.

Action Plan :

1. As the volumes are thin provisioned, please check if they are NR10 (2-way replicated).
2. If they are NR10 and not unreplicated NR0, then you can convert any of this volume to NR0 for now.

The above action will give breathing space to cluster and bring the utilization down to an acceptable limit.
Volumes will become accessible,hence you need to take a back up immediately.
As soon as its done, you can start the cleaning process by removing unwanted data from the datastores.
While you are doing so, please check if space reclamation is enabled on the management group your store virtual nodes are running on.
If not then you enable them and then run unmap command on the esxi to reclaim space.
If you are running ESXI 6.5, it runs auto unmap provided you have datastores as VMFS 6, otherwise unmap will not work.

Once you have reclaimed space on blocks and see enough space to accomodate the change of volume to NR10 again, you can proceed with the same.

As there are too many activities involved here, I would request you to raise a case with HPE support for assistance.

NOTE : ACTIVITIES SUGGESTED INVOLVE DATA, HENCE PLEASE RAISE A REQUEST WITH HPE SUPPORT TO PERFORM THESE STEPS.

Regards,
Mukesh

I am an HPE employee

Accept or Kudo

chrisgatguis
Occasional Visitor

Re: Over Provisioned storage at 95% but writes failing

Hi Mukesh, 

I think that it is already reporting an out of space condition to VMWare therefore i'm unable to 'work' with the vmfs datastore at all to allow me to run the unmap command. 

Is there anyway of doing the unmap command locally on the storage rather than via VMWare?

Sadly our volumes are both at 'Data Protection L:evel' Network Raid-0 (None) - I assume this is what you are referring to below. 

Underneath the volumes the underlying storage is Raid 10 at the disk level which is why i believe this was configured in this way. 

 

It is just frustrating because I do belive there is still almots 1TB of available space there but VMWare thinks that it has all ran out.when using thin volumes, does the consumed space just grow naturally i.e. at normal increments, or does it in anyway attemp to pre assign storage in chunks? 

I just can't understand why its showing 995GB available space and yet VMWare is seeing :

2018-10-26T10:05:40.819Z cpu24:33687)ScsiDeviceIO: 2338: Cmd(0x412eca73c780) 0xfe, CmdSN 0x27fe from world 39493 to dev "naa.6000eb3ef94ed5b200000000001310ae" failed H:0x0 D:0x2 P:0x8 Possible sense data: 0x7 0x27 0x7.
2018-10-26T10:05:40.871Z cpu24:33510)ScsiDeviceIO: 2338: Cmd(0x412ed3ad3c40) 0xfe, CmdSN 0x2800 from world 39493 to dev "naa.6000eb3ef94ed5b200000000001310ae" failed H:0x0 D:0x2 P:0x8 Possible sense data: 0x7 0x27 0x7.
2018-10-26T10:05:40.922Z cpu24:33510)ScsiDeviceIO: 2338: Cmd(0x412ed14ccc80) 0xfe, CmdSN 0x2802 from world 39493 to dev "naa.6000eb3ef94ed5b200000000001310ae" failed H:0x0 D:0x2 P:0x8 Possible sense data: 0x7 0x27 0x7.
2018-10-26T10:05:40.922Z cpu2:39493 opID=3e9664f4)HBX: 850: Setting pulse [HB state abcdef02 offset 3542528 gen 1 stampUS 15131395340190 uuid 5aec03fc-942704c8-d5ae-6cc2173cb540 jrnl <FB 0> drv 14.60] on vol 'HP-SATA-BACKUP' failed: No space left on dev$

 

 

 

oikjn
Honored Contributor

Re: Over Provisioned storage at 95% but writes failing

what does it show in CMC?  Do you have or are you trying to use snapshots?

 

I generally suggest that for everyone who uses thin volumes (suggested) and overprovisioning (also typical), you should create at least one Thick provisioned LUN that would be large enough to save you in a situation like this where you weren't paying attention and the SAN filled up completely.  Then it would take all of 30 seconds to delete that LUN and regain write access to the other luns.  Just make sure that the full provitioned "emergency space" LUN is large enough that deleting that will give you enough time to either delete/migrate data off the SAN to free up space OR give you enough time to find add another node into the system.

If you are 100% out of space AND there is nothing youc an delete.  YOu either have to switch a LUN to NR0 for long enough to free up space or spin up a VSA on any hardware you can find that can has at least as much raw storage capacity as your existing nodes and add that to the cluster.  Switching to NR0 would be faster, but it does put your data at risk.

 

Mukesh2
Advisor

Re: Over Provisioned storage at 95% but writes failing

Hi Chris,

Apologies for the delayed response but the storage has run out of space as it has exceeded the threshold of 95%. 

As mentioned in my earlier post this threshold is to protect data integrity. If threshold exceeds there are higher chances of data corruption which is why volumes become inaccessible the moment threshold exceeds.

I would request you to raise a case with HPE support to check if anything can be done which i doubt as you have NR0 volumes (thinly provisioned if I am not wrong).

Regards,

Mukesh

I am an HPE employee

Accept or Kudo