StoreVirtual Storage
cancel
Showing results for 
Search instead for 
Did you mean: 

Storevirtual P4500 G2 Manager stuck in Starting/Stopping

 
Galtster
Occasional Visitor

Storevirtual P4500 G2 Manager stuck in Starting/Stopping

Hi,

I have an interesting case that I haven't been able to solve after scouring the internet and this forum for a solution.

Background: We are running 3 x HP Storevirtual (formally Lefthand SAN) P4500 G2 in a single cluster with RAID 5 on the chassis and Network RAID 10 across the SANs on a single volume. These SANs are used for backups which means I can reboot them during the day for troubleshooting but I need to keep the inforamation stored on them. I can't wipe them.

About a week ago we experienced a disk failure on one of the SANs and immediately started getting emails from the management group that the affected SAN was overloaded and that the manager was having some problems. Over the space of a couple hours we continued to get emails from the management group that the volume was UP, then down, then the SAN was overloaded. Eventually the messages stopped and the faulty SAN was left in the state "Not Ready" but the volume was "UP". When I checked CMC, I could see that the Manager on the SAN was not running which seemed strange to me. My understanding is that a single disk failure in a volume protected by RAID 5 shouldn't have any issues at all.

My next step was to try start the manager again on the affected SAN. Since then, the SAN has been stuck in the state of "Starting/Stopping Manger" and no matter what I try to do, I can't stop nor start the manager. This means I can't remove the SAN and repair it since it keeps telling me to stop the running manager on the affected SAN. I also get messages when I login that the Management Group and the affected SAN do not agree about the managers that are being run in the Management Group.

Steps I have tried for fixing the issue:

1) reboot the affected SAN - No luck

2) reboot the whole management group - no luck

3) Checked and rechecked all the network connectivity. We are using jumbo frames and pinging between all the SANs using a MTU of 9000 is working without any issues. Nothing has changed on the networking side so I am 100% sure it is not a networking issue.

On a side note: After I realised that the manager was stuck in Starting/Stopping I tried to reboot it. At this stage I had NOT removed the faulty disk yet as I didn't want to start the RAID rebuild until I got the SAN working again. The SAN didn't boot up. It got stuck at the point where it was loading the operating system and seemed to hang there for a very long time. In an act of desperation I removed the faulty drive (without replacing it with the new one) and rebooted the SAN again. This time it booted perfectly back into its operating system. The SAN then did a resync with the rest of the cluster and returned to the operational state again. However, the manager on the affected SAN was still stuck in the Starting/Stopping state. 

I decided to replace the faulty disk to allow the SAN to rebuild the RAID 5 array. 26 hours later the rebuild had finished. My initial thought was that something on the faulty disk had caused the SAN to stop working properly (again, this shouldn't happen), but after the rebuild was successful, I decided to try reboot the single SAN to see if that would fix the manager and then rebooted the whole Management Group as a last attempt.

Nothing has worked.

Is there some way to delete the settings of the faulty manager via CLI and let it resync with the rest of the Management Group?

1 REPLY
oikjn
Honored Contributor

Re: Storevirtual P4500 G2 Manager stuck in Starting/Stopping

first...  SAN is the entire storage network.  you appear to be calling your NODES a SAN, they are nodes whcih are each members of a cluster within a management group which all together makes up your SAN.

I don't have experiance with that specific hardware, but I"ve seen faulty disks that are going bad cause some funky things with nodes.  That are not unhealthy enough to be offline, but troubled enough to effect the other nodes.  Its rare, but I've seen it happen.  Are you having quorum issues?  If you leave the trouble node offline, does do the other nodes in the cluster function with the LUN in a warning state?  If so, I'd suggest you turn the bad node off, replace the bad disk and then turn it back on to rebuild.  If the node still doesn't join back on correctly, I would suggest you reset that node completely and do a node exchange to swap out the reformatted node with the RIP placeholder node that you will see in CMC when you turn the reset node back online with the same MAC and IP address.