StoreVirtual Storage
1753487 Members
4632 Online
108794 Solutions
New Discussion

Cache error and Node Repair

 
SOLVED
Go to solution
JMan1
Advisor

Cache error and Node Repair

Hi all,

Running CMC v10.5 with SAN/IQ 9.5.  Unable to upgrade due to some legacy nodes.  Multisite setup.

I had to take one of our nodes down yesterday and when I brought it back up, it didn’t come back up.  I was getting a Cache Status ‘Corrupt’ error.  I have replaced the raid card, battery and cache but the error was still there.

So I selected Repair storage system.  That left a ghost place holder for that storage system in the cluster.  I was able to get the node back online by reconfiguring the RAID on that node.

So, .   I have the node back online but now it is outside the cluster with the old one sitting there as a “ghost”.  When I try to exchange it, I get "There is a ghost storage system in this cluster.  To avoid restriping the entire cluster, exchange the placeholder system with a new or repaired system".

So I select the ghosted one and then the “repaired” one from the selection list (it is the only one).  It tells me that at least one object has an alarm and do I want to continue.  I continue.  Then I get "This operation will cause cluster to change from a Multi-Site cluster to a standard cluster.  Are you sure?"

What happens if it changes from a Multi-site cluster to a standard cluster?  And can I go back to a multi-site?  I am not sure why it is trying to change the cluster type.

Any ideas on options here?  Thanks.

 

16 REPLIES 16
oikjn
Honored Contributor

Re: Cache error and Node Repair

you forgot to set the location of the "new" node.  once it is in the management group, you can add that node back into the correct site and then you should be good to do the swap.

 

I believe you want to do the "node exchange" option and exchange the available one with the RIP one. 

 

Keep in mind that if you want to go faster on the rebuild or if you are running into production performance problems, you can adjust the speed of the restripe by setting the management speed under the properties of the management group.

JMan1
Advisor

Re: Cache error and Node Repair

That node is already in the management group for SAN.  It never left the management group.  When i selected to repair it, it moved it out of the cluster to just below the management group tree node and "ghosted" the ip of that node in the cluster itself.

Where do I set location?  Are you referring to the Site location?

JMan1
Advisor

Re: Cache error and Node Repair

So, these are the steps I originally took but I am unable to exchange it with the ghost storage system.  I know it is long.  This was taken from the Storevirtual 4000 User Manual, starting on pg. 138.

----------------------------

Repairing a storage system
Repairing a storage system allows you to replace a failed disk in a storage system that contains
volumes configured for data protection levels other than Network RAID-0, and trigger only one
resynchronization of the data, rather than a complete restripe. Resynchronizing the data is a shorter
operation than a restripe.
Because of the data protection level, removing and returning the storage system to the cluster would
normally cause the remaining storage systems in the cluster to restripe the data twice—once when
the storage system is removed from the cluster and once when it is returned.
However, the Repair Storage System feature creates a placeholder in the cluster, in the form of a
“ghost” storage system. This ghost storage system keeps the cluster intact while you remove the
storage system, replace the disk, configure RAID, and return the storage system to the cluster. The
returned storage system only has to resynchronize with the other two storage systems in the cluster.

The volume must have Network RAID-10, Network RAID-10+1, Network RAID-10+2, Network
RAID-5, or Network RAID-6.
• The storage system must display the blinking red and yellow triangle in the navigation window.
A disk inactive or disk off event appears in the Events list, and the Status label in the tab
window shows the failure.
• If the storage system is running a manager, stopping that manager must not break quorum.
1. If the storage system is running a manager, stop the manager. See “Stopping managers”
(page 113).
2. Right-click the storage system, and select Repair Storage System.
3. From the Repair Storage System window, select the item that describes the problem to solve.
Click More for more detail about each selection.
• Repair a disk problem
If the storage system has a bad disk, be sure to read “Replacing a disk” (page 41) before
beginning the process.
• Storage system problem
Select this choice if you have verified that the storage system must be removed from the
management group to fix the problem. For more information about using Repair Storage
System with a disk replacement, see “Replacing disks” (page 243).
• Not sure
This choice offers the opportunity to confirm whether the storage system has a disk problem
by opening the Disk Setup window so that you can verify disk status. As with repairing
a disk problem, be sure to plan carefully for a disk replacement.
4. Click OK.
The storage system leaves the management group and moves to the Available Systems pool.
A placeholder, or “ghost” storage system, remains in the cluster. It is labeled with the IP address
instead of the host name, and a special icon .
5. Replace the disk in the storage system and perform any other physical repairs.
Depending on the model, you may need to power on the disk and reconfigure RAID. See
“Replacing a disk” (page 41).
6. Return the repaired storage system to the management group.
The ghost storage system remains in the cluster.
NOTE: The repaired storage system will be returned to the cluster in the same place it
originally occupied to ensure that the cluster resyncs, rather than restripes. See “Glossary”
(page 262) for definitions of restripe and resync.
7. [Optional] Start a manager on the repaired storage system.
8. Use the Exchange Storage System procedure to replace the ghost storage system with the
repaired storage system. See “Exchange a storage system in a cluster” (page 136).

------------------------------------------------------------------------------------

So step 8 seems to be where I am stuck because I can't return it to the cluster.  The only thing that I can see that is different is that the node never left the management group.  It left the cluster but is still listed in the management group.  I don't know if that is the issue right now.  Should I manually force it out of the management group?

oikjn
Honored Contributor

Re: Cache error and Node Repair

your last msg was too dense for me :)

 

The ghost node is just a placeholder.  You do not do anything with it and once the real node is back in the system it will go away.

 

If you did reset the node, do you know its IP address?

Scan for available nodes using that IP address and you should see a node listed as available.

upgrade the node to match the level of the management group, name it what you want and then...

Join that available node to the management group.

Once joined to the group, you should set the site location to be what you want it.

Once the site is correct, you can right click on the cluster and say "exchange nodes"  where you will select the dead node and the replacement node and then it will do the swap.

while the swap is going on, the RIP node will show in CMC, but it will go away once the exchange is complete without you doing anything further to it.

 

oikjn
Honored Contributor

Re: Cache error and Node Repair

if you are really paranoid and want to test this, you can create a couple VSAs with very small storage sizes and then test everything to see it for yourself.

 

I've done this process many times using VSA nodes (CMC treats them the same as hardware nodes).  I found this was the best/fastest way for me to upgrade nodes when I upgrade a host instead of replacing a host.

JMan1
Advisor

Re: Cache error and Node Repair

Yeah, I know.  It was long...

The node was never reset.  The Raid was reconfigured.  However, it never removed itself from the management group and still shows up there.  It looks like it will let me exchange it but it says it will convert it into a standard site instead of multisite in order to do that and I need to keep it at a multisite.

I am thinking I need to just need to manually remove it from the management group and then try to add it back in.  Thoughts?  It is already showing as offline so i don't think removing it from the management group will hurt it at this point.

Yes, a bit concerned because this holds our VMs and enterprise data.

oikjn
Honored Contributor

Re: Cache error and Node Repair

check out the "availability" tab on the node to verify that no LUN is listed on the failed node.

If it doesn't show anything there, make sure it isn't running as a manager...  if it is, stop the manager and then reconfigure the number of managers running to keep quorum correctly (since you said multi-site, I assume you are running 5 managers normally and so now you should run 3 or install a FOM as a temporary holdover for a manager if you don't want to run 3 managers.

 

Once all that is set, you can pretty much do anything you want with the node without too much concern. 

Personally, I would remove it from the management group and then re-add it to the management group to make sure it has a clean configuration.  Then all you should have to do is add the node to the correct site and you shouldn't get the warning you are seeing.

 

The warning you show is all all about the site assignment of the node.  Have you checked the sites to verify they are configured correctly?  My guess is the site list is where you will find the problem and you could fix that without having to remove the node from the management group, but personally I would do that extra step just because it will make sure the node is fresh and doesn't have any holdover issue from the reconfiguration you did. 

 

 

 

JMan1
Advisor

Re: Cache error and Node Repair

Thanks for the reply.  No, there is nothing listed in teh Availability tab for that node.

I removed it as a manager prior to putting it into repair mode and set a different node as manager.  yes, i normally have 5 managers running.

After your comments and my own perusing of the manual, I am agreeing that i should remove it and re-add it.

----------------------------------------------------------

On a similar topic, if i select the Availability tab of all of my other nodes (i have 10 total), i only see that they will become unavailable on one of those nodes and that one lists all of my volumes.  This is not the one that is being repaired.  Is that normal?  It seems to me that availability should be spread out among all of the nodes.  Or am I misinterpreting?

JMan1
Advisor

Re: Cache error and Node Repair

Ok, so i removed it from the management group and added it back in.  When i tried to exchange it with the ghost system, i got the exact same thing as in my first post:

It tells me that at least one object has an alarm and do I want to continue.  I continue.  Then I get "This operation will cause cluster to change from a Multi-Site cluster to a standard cluster.  Are you sure?"