StoreVirtual Storage
1823717 Members
3752 Online
109664 Solutions
New Discussion

Cache error and Node Repair

 
SOLVED
Go to solution
JMan1
Advisor

Cache error and Node Repair

Hi all,

Running CMC v10.5 with SAN/IQ 9.5.  Unable to upgrade due to some legacy nodes.  Multisite setup.

I had to take one of our nodes down yesterday and when I brought it back up, it didn’t come back up.  I was getting a Cache Status ‘Corrupt’ error.  I have replaced the raid card, battery and cache but the error was still there.

So I selected Repair storage system.  That left a ghost place holder for that storage system in the cluster.  I was able to get the node back online by reconfiguring the RAID on that node.

So, .   I have the node back online but now it is outside the cluster with the old one sitting there as a “ghost”.  When I try to exchange it, I get "There is a ghost storage system in this cluster.  To avoid restriping the entire cluster, exchange the placeholder system with a new or repaired system".

So I select the ghosted one and then the “repaired” one from the selection list (it is the only one).  It tells me that at least one object has an alarm and do I want to continue.  I continue.  Then I get "This operation will cause cluster to change from a Multi-Site cluster to a standard cluster.  Are you sure?"

What happens if it changes from a Multi-site cluster to a standard cluster?  And can I go back to a multi-site?  I am not sure why it is trying to change the cluster type.

Any ideas on options here?  Thanks.

 

16 REPLIES 16
oikjn
Honored Contributor

Re: Cache error and Node Repair

you forgot to set the location of the "new" node.  once it is in the management group, you can add that node back into the correct site and then you should be good to do the swap.

 

I believe you want to do the "node exchange" option and exchange the available one with the RIP one. 

 

Keep in mind that if you want to go faster on the rebuild or if you are running into production performance problems, you can adjust the speed of the restripe by setting the management speed under the properties of the management group.

JMan1
Advisor

Re: Cache error and Node Repair

That node is already in the management group for SAN.  It never left the management group.  When i selected to repair it, it moved it out of the cluster to just below the management group tree node and "ghosted" the ip of that node in the cluster itself.

Where do I set location?  Are you referring to the Site location?

JMan1
Advisor

Re: Cache error and Node Repair

So, these are the steps I originally took but I am unable to exchange it with the ghost storage system.  I know it is long.  This was taken from the Storevirtual 4000 User Manual, starting on pg. 138.

----------------------------

Repairing a storage system
Repairing a storage system allows you to replace a failed disk in a storage system that contains
volumes configured for data protection levels other than Network RAID-0, and trigger only one
resynchronization of the data, rather than a complete restripe. Resynchronizing the data is a shorter
operation than a restripe.
Because of the data protection level, removing and returning the storage system to the cluster would
normally cause the remaining storage systems in the cluster to restripe the data twice—once when
the storage system is removed from the cluster and once when it is returned.
However, the Repair Storage System feature creates a placeholder in the cluster, in the form of a
“ghost” storage system. This ghost storage system keeps the cluster intact while you remove the
storage system, replace the disk, configure RAID, and return the storage system to the cluster. The
returned storage system only has to resynchronize with the other two storage systems in the cluster.

The volume must have Network RAID-10, Network RAID-10+1, Network RAID-10+2, Network
RAID-5, or Network RAID-6.
• The storage system must display the blinking red and yellow triangle in the navigation window.
A disk inactive or disk off event appears in the Events list, and the Status label in the tab
window shows the failure.
• If the storage system is running a manager, stopping that manager must not break quorum.
1. If the storage system is running a manager, stop the manager. See “Stopping managers”
(page 113).
2. Right-click the storage system, and select Repair Storage System.
3. From the Repair Storage System window, select the item that describes the problem to solve.
Click More for more detail about each selection.
• Repair a disk problem
If the storage system has a bad disk, be sure to read “Replacing a disk” (page 41) before
beginning the process.
• Storage system problem
Select this choice if you have verified that the storage system must be removed from the
management group to fix the problem. For more information about using Repair Storage
System with a disk replacement, see “Replacing disks” (page 243).
• Not sure
This choice offers the opportunity to confirm whether the storage system has a disk problem
by opening the Disk Setup window so that you can verify disk status. As with repairing
a disk problem, be sure to plan carefully for a disk replacement.
4. Click OK.
The storage system leaves the management group and moves to the Available Systems pool.
A placeholder, or “ghost” storage system, remains in the cluster. It is labeled with the IP address
instead of the host name, and a special icon .
5. Replace the disk in the storage system and perform any other physical repairs.
Depending on the model, you may need to power on the disk and reconfigure RAID. See
“Replacing a disk” (page 41).
6. Return the repaired storage system to the management group.
The ghost storage system remains in the cluster.
NOTE: The repaired storage system will be returned to the cluster in the same place it
originally occupied to ensure that the cluster resyncs, rather than restripes. See “Glossary”
(page 262) for definitions of restripe and resync.
7. [Optional] Start a manager on the repaired storage system.
8. Use the Exchange Storage System procedure to replace the ghost storage system with the
repaired storage system. See “Exchange a storage system in a cluster” (page 136).

------------------------------------------------------------------------------------

So step 8 seems to be where I am stuck because I can't return it to the cluster.  The only thing that I can see that is different is that the node never left the management group.  It left the cluster but is still listed in the management group.  I don't know if that is the issue right now.  Should I manually force it out of the management group?

oikjn
Honored Contributor

Re: Cache error and Node Repair

your last msg was too dense for me :)

 

The ghost node is just a placeholder.  You do not do anything with it and once the real node is back in the system it will go away.

 

If you did reset the node, do you know its IP address?

Scan for available nodes using that IP address and you should see a node listed as available.

upgrade the node to match the level of the management group, name it what you want and then...

Join that available node to the management group.

Once joined to the group, you should set the site location to be what you want it.

Once the site is correct, you can right click on the cluster and say "exchange nodes"  where you will select the dead node and the replacement node and then it will do the swap.

while the swap is going on, the RIP node will show in CMC, but it will go away once the exchange is complete without you doing anything further to it.

 

oikjn
Honored Contributor

Re: Cache error and Node Repair

if you are really paranoid and want to test this, you can create a couple VSAs with very small storage sizes and then test everything to see it for yourself.

 

I've done this process many times using VSA nodes (CMC treats them the same as hardware nodes).  I found this was the best/fastest way for me to upgrade nodes when I upgrade a host instead of replacing a host.

JMan1
Advisor

Re: Cache error and Node Repair

Yeah, I know.  It was long...

The node was never reset.  The Raid was reconfigured.  However, it never removed itself from the management group and still shows up there.  It looks like it will let me exchange it but it says it will convert it into a standard site instead of multisite in order to do that and I need to keep it at a multisite.

I am thinking I need to just need to manually remove it from the management group and then try to add it back in.  Thoughts?  It is already showing as offline so i don't think removing it from the management group will hurt it at this point.

Yes, a bit concerned because this holds our VMs and enterprise data.

oikjn
Honored Contributor

Re: Cache error and Node Repair

check out the "availability" tab on the node to verify that no LUN is listed on the failed node.

If it doesn't show anything there, make sure it isn't running as a manager...  if it is, stop the manager and then reconfigure the number of managers running to keep quorum correctly (since you said multi-site, I assume you are running 5 managers normally and so now you should run 3 or install a FOM as a temporary holdover for a manager if you don't want to run 3 managers.

 

Once all that is set, you can pretty much do anything you want with the node without too much concern. 

Personally, I would remove it from the management group and then re-add it to the management group to make sure it has a clean configuration.  Then all you should have to do is add the node to the correct site and you shouldn't get the warning you are seeing.

 

The warning you show is all all about the site assignment of the node.  Have you checked the sites to verify they are configured correctly?  My guess is the site list is where you will find the problem and you could fix that without having to remove the node from the management group, but personally I would do that extra step just because it will make sure the node is fresh and doesn't have any holdover issue from the reconfiguration you did. 

 

 

 

JMan1
Advisor

Re: Cache error and Node Repair

Thanks for the reply.  No, there is nothing listed in teh Availability tab for that node.

I removed it as a manager prior to putting it into repair mode and set a different node as manager.  yes, i normally have 5 managers running.

After your comments and my own perusing of the manual, I am agreeing that i should remove it and re-add it.

----------------------------------------------------------

On a similar topic, if i select the Availability tab of all of my other nodes (i have 10 total), i only see that they will become unavailable on one of those nodes and that one lists all of my volumes.  This is not the one that is being repaired.  Is that normal?  It seems to me that availability should be spread out among all of the nodes.  Or am I misinterpreting?

JMan1
Advisor

Re: Cache error and Node Repair

Ok, so i removed it from the management group and added it back in.  When i tried to exchange it with the ghost system, i got the exact same thing as in my first post:

It tells me that at least one object has an alarm and do I want to continue.  I continue.  Then I get "This operation will cause cluster to change from a Multi-Site cluster to a standard cluster.  Are you sure?"

JMan1
Advisor

Re: Cache error and Node Repair

If I look at the site configuration, the node has already been added into the site even though I didn't add it.  So now i have 6 nodes in that site (should be 5).  I have the ghost one in there and the repaired node.

What happens if I remove the ghost node from the management group and just add the repaired node in without doing an exchange?

oikjn
Honored Contributor

Re: Cache error and Node Repair

I don't think you can remove the ghost node even if you try...  can't hurt to try.

 

You are 100% sure the "new" node is on the site?  If you didn't add it to it, it shouldn't be there.

Can you attach a screenshot of the node list for the cluster and the site list? 

 

Too much to read, but if the dead node is still in the cluster, you cannot/should not add the new node unless you do an exchange as the strip wouldn't be correct as it would assume the dead node will come back at some point.  Also, assuming the dead node IS showing in the cluster still, right click on it and select "repair storage system".  If that wasn't done, it might not have triggered the condition where it is viewing that system as truely dead and not returning instead of just missing.  Once that is done, it should show the node as RIP:mac_address. at which point the proceedure you are trying should work without the warning.

 

 

 

 

 

 

JMan1
Advisor

Re: Cache error and Node Repair

Ok, so here is what I did.  I removed that node from the management group with no problem, checked the site list for the second site and it wasn’t there.  I added it back in to the management group, checked the site list (Image 1), and it still wasn’t there.  What is still showing up in the site is the ghosted machine.  So far so good.

I right click on the cluster, select Edit cluster, and then Exchange Storage Systems.  I select the ghosted system from the list, then select the “Exchange Storage System…” button.  I then select the repaired unit as it is the only one in the list of available units and click OK.  When I click OK, I get the message shown in Image 2.

After I click OK, I get the message shown in Image 3.

Does this mean that the site will be deleted or the servers listed will be moved from that site?  I don’t want to do anything to change my sites so I click Cancel, then Cancel 2 more times to get out of this and get back to the CMC.

However, now, when I go into sites, SAN2 is listed in the secondary site, even though I cancelled it out.  See Image 4. And SAN2 is still sitting outside of the cluster with the ghosted machine still in there.

And SAN2 is still sitting outside of the cluster with the ghosted machine still in there.  See Image 5.

And yes, if I click on the ghosted node, it shows the MAC as RIP0-xx:xx:xx, where the MAC is the same as SAN 2.

oikjn
Honored Contributor
Solution

Re: Cache error and Node Repair

before you attemp the exchange node action you need to see SAN2 in the same site as the RIP node in the cluster.  You did not mention doing that step in the process after joining the management group and I think img2 is a warning about that.  The node you add must be in the same site as the missing node or it will mess up the stripe pattern for redundancy and cause the system to do a total restipe which is not ideal (understatement of the day).

 

That said, the warning in img4 refrences a single user...  that is strange and I don't know if/what user prefrences are site specific.  I don't see my users with any site specific settings.  I think as long as you are 100% with the sites definititions for your systems in the management group, then you should be fine to proceed with the node exchange.  The 100% critical point is that you must see the new node and the old node listed in the same site BEFORE you initiate the cluster exchange node action.

Once you start that action, the RIP node will show outside the cluster but inside the management group and the new node will show in the cluster with a caution until it completes its restripe of each LUN.  Once that is done, the RIP node will dissappear from the management group.

 

Side note:  the "SAN" is the entire network (IE the management group(s)), the systems aren't SAN's they are Nodes inside the san. ;)

 

 

you do NOT want to add the node and if CMC

JMan1
Advisor

Re: Cache error and Node Repair

You are correct.  I didn't add it to the site after joining the management group.  The users in that image are servers but Lefthand is calling them users.

Yes, I know that they are nodes.  That is just the naming scheme that was chosen (for whaterver reason) so we stuck with it.  If I had it do over again, they would be named differently.

Sorry, didn't understand your last sentence.

JMan1
Advisor

Re: Cache error and Node Repair

Yes!!!  You rule!!!

The correct solution was to add it to the site before exchanging it.  It is restriping now.

My last question is now I have the ghost node sitting outside the cluster but still within the management group.  Will that go away on its own or do I need to remove that from the Management Group?

oikjn
Honored Contributor

Re: Cache error and Node Repair

glad to hear its working :)

 

yes it will go away on its own when the restripe is complete.