Disk Enclosures
cancel
Showing results for 
Search instead for 
Did you mean: 

replacing a bad disk on SC10 array / FC60 (help plz)

 
Highlighted
Super Advisor

replacing a bad disk on SC10 array / FC60 (help plz)

Hi All,

 

kindly help me with the following issue:

 

the fact is that one of our disks (4:0) on the SC10 array went bad and after replacing it we're still having the disk's orange LED "ON" from the front side panel . In addition, in the amdsp output we are getting the following (as you can also see in attachment):

Disk State             = REPLACED       instead of    "OPTIMAL"   and for the hot spare activity field we're gettin the following:  "2:4 is sparing 4:0"    what is the reason for such behavior? and how could it be fixed? in addition, why there is always a sparing activity and why didn't the rebuild start automatically?

Note that we already know that there is a battery critical status as well as controller B are in BAD status and we are waiting for spare parts in order to have them replaced.

 

Thanks in advance for your replies

16 REPLIES 16
Highlighted
Honored Contributor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Note that controller B is the owning controller for the LUN in which the affected disks reside.

 

   -----------------------------------------
   LUN       Status        Capacity   Ctrl  RAID  Segment  Disks
   ---  -----------------  ---------  ----  ----  -------  -----
     0  OPTIMAL              67.7 GB   A     5         16  1:0
                                                           3:0
                                                           5:0
                                                           1:1
                                                           3:1

     1  OPTIMAL             136.7 GB   B     5         16  2:0
                                                           4:0
                                                           6:0

     2  OPTIMAL             136.7 GB   A     5         16  2:1
                                                           4:1
                                                           6:1

Since controller B is not functioning correctly, I would wait until its working before doing anything with the disks.  Its unknown status could be reason the rebuild did not occur.  Right now the disk is being spared, so the LUN still has redundancy.  Performing tasks like trying to force a rebuild could cause more problems.

 

-Bob

----------------
Was this helpful? Like this post by giving me a thumbs up below!
Highlighted
Super Advisor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Thanks a lot for your cooperation.

In addidtion, is there a way to temporarily move all the LUN (containing the Failed disk) to controller A? if so, could you please advise?

And once we received the controller, what are the steps to be performed prior to replace it? I guess and as the LUN (containing the replaced disk) is on controller B which is not in a GOOD state there are necessarily steps required to perform before replacing the controller.

Once more, many thanks for your precious help

 

Highlighted
Honored Contributor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

OK, so first, I emplore you to make and ensure you have a good back up of your data.  These FC60's can be a bit touchy when working on them.  Unfortunately somewhat simple problems like this can quickly lead to a DEAD LUN.  Ok, so you are warned!

 

You can transfer ownership of LUNs from one controller to another.  Use the amcfg command:

 

amcfg -M <LUN> -c <CtrlID> <ArrayID>
    For example:  To set the ownership of LUN 1 to controller A on array with ID "000800A0B809500A":
    # amcfg -M 1 -c A 000800A0B809500A

 

Replacing an FC60 controller is typically straight-forward and should be done "hot"; that is with power on.

 - Remove the original controller.

 - Check to be sure the replacement controller has the same DIMM configuration as the orignal.

 - Install the replacement. 

 

The replacement should sync up with firmware and become active. Check state with amdsp -c or amdsp -a command.

 

You can then move the LUN ownership back as desired.

 

Good luck!

 

-Bob


 

 

----------------
Was this helpful? Like this post by giving me a thumbs up below!
Highlighted
Super Advisor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

One more thing please, while reading the FC60 Advanced user's guide, i saw the ammgr command.

Does    ammgr -c AA <ArrayID>   make any change to the ControllerB status in that case? Please give me your opinion as i do not need to type any command that could have negative consequences.

 

Many Thanks

Highlighted
Super Advisor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Hi All,

 

i wanted to be 100% sure before judging that the controller B is defective so I proceeded by shutting down the whole platform and i swapped the physical locations of the 2 controllers A & B.

The results that ive got can be found in the attachment.

 

Any suggestions?

Highlighted
Honored Contributor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Well, I would not have advised what you had done.  Shutting down these arrays is really the last thing, especially with known problems.  I suppose it is lucky that your LUNs are still accessible.

 

Basically now you have:

 

 - A bad array controller (though maybe something more as the problem stayed with location B; the problem is that even by swapping one controller with the other, this was done in an offline state, so the array does not see a controller replacement.  It could simply be the array controllers are now very confused).

 - A bad BCC controller in one of the SC10 enclosures.

 - The two bad disks are both on channel 4.  This could simply be a coincidence, or a failure due to the BCC controller.

 

A good thing is that you were able to get that LUN rebuild kicked off.  I would wait for that to complete before doing anything else.

 

The controller is what I would first focus on...try replacing it and see if the status recovers.  If that succeeds, contnue with replacing the BCC.  If the drives are still marked as bad, then look to replace them as well.

 

If you cannot replace the hardware, you can try the following to fail and unfail the controller:

 - Transfer ownership of LUN 2 to controller A.

 - Attempt to "fail" the controller:  amutil -C b <arrayID>

 - If the command fails, then you will need to actually replace it.  If this works, unfail it as follows:  amutil -c b <arrayID>

 

Oh, and the armmgr -AA command simply sets the array controller to an Active / Active status.  This would only need to be done if one controller was in a "Passive" state.  I dont think "Unknown" would qualify for this.

 

-Bob

----------------
Was this helpful? Like this post by giving me a thumbs up below!
Highlighted
Super Advisor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Hi Bob, and one more time thanks for your precious help

just to tell you that after restarting the whole platform and after the rebuild was completed, the Orange LED on the lately replaced disk was off and the disk's LED passed to green, it is really strange. Could you plz explain what happens? i really want to know or a logical explanation for all this.

 

Thanks in advance

 

Highlighted
Honored Contributor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Well, I would say that either the BCC controller was problematic and/or the problematic array controller was affecting the ability of the LUN to automatically perform the rebuild.  By resetting the array, the controller was able to start the rebuild process that was "hung".

 

As I stated, these arrays can be a bit tricky to work wtih.

 

-Bob

----------------
Was this helpful? Like this post by giving me a thumbs up below!
Highlighted
Super Advisor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Thanks Bob once again.

i attached the latest output received on 1st September 2011 and as you will see in the output of amdsp -a that the status for both disks 4:0 and 4:1 the state is "NO RESPONSE" eventhough the LED green is ON on both disks..

Any suggestions?