Disk Arrays
cancel
Showing results for 
Search instead for 
Did you mean: 

replacing a bad disk on SC10 array / FC60 (help plz)

meekrob
Super Advisor

replacing a bad disk on SC10 array / FC60 (help plz)

Hi All,

 

kindly help me with the following issue:

 

the fact is that one of our disks (4:0) on the SC10 array went bad and after replacing it we're still having the disk's orange LED "ON" from the front side panel . In addition, in the amdsp output we are getting the following (as you can also see in attachment):

Disk State             = REPLACED       instead of    "OPTIMAL"   and for the hot spare activity field we're gettin the following:  "2:4 is sparing 4:0"    what is the reason for such behavior? and how could it be fixed? in addition, why there is always a sparing activity and why didn't the rebuild start automatically?

Note that we already know that there is a battery critical status as well as controller B are in BAD status and we are waiting for spare parts in order to have them replaced.

 

Thanks in advance for your replies

16 REPLIES
Robert_Jewell
Honored Contributor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Note that controller B is the owning controller for the LUN in which the affected disks reside.

 

   -----------------------------------------
   LUN       Status        Capacity   Ctrl  RAID  Segment  Disks
   ---  -----------------  ---------  ----  ----  -------  -----
     0  OPTIMAL              67.7 GB   A     5         16  1:0
                                                           3:0
                                                           5:0
                                                           1:1
                                                           3:1

     1  OPTIMAL             136.7 GB   B     5         16  2:0
                                                           4:0
                                                           6:0

     2  OPTIMAL             136.7 GB   A     5         16  2:1
                                                           4:1
                                                           6:1

Since controller B is not functioning correctly, I would wait until its working before doing anything with the disks.  Its unknown status could be reason the rebuild did not occur.  Right now the disk is being spared, so the LUN still has redundancy.  Performing tasks like trying to force a rebuild could cause more problems.

 

-Bob

----------------
Was this helpful? Like this post by giving me a thumbs up below!
meekrob
Super Advisor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Thanks a lot for your cooperation.

In addidtion, is there a way to temporarily move all the LUN (containing the Failed disk) to controller A? if so, could you please advise?

And once we received the controller, what are the steps to be performed prior to replace it? I guess and as the LUN (containing the replaced disk) is on controller B which is not in a GOOD state there are necessarily steps required to perform before replacing the controller.

Once more, many thanks for your precious help

 

Robert_Jewell
Honored Contributor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

OK, so first, I emplore you to make and ensure you have a good back up of your data.  These FC60's can be a bit touchy when working on them.  Unfortunately somewhat simple problems like this can quickly lead to a DEAD LUN.  Ok, so you are warned!

 

You can transfer ownership of LUNs from one controller to another.  Use the amcfg command:

 

amcfg -M <LUN> -c <CtrlID> <ArrayID>
    For example:  To set the ownership of LUN 1 to controller A on array with ID "000800A0B809500A":
    # amcfg -M 1 -c A 000800A0B809500A

 

Replacing an FC60 controller is typically straight-forward and should be done "hot"; that is with power on.

 - Remove the original controller.

 - Check to be sure the replacement controller has the same DIMM configuration as the orignal.

 - Install the replacement. 

 

The replacement should sync up with firmware and become active. Check state with amdsp -c or amdsp -a command.

 

You can then move the LUN ownership back as desired.

 

Good luck!

 

-Bob


 

 

----------------
Was this helpful? Like this post by giving me a thumbs up below!
meekrob
Super Advisor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

One more thing please, while reading the FC60 Advanced user's guide, i saw the ammgr command.

Does    ammgr -c AA <ArrayID>   make any change to the ControllerB status in that case? Please give me your opinion as i do not need to type any command that could have negative consequences.

 

Many Thanks

meekrob
Super Advisor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Hi All,

 

i wanted to be 100% sure before judging that the controller B is defective so I proceeded by shutting down the whole platform and i swapped the physical locations of the 2 controllers A & B.

The results that ive got can be found in the attachment.

 

Any suggestions?

Robert_Jewell
Honored Contributor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Well, I would not have advised what you had done.  Shutting down these arrays is really the last thing, especially with known problems.  I suppose it is lucky that your LUNs are still accessible.

 

Basically now you have:

 

 - A bad array controller (though maybe something more as the problem stayed with location B; the problem is that even by swapping one controller with the other, this was done in an offline state, so the array does not see a controller replacement.  It could simply be the array controllers are now very confused).

 - A bad BCC controller in one of the SC10 enclosures.

 - The two bad disks are both on channel 4.  This could simply be a coincidence, or a failure due to the BCC controller.

 

A good thing is that you were able to get that LUN rebuild kicked off.  I would wait for that to complete before doing anything else.

 

The controller is what I would first focus on...try replacing it and see if the status recovers.  If that succeeds, contnue with replacing the BCC.  If the drives are still marked as bad, then look to replace them as well.

 

If you cannot replace the hardware, you can try the following to fail and unfail the controller:

 - Transfer ownership of LUN 2 to controller A.

 - Attempt to "fail" the controller:  amutil -C b <arrayID>

 - If the command fails, then you will need to actually replace it.  If this works, unfail it as follows:  amutil -c b <arrayID>

 

Oh, and the armmgr -AA command simply sets the array controller to an Active / Active status.  This would only need to be done if one controller was in a "Passive" state.  I dont think "Unknown" would qualify for this.

 

-Bob

----------------
Was this helpful? Like this post by giving me a thumbs up below!
meekrob
Super Advisor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Hi Bob, and one more time thanks for your precious help

just to tell you that after restarting the whole platform and after the rebuild was completed, the Orange LED on the lately replaced disk was off and the disk's LED passed to green, it is really strange. Could you plz explain what happens? i really want to know or a logical explanation for all this.

 

Thanks in advance

 

Robert_Jewell
Honored Contributor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Well, I would say that either the BCC controller was problematic and/or the problematic array controller was affecting the ability of the LUN to automatically perform the rebuild.  By resetting the array, the controller was able to start the rebuild process that was "hung".

 

As I stated, these arrays can be a bit tricky to work wtih.

 

-Bob

----------------
Was this helpful? Like this post by giving me a thumbs up below!
meekrob
Super Advisor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Thanks Bob once again.

i attached the latest output received on 1st September 2011 and as you will see in the output of amdsp -a that the status for both disks 4:0 and 4:1 the state is "NO RESPONSE" eventhough the LED green is ON on both disks..

Any suggestions?

Robert_Jewell
Honored Contributor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Both drives are on the same channel.  Are these two disks in the same enclosure with the BCC that is marked as "Unknown"?  Are there any other drives in this enclosure on this same bus?.  It could be that the "bad" BCC is causing the disks to report as bad/unknown.  One of them is being spared, but the other (4:0) is not, yet the LUN it belongs to is still marked as OPTIMAL.  Strange.  Id work to get the BCC and the array controller working before messing with the disk modules at this time.

 

-Bob

----------------
Was this helpful? Like this post by giving me a thumbs up below!
meekrob
Super Advisor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Hello Bob,

 

how can i find if these 2 disks (4:0  and  4:1) are in the same enclosure with the BCC controller that is marked as "unknown"? These 2 disks are located on the SC10 disks array (bay 2) however  the BCC controller isn't meant to be located in the controller B enclosure? Note that both controllers A and B are meant to be configured for redundancy (connected to 3x  SC10 disks arrays and 2 servers / server 1 : Application server and server 2: Database server).

So, in your opinion, first i should replace the BCC & controller ?

 

Thanks in advance

 

Robert_Jewell
Honored Contributor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

> how can i find if these 2 disks (4:0  and  4:1) are in the same enclosure with the BCC controller that is marked as "unknown"?

According to the amdsp output the bad controller is disk enclosure "2", but thumbwheel setting of 1 is what is referenced:

Firmware Revision   = HP04
   Information for Disk System 2 (USSD00098870), Controller A:
      SCSI Channel        = 3
      Thumbwheel Setting  = 1
      Controller Status   = GOOD
      Vendor ID           = HP      
      Product ID          = A5294A          
      BCC Serial Number   = USSD00098870
      Firmware Revision   = HP04
   Information for Disk System 2 (USSD00098870), Controller -1:
      SCSI Channel        = 0
      Thumbwheel Setting  = -1
      Controller Status   = UNKNOWN
      Vendor ID           = NO_VENDOR
      Product ID          = NO_MODEL
      BCC Serial Number   = NO_SER_NUM
      Firmware Revision   = NO_FWREV

From your first amdsp attachment you can see that disk 4:0 is in enclosure 1, slot 1 and disk 4:1 is in enclosure 1, slot 3.  Both of these disks are affected by the faulted BCC "B".   Replace that to get your disks back (hopefully).

 

> however  the BCC controller isn't meant to be located in the controller B enclosure?

 

Not sure what you mean, but array controllers and BCC (bus control card) are different.  There is an A and B for each.

 

> So, in your opinion, first i should replace the BCC & controller ?

 

Replace the BCC to hopefully get all of your drives in good order.  Then focus on the array controller as previously discussed.

 

Thats my suggestion anyways!!

 

-Bob

 

 

 

 

 

----------------
Was this helpful? Like this post by giving me a thumbs up below!
meekrob
Super Advisor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

Hello,

 

still getting issues on this topic, we replaced both BCCs on disk system2 and proceed by replacing the controller with no results. We proceeded by backing up all data and syswiped the array and now when trying to create a LUN on controller B we received the following output:

> amcfg -force -L B:1 -d 2:0,4:0,6:0 -r 5 -s 16 Array1
Error in command execution, "RMT_AM60ERRORSTATUS_MSG"

   AM60ERR        : ERR_COMMAND_FAILED
   AM60ERR QUAL   : CREATE_LUN
   MODULE_CODE_ID : SUBSYSTEM
   COMMAND STATE  : A SCSI error occurred
   ERROR NUMBER   : 2

   Sense Key                  = 0x05:  "ILLEGAL REQUEST"
   Additional Sense Code      = 0x91
   Additional Sense Code Qual = 0x03

Decoded SCSI Sense:
Illegal Operation for Current Disk State

amcfg:  Error in command execution

isnt syswipe intended to delete all disks data and config?

 

Any help is much appreciated

 

Robert_Jewell
Honored Contributor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

It looks like there is still  a hardware issue somewhere. 

 

Does amdsp -a report any improvements since the controller replacements you referenced?

 

-Bob

----------------
Was this helpful? Like this post by giving me a thumbs up below!
meekrob
Super Advisor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

No improvements even after replacing the controller.

These FC60 are too weird, i troubleshooted everything (scsi cables, ports.......)

then by itself after some time it showed no problems and everything went OK.

However nobody knows what does it happen. Too Weird !

Robert_Jewell
Honored Contributor

Re: replacing a bad disk on SC10 array / FC60 (help plz)

This is probably the line that says it all;

> These FC60 are too weird,

 

Very true!

 

-bob

----------------
Was this helpful? Like this post by giving me a thumbs up below!