MSA Storage
cancel
Showing results for 
Search instead for 
Did you mean: 

msa1000 strangeness after upgrading to fw 4.48

 
SOLVED
Go to solution

msa1000 strangeness after upgrading to fw 4.48

After upgrading from 4.24 to 4.48 I restarted the MSA1000 and lo and behold TWO of the drives indicated failures (bright orange LED on the drives), but the MSA1000 LCD did not report any errors. I powered up the servers and they came up just fine. Diagnostics on the drives indicates that they are operating within spec. and should not be replaced. The single configured array is fine and the 4 configured units are fine.

Tech support has worked with me on the phone and has sent a field engineer to check out the situation, but no one seems to really know what's going on. Here is the output of 'show disks' from the console CLI:

CLI> show disks
box,bay bus,ID Size Speed Units
Disk101 1,01 0,00 72.8 GB 160 MB/s 0, 1, 2, 3
Disk102 1,02 0,01 72.8 GB 160 MB/s 0, 1, 2, 3
Disk103 1,03 0,02 72.8 GB 160 MB/s 0, 1, 2, 3
Disk104 1,04 0,03 72.8 GB 160 MB/s 0, 1, 2, 3
Disk105 1,05 0,04 72.8 GB 160 MB/s 0, 1, 2, 3
Disk107 1,07 0,08 72.8 GB 160 MB/s 0, 1, 2, 3
Disk112 1,12 0,13 72.8 GB 160 MB/s 0, 1, 2, 3
Disk108 1,08 1,00 72.8 GB 160 MB/s 0, 1, 2, 3
Disk109 1,09 1,01 72.8 GB 160 MB/s 0, 1, 2, 3
Disk110 1,10 1,02 72.8 GB 160 MB/s 0, 1, 2, 3
Disk111 1,11 1,03 72.8 GB 160 MB/s 0, 1, 2, 3
Disk112 1,12 1,04 72.8 GB 160 MB/s 0, 1, 2, 3
Disk114 1,14 1,08 72.8 GB 160 MB/s 0, 1, 2, 3
Disk1255 1,255 1,13 72.8 GB 160 MB/s 0, 1, 2, 3


Notice that two disks are indicating that they are in physical bay 12. None of them indicate they are in 6 or 13, and what's up with bay 255? The two disks with the failure lights are actually in bays 6 and 13.

The firmware was upgraded from 4.24 to 4.48. The EMU is 1.86

The field engineer is coming again Tuesday with a replacement backplane, EMU, and controller.

Does any of this make sense? What are the chances of data loss when replacing the above components?

Thanks for any thoughts,

RS
27 REPLIES 27

Re: msa1000 strangeness after upgrading to fw 4.48

BTW, the following appears next to each drive in the report from the Array Diagnostic Utility.

"Error occurred reading RIS copy"

Is this going to affect what happens when the engineer starts swapping out components? Should I be concerned?

Thanks,

RS
John Kufrovich
Honored Contributor

Re: msa1000 strangeness after upgrading to fw 4.48

Strange, I haven't seen this problem in my lab.

Give this a try. Power down Servers, and then MSA. reseat drives Disk106, Disk113. Then power on everything.

jk

Re: msa1000 strangeness after upgrading to fw 4.48

I tried this while on the phone with tech support the other day. No luck.

They also had me reseat all the drives and the controller. When I did that two OTHER drives showed up as failed and it really did break everything. (That was a scary few minutes.) Reseating those two again solved that problem, but the original problem remains.

Any thoughts about the RIS message?

Thanks,

RS
John Kufrovich
Honored Contributor

Re: msa1000 strangeness after upgrading to fw 4.48

Even after you reseated the drives the report looks the same?

The RIS area, stores your MSA Array,LUN configuration and a couple of other items. Each drive has a copy. I believe we read each drive RIS, just in case someone performed a DTS (Direct to SAN).

when you do a >show unit 0, does the drive show up as failed. Can you upload a >show tech_support

I'm suspecting the EMU.

Re: msa1000 strangeness after upgrading to fw 4.48

Yes, the status has not changed at all since the firmware upgrade. The only time it differed was when the two OTHER drives reported as failed and had to be reseated.

I'm asking about the RIS because I don't know whether or not to be concerned that the information is unavailable to the diagnostic tools.

Here is the show unit 0 result:

Unit 0:
In PDLA mode, Unit 0 is Lun 1; In VSA mode, Unit 0 is Lun 0.
Unit Identifier :
Device Identifier : 600805F3-000D4F00-A91B235C-AC1E0013
Cache Status : Enabled
Max Boot Partition: Disabled
Volume Status : VOLUME OK
Parity Init Status: Complete
14 Data Disk(s) used by lun 0:
Disk101: Box 1, Bay 01, (SCSI bus 0, SCSI id 0)
Disk102: Box 1, Bay 02, (SCSI bus 0, SCSI id 1)
Disk103: Box 1, Bay 03, (SCSI bus 0, SCSI id 2)
Disk104: Box 1, Bay 04, (SCSI bus 0, SCSI id 3)
Disk105: Box 1, Bay 05, (SCSI bus 0, SCSI id 4)
Disk112: Box 1, Bay 12, (SCSI bus 0, SCSI id 13)
Disk107: Box 1, Bay 07, (SCSI bus 0, SCSI id 8)
Disk108: Box 1, Bay 08, (SCSI bus 1, SCSI id 0)
Disk109: Box 1, Bay 09, (SCSI bus 1, SCSI id 1)
Disk110: Box 1, Bay 10, (SCSI bus 1, SCSI id 2)
Disk111: Box 1, Bay 11, (SCSI bus 1, SCSI id 3)
Disk112: Box 1, Bay 12, (SCSI bus 1, SCSI id 4)
Disk1255: Box 1, Bay 255, (SCSI bus 1, SCSI id 13)
Disk114: Box 1, Bay 14, (SCSI bus 1, SCSI id 8)
Spare Disk(s) used by lun 0:
No spare drive is designated.
Logical Volume Raid Level: DISTRIBUTED PARITY FAULT TOLERANCE (Raid 5)
stripe_size=16kB
Logical Volume Capacity : 498MB



The show tech_support result is attached.

What are the implications for data loss of replacing the EMU?
John Kufrovich
Honored Contributor

Re: msa1000 strangeness after upgrading to fw 4.48

Richard,

Replacing EMU will not be a problem.

Strange no drive failures are being reported.

at the cli
>locate disk disk105 2
Does the drive LED flash?

Try the next inc drive.

Have you backed up everything?

Re: msa1000 strangeness after upgrading to fw 4.48

>Strange no drive failures are being reported.

How so? Do you mean the physical drives themselves or the logical units?

>at the cli locate disk disk105 2
>Does the drive LED flash?

The drive in bay 5 flashes.

Yesterday the field engineer and I identified the drives one by one using the HP Insight Diagnostics tool. The drives are not being reported by that utility as 0-13 or 1-14. They are represented as 4-17. The mapping from this number to the bay # in the shelf is:

4-1
5-2
6-3
7-4
8-5
9-7
10-??? (two drives are solid amber so couldn't tell)
11-8
12-9
13-10
14-11
15-12
16-14
17-??? (two drives are solid amber...)

>Have you backed up everything?

Yes. I do nightly backups but I'm deathly afraid of having to resurrect MS-SQL and Exchange on the same evening. Any good pointers to quick recovery procedures wouldn't be unwelcome...

---

I just spoke with the field engineer who said that the people he has spoken to think it is the backplane. Any thoughts on this -vs- the EMU? Why do you think one over the other?

Something else he said that he wants to try first is to change the read/write % on the controller, forcing a configuration change on it.

I'm dying here not knowing what, if anything, is actually wrong.

In the meantime, thanks for your help.

RS
Anthony Martin_1
Frequent Advisor

Re: msa1000 strangeness after upgrading to fw 4.48

Hi Richard,
It could be worthwhile to check the firmware versions on these 2 disk drives. It is possible that they are down rev and the new firmware doesn't like them too much.

Cheers
Anthony

Re: msa1000 strangeness after upgrading to fw 4.48

4 of them are HPB4, the other 10 are HPB3. The HPB4 drives are in bays 1-4.

Thanks for weighing in.

RS

Re: msa1000 strangeness after upgrading to fw 4.48

My understanding is that the field engineer is going to bring the following:

EMU
Backplane
MSA1000 Controller

Can all of these be replaced in one operation? Any thoughts about which is most likely the culprit?

Where is the array information stored? On the drives and on the controller, right? Is it possible to 'backup' this information somehow?

Thanks again for all the assistance.

RS
John Kufrovich
Honored Contributor

Re: msa1000 strangeness after upgrading to fw 4.48

The array information is stored on the drives and in controller nvram.

The EMU controls environmental status plus handles some drive events. Example hot remove and hot add.

Look at the show tech_support, your LUNs are not showing a drive fault of any kind. If you experience a drive fault, the would be line added to the faulty disk in your LUN.

Just as a safety precaution, locate the cpqacuxe.exe or hpacuxe.exe. At a cmd prompt, under the directory of the executible. issue the cmd.
cpqacuxe -c
This will capture your servers Array configuration plus the MSA's. The file created will be acucapt.ini
cpqacuxe -h will popup a help screen.
If needed you can edit out the embedded Smart Array configuration information, leaving the MSA information. execute, cpqacuxe -i acucapt.ini This will recreate your configuration back to the original state.

It's difficult to say exactly if it is the backplane or the EMU. The only thing I can think of on the backplane that could cause something like this is perhaps one of the resistors in a RNET, used for the SCSI terminators either opened or shorted with another resistor in the RNET.

Tomorrow I'll do a little more digging. I would be interested in capturing this equipment.








Re: msa1000 strangeness after upgrading to fw 4.48

>Look at the show tech_support, your LUNs >are not showing a drive fault of any kind. >If you experience a drive fault, the would >be line added to the faulty disk in your >LUN.

This how I interpret that information. Several suggestions have been made to pull one of the affected drives and let the array rebuild. It doesn't seem that this is likely to be a solution. In fact, it seems that this could be counterproductive given the extremely long rebuild time (?? hours) when the overall hardware state is supspect.

I have backed up the array configuration information from all 4 servers accessing this SAN using a local instance of cpqacuexe.exe on each server. Can you give me the 10 cent rundown on how this might be useful in case something goes awry during the equipment swap?

>Tomorrow I'll do a little more digging. I >would be interested in capturing this >equipment.

Thanks. I would be interested in sending it to you. Do you or I have to do anything to redirect it from the normal process once the faulty hardware has been identified?

RS

PS Just to make sure I understand, say for instance we brought in a completely new MSA1000 and populated it with our current drives. Theoretically that would start up and be usable, right? Sorry for all the questions. I'm new to HP land (though customer service like this is making me glad I came over!)
John Kufrovich
Honored Contributor

Re: msa1000 strangeness after upgrading to fw 4.48

I was going to suggest hot remove and insert but your reporting two faulty drives in a RAID 5 configuration and didn't want to take any chances with your data.

Eventhough you have two drives with amber lights. Your system is still up and functional, yes?

If you do try the hot remove and insert, move your rebuild priority to medium or high.


Re: msa1000 strangeness after upgrading to fw 4.48

Yes. The system is actually working with no degradation or loss of performance.

I really don't want to pull a drive if we don't have to.

Regarding the theory of how the device operates, what would happen if I took all of those drives and put them into a different MSA1000? Would the array be retained? I'm not asking because I want a new MSA1000. I truly just want to understand how far the system is designed to go before losing the data.

Thanks,

RS
John Kufrovich
Honored Contributor

Re: msa1000 strangeness after upgrading to fw 4.48

This is a long shot.
Any chance you pulled the controller and reinserted it. If so, could you have bent a pin.

If able, power down everything and pull the controller. Look at the backplane connector for bent pins or look at the controller connector for messed up holes.

Still stumped.

By having the RIS data on the drives you can take a LUN and move it to another Smart Array device and still perserve everything. There are some backwards compatibility with older SA controllers.

Re: msa1000 strangeness after upgrading to fw 4.48

The controller was removed and reinserted, but only after this problem arose. I did this at the request of the inital phone support technician.

>By having the RIS data on the drives you can take a LUN and move it to another Smart Array device and still perserve everything.

This is why I asked about the "Error occurred reading RIS copy" messages. Does this mean that the array information on the drives is corrupted? If that's the case, should the swap of the controller only be as a last resort?

Here is a snippet from the ADU report. The entire report is attached.

SLOT 2 (ID 65536) MSA1000 Array Controller ERROR REPORT:

Error occurred reading RIS copy from SCSI Port 1 Drive ID 0
Error occurred reading RIS copy from SCSI Port 1 Drive ID 1
Error occurred reading RIS copy from SCSI Port 1 Drive ID 2
Error occurred reading RIS copy from SCSI Port 1 Drive ID 3
Error occurred reading RIS copy from SCSI Port 1 Drive ID 4
Error occurred reading RIS copy from SCSI Port 1 Drive ID 8
Error occurred reading RIS copy from SCSI Port 1 Drive ID 13
Error occurred reading RIS copy from SCSI Port 2 Drive ID 0
Error occurred reading RIS copy from SCSI Port 2 Drive ID 1
Error occurred reading RIS copy from SCSI Port 2 Drive ID 2
Error occurred reading RIS copy from SCSI Port 2 Drive ID 3
Error occurred reading RIS copy from SCSI Port 2 Drive ID 4
Error occurred reading RIS copy from SCSI Port 2 Drive ID 8
Error occurred reading RIS copy from SCSI Port 2 Drive ID 13

What would make the RIS copy of each drive unreadable? Does this point to a specific piece of hardware such as the EMU?


Thanks for working so hard on this.

RS

Re: msa1000 strangeness after upgrading to fw 4.48

Any more thoughts? It is 5PM EDT and the field engineer is due to arrive in a few hours with a handful of parts.

Thanks,

RS
John Kufrovich
Honored Contributor

Re: msa1000 strangeness after upgrading to fw 4.48

The MSA is able to read a RIS from somewhere because you still have you LUNs intact.

I've looked over the backplane schmetic. If it was a backplane issue, I would expect other problems. Especially accessing your LUNs.

Tell the FE, that I work in the MSA development. Myself and a FW engineer would like to get our hands on the faulty component.

Re: msa1000 strangeness after upgrading to fw 4.48

OK. We'll start with the EMU, then the controller, then the backplane.

The first thing he wants to try is to change the read/write cache distribution, so we'll do that then start on the hardware.

Wish us luck...

RS

Re: msa1000 strangeness after upgrading to fw 4.48

The FE said that you need to contact him about obtaining the part. I am happy to give you his name, email, phone, etc., but I am reluctant to provide this information in the public forum. How would you like me to send it to you?

We did not make any changes last night. We decided to try and make it until this weekend when I've got more time to recover from a worst-case scenario.

Thanks,

RS

Re: msa1000 strangeness after upgrading to fw 4.48

Update - 7/3/06

None of the HP-supplied parts had any effect. Individually, I replaced the EMU, backplane, MSA1000 controller with the old cache installed, and a new cache module in the old controller. That is, each part was tried by itself without combining more than one new part at a time.

In each case the array came up and displayed the same odd behavior.

Since the drive in bay 13 was known to be the one displaying:

Disk1255 1,255 1,13 72.8 GB 160 MB/s 0, 1, 2, 3

in the show disks command, I shut down the msa1000 and replaced it with a new drive. Well, things only got weirder. Instead of replacing the missing drive and rebuilding, the old disk was still being seen as inserted, but obviously with a failed status. The new physical disk showed up, but couldn't be added to the array since it was the '15th' drive. Examples of show disks and show unit below:
CLI> show disks
box,bay bus,ID Size Speed Units
Disk101 1,01 0,00 72.8 GB 160 MB/s 0, 1, 2, 3
Disk102 1,02 0,01 72.8 GB 160 MB/s 0, 1, 2, 3
Disk103 1,03 0,02 72.8 GB 160 MB/s 0, 1, 2, 3
Disk104 1,04 0,03 72.8 GB 160 MB/s 0, 1, 2, 3
Disk105 1,05 0,04 72.8 GB 160 MB/s 0, 1, 2, 3
Disk107 1,07 0,08 72.8 GB 160 MB/s 0, 1, 2, 3
Disk112 1,12 0,13 72.8 GB 160 MB/s 0, 1, 2, 3
Disk108 1,08 1,00 72.8 GB 160 MB/s 0, 1, 2, 3
Disk109 1,09 1,01 72.8 GB 160 MB/s 0, 1, 2, 3
Disk110 1,10 1,02 72.8 GB 160 MB/s 0, 1, 2, 3
Disk111 1,11 1,03 72.8 GB 160 MB/s 0, 1, 2, 3
Disk112 1,12 1,04 72.8 GB 160 MB/s 0, 1, 2, 3
Disk113 1,13 1,05 72.8 GB 160 MB/s none
Disk114 1,14 1,08 72.8 GB 160 MB/s 0, 1, 2, 3

CLI> show units

Unit 0:
In PDLA mode, Unit 0 is Lun 1; In VSA mode, Unit 0 is Lun 0.
Unit Identifier :
Device Identifier : 600805F3-000D4F00-A91B235C-AC1E0013
Cache Status : Enabled
Max Boot Partition: Disabled
Volume Status : VOLUME USING REGENERATE
Parity Init Status: Complete
14 Data Disk(s) used by lun 0:
Disk101: Box 1, Bay 01, (SCSI bus 0, SCSI id 0)
Disk102: Box 1, Bay 02, (SCSI bus 0, SCSI id 1)
Disk103: Box 1, Bay 03, (SCSI bus 0, SCSI id 2)
Disk104: Box 1, Bay 04, (SCSI bus 0, SCSI id 3)
Disk105: Box 1, Bay 05, (SCSI bus 0, SCSI id 4)
Disk112: Box 1, Bay 12, (SCSI bus 0, SCSI id 13)
Disk107: Box 1, Bay 07, (SCSI bus 0, SCSI id 8)
Disk108: Box 1, Bay 08, (SCSI bus 1, SCSI id 0)
Disk109: Box 1, Bay 09, (SCSI bus 1, SCSI id 1)
Disk110: Box 1, Bay 10, (SCSI bus 1, SCSI id 2)
Disk111: Box 1, Bay 11, (SCSI bus 1, SCSI id 3)
Disk112: Box 1, Bay 12, (SCSI bus 1, SCSI id 4)
Disk1255: Box 1, Bay 255, (SCSI bus 1, SCSI id 13) DRIVE FAILED!
Disk114: Box 1, Bay 14, (SCSI bus 1, SCSI id 8)
Spare Disk(s) used by lun 0:
No spare drive is designated.
Logical Volume Raid Level: DISTRIBUTED PARITY FAULT TOLERANCE (Raid 5)
stripe_size=16kB
Logical Volume Capacity : 498MB

Some very friendly HP support folks had me replace the new drive with the original and the array rebuilt itself. We still have two drives with failed indications, but the MSA1000 is performing just fine.

My thought now is that the problem is in one or more of the physical drives, but which one(s)?
John Kufrovich
Honored Contributor

Re: msa1000 strangeness after upgrading to fw 4.48

What does ACU say about your system. Does it show the disk correctly?


Re: msa1000 strangeness after upgrading to fw 4.48

Yes, but it appeared outside the array. The 'failed' drive (the one that was removed) was listed as still being part of the array.

John Kufrovich
Honored Contributor

Re: msa1000 strangeness after upgrading to fw 4.48

So, ACU show DISK106 and DISK113. While the MSA cli, doesn't report those disk.

Does ACU report you LUN as failed?