Operating System - OpenVMS
1748142 Members
3646 Online
108758 Solutions
New Discussion юеВ

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

 
Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

This morning I powered down the ES40, including pulling the power cables. I left it completely unplugged for about 5 minutes, then started it back up again.

The DATACHECK problems are still occurring, even after a complete utility power-down of the ES40.

Still searching for a cure....

Thanks,
Joe
Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Jon,

I'm sorry I failed to answer one of your questions.

YES, the HBA firmware is up-to-date. The same firmware is installed on the FCA2684 cards in both my ES40 (DATACHECK errors) and the DS25 (works without errors). The revision level is TS1.91X6.

Thanks,
Joe
EdgarZamora
Trusted Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

This sure smells like a hardware problem to me. Have to tried reseating the HBAs, checking cable connections, etc. Have you tried swapping the HBAs as Jon suggested? have you tried switching the paths of the problem disks?
Jon Pinkley
Honored Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Any clues in the errlog on the ES40?

What is odd is that the path to the MSA disks at least seemed to work with the large files, but there seems to be something about i/o with datacheck operations that is tripping something up.

Does the problem get worse if you mount the disk on the ES40 with the /datacheck qualifier? It isn't obvious to me why that would cause errors. Are you sure the 20G files you copied to the device were copied correctly? Did you do a backup/verify or a backup/compare (after the initial copy)?

Can you confirm that the ES40 used to work with other disks, and that the only thing changed was to addition of the HBA/MSA1000?

If the problem exist in the MSA1000, can anyone explain why the DS25 isn't seeing the same problem? If it was a problem in the FC switch or GBIC, or the fiber cable, we would expect to see errors (like crc) on that port. We don't. If you want to eliminate that as a possibility, you could swap the port the DS25 and ES40 are plugged into.

Since you are using a dedicated switch, and I think the MSA is limited to serving a single type of OS at a time, you probably don't have to worry about zoning on the switch.

Do you have any other FC controllers you can connect the ES40 to? Since you are using the integrated switch in the MSA, my guess is that you do not.

I don't know if there are any "loopback" type diagnostics that can test the HBA.

Has HP discovered anything from the data they collected?

Jon
it depends
Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Jon,

Thank you and others for hanging in here with me on this...

To answer your questions:

1. I have not tried the /data_check qualifier. Not real sure what good that would do in this situation.

2. I had not used backup/VERIFY in previous tests, but did try it today with some large files (6GB and 19GB files). The files copied without error, but the verify pass went crazy with verification errors for thousands of blocks. So this problem persists for large files, but manifests itself differently. Strange.....

3. The ES40 was (and still is) connected to a RaidArray 3000 system via SCSI. These drives have worked, and continue to work, flawlessly. In addition, there are a few disks in the system cage itself that have been and are working without error. Only the new SAN array I/O is bad from the ES40.

4. I upgraded the firmware on the 2/8 switches today. SO now the switches (3.2.1c) and the MSA controllers (7.00) are up to most recent firmware levels. There was no change in the behavior on either system.

5. Zoning is not an issue. We are only trying to use this array with OpenVMS systems.

6. I do not have any other FC controllers available.

7. I'm not aware of any loopback diags; HP support has not mentioned anything like that.

8. So far HP Support has recommended firmware updates and locking down the speed of the ports, which I have done, to no avail.

I'm going to try swapping the fibre cables between the ES40 and DS25. This will eliminate the switches, port connectors and fibre cable itself as a problem source.

Next, I'm thinking of re-seating or moving the HBA cards to different PCI slots in the ES40, if there are two more slots available.

Next, I may try swapping the HBA cards between the ES40 and DS25. If the problem is in the cards, that will isolate them.

Thanks again,
Joe

Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Update: Late today I swapped the fibre cables between the ES40 and DS25 systems to see if the problem followed. It did not follow the cable switch; the ES40 continued to generate DATACHECK and related errors writing to the MSA1000 LUNs.

While the cables were swapped and files were copying (and generating DATACHECK write errors), I tried pulling the fibre cable cables off each HBA (one at a time) to break the datapath. The system fail-over worked as it should, and I proved again that the errors occur regardless of which HBA the data is going thru to reach the MSA.

So I think I've proven that the MSA1000 is not at fault but is working as it should, and the fault lies in the ES40 box which contains the FCA2684 HBA cards.

Tonight I logged an additional call with HP, this time as a hardware call cross-referencing the open software call. (The original call was logged with VMS support.) I have 24/7 coverage on my ES40 and the FCA2684's. I'm hoping to get an engineer on-site on Saturday.

More updates later. Thanks!
Joe
Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Resolution:

A field engineer came on-site last Saturday night and spent about 9 hours helping isolate the problem. Here is the result.

The problem was ultimately resolved by moving the FC HBA cards to different PCI slots in the ES40 PCI backplane. The PCI backplane consists of 10 slots divided into 2 separate I/O buses, four slots in bus PCI-0 and 6 slots in bus PCI-1. Several months ago I installed the FC cards into bus PCI-0, which also contains the vga video card and nothing else.

During the evening the FE replaced the PCI backplane, which ultimately proved to be unnecessary, we believe.

By moving the FC cards over to bus PCI-1, the data corruption seems to have disappeared.

The lingering question is whether there is a compatibility issue between PCI-0 and the FC cards, or if the video card was somehow hosing the bus and causing the I/O problems. At 4am on a Sunday morning, I didn't really care to find out since it appeared we had a working system at that point and I was more interested in getting some breakfast and sleeping...

Thanks to all of you who offered advice. I learned a lot about the MSA1000 during this event.

Joe
Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Resolved.
Jon Pinkley
Honored Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Joe,

Thanks for the update.

Does anyone reading this have an ES40 with an FCA2684 (also known as DS-A5132-AA (370426-B21) PCI-X 64BIT 133MHZ 2Gb-ALPHA LP10000) working in PCI-0?

You can use

SDA> clue config

and look for 3R-A513*-AA (Emulex LP10000) See attachment for the Adapter config section on our ES40 showing the FCA2684 (Emulex LP10000) on PCI-1 (PCI on Hose 1).

In our ES40, which is currently a test system, we have only a single FC HBA (same FCA2684 as you have), and it happens to be on the PCI-1 bus. However we have other adapters using the PCI-0 hose.

Just curious, are you using the video card? When we ordered our ES40 M2, I didn't order the video card, since we didn't plan to use it, instead using the serial console.

Which video card are you using?

From a performance standpoint, it would be nice to put something on PCI-0. In the systems and options catalog, there are no restrictions listed concerning the incompatibility of using the FCA2684 in PCI-0 or with a video card, although the SN-PBXGK-BB (400712-B21) ELSA GLORIA SYNERGY 8MB GFX (GRAPHICS) video card does have a restriction to use PCI-0.

Ideally, it would be best to have the two FCA2684's on different PCI buses, since these are high bandwidth devices, and VMS can use different paths for different lunsLUNs, so it could be actively using both FCA2684s. Having them on separate buses could reduce PCI contention. Whether you would notice a difference is another question. You probably could using something like iohammer or diskblock, but may not in normal operation.

When you stated that you had two FCA2684's, it seemed unlikely that it was a bad HBA, and that there was something else corrupting the data.

So I think you still have a hardware problem or a non-documented incompatibility in the ES40, and you are avoiding it by not using PCI-0. Do you know if the video adapter works? If it isn't being used, I would try removing it and moving one of the FCA2684's to PCI-0. Or if you have a Gb ethernet adapter, you could try that, as if there is as it could help in determining if the problem is the PCI or an incompatibility of PCI-0 with the FCA2684.It that doesn't work, then there may be a problem with the PCI-0, although there are other things on out ES40 M2 that use PCI-0, for example the serial console and CD IDE interface.

Jon
it depends
Bill Hall
Honored Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Jon,

I've got a few ES40s with a 3R-A513*-AA (Emulex LP10000) on each PCI hose. Also have ES40s with two KGPSA-DA (Emulex LP9002) on each PCI hose. We have never had video adapters in our ES40s though as they are all rack mounted in a data center environment.

Bill
Bill Hall