Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

 
Joe Trimble
Advisor

DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Hello, fellow VMS folks...

I just installed a new MSA1000 integrated system for use with OpenVMS. I created 4 new luns, gave them LUN ID numbers, and can seem them on my system as $1$DGA1 thru $1$DGA4.

When I try to initialize or use the new drives with VMS, I am getting hundreds of errors, some random in nature, during the INIT command or while copying (or using backup to copy) file to the new devices.

Here are some notes;
- OpenVMS V8.2
- All relevent patches applied, including VMS82A_FIBRE_SCSI V6.0.
- MSA100 upgraded to firmware version 7
- MSA100 in active/active mode
- 3 luns created using ADG (RAID6)
- 1 lun created using RAID1
- 8 drives (out of 14) hosting the 4 luns
- luns are configured using disks 1-4 and 8-11 to optimize I/O accross the scsi buses.
- each MSA1000 controller has 512MB cache, configured 50% read, 50% write.
- caching is enabled on the configured luns

When I initialize the drives, they will randomly fail as the following example shows:

$ init /system /share /headers=64000 /structure=5 $1$dga3: disk3n
%INIT-F-DATACHECK, write check error

When the drive does init correctly, and then you mount and copy files to the device, these errors are routinely seen.

%BACKUP-E-OPENOUT, error opening DISK3N:[IS$DISK.MISTY.SQL]PZVDEDN.SQL;3 as output
-RMS-E-CRE, ACP file create failed
-SYSTEM-F-DATACHECK, write check error
%BACKUP-E-OPENOUT, error opening DISK3N:[IS$DISK.OPS.MB]MB901S.RPT;1 as output
-RMS-E-CRE, ACP file create failed
-SYSTEM-F-DATACHECK, write check error
%BACKUP-E-OPENOUT, error opening DISK3N:[IS$DISK.OPS.MB]MB911S.RPT;2 as output
-RMS-E-CRE, ACP file create failed
-SYSTEM-W-BADIRECTORY, bad directory file format
%BACKUP-E-CREDIRERR, error creating directory DISK3N:[IS$DISK.OPS.MB.STORED]
-SYSTEM-W-BADIRECTORY, bad directory file format

This is only a small sample.

It seems that copying small files presents the problem more than copying large files. I have successfully copied large (20G) files to a device, and also restored a backup saveset containing several large files to the disk successfully. When I copy lots of small files, like moving user directories over to the new device, then I get hundreds of errors in a short time.

I'm likely going to call HP on this, but wanted to see what your experience and ask for your help.

Thank you!
Joe
29 REPLIES
Uwe Zessin
Honored Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Joe,

check the connections on the MSA1000 and make sure they are working with the "OpenVMS" profile.

CLI> show connections
...

CLI> add connection ALPHA1_PGA0 wwpn=10000000-C9244321 profile=OpenVMS
.
Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

I have double-checked. All connections are designated profile=OpenVMS

Here's a thought...

I may have switched the names of the two connections between the MSA's and the HBA's in my system since VMS was last booted 2 days ago. The connections are called flash-1 and flash-2. I switched them in the MSA configuration so they would match the order of the HBA cards on the PCI bus in my system. The LUNs and ACLs were created after that switch. Could that make a difference? Could the system be confused? Should I try reinitializing the Alphaserver?

Thanks.
Jon Pinkley
Honored Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Joe,

Disclaimer: I have never used an MSA.

This is only a guess, but because you notice the problems with small files, I would suspect the cache. Does the MSA have any diagnosics it can run to test its memory?

Does the problem go away if you turn off caching?

Summary: My guess is that it is a hardware issue that needs to be fixed.

Jon
it depends
Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Thanks, Jon.

I thought about the cache, and that is still on the table, as far as I'm concerned.

I ended up opening a call with HP OpenVMS Support this afternoon. I'm going to get some additional information for the support rep tomorrow morning, and power-cycle the MSA and my AlphaServer system. Unfortunately, today was a work-at-home day, so I'm not physically with the equipment today.

I mentioned the cache to the support rep. He indicated it was a potential problem, but needs additional information, thus my trip to the office early Saturday.

Thanks for the replies so far.. I'll update this thread again when I have more information or questions.

Joe
Rob Leadbeater
Honored Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Hi Joe,

If you've just installed the MSA1000, it's possible that the RAID initialisation is still going on in the background...

This *shouldn't* affect the hosts connected to it, but I guess it might have an effect.

If you're talking to HP support, they'll probably have you hook up the CLI cable to front of the MSA, and do a "show techsupport" (IIRC). Looking through that output, might highlight some issues.

You might also want to check that the firmware on the MSA is current. (5.20 or 7.00 depending on whether you're active/standby or active/active.)

Cheers,

Rob
Rinkens
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Are these disk write enabled on the controller
maybe a stupid question.

$ init /system /share /headers=64000 /structure=5 $1$dga3: disk3n
%INIT-F-DATACHECK, write check error

here is goes wrong already, you have should first solve this problem.

Check the setting on your msa100

writeback cache
read cache and so on



Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Here is an update on this situation. No resolution yet, so it is still being worked by HP Support.

Currently, the Storage team is still examining several log files created over the weekend during several hours of testing. Today it will likely be passed over to the AlphaServer team.

Storage team's initial conclusion is that the MSA1000 is setup and configured correctly. Firmware (V7.0) looks good on both MSA1000's, and is in active/active mode.

I found over the weekend that when I create and connect additional LUNS to another VMS server (DS25, identical VMS V8.2 and patches applied), then that system can write files without errors. So the errors appear to be a problem with the ES40 connection to MSA1000 only.

To answer recent suggestions:
- the initialization is complete on all LUNs; it made no difference in the errors.
- I have tried LUNs with cache both on and off. There seems to be fewer errors with cache turned off, but still plenty of errors are encountered. (From the DS25 test system, no DATACHECK errors occurred regardless of the cache setting.)

I now believe there is some configuration or compatibility issue between my ES40 system and the MSA1000 integrated box. The ES40 has FCA2684 HBA cards installed (same as the DS25), and the firmware is updated to the latest rev (TS1.91X6).

Thanks for your continued assistance and questions. I'll update again later.

Joe
Rob Leadbeater
Honored Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Hi Joe,

Are you using the embedded 2/8 switch in the back of the MSA1000 ?

If so, has the version of FabricOS been checked ?


Cheers,

Rob
Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Hi Rob,

Yes, we are using the embedded 2/8 switches.

No, I have not checked the firmware levels there. Do you think that might make a difference? The DS25 is newer hardware than the ES40. Could there be some incompatibility with the ES40?

The HP Storage team has not asked about the switches, at least not yet. I'll try to find out more information on this.

Thanks!
Joe
Rob Leadbeater
Honored Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Hi,

If the firmware on the switches is too old, it might cause incompatibility issues with the newer firmware on the HBAs...although if the DS25's HBAs are also at the same firmware revision, that should rule that out...

I'm surprised that Storage Support haven't asked you to look at the switch port status, as that might indicate some faults, either with a HBA or cable etc.

Cheers,

Rob
Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

The switches show firmware as follows. Both are identical, so only switch1 is displayed here.

MSA-VMS-1-switch1:admin> version
Kernel: 5.4
Fabric OS: v3.2.1b
Made on: Fri Jul 28 14:42:33 PDT 2006
Flash: Fri Jul 28 14:43:15 PDT 2006
BootProm: Mon Jul 8 18:35:44 PDT 2002

A quick check on the HP download site shows this is one minor revision behind on the fabric OS. The latest download is v3.2.1c, dated in 2007.

Joe
Rob Leadbeater
Honored Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

That version shouldn't cause any problems...

Does a "portErrShow" indicate any issues on the ES40's ports ?

Cheers,
Rob
Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Rob,

I'll attach a small file showing output from a few commands on both 2/8 switches, including porterrshow. I don't know how to interpret most of the stats.

I've sent the same information to HP Storage Support as part of my open call with them.

Note the ES40 is connected to port 1 on each switch; the DS25 is connected to port 2.

Thanks,
Joe
Khairy
Esteemed Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

hi joe,

can you provide the show tech_support output from msa controller. I know you've checked this but just want to understand more the configuration and the problem you have.

I've installed dual controller MSA1000 (14 x 72GB) with 2 x DS25 running openvms 7.3.2 last year without no problem. The only thing i encounter was patch issue. I upgraded both SRM to rule out firmware problem and it works for me.

> show tech_support
Jon Pinkley
Honored Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Joe,

No guarantees, but have you tried a complete power cycle (complete removal of power cords) on the ES40?

We installed DS-A5132-AA (370426-B21) PCI-X 64BIT 133MHZ 2Gb-ALPHA LP10000, FCA2684 HBAs in ES40 M2 4 6/667 systems, and had problems until I did a firmware upgrade on the ES40, using the manual procedure and upgraded everything, including RMC, followed by a complete power cycle (per the firmware
upgrade instructions). For more details see this thread:

http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=1140804

The storage was EVA on HP (Brocade) 4/16 switch fabrics. The error we saw was problems in autoconfig, not exactly the same as you are seeing, but I would still try, it only cost time.

Your porterrshow output looks fine. The "enc out" errors are related to auto-negotiation and they are normal. If you really dislike them, you can lock the speed, but in my opinion, that tends to come back to haunt in the future when you plug something else into that port (perhaps for debugging).

The MSA works with your DS25, which tends to point toward the ES40 as the problem location.

If you are going to need to power cycle, I would take the opportunity to do a firmware update on the ES40, and I would use the manual procedure to update everything, including RMC, followed by a complete power disconnect (not just turning off the power at the front panel. I shut my systems down, then disconnected the cords for a minute (perhaps overboard, but it isn't something you need to do frequently).

If that doesn't fix the problem, the next thing I would do is swap the FC HBA in the problem box. But I would try the firmware upgrade/power cycle first before introducing other changes.

Good luck,

Jon
it depends
Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

I'm attaching a file containing the show tech_support command from controller 1, captured last Saturday.

Regarding ES40 firmware -- the latest firmware updates (7.3) were installed before the MSA1000 was connected. We needed to upgrade because our old firmware was not up to the correct level to recognize the MSA devices.

Current firmware revisions (taken from LFU capture last Saturday during a review):

Abios v5.71
SRM v7.3-1
pga0 TS1.91X6
pgb0 TS1.91X6
rmc V2.8
srom V2.22-G
tig 10

I have power-cycled the ES40 multiple times, but only by turning off the switch on the front panel. The LED's on the HBA's do go dark when the system is powered off. Do you think pulling the power cables will make a real difference?

Thanks,
Joe
Jon Pinkley
Honored Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Is the firmware on the HBA's up to data?

The first time I did my firmware upgrade, I didn't cycle power and I didn't update the RMC (I hadn't used the manual update, which is still automated).

Whether it makes a difference? We have had several flakey problems on the ES40s (in this case a fan was being reported as bad, when it appeard to be working). A Field Service engineer replaced the fan, it worked, but the show power command continued to call out a bad fan. He was about to replace the motherboard (he had even ordered it), but before we did, he said he wanted to cycle power, with a complete disconnect.

That solved the problem.

So in at least those two cases, cycling power all completely, seems to have had an effect. The ES40 is never "completely off" when you use the fromt panel.

I think there is even something in the firmware update documentation that says you need to do a complete power cycle after appliying certain firmware updates (for example the RMC).

Like I said before, no guarantees. But I would try it. And if you haven't updated your RMC, I would do that too.

If you looked at the thread I referenced before, you will see we were getting errors. Have you analyzed your errlog with SEA?

Jon
it depends
Jon Pinkley
Honored Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

This thread is about a priblem that sounds exactly like the one we had that was solved by a complete power cycle. My guess is they really didn't need to replace anything.

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1146545
it depends
Jon Pinkley
Honored Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Here's another recent thread involving ES40s.

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1247808

What "fixed" the problem was moving the CPU boards and memory banks. So the question remains, was it the complete power cycle (assuming they really did remove power when moving the modules), or was it the reseating of the components? If you don't try a complete power cycle before changing something, you never will know for sure what the fix was.

So I always do the complete power cycle first and verify that the problem still exists before starting to change anything else.

Jon
it depends
Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

This morning I powered down the ES40, including pulling the power cables. I left it completely unplugged for about 5 minutes, then started it back up again.

The DATACHECK problems are still occurring, even after a complete utility power-down of the ES40.

Still searching for a cure....

Thanks,
Joe
Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Jon,

I'm sorry I failed to answer one of your questions.

YES, the HBA firmware is up-to-date. The same firmware is installed on the FCA2684 cards in both my ES40 (DATACHECK errors) and the DS25 (works without errors). The revision level is TS1.91X6.

Thanks,
Joe
EdgarZamora
Trusted Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

This sure smells like a hardware problem to me. Have to tried reseating the HBAs, checking cable connections, etc. Have you tried swapping the HBAs as Jon suggested? have you tried switching the paths of the problem disks?
Jon Pinkley
Honored Contributor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Any clues in the errlog on the ES40?

What is odd is that the path to the MSA disks at least seemed to work with the large files, but there seems to be something about i/o with datacheck operations that is tripping something up.

Does the problem get worse if you mount the disk on the ES40 with the /datacheck qualifier? It isn't obvious to me why that would cause errors. Are you sure the 20G files you copied to the device were copied correctly? Did you do a backup/verify or a backup/compare (after the initial copy)?

Can you confirm that the ES40 used to work with other disks, and that the only thing changed was to addition of the HBA/MSA1000?

If the problem exist in the MSA1000, can anyone explain why the DS25 isn't seeing the same problem? If it was a problem in the FC switch or GBIC, or the fiber cable, we would expect to see errors (like crc) on that port. We don't. If you want to eliminate that as a possibility, you could swap the port the DS25 and ES40 are plugged into.

Since you are using a dedicated switch, and I think the MSA is limited to serving a single type of OS at a time, you probably don't have to worry about zoning on the switch.

Do you have any other FC controllers you can connect the ES40 to? Since you are using the integrated switch in the MSA, my guess is that you do not.

I don't know if there are any "loopback" type diagnostics that can test the HBA.

Has HP discovered anything from the data they collected?

Jon
it depends
Joe Trimble
Advisor

Re: DATACHECK errors with MSA1000 DGA Devices on OpenVMSV8.2

Jon,

Thank you and others for hanging in here with me on this...

To answer your questions:

1. I have not tried the /data_check qualifier. Not real sure what good that would do in this situation.

2. I had not used backup/VERIFY in previous tests, but did try it today with some large files (6GB and 19GB files). The files copied without error, but the verify pass went crazy with verification errors for thousands of blocks. So this problem persists for large files, but manifests itself differently. Strange.....

3. The ES40 was (and still is) connected to a RaidArray 3000 system via SCSI. These drives have worked, and continue to work, flawlessly. In addition, there are a few disks in the system cage itself that have been and are working without error. Only the new SAN array I/O is bad from the ES40.

4. I upgraded the firmware on the 2/8 switches today. SO now the switches (3.2.1c) and the MSA controllers (7.00) are up to most recent firmware levels. There was no change in the behavior on either system.

5. Zoning is not an issue. We are only trying to use this array with OpenVMS systems.

6. I do not have any other FC controllers available.

7. I'm not aware of any loopback diags; HP support has not mentioned anything like that.

8. So far HP Support has recommended firmware updates and locking down the speed of the ports, which I have done, to no avail.

I'm going to try swapping the fibre cables between the ES40 and DS25. This will eliminate the switches, port connectors and fibre cable itself as a problem source.

Next, I'm thinking of re-seating or moving the HBA cards to different PCI slots in the ES40, if there are two more slots available.

Next, I may try swapping the HBA cards between the ES40 and DS25. If the problem is in the cards, that will isolate them.

Thanks again,
Joe