Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

%SCSI_CPU-W-RETRY errors on boot

SOLVED
Go to solution
Jess Goodman
Esteemed Contributor

%SCSI_CPU-W-RETRY errors on boot

I have an ES40 attached to a SCSI Hub which is attached to three other VMS nodes and a dual HSZ70 controller Raid cabinet.

The entire cluster (all are Alphas VMS 7.3-2 with patches and EVA SAN storage) was rebooted last week and this node came up with no problems. But when I rebooted this node last night (and two more attempts today), it failed to make a local connection to this Raid cabinet, due to this error:

%SCSI_CPU-W-RETRY, port PKC0 alloclass 3 status 94 inconsistency

That error message shows up five times ("retry n/5)" before the %MSCPLOAD-I-CONFIGSCAN message and then another five times after the %STDRV-I-STARTUP and %STDRV-I-LOG messages.

See the text attachment for the console output

Relavant system parameters are:
ALLOCLASS=1
DEVICE_NAMING=1

In SYS$DEVICES.DAT (shown fully in the text attachment) this node's port PKC is set to allocation class 3.

A >>>SHOW DEVICE properly displays the PKC controller and all the devices that the HSZ70 cabinet present.

On my last boot attempt I used -flags 0,30000 and these lines were displayed immediately before the first %SCSI_CPU message (I do not know if these are normal or not):

%LOADER-I-INIT, initializing SYS$PKQDRIVER.EXE
%PKQDRIVER-I- PKC0, loading firmware version 5.57 from console
%PKQDRIVER-I- PKC0, initialization complete; port online

Nothing relevant was changed since the last reboot. Thanks in advance for any help you can offer.

Jess
I have one, but it's personal.
13 REPLIES
Jon Pinkley
Honored Contributor

Re: %SCSI_CPU-W-RETRY errors on boot

Jess,

We no longer have an HSZ70, but when we had a SCSI bus shared between two ES40s, when we did a >>> SHOW DEVICE, the SRM console would display the SCSI adapter of the other VMS system (with VMS in the description). We were not using Port Allocation classes, or new device naming, so that is one difference. We also did not use a SCSI hub.

I have never seen the SCSI_CPU message (that I remember), but the text suggests that the system may be checking if the other systems sharing the buss have the same allocation class. Do the other 3 VMS nodes have allocation class 3 on the SCSI controllers connected to the SCSI hub?

What were the "non-relevant" changes since the last boot? :-)

Jon
it depends
Shriniketan Bhagwat
Trusted Contributor

Re: %SCSI_CPU-W-RETRY errors on boot

Hi Jess,

If you look at the message, it looks like the same port allocation class is used for different interfaces.

%SCSI_CPU-W-RETRY, port PKC0 alloclass 3 status 94 inconsistency

Please check if you have set the same port allocation class and do not assign the same port allocation class to different interfaces.

Regards,
Ketan
Volker Halle
Honored Contributor

Re: %SCSI_CPU-W-RETRY errors on boot

Jess,

I've recently seen this message as well under the following circumstances:

ES40 configured with shared SCSI (HSZ70) and a common SCSI system disk (V7.3-1).

2 HBAs have then been added to this ES40 and a new root has been configured on the SAN system disk, from which the ES40 had then been booted. The SAN system disk was actually a copy of the SCSI system disk (also V7.3-1).

While booting the ES40 from the SAN system disk, I saw those messages. I had checked the system parameters and port allocation classes and they looked o.k. There are 2 systems still booted from the HSZ70 system disk in this cluster, who are still connected to the shared SCSI bus.

Volker.
Ian Miller.
Honored Contributor
Solution

Re: %SCSI_CPU-W-RETRY errors on boot

%SCSI_CPU-W-RETRY - one of my favourites is that one :-)

Check the system parameters and SYS$DEVICES on each node in the cluster.
There could be a node that has the port allocation class configured but is not connected to the HSZ

In [CLUSTER]SCCPUVER there is some code that compares nodes that have a port allocation class configured and and a list of nodes this node can find by looking at whats on the scsi buses.

It will not surprise Volker that I wrote an SDA extension to display that internal list :-)
____________________
Purely Personal Opinion
Verne Britton
Regular Advisor

Re: %SCSI_CPU-W-RETRY errors on boot

IAN IAN IAN ... you are so funny ...

you tell us about it, but don't let us know where it is (download) or how to use it :-) :-)

Please, please ... share !!

p.s. how old or new does my VMS version have to be to use your SDA extension ??



Respectfully,

Verne
Jess Goodman
Esteemed Contributor

Re: %SCSI_CPU-W-RETRY errors on boot

Ian,

You were right. I had, of course, carefully checked the systems that were connected to the HSZ. But I had not thought it necessary to check the other nodes in the cluster.

Sure enough, one of them used to be connected to this HSZ and it still had allocation class 3 defined for a SCSI port that currently has nothing connected to it.

I can't reboot that system until late tonight or tomorrow, so I won't know if this fixed my problem with the other node until then. I will post the results and award points after that.

Assuming it does fix it, I must say that I am quite surprised by this restriction. What is this software check protecting me against?

This would also mean that if my two other nodes that are still using the HSZ rebooted, they would not be able to access the HSZ either. That would leave the cluster with no path to access the HSZ until the unconnected node was reconfigured and rebooted.
I have one, but it's personal.
Jess Goodman
Esteemed Contributor

Re: %SCSI_CPU-W-RETRY errors on boot

Ian was absolutely correct. I first rebooted a system that was not connected to the HSZ70s to clear its unused SCSI port's allocation class.

I then rebooted the problem system and there were no error %SCSI_CPU message and it used its connection to the HSZ70s.

Would be nice if that error message was fully documented.

$ help /message scsi_cpu ! VMS 7.3-2
%MSGHLP-F-NOTFOUND, message not found in Help Message database
I have one, but it's personal.
Ian Miller.
Honored Contributor

Re: %SCSI_CPU-W-RETRY errors on boot

SCSI_CPU-W-RETRY is defined only in the code and is not defined in a message file anywhere.

I've been caught myself by nodes having a port allocation class defined but not being connected to that shared scsi bus.

Each node will take a lock on a resource named after any non-zero port allocation class defined. e.g for PAC=10 a EX lock on resource called IOGEN$_10 is queued. Getting information about these locks allows a list to be created of nodes who have that PAC defined. This can be seen using
SDA SHOW RESOURCE/NAME=IOGEN$_10

This list is compared against a list of nodes that that be seen on shared SCSI buses. If the two nodes do not match then SCSI_CPU-W-RETRY is output to the console and only to the console.
____________________
Purely Personal Opinion
Jon Pinkley
Honored Contributor

Re: %SCSI_CPU-W-RETRY errors on boot

Jess,

This check is to "protect" you against a misconfiguration that would allow two different physical devices to have the same ALLDEVNAM. It is a stricter check than the pre-PAC (Port Allocation Class) checks, which allowed multiple devices to have the same name.

If you had a device $3$DKC100: on the HSZ70, you disconnected from the HSZ70, and then plugged in a SCSI disk with SCSI ID 1 to the "unused" SCSI adapter, it would also show up as $3$DKC100: If $3$DKC100: was being MSCP served and was mounted on the system with the "unused" adapter, I am not sure what would happen to the node you just plugged the disk into. The disk may go into mount verification, or perhaps the node would crash. At any rate, it isn't something that I would want to try on a production system. Since the HSZ uses HV differential wide SCSI, you probably wouldn't have a disk that your could easily use with the adapter, but you get the point.

Here's what the code does (paraphrased).

When the SCSI initialization is done, inquiries are sent to every device on bus, and for every "CPU" response, an entry is filled in that has the PAC, SCSSYSTEMID, controller id (letter) and SCSI info. When all the devices have been configured, all CPUs that the system can directly see on the SCSI bus will be in the list. For each PAC in the list, the lock manager is used to determine which cluster nodes have configured an interface with the PAC. If the there is a system (currently in the cluster) that has the PAC configured, but its SCSSYSTEMID isn't in the in the PAC_ID_LIST, then SS$_DUPLNAM status is returned (this is the status 94 in the cryptic message). However, the routine that has determined the offending SCSSYSTEMID does not print the message, and therefore this useful piece of information is not used in the messages printed on the console. If the routine that was checking the locks printed the message for every SCSSYSTEMID that it did not find in the list, it would make the message much more meaningful, i.e. something like

port PKC0 alloclass 3 configured on ES40_1 PKC but not seen on local PKC SCSI bus

The take away message is the following:

If you have PAC configured SCSI adapters, and you plan to remove the connection to the shared bus permanently, plan to reconfigure PAC and reboot the system at the time you remove the cable.

I see that Ian just posted with similar info, including the resource name of the lock used to find the SCSSYSTEMIDs of the current cluster members that have the PAC configured.

Jon
it depends
Ian Miller.
Honored Contributor

Re: %SCSI_CPU-W-RETRY errors on boot

PAC$SDA now added to my freeware page at

http://eisner.encompasserve.org/~miller/
____________________
Purely Personal Opinion
Jess Goodman
Esteemed Contributor

Re: %SCSI_CPU-W-RETRY errors on boot

The reason I was surprised to here about this is that, unless I am mistaken, it can turn boot order into a critical factor affecting whether any node in the cluster can access an entire disk farm.

As I said we rebooted the cluster last week and we didn't have this problem. All the nodes on the SCSI hub connected to the HSZ70 and its volumes.

This must have been because the node that is no longer connected to the HSZ70 was booted last, so its unneeded port having a PAC set did not cause a problem.

But if I understand this algorithm, if that node had booted first then all the nodes connected to the SCSI hub would have failed to connect to the HSZ70 (showing this unhelpful message) and the HSZ70's storage would have been inaccessible to the cluster.

This seems extreme to me, since nothing is plugged into that port. Why can't this check be delayed until AUTOCONFIGURE discovers at least one device on the port? Even better would be only if it only happened when there is an actual "duplicate device" name, but I see why that might be technically difficult.

Jess
I have one, but it's personal.
Volker Halle
Honored Contributor

Re: %SCSI_CPU-W-RETRY errors on boot

Note: PAC$SDA V1.0 only shows the FIRST entry in the IOC$GL_PAC_ID_LIST, additional entries are not being displayed (e.g. in case of a 3 node shared SCSI cluster).

Volker.
Volker Halle
Honored Contributor

Re: %SCSI_CPU-W-RETRY errors on boot

Jess,

this problem can ALSO show up on a CORRECTLY configured shared-SCSI cluster (using port allocation class 1), which ALSO uses SAN devices (DGA) - at least on OpenVMS V7.3-1.

I've found that there are much more locks on resource IOGEN$_1 than there were nodes connected to that shared SCSI bus with PAC=1.

I looked through the DDBs until I found a DDB with DDB$L_CLASS_LKID .ne. 0. That entry pointed to a PKB, whichs contained the LKID of an IOGEN$_1 lock. And that DDB was for a 'DGA' (SAN) disk.

This finding supports the suggestion of NOT using port allocation classes 1 or 2 for shared SCSI clusters, especially if planning to migrate to SAN disks lateron.

Volker.