Re: %SCSI_CPU-W-RETRY errors on boot

Jess Goodman · ‎07-14-2010

I have an ES40 attached to a SCSI Hub which is attached to three other VMS nodes and a dual HSZ70 controller Raid cabinet.

The entire cluster (all are Alphas VMS 7.3-2 with patches and EVA SAN storage) was rebooted last week and this node came up with no problems. But when I rebooted this node last night (and two more attempts today), it failed to make a local connection to this Raid cabinet, due to this error:

%SCSI_CPU-W-RETRY, port PKC0 alloclass 3 status 94 inconsistency

That error message shows up five times ("retry n/5)" before the %MSCPLOAD-I-CONFIGSCAN message and then another five times after the %STDRV-I-STARTUP and %STDRV-I-LOG messages.

See the text attachment for the console output

Relavant system parameters are:
ALLOCLASS=1
DEVICE_NAMING=1

In SYS$DEVICES.DAT (shown fully in the text attachment) this node's port PKC is set to allocation class 3.

A >>>SHOW DEVICE properly displays the PKC controller and all the devices that the HSZ70 cabinet present.

On my last boot attempt I used -flags 0,30000 and these lines were displayed immediately before the first %SCSI_CPU message (I do not know if these are normal or not):

%LOADER-I-INIT, initializing SYS$PKQDRIVER.EXE
%PKQDRIVER-I- PKC0, loading firmware version 5.57 from console
%PKQDRIVER-I- PKC0, initialization complete; port online

Nothing relevant was changed since the last reboot. Thanks in advance for any help you can offer.

Jess

I have one, but it's personal.

Jon Pinkley · ‎07-14-2010

Jess,

We no longer have an HSZ70, but when we had a SCSI bus shared between two ES40s, when we did a >>> SHOW DEVICE, the SRM console would display the SCSI adapter of the other VMS system (with VMS in the description). We were not using Port Allocation classes, or new device naming, so that is one difference. We also did not use a SCSI hub.

I have never seen the SCSI_CPU message (that I remember), but the text suggests that the system may be checking if the other systems sharing the buss have the same allocation class. Do the other 3 VMS nodes have allocation class 3 on the SCSI controllers connected to the SCSI hub?

What were the "non-relevant" changes since the last boot? :-)

Jon

it depends

Shriniketan Bhagwat · ‎07-14-2010

Hi Jess,

If you look at the message, it looks like the same port allocation class is used for different interfaces.

%SCSI_CPU-W-RETRY, port PKC0 alloclass 3 status 94 inconsistency

Please check if you have set the same port allocation class and do not assign the same port allocation class to different interfaces.

Regards,
Ketan

Volker Halle · ‎07-15-2010

Jess,

I've recently seen this message as well under the following circumstances:

ES40 configured with shared SCSI (HSZ70) and a common SCSI system disk (V7.3-1).

2 HBAs have then been added to this ES40 and a new root has been configured on the SAN system disk, from which the ES40 had then been booted. The SAN system disk was actually a copy of the SCSI system disk (also V7.3-1).

While booting the ES40 from the SAN system disk, I saw those messages. I had checked the system parameters and port allocation classes and they looked o.k. There are 2 systems still booted from the HSZ70 system disk in this cluster, who are still connected to the shared SCSI bus.

Volker.

Ian Miller. · ‎07-15-2010

%SCSI_CPU-W-RETRY - one of my favourites is that one :-)

Check the system parameters and SYS$DEVICES on each node in the cluster.
There could be a node that has the port allocation class configured but is not connected to the HSZ

In [CLUSTER]SCCPUVER there is some code that compares nodes that have a port allocation class configured and and a list of nodes this node can find by looking at whats on the scsi buses.

It will not surprise Volker that I wrote an SDA extension to display that internal list :-)

____________________
Purely Personal Opinion

Verne Britton · ‎07-15-2010

IAN IAN IAN ... you are so funny ...

you tell us about it, but don't let us know where it is (download) or how to use it :-) :-)

Please, please ... share !!

p.s. how old or new does my VMS version have to be to use your SDA extension ??

Respectfully,

Verne

Jess Goodman · ‎07-15-2010

Ian,

You were right. I had, of course, carefully checked the systems that were connected to the HSZ. But I had not thought it necessary to check the other nodes in the cluster.

Sure enough, one of them used to be connected to this HSZ and it still had allocation class 3 defined for a SCSI port that currently has nothing connected to it.

I can't reboot that system until late tonight or tomorrow, so I won't know if this fixed my problem with the other node until then. I will post the results and award points after that.

Assuming it does fix it, I must say that I am quite surprised by this restriction. What is this software check protecting me against?

This would also mean that if my two other nodes that are still using the HSZ rebooted, they would not be able to access the HSZ either. That would leave the cluster with no path to access the HSZ until the unconnected node was reconfigured and rebooted.

I have one, but it's personal.

Jess Goodman · ‎07-15-2010

Ian was absolutely correct. I first rebooted a system that was not connected to the HSZ70s to clear its unused SCSI port's allocation class.

I then rebooted the problem system and there were no error %SCSI_CPU message and it used its connection to the HSZ70s.

Would be nice if that error message was fully documented.

$ help /message scsi_cpu ! VMS 7.3-2
%MSGHLP-F-NOTFOUND, message not found in Help Message database

I have one, but it's personal.

Ian Miller. · ‎07-16-2010

SCSI_CPU-W-RETRY is defined only in the code and is not defined in a message file anywhere.

I've been caught myself by nodes having a port allocation class defined but not being connected to that shared scsi bus.

Each node will take a lock on a resource named after any non-zero port allocation class defined. e.g for PAC=10 a EX lock on resource called IOGEN$_10 is queued. Getting information about these locks allows a list to be created of nodes who have that PAC defined. This can be seen using
SDA SHOW RESOURCE/NAME=IOGEN$_10

This list is compared against a list of nodes that that be seen on shared SCSI buses. If the two nodes do not match then SCSI_CPU-W-RETRY is output to the console and only to the console.

____________________
Purely Personal Opinion

Jon Pinkley · ‎07-16-2010

Jess,

This check is to "protect" you against a misconfiguration that would allow two different physical devices to have the same ALLDEVNAM. It is a stricter check than the pre-PAC (Port Allocation Class) checks, which allowed multiple devices to have the same name.

If you had a device $3$DKC100: on the HSZ70, you disconnected from the HSZ70, and then plugged in a SCSI disk with SCSI ID 1 to the "unused" SCSI adapter, it would also show up as $3$DKC100: If $3$DKC100: was being MSCP served and was mounted on the system with the "unused" adapter, I am not sure what would happen to the node you just plugged the disk into. The disk may go into mount verification, or perhaps the node would crash. At any rate, it isn't something that I would want to try on a production system. Since the HSZ uses HV differential wide SCSI, you probably wouldn't have a disk that your could easily use with the adapter, but you get the point.

Here's what the code does (paraphrased).

When the SCSI initialization is done, inquiries are sent to every device on bus, and for every "CPU" response, an entry is filled in that has the PAC, SCSSYSTEMID, controller id (letter) and SCSI info. When all the devices have been configured, all CPUs that the system can directly see on the SCSI bus will be in the list. For each PAC in the list, the lock manager is used to determine which cluster nodes have configured an interface with the PAC. If the there is a system (currently in the cluster) that has the PAC configured, but its SCSSYSTEMID isn't in the in the PAC_ID_LIST, then SS$_DUPLNAM status is returned (this is the status 94 in the cryptic message). However, the routine that has determined the offending SCSSYSTEMID does not print the message, and therefore this useful piece of information is not used in the messages printed on the console. If the routine that was checking the locks printed the message for every SCSSYSTEMID that it did not find in the list, it would make the message much more meaningful, i.e. something like

port PKC0 alloclass 3 configured on ES40_1 PKC but not seen on local PKC SCSI bus

The take away message is the following:

If you have PAC configured SCSI adapters, and you plan to remove the connection to the shared bus permanently, plan to reconfigure PAC and reboot the system at the time you remove the cable.

I see that Ian just posted with similar info, including the resource name of the lock used to find the SCSSYSTEMIDs of the current cluster members that have the PAC configured.

Jon

it depends

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: %SCSI_CPU-W-RETRY errors on boot

%SCSI_CPU-W-RETRY errors on boot