Operating System - OpenVMS
1755056 Members
3033 Online
108829 Solutions
New Discussion юеВ

Autoconfiguration problems ES40 with SCSI and FC

 
SOLVED
Go to solution
Jon Pinkley
Honored Contributor

Autoconfiguration problems ES40 with SCSI and FC

We have 2 ES40 M2 4/667 EV67 systems, which are boot nodes in a shared SCSI cluster.

After recently adding a single FC HBA to each ES40, the systems hang or crash during autoconfiguration of one of the SCSI adapters if a tape drive is attached, seemingly due to memory being corrupted.

Background follows:

Each node is configured identically; with the exception that one has an FDDI DAS controller, which connects to a satellite node.

The ES40 M2 has 10 PCI slots. Ours are configured like this:

Slot 1-3 has 3X-KZPCA-AA 1 ch U2 LVD SCSI, slot 1 and 2 connect to DLT tapes, and slot 3 to internal disks
Slot 4 has 1Gb ethernet DEGPA-TA (old, slow 10/100/1000 Ethernet)
Slot 5-7 have KZPBA-CB 1 ch UWD SCSI (shared bus) slot 5 and 6 to HSZ70 controllers, 7 to HSZ40 controller
Slot 8 DEFPA-DB FDDI DAS MMF (in one system, other has this slot empty)
Slot 9 (was empty, more on this later)
Slot 10 3X-DE602-AA Dual port 10/100 UTP Ethernet

When the above was the configuration, Alpha VMS 7.3-2 with latest patches as of May 6, 2007 and Alpha Firmware CD 7.2 (what shipped with VMS 8.3) versions were applied.

We want to connect the ES40s to an EVA6000 and needed to add FC adapters. We will be migrating all the storage from the HSZ before we disconnect them; we need the ability to connect to both the legacy SCSI and the EVA at least for a while.

We bought two DS-A5132-AA 1 ch 2Gb FC (Emulex LP10000) HBAs (one adapter for each ES40)

These were installed in Slot 9, and Firmware for cards was updated from Alpha Firmware 7.3 (April 2007) Updated everything the update CD could.

After these were added, I was able to connect to the EVA6000.

The problem I am seeing seems to be related to autoconfiguration and the combination of cards I have. If there is a problem in the configuration, I didn't see anything obvious in the Supported options list.

http://h18002.www1.hp.com/alphaserver/options/ases40/ases40_options.html

The systems have been stable prior to the latest addition, so I am inclined to believe the problems are related to the latest changes.

The "fine print" on the LP10000 has the following

-------------------------------------------------------

http://h18002.www1.hp.com/alphaserver/options/ases40/ases40_ds-a5132-aa.html

DS-A5132-AA
(370426-B21) PCI-X 64BIT 133MHZ 2Gb-ALPHA LP10000, FCA2684

"Option Restrictions

Total adapter ports installed cannot exceed maximum 6 ports."

-------------------------------------------------------

There are single and dual channel versions of this card, and I read this as no more than 6 FC ports can be installed on the system. The backplane on the ES40 is quite limited compared to new machines, so I wouldn't be surprised if the backplane could easily become a bottleneck.

The problem I am seeing is during autoconfigure processes. First I had a problem after several autoconfig runs adding $1$DGA devices. Specifically, when I would autoconfig on one node, that node would server the device to the other node before I did the autoconfig on the other node. So tried using

$ mcr sysman set env/node=(sigma,omega)
SYSMAN> io scsi
SYSMAN> io auto/log ! Hung here attempting to acquire a resource containing the string IOGEN$LOCK, aborted with ^Y

This evidently cause a problem for the SMISERVER on the SIGMA node, and later, in our overnight processing, a use of SYSMAN to execute commands on multiple nodes hung. I shut down the SIGMA node, and that solved the overnight problem on OMEGA, but later, when I attempted to boot the SIGMA node, it hung in STACONFIG with the PKE0 device busy (gathered from forced crash). PKE0 had a DLT8000 tape drive attached. After other things (like removing the FC cables which didn't help) I removed the cable from the DLT8000 drive, and then the system made it past STACONFIG and booted successfully.

After the users were done for the day, I reconnected the DLT8000 drive (while drive powered off), and issued a

$ mc sysman io scsi
$ mc sysman io auto /select=mke*/log

This didn't find anything so I did an autoconfig without /select

This scanned the PGA (FC controller) and I got a

%IOGEN-I-FIBREPOLL, scanning for devices through FIBRE port PGA0
%IOGEN-F-FTLIOERR, fatal I/O error while trying to access device <-- Note error message

which I assumed was due to the fact that no FC cable was connected, and it could poll for devices.

This didn't add any mke400 device. I then connected the cable to another DLT8000 drive that was originally connected to OMEGA (MKE300) and issued a sysman io auto/log and that's the last thing that was displayed on the terminal (system crashed).

I connected to the console with serial connection, and it was at the P00>>> prompt. Whether the connection to the console port "generated" a break or if it halted, I don't know, normally I use ^P. the auto action is set to RESTART, so I was expecting an auto reboot.

I booted, and the system crashed

I've attached the log of summary of the crash.

Has anyone seen a similar problem? Would moving PCI cards to different slots possibly solve the problem? Is there anything else I should try?
it depends
9 REPLIES 9
Volker Halle
Honored Contributor

Re: Autoconfiguration problems ES40 with SCSI and FC

Jon,

PKE0 has accumulated an error count of 47 in the dump. Try

SDA> CLUE ERRLOG

and have a look at the errors in CLUE$ERRLOG.SYS with DECevent (or SEA).

Note R0=00000054 %SYSTEM-F-CTRLERR in the dump.

SCDRP$L_STS_PTR(R5) was invalid (FFFFFFFF), this caused the INVEXCEPTN crash.

Does the current SCDRP format correctly ?

SDA> CLUE SCSI/REQUEST=8171F3E0

Volker.
Volker Halle
Honored Contributor

Re: Autoconfiguration problems ES40 with SCSI and FC

Jon,

SYSMAN IO AUTO uses an EX lock on resource xxxxIOGEN$LOCK with xxxx=CSID of local system to synchronize certain operations during AUTOCONFIGURE on the local node. If your SYSMAN process was waiting for this lock, this would indicate, that someone else may also have been running such an operation at the same time on the local node. Due to using SYSMAN with cluster environment set, the problem appears to have been on SIGMA, i.e. your 'remote' IO AUTO conflicting with a similar operation already 'hanging' on SIGMA.

Volker.
Edgar Ulloa
Frequent Advisor

Re: Autoconfiguration problems ES40 with SCSI and FC

Hi

Also try this


$mcr sysman

io list
io auto/log
$ana/system
sda>fc set dev fga0
sda>fc name
sda>fc set dev fgb0
sda>fc name
sda>exit

again
sysman> io list

$sh dev

and thats it
Jon Pinkley
Honored Contributor

Re: Autoconfiguration problems ES40 with SCSI and FC

Volker>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
PKE0 has accumulated an error count of 47 in the dump. Try

SDA> CLUE ERRLOG

and have a look at the errors in CLUE$ERRLOG.SYS with DECevent (or SEA).

Note R0=00000054 %SYSTEM-F-CTRLERR in the dump.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++<<

I did this and got

SDA> clue err

Dumpfile Errorlog Entry Information:
------------------------------------
Sequence Date Time Error Message Type
-------- ----------- ----------- --------------------------------
0 28-JUN-2007 19:08:13.00 CRD Throttle Event
1 28-JUN-2007 19:08:13.00 CRD Throttle Event
2 28-JUN-2007 19:08:13.00 CRD Throttle Event
3 28-JUN-2007 19:08:13.00 CRD Throttle Event
4 28-JUN-2007 19:08:23.61 Asynch Device Attention
... many repeats
54 28-JUN-2007 19:09:11.62 Asynch Device Attention
55 28-JUN-2007 19:09:11.62 * Crash Entry

Config Entry and Errlog Entries written to CLUE$ERRLOG.SYS file, use COMPAQ Analyze or DECevent to analyze.

I don't have WEBES installed, I need to convert a disk to ODS-5

Volker>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

SCDRP$L_STS_PTR(R5) was invalid (FFFFFFFF), this caused the INVEXCEPTN crash.

Does the current SCDRP format correctly ?

SDA> CLUE SCSI/REQUEST=8171F3E0
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++<<
SDA> clue scsi /REQUEST=8171F3E0

SCSI Class Driver Request Packet (SCDRP):
-----------------------------------------
%CLUE-W-NOTSCDRP, (8171F3E0) is not a valid SCDRP address
SDA>


>>>Edgar Ulloa++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Also try this


$mcr sysman

io list
io auto/log
$ana/system
sda>fc set dev fga0
sda>fc name
sda>fc set dev fgb0
sda>fc name
sda>exit

again
sysman> io list

$sh dev

and thats it
>>>Edgar Ulloa++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

the sysman io list didn't disply anything.
io auto didn't discover any new devices
only fga0 exists (only one FC card in each ES40)

From the crash the output was:

Dump taken on 28-JUN-2007 19:09:11.62
INVEXCEPTN, Exception while above ASTDEL

SDA> fc set dev pga0
Unsupported device class (0x80) or device type (0x37)
SDA> fc show name
FGA0: Name List
Index qfl qbl port name node name state ale_index rpi
----- -------- -------- ---------------- ---------------- ------- --------- ----
SDA>

Remember, the fibre was disconnected from the switch at the time of the crash,

Here's what things look like on the working system:

$ sho dev mk

Device Device Error Volume Free Trans Mnt
Name Status Count Label Blocks Count Cnt
SIGMA$MKD600: Online 0
SIGMA$MKE400: Online 0
$ sho dev dg

Device Device Error Volume Free Trans Mnt
Name Status Count Label Blocks Count Cnt
$1$DGA102: (SIGMA) Mounted 0 EVA001 672 1 3
$ anal/sys

OpenVMS (TM) system analyzer

SDA> fc set dev pga0
Unsupported device class (0x80) or device type (0x37)
SDA> fc show name
FGA0: Name List
Index qfl qbl port name node name state ale_index rpi
----- -------- -------- ---------------- ---------------- ------- --------- ----
0: 00000000 00000000 0000000000000000 0000000000000000 ******* 0 0000
1: 8165BB68 8165BB68 200900051E03F43E 100000051E03F43E VALID 1 0001
2: 8165BBB0 8165BBB0 21FC00051E03F43E 100000051E03F43E VALID 2 0003
3: 8165BBF8 8165BBF8 50001FE1500B89BD 50001FE1500B89B0 VALID 3 0004
4: 8165BC40 8165BC40 50001FE1500B89B9 50001FE1500B89B0 VALID 4 0005
5: 8165BC88 8165BC88 200900051E03F43E 100000051E03F43E VALID 5 0006
SDA>

I did some more testing on Monday night (first available time due to period end processing).

I reconnected the DLT8000 drive to the PKE controller (only thing on bus) and did sysman io auto/log, but no device was discovered.

Shutdown node, did INIT, then show device. No MKE showed up, (in both cases MKD did).

Rebooted Firmware 7.3 and did manual update. rmc 2.8 was updated (still had 2.7, which was replace in 6.5 according to firmware release notes, but it was never updated on the systems).

Hardcycled power (removed power cords from power supplies and waited for 1 minute after last LED went out)

Then after init show device showed the mke device. On the other ES40, the mke device showed up, and it had rmc 2.7 as well, so something was "confused" on the SIGMA ES40, and the update/power cycle seemed to slear the problem, at least for now. Two weeks ago, the same ES40 reported 2 bad fans, FAN1 and FAN2 (PCI) but they were working, and hard power cycle "fixed" that problem too.

The other thing that was fixed was reformatting the FCA-2684 (LP10000) nwram by setting topology to FABRIC with wwidmgr. Note that even with this not set, the system was able to see the EVA disks before, but it is possible that this was causing a problem.

I will leave this topic open for about a week in case anyone has any other input.

Thanks,

Jon

See attachment for logfile of the reappearing mke after hard powercycle.
it depends
Volker Halle
Honored Contributor
Solution

Re: Autoconfiguration problems ES40 with SCSI and FC

Jon,

you can use DECevent to decode SCSI adapter (PKE) errors.

For WEBES, you could either use a LD device with ODS-5 or you can install WEBES on your PC/laptop. You might want to look at the CRD (Correctable Read Data) errors as well.

Use SDA> FC SET DEV FGA0 (not PGA0)

The crash might have been caused by some problem handling those SCSI adapter errors. Probably not worth investigating any further.
It looks like parts of the SCDRP had been overwritten...

Volker.
Jon Pinkley
Honored Contributor

Re: Autoconfiguration problems ES40 with SCSI and FC

Volker,

I saw the kit for Windows, but I assumed that it was to decode errors for windows platforms.

If I can install there, I would prefer that, because I have heard that Webes is a resource hog, but perhaps that is not an issue on a reasonably new machine.

I will download the windows kit and try it.

Thanks for the update.

Also thanks for the hint that CRD wasn't an abbreviation for "Card".

Jon
it depends
Volker Halle
Honored Contributor

Re: Autoconfiguration problems ES40 with SCSI and FC

Jon,

the WEBES tool can decode errorlog files from any supported platform on any supported platform. Don't bother to try the CLI interface, use the WEB GUI on localhost:7902

The advantage of running WEBES on the ES40s is, that analysis would automatically send you mails in case it finds any problem patterns while scavenging ERRLOG.SYS.

Volker.
Jon Pinkley
Honored Contributor

Re: Autoconfiguration problems ES40 with SCSI and FC

I am closing this; things have been stable for the last 10 days.

I am not sure what the fix was, but things have been working correctly after I reapplied the firmware, using the manual update procedure to update everything. The RMC was updated, and after updating, a complete power off (including removing the power cords to all power supplies).

Special thanks to Volker who ran my ERRLOG files though WSEA.
it depends
Jon Pinkley
Honored Contributor

Re: Autoconfiguration problems ES40 with SCSI and FC

See above.
it depends