Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Formerly reliable V8.3 AS4100 boot now hangs

SOLVED
Go to solution
Edward Miller_1
Occasional Advisor

Formerly reliable V8.3 AS4100 boot now hangs

We have had a persistent problem in booting either of our two (CI-clustered) AS4100
VMS 8.3 systems since February 2013. That problem is still only partially understood
and only partially resolved. The misbehaviour appears to be identical on both systems,
though our tests have been done almost exclusively on one of the systems.
|
| The problem behaviour:
| If a system is shut down normally, then rebooted normally at the
| >>> console prompt, the boot is successful.
|
| However, if a system is shut down normally, but then
| either the >>> HALT command is executed
| or the system is power cycled
| then a subsequent attempt to boot fails. The boot starts normally and is clearly
| successfully accessing the system disk, but it eventually hangs.
| There is no failure message, no return to a console prompt -- the boot
| is simply not proceeding.
|
When we first encountered the above failure we were fortunate enough
to discover the following workaround:
|
| After a power cycle (or a >>>HALT) -- which would otherwise lead to the above
| failure in the normal boot attempt:
|
| Boot instead from the VMS 8.3 distribution CD, which eventually
| leads to a prompt asking you what action you wish to select from a menu.
|
| At that prompt, enter ^P, which leads to the console prompt (>>>).
|
| At the console prompt, issue the normal boot command, which now succeeds.
|
| (No, we had no good idea as to why this CD boot makes the normal VMS boot, that
| would otherwise fail, succeed.)
|
We have done considerably more investigation of this behaviour, and have accumulated
a lot of evidence, but no clear indication of the cause of this problem. There have
been reports (on both I64 and Alpha systems) of a boot problem which has some similarities
to what we observe. However, in those reports the failure is an unexpected prompt for the
date and time after the boot is initiated, not the boot hang which we observe. Some references:
|
| http://labs.hoffmanlabs.com/node/1795
| http://h30499.www3.hp.com/t5/Integrity-Servers/Hp-Integrity-rx2620-Date-Time-issue/td-p/4344487#.Ua4u1kBeY-F
| https://groups.google.com/forum/#!topic/comp.os.vms/NxACjO3dLjw
|
The information in the above references (together with our own observation that our failures first
occurred 5 years after the LINK date stored in our SYS$BASE_IMAGE.EXE file -- 14-FEB-2008 11:34:23.39)
inspired us to make the following test:
|
| (1) Set the system date back to 2012
| (2) Do a normal shutdown
| (3) At the prompt >>> HALT
| (4) At the prompt >>> B
|
and this boot now succeeds instead of hanging (as it would if we were using the correct system date).
We expect (but have not yet demonstrated) that if we apply VMS 8.3 updates that provide
a SYS$BASE_IMAGE.EXE file with a more recent LINK date that this similarly will fix the problem,
even when using the correct system date. (The reason for our delay in installing these
updates is that we have a common system disk for the two AS4100 systems, and one of those
systems is a production system that can not be disturbed until a "downtime" opportunity
arises.)
|
Summary to this point:
| We expect that the "solution" to the problems others have reported will be a
| solution to our problem, but why do we see a boot hang instead of the date-time prompt?
|
| We have seen no official HP acknowledgment (or real fix) for the Alpha
| date-time-prompt version of this problem (assuming it is the "same" problem as ours).
|
Some additional information on the nature of the boot hang:
|
| We have performed the failing boot with debug output enabled and have
| forced a crash dump when it hangs. We have done this a few times, and the
| results are repeatable.
|
| The crash dump indicates that the time of the crash was exactly 20
| years in the past -- we have no explanation for that, but it seems to be
| to be implicated in the hang which we observe (see details below).
|
| Some output from the crash analysis:
|
| ===============================================================================
| Dump taken on 16-MAY-1993 09:30:00.39 using version V8.3
| OPERCRASH, Operator forced system crash
| ===============================================================================
| Current process summary
| -----------------------
|
| Extended Indx Process name Username State Pri PCB/KTB PHD Wkset
| -- PID -- ---- --------------- ------------ ------- --- -------- -------- ------
| 00000401 0001 SWAPPER SYSTEM PFW 16 824E39C8 824E3400 0
| ===============================================================================
| Crash Time: 16-MAY-1993 09:30:00.39
| Bugcheck Type: OPERCRASH, Operator forced system crash
| Node: MCCDEV (Cluster)
| CPU Type:
| VMS Version: V8.3
| Current Process: NULL
| Current Image: <not available>
| Failing PC: FFFFFFFF.8AC68688
| Failing PS: 00000000.00000004
| Module: <not available>
| Offset: 00000000
|
| Boot Time: 17-NOV-1858 00:00:00.00
| System Uptime: 0 00:00:00.00
| ===============================================================================
|
| The console debug output at the point of the hang:
| ...
| %LOADER-I-INIT, initializing MSCP
| %LOADER-I-INIT, initializing SYSLDR_DYN
| %LOADER-I-INIT, initializing SYS$MME_SERVICES
| %LOADER-I-INIT, initializing SYS$NTA ** last line appearing in the hung boot
| %LOADER-I-INIT, initializing SSPI ** next line that appears in a normal boot
|
| (Note that this is not the first set of references to SYS$NTA and SSPI, but
| the fifth such set of references in the debug output.)
|
| If, at the hang, one does a series of ^P, >>> CONT commands the
| address reported is always FFFFFFFF.8AC68688, and the CLUE CRASH analysis of the
| dump indicates that it was in the NULL process with the code at that address:
|
| ...
| FFFFFFFF.8AC68680: LDA R16,#X0041(R31)
| FFFFFFFF.8AC68684: CSERVE
| FFFFFFFF.8AC68688: BR R31,#XFFFF9F
|
| Question: what is CSERVE with function code #X0041 supposed to be doing?
| Is this supposed to be an implementation of the NULL process, or something
| else? (We found no such CSERVE reference on our VMS 8.3 source CD set -- but
| maybe we missed it.)
|
| From a detailed examination of the dump we see:
|
| The SWAPPER (which is the only process listed in the system, and which
| is in PFW state) is supposed to be accessing pages on the disk for
| image SSPI.
|
| None of the expected connections over the CI have been made (to the
| other AS4100 and the 4 HSJ disk controllers). (But the system disk is
| being booted thru one of those HSJ controllers.)
|
| The failure of the SWAPPER to access the disk appears to be a consequence of
| the 20 year old system time reported for the dump. The 20 year discrepancy
| in EXE$GQ_SYSTIME leads to the failure to call the routine EXE$TIMEOUT,
| which in turn leads to the failure to call the routine PN_TIMER within the
| CIPCA driver. This prevents the driver from polling for other CI nodes, and
| thus from discovering paths to other CI nodes when those nodes respond to the
| polls, and finally from building SCS path blocks and system blocks and linking
| these into the system database. In the absence of this database infrastructure,
| the local copy of DUDRIVER is never informed that the HSJ disk controllers possess
| instances of the MSCP disk server to which it can connect.
|
|Some configuration details:
|--------------------------
| Configuration: VMS 8.3 cluster with two AS4100 CI node, 4 HSJ50's on CI
| and a few NI nodes. The AS4100 systems have a common system disk on
| one pair of the HSJ's.
|
| AS4100 running VMS 8.3 with following updates only:
| DEC AXPVMS VMS83A_SMGRTL V1.0 Patch Install Val 09-SEP-2008
| DEC AXPVMS VMS83A_CLIUTL V1.0 Patch Install Val 09-SEP-2008
| DEC AXPVMS VMS83A_COPY V1.0 Patch Install Val 09-SEP-2008
| DEC AXPVMS VMS83A_DCL V3.0 Patch Install Val 09-SEP-2008
| DEC AXPVMS VMS83A_LOGIN V1.0 Patch Install Val 09-SEP-2008
| DEC AXPVMS VMS83A_MAILSHR V1.0 Patch Install Val 09-SEP-2008
| DEC AXPVMS VMS83A_MANAGE V3.0 Patch Install Val 09-SEP-2008
| DEC AXPVMS VMS83A_RMS V7.0 Patch Install Val 09-SEP-2008
| DEC AXPVMS VMS83A_SYS V9.0 Patch Install Val 09-SEP-2008
| DEC AXPVMS VMS83A_UPDATE V6.0 Patch Install Val 09-SEP-2008
| DEC AXPVMS VMS83A_PCSI V2.0 Patch Install Val 09-SEP-2008
| DEC AXPVMS VMS83A_FORRTL V1.0 Patch Install Val 11-DEC-2007
| HP AXPVMS DCPS V2.6 Full LP Install (U) 06-DEC-2007
| HP AXPVMS DCPS V2.5 Full LP Remove - 06-DEC-2007
| DEC AXPVMS FORTRAN V8.0-2 Full LP Install (U) 21-SEP-2007
| DEC AXPVMS FORTRAN V8.0-2 Full LP Install (U) 21-SEP-2007
| DEC AXPVMS FORT95HOT T8.1-104662 Patch Remove - 21-SEP-2007
| DEC AXPVMS FORTRAN V8.0-1 Full LP Remove - 21-SEP-2007
| DEC AXPVMS CXML V5.2-1 Full LP Install (U) 21-SEP-2007
| DEC AXPVMS FORRTL V7.6-1 Full LP Install (U) 21-SEP-2007
| DEC AXPVMS VMS83A_SYS V1.0 Patch Install Val 11-SEP-2007
| DEC AXPVMS VMS83A_MOUNT96 V3.0 Patch Install Val 11-SEP-2007
| DEC AXPVMS VMS83A_BACKUP V3.0 Patch Install Val 11-SEP-2007
| DEC AXPVMS VMS83A_ACRTL V3.0 Patch Install Val 11-SEP-2007
| DEC AXPVMS DWMOTIF_ECO02 V1.6 Patch Install Val 11-SEP-2007
| DEC AXPVMS VMS83A_UPDATE V3.0 Patch Install Val 11-SEP-2007
| CPQ AXPVMS CDSA V2.2-271 Full LP Install (C) 11-SEP-2007
| DEC AXPVMS DECNET_PHASE_IV V8.3 Full LP Install (U) 11-SEP-2007
| DEC AXPVMS DWMOTIF V1.6 Full LP Install (C) 11-SEP-2007
| DEC AXPVMS DWMOTIF_SUPPORT V8.3 Full LP Install (U) 11-SEP-2007
| DEC AXPVMS OPENVMS V8.3 Platform Install (U) 11-SEP-2007
| DEC AXPVMS VMS V8.3 Oper System Install (U) 11-SEP-2007
|
| Our console version:
| PALCODE_VERSION = 1.21-2
| CONSOLE_VERSION = V6.0-4
| The HP site indicates that V6.1 is the latest console for AS4100, and we
| intend to install that. V6.1 purportedly includes the same PALCODE
| version that we are running.
|
Any help in further explaining what we have reported above would be welcome.
Has anyone else experienced boot hangs that are the same or similar to ours?
Is there an acknowledgment from HP (and a fix) for the date-time interference with
booting in our (Alpha VMS 8.3 AS4100) environment?

Thanks,
Ed Miller

3 REPLIES
Jur van der Burg
Respected Contributor

Re: Formerly reliable V8.3 AS4100 boot now hangs

It's an educated guess, but there was a problem in a vms patchkit for V8.3 which could cause a hang during boot when the time was wrong. The affected image was EXEC_INIT that has a problem calculating a time difference, so look for the latest kits which have that image and apply it.

 

Fwiw,

 

Jur.

 

 

Edward Miller_1
Occasional Advisor

Re: Formerly reliable V8.3 AS4100 boot now hangs

Hello Jur,
Thanks very much for your suggestion. In the VMS83A_SYS-V1100.RELEASE_NOTES I find
the following fix (EXEC_INIT.EXE created 24-JUN-2009):
|
| 5.2.9 TQE Relocation To Avoid Hang While Booting
|
| 5.2.9.1 Problem Description:
|
| TQEs (Timer Queue Entry) that were supposed to
| expire during the boot process did not expire,
| resulting in a system hang during boot. Boot time
| hangs were sometimes caused by a difference between
| the default system time and the current system time,
| which could not be obtained from the HW clock. This
| caused some TQEs to wait until the current system
| time "caught up" to the default system time.
|
| Images Affected:
|
| - [SYS$LDR]EXEC_INIT.EXE
|
| - [SYS$LDR]EXEC_INIT.STB
|
This probably is (at least part of) our problem. We do have a hang, and we do have a several TQEs
(set to expire at about 5 years ago, where the system time is 20 years ago). What caused
us to trigger this bug is less clear -- we have no indication of a problem with the
hardware clock, and the problem appeared "simultaneously" on two systems each with
its own hardware clock.
|
From one of our forced crashes with the system hung in boot:
Timer queue entries
-------------------
|
| System time: 16-MAY-1993 09:30:00.39
| First TQE time: 16-MAY-1993 09:30:01.03
|
| TQE PID/
| address Expiration Time Type routine
| -------- ----------------------------------------- ------ --------
| 82C81D40 0096C955.236C61A7 16-MAY-1993 09:30:01.03 SRD--- 8257DED8 CNX$BUGCHECK_CLUSTER+00498
| 82C81840 0096C955.23844613 16-MAY-1993 09:30:01.18 SRD--- 825822A8 LCK$SND_REDO_SRCH+00B18
| 82D32600 0096C955.23D1BC44 16-MAY-1993 09:30:01.69 SSD--- 82598498 SYS$PCADRIVER+27C98
| 8258B210 0096C95D.3C8444C9 16-MAY-1993 10:27:59.10 SSD--- 8258BC20 MPDEV$DO_IO_TO_PATH+00320
| 824520E8 00A72F91.B7238000 1-JAN-2008 00:00:00.00 SRD--- 824542F0 EXE$TIMEOUT
| 82C80280 00A72F91.B7238000 1-JAN-2008 00:00:00.00 SRD--- 82577178 SCS$PORT_INIT_DONE+00120
| 82D1FC00 00A72F91.C30F4200 1-JAN-2008 00:00:20.00 SRD--- 8258CCA8 SCS$MSCP_CHECK_SERVICE+00040
| 82448B58 00A72F91.C9052300 1-JAN-2008 00:00:30.00 SRD--- 82453350 EXE$TRIM_POOL_LIST+000A8
| 824BBB10 00A72F91.C9052300 1-JAN-2008 00:00:30.00 SSD--- 824BE768 SMP$CALIBRATE_SCC
| 824E5978 00A72F91.C9052300 1-JAN-2008 00:00:30.00 SSD--- 824ED4F0 EXE$PSHARED_TEARDOWN+00248
| 82C7FF80 00A72F91.DAE6C600 1-JAN-2008 00:01:00.00 SRD--- 82577560 SCS$SET_LOAD_RATING+000A0
|
Our previous guess was that the 5-year old linkdate in SYS$BASE_IMAGE.EXE was
leading to our problems. This was supported weakly by the evidence that
(1) the problem started 5 years after that link date
(2) the problem could be fixed by setting the hardware clock back by a year.
However, we have now made a test which pretty much negates this as a possible cause.
We have patched the link date in SYS$BASE_IMAGE.EXE from Feb 2008 to Feb 2013.
We find that booting using that patched image also hangs.
|
So at this point our only good candidate for a fix is the EXEC_INIT fix.
And for that, we may have to wait for a suitable downtime in our production
system.
|
Thanks again for your help,
Ed Miller

Edward Miller_1
Occasional Advisor
Solution

Re: Formerly reliable V8.3 AS4100 boot now hangs

We have now found a fix for our boot hang problem -- the fix is to patch the image
SYS$CPU_ROUTINES_1605.EXE
as described in the discussion in the OpenVMS hardware section of this forum:
"Re: Changes Boot behavior on XP1000 since 2013"
|
The failure in our case (system hangs during boot) is NOT the same as described
in that discussion (system prompts for time during boot), but the root cause (the
systematic corruption of the year field in the hardware clock during a console INIT)
is the same, so the fix is the same.
|
Why did our system hang rather than prompt for a time? It is pretty clear that
this is because our file EXEC_INIT.EXE still harbors the bug that was fixed in
update VMS83A_SYS-V1100 [24-JUN-2009] -- an update which we have yet to apply.
However, we have not gone to the effort required to prove this point.
Thanks again to Jur for pointing out this fix.
|
Note that we previously had found a work-around that avoided the boot hang. That
workaround was to first boot from a VMS installation CD, and then boot normally.
It is clear now that the reason this worked is that the VMS installation CD prompts
the user for the current date and time, which it then writes to the hardware clock,
thereby repairing the corruption done by the console INIT.
|
-- Ed Miller