Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Bugcheck 0000036c

 
SOLVED
Go to solution
Wim Van den Wyngaert
Honored Contributor

Bugcheck 0000036c

I have a bugcheck on an Alphas6taion 400/400 running VMS 7.3. In a cluster with 2 GS160 (it's the q station). See enclosure.

I suspect that a file has gone bad.
A power cycle has already been done.
system parameters have not been changed since previous boot (2 years ago).
I tried minimal boot and it bugchecks too.

How to debug this bugcheck ?

Wim

PS Machine is not at my location.
Wim
25 REPLIES
Wim Van den Wyngaert
Honored Contributor

Re: Bugcheck 0000036c

Looks scrambled at the end.
Post it again.

Wim
Wim
Jur van der Burg
Respected Contributor

Re: Bugcheck 0000036c

The system disk is bad. Errorcode = 910 = %SYSTEM-W-NOSUCHFILE, no such file while trying to start SYSINIT.EXE. Check the disk by booting from a CD or alternate disk.

Jur.
Wim Van den Wyngaert
Honored Contributor

Re: Bugcheck 0000036c

I mop booted the station into another station but this without page/swap file.

Then I verified the disk with anal/dis/read/repair. No errors at all.

Did anal/rms of sysinit.exe. No errors.

Tried booting it again from disk. Bugcheck.

Strange.

Wim

Wim
Wim Van den Wyngaert
Honored Contributor

Re: Bugcheck 0000036c

%EXECINIT-I-LOADING, loading SYS$FILES_64.EXE
%EXECINIT-E-LOADERR, error loading SYS$FILES_64.EXE, status = 00000910
%EXECINIT-I-LOADING, loading SYS$XFS_CLIENT.EXE
%EXECINIT-E-LOADERR, error loading SYS$XFS_CLIENT.EXE, status = 00000910
%EXECINIT-I-LOADING, loading SYS$XFS_SERVER.EXE
%EXECINIT-E-LOADERR, error loading SYS$XFS_SERVER.EXE, status = 00000910
%EXECINIT-I-LOADING, loading SYS$LFS.EXE
%EXECINIT-E-LOADERR, error loading SYS$LFS.EXE, status = 00000910

These files don't exist on my other systems too.

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: Bugcheck 0000036c

I filled the disk completely while booted via mop. Then verified it again. Still no error.

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: Bugcheck 0000036c

I stopped the whole GS160 cluster. Then restarted it (yes, lucky I can do that today). Bugcheck is present even when booted as first node in cluster.

I also defragmented the disk before reusing it. No problems while defragmenting.

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: Bugcheck 0000036c

I analyzed the dump. See encl.

Wim
Wim
Hoff
Honored Contributor

Re: Bugcheck 0000036c

PROCGONE involves looking at the registers to see what happened. 00000910 is file not found/FNF, as others have mentioned. Then digging around from there.

V7.3? ECO it or (better) upgrade to V7.3-2 and ECO it. Chasing old bugs and particularly chasing old and fixed bugs is a waste of time. I'd slam in a replacement copy of OpenVMS Alpha onto this disk.

Do check the system battery for this box; the boot year is 1858. That can cause various weirdness, though I've not specifically seen it trigger a crash.

Those files listed as 00000910 errors are very likely the remnants of an old Spiralog installation, and that product should have been deinstalled several releases ago. I'd *guess* that a series of 00000910/FNF references are not the proximate trigger for the error, though stranger things have happened. See if a combination of SYS_LOADABLE REMOVE on the Spiralog images and invoking SYS$UPDATE:VMS$SYSTEM_IMAGES.COM cleans up that part of the diagnostic display. (The diagnostic bootstrap might point at the specific file.)

As for another approach toward troubleshooting, enable boot time diagnostics via R5 (30000, IIRC), and look to clear out the Spiralog loadable images via the LOAD_SYS_IMAGES parameter.

But then, the best approach with what is very clearly a somewhat questionable system disk that's not being used for anything other than as a quorum node is to slam new V7.3-2 bits onto the disk, load the current ECOs, slam CLUSTER_AUTHORIZE.DAT and VOTES and EXPECTED_VOTES from values expected for the the cluster, and see if that fixes this case.

I have to ask what you expect to gain by debug this. Not to be sarcastic or cynical or such here; I've simply found that there are cases when the knowledge and the value that might be provided from troubleshooting a bugcheck is worth zero. And it costs. Analyzing a crashdump is the software equivalent of troubleshooting hardware. There are cases where that approach is valuable, and there are cases when it is better to swap out the gear with the more current gear. But for this "quorum host" box, new bits (ECOs and/or upgrades) seem an easy and efficient and expeditious diagnostic.

Stephen Hoffman
HoffmanLabs LLC


Richard Brodie_1
Honored Contributor

Re: Bugcheck 0000036c

I'm not an expert but I would take a look at SHOW EXEC and compare it with a live system. What was the last module loaded, and the first not loaded?
Wim Van den Wyngaert
Honored Contributor

Re: Bugcheck 0000036c

I compared all exe files with an identical systen. sys$base_images.exe was not the same. I copied it again to be sure. Still bugcheck.

Upgrade is not allowed. Nothing installed since 2003 !!

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: Bugcheck 0000036c

Richard,

Show exec in ana/cra shows that all images were loaded (compared with a crash where the node exited the cluster after some uptime).

Wim
Wim
Hoff
Honored Contributor

Re: Bugcheck 0000036c

[[[Upgrade is not allowed.]]]]

You've caught yourself very firmly in the enterprise trap, eh? Okfine. The more firmly a business gets itself stuck in that particular trap, the more it tends to cost the business. TANSTAAFL, as Mr Heinlein posited, applies here.

[[[ Nothing installed since 2003 !!]]]

Respectfully, um, so? Latent bugs have shown up after twenty or even thirty years, and all hardware eventually goes weird. There are PROCGONE bugcheck fixes around for V7.3 (I found one ECO listing a case where a directory appears to erroneously go walkabout), and for later releases.

Roll in your last good backup, or your post V7.3 install disk image. Or roll in a fresh installation of V7.3, and ECO to current. (That gets rid of all traces of Spiralog, too.)

Or pay somebody to get this bugcheck resolved within the site-local rules. The perceived savings and the safety of not upgrading -- the enterprise trap -- eventually do start to incur costs.

And check that battery. Those can and do fail, and an AlphaStation 400 (if that's what is in use here; I don't recall a 400 MHz variant) is certainly within the range for those failures; that box is easily old enough.

Volker Halle
Honored Contributor

Re: Bugcheck 0000036c

Wim,

a successfull boot looks like this (from a V6.2-1H3 system with -fl 0,30000):

...
%SWAPPER-I-CREPRC, creating the SYSINIT process
%SWAPPER-I-MAINLOOP, entering SWAPPER main loop

%SYSINIT-I-START, SYSINIT process execution begins
%SYSINIT-I-UNLOAD, unloading EXEC_INIT
%SYSINIT-I-AUDIT, initializing security auditing
%SYSINIT-I-LOAD, loading RMS.EXE
%SYSINIT-I-LOAD, loading RECOVERY_UNIT_SERVICES.EXE
%SYSINIT-I-LOAD, loading DDIF$RMS_EXTENSION.EXE
%SYSINIT-I-LOAD, loading SYSMSG.EXE
%SYSINIT-I-ALTLOAD, loading site specific execlets
%SYSINIT-I-TIME, setting the system time
%SYSINIT-I-CLUSTER, cluster/lock manager initialization
%SYSINIT-I-DEFINE, defining system logical names
%SYSINIT-I-INIT, initializing the XQP
%SYSINIT-I-MOUNT, mounting the system disk
%SYSINIT-I-TIME, setting the system time
%SYSINIT-I-FILCACHE, deallocating the primitive file cache
%SYSINIT-I-DEFINE, defining SYS$TOPSYS
%SYSINIT-I-OPEN, marking system files open
%SYSINIT-I-CREPRC, creating the STARTUP process
%SYSINIT-I-FINISH, SYSINIT process execution completed
...

In your PROCGONE crash R0=00000910 = %SYSTEM-W-NOSUCHFILE must indicate, that SYSINIT failed to start. Look for all images required for SYSINIT.EXE to start and see if they all exist and can be successfully accessed:

From $ ANAL/IMA SYSINIT.EXE
...
Shareable Image List

0) "DECC$SHR"
1) "LIBRTL"
2) "LIBOTS"
3) "SYS$BASE_IMAGE"
4) "SYS$PUBLIC_VECTORS"

Volker.
Volker Halle
Honored Contributor

Re: Bugcheck 0000036c

Wim,

the current process name is SYSINIT and there is no image name. This may the symptom of an image activation problem !

Look at the Image Activator Scratch Area in the dump.

SDA> clue proc/lay
...
Image Activator Scratch Area 00000000.xxxxxxxxx 00000000.7FFD1800 00001000
...

SDA> exa xxxxxxxx;1000

do you see a filename in the ASCII dump ?
Does that file exist ?

Volker.
Wim Van den Wyngaert
Honored Contributor

Re: Bugcheck 0000036c

In the mean time the disk has been replaced.
I still have the old disk but can't play with it in boots. A backup copy to the new disk gave no errors during backup but during the verification of the backup (/ver) about 20 parity errors were given. None of them on exe files (mostly cde files). The boot of the copy resulted in the same crash. We recreated the OS on the new disk from scratch.

Volker,

The address exa shows multiple file names.
See enclosure. All exe files you mentioned existed.

Wim
Wim
Volker Halle
Honored Contributor
Solution

Re: Bugcheck 0000036c

Wim,

any shareable images missing, which the shareable images directly linked to SYSINIT may need ?

Volker.
Wim Van den Wyngaert
Honored Contributor

Re: Bugcheck 0000036c

Missing are : CMA$* (4), CRFSHR.

Last week I compared all files present with those on another system and discovered 1 file different. I should have compared all exe files.

Are these files the reason ? I found cms$tis_shr in the sda enclosure too.

Wim
Wim
Richard Brodie_1
Honored Contributor

Re: Bugcheck 0000036c

cms$tis_shr is one of the shared images that decc$rtl depends on, so it would be indirectly needed for SYSINIT (at least if the normal image activation rules apply).
Volker Halle
Honored Contributor

Re: Bugcheck 0000036c

Wim,

I just learned that there is a toolset called VOIT on the V8 freeware CDs. It contains the tool SHMTL, which diagnoses and prints the shareable image list for a given image and it's dependant shareable images. Much easier than multiple invocations of ANAL/IMAGE...

Volker.

Volker Halle
Honored Contributor

Re: Bugcheck 0000036c

Wim,

correction: the tool is called SHIML

Volker.
Wim Van den Wyngaert
Honored Contributor

Re: Bugcheck 0000036c

If I would have monitored the system disk with a checksum on all files (as I do on the sox systems) I would have found it directly.

Thanks Volker

Wim
Wim
Hoff
Honored Contributor

Re: Bugcheck 0000036c

I'd think it would be to everyone's advantage to monitor system integrity; to maintain some idea of which files are present, and what the SHA-1 values (or better) are for the various files.

Seems like a reasonable enhancement for supportability.

Wim Van den Wyngaert
Honored Contributor

Re: Bugcheck 0000036c

I added monitoring to about 50 stations with a single disk.
But on all sys$library was still complete.

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: Bugcheck 0000036c

Shouldn't all files (CMA*) needed to reboot be checked by the reboot option REBOOT_CHECK ?

We used it and reboot was not stopped !

Wim
Wim