Operating System - OpenVMS
1748218 Members
4260 Online
108759 Solutions
New Discussion

Alphaserver 1000 4/266 crashing every few days

 
aaroncf
Occasional Visitor

Alphaserver 1000 4/266 crashing every few days

I have an Alphaserver 1000 4/266 running Digital Unix version 4.0D that has started to crash every few days.  There is no crash-file and the only message from uerf is when I reboot the system (it always reboots with no problem).

 

Any suggestions on what might cause this would be greatly appreciated.

4 REPLIES 4
Hoff
Honored Contributor

Re: Alphaserver 1000 4/266 crashing every few days

So no details and no diagnostics, and a fossil of a system, and this box probably either has sentimental value and/or no hope for wholesale replacement of this box with a box from this millennium, etc...

 

Typical guesses: a disk error underneath some bit of DU/Tru64 that matters, or any of various potential hardware errors including memory errors, SCSI bus errors, disk errors, power supply problems, bad fan, thermal limits reached, random controller errors, CPU errors, etc.   Pretty much anything can go wrong on an old system, and this series is ~twenty years old...

 

You're going to have to go gather some evidence, either through the console diagnostics or through swapping partse. Check the console for any relevant output.  If you don't have a serial console, configure and wire one and connect via an emulator with a large buffer enabled.   Read up on the SRM diagnostics minimally, and run them.   

 

Manuals: OwnersService

 

FWIW on VMS, a halt instruction execution will only trigger a crash when the SRM system console is set for a restart.  If you reset and reboot, you won't get a crashdump written.  Caveat:  I'm not familiar with troubleshooting DU/Tru64 crashes and failures.  Definitely look for console errors.

 

aaroncf
Occasional Visitor

Re: Alphaserver 1000 4/266 crashing every few days

Thanks for getting back to me.  I have some additional information:  I did a memory test from the >>> prompt on the console and the system powered down after a few seconds.  I then restarted the system and booted and we are now running fine.

Hoff
Honored Contributor

Re: Alphaserver 1000 4/266 crashing every few days


@aaroncf wrote:

Thanks for getting back to me.  I have some additional information:  I did a memory test from the >>> prompt on the console and the system powered down after a few seconds.  I then restarted the system and booted and we are now running fine.


 

This system will probably continue to run until the fault is triggered again, and Digital Unix / Tru64 Unix / OSF/1 will again crash.

 

Powering down during a memory diagnostics test is not a particularly auspicious result from the tests, either.  That can be a bad fan, bad power supply, maybe a CPU problem, etc.   See the service manual.

 

Consider getting somebody in that can troubleshoot this box for you, or — what may be an equivalent or cheaper approach, and one likely with a better longer-term outcome — consider replacing this box with a newer Alpha box and migrating your data and applications, or replacing this box with an Alpha emulation, or — and this is obviously the longer-term requirement for this configuration — migrating this environment to a newer Unix platform and newer hardware.

aaroncf
Occasional Visitor

Re: Alphaserver 1000 4/266 crashing every few days

Thanks for getting back to me.  I replaced the UPS that the machine was plugged into about 2 weeks ago and since then the system has not powered down by itself.

However UNIX still crashes a few times a week.  But there doesn't seem to be a crash-dump file.  Are there settings to check to try to figure out why there is no crash-dump file?