Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Alpha Server 4100

Sayed Shalaby
Occasional Advisor

Alpha Server 4100

Alpha Server AS4100 restarting every few days
20 REPLIES
Jim_McKinney
Honored Contributor

Re: Alpha Server 4100

Could be most anything, but, you might start investigating with

$ analyze/crash sys$system:sysdump.dmp
SDA> clue crash
SDA> exit

If you've got valid crash dump and the timestamp within is recent, post the output here.

The content of the system error log might also be interesting - perhaps you've got DECevent installed? If so, the following might reveal something interesting.

$ DIAGNOSE/EXCLUDE=(CONT,VOLU)/SINC=20-JUN-2008
Robert Gezelter
Honored Contributor

Re: Alpha Server 4100

Sayed,

The first question is: Why?

Is there a hardware problem? Power Problem? Software Crash?

When this answer is determined, then it is possible to isolate the specific problem.

- Bob Gezelter, http://www.rlgsc.com
Sayed Shalaby
Occasional Advisor

Re: Alpha Server 4100

Jim,
I'll start analyze the problem tonight, I'll post the answer
Robert
i have already changed both power supplies,
tonight i am going to change both CPU's
Robert Gezelter
Honored Contributor

Re: Alpha Server 4100

Sayed,

Personally, without error logs showing a CPU problem, I would not change the hardware.

Most hardware problems produce entries in the error logs.

Power supply problems can produce strange problems however.

- Bob Gezelter, http://www.rlgsc.com
Andy Bustamante
Honored Contributor

Re: Alpha Server 4100

Before troubleshooting the 4100, is there something else on the same power/ups that indicates a restart or continuous service?

That said, check for a crash dump and review error logs as previously noted.

Good luck,

Andy
If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net
Sayed Shalaby
Occasional Advisor

Re: Alpha Server 4100

Robert,
the crashdump summery information :
ssrvexcept, unexpected system service exception
failing pc:00000000 00000000
failing ps:00000000 00000000
crash/primary cpu: 00/00
Sayed Shalaby
Occasional Advisor

Re: Alpha Server 4100

Jim, Robert and Andy,
the crashdump summery information :
ssrvexcept, unexpected system service exception
failing pc:00000000 00000000
failing ps:00000000 00000000
crash/primary cpu: 00/00
Khairy
Esteemed Contributor

Re: Alpha Server 4100

hi sayed,

could pls post the following:

$ write sys$output f$getsyi("power_vector")
$ write sys$output f$getsyi("thermal_vector")
$ write sys$output f$getsys("fan_vector")

Rgds
Jim_McKinney
Honored Contributor

Re: Alpha Server 4100

> the crashdump summery information :


How about posting the entire output of "SDA> CLUE CRASH" as an attachment for a start? That little snippet isn't enough to determine anything...
Sayed Shalaby
Occasional Advisor

Re: Alpha Server 4100

Jim,
that's all the crashdump information :
Crashdump Summary Information:
------------------------------
Crash Time: 24-JUN-2008 13:53:17.97
Bugcheck Type: SSRVEXCEPT, Unexpected system service exception
Node: AFHSR (Clustered)
CPU Type: AlphaServer 4100 5/400 4MB
VMS Version: V6.2-1H3
Current Process: SRV0400_01_0300
Current Image:
Failing PC: 00000000 00000000
Failing PS: 00000000 00000000
Module:
Offset: 00000000

Boot Time: 22-JUN-2008 00:31:26.00
System Uptime: 2 13:21:51.97
Crash/Primary CPU: 00/00
Saved Processes: 0
Pagesize: 8 KByte (8192 bytes)
Physical Memory: 1536 MByte (196608 PFNs)
Dumpfile Pagelets: 180046 blocks
Sayed Shalaby
Occasional Advisor

Re: Alpha Server 4100

Khairy,
i'll post that answers, but i have already changed both power supplies and the show power commands giving all are ok
Jim_McKinney
Honored Contributor

Re: Alpha Server 4100

Not much info in that dump file... perhaps a prior crash left more footprints. Take a look in SYS$ERRORLOG

$ dire/sinc=1-Jun-2008 sys$errorlog:clue*.lis

and see if there are any more interesting CLUE files.
Hoff
Honored Contributor

Re: Alpha Server 4100

Long before swapping hardware, I'd be looking at the details of the crash footprint; the crash is nulls, and not the usual sort of machine check or such. This particular case can arise when the PC ends up in a block of zeros.

Unexpected system service exceptions can potentially be power-related (such as a peripheral bus powering down unexpectedly; anybody that's dealt with that pesky coil cord and that circuit breaker on the back of the ancient BA11 box has probably seem one of those crashes), but that's a comparatively rare case.

Post up the CLUE CRASH, and any previous crash footprints you have. That'll tell us which of the system service exceptions were logged. If you're not collecting CLUE CRASH footprints already, now is the time to add that.

I'd also gather up the OpenVMS ECO (patch) kits and would ensure that this box is current on its patches for OpenVMS Alpha V6.2-1H3. Also for your network stack(s), and for any other kernel-mode code.

The RCM console (or is it RMC on this box?) is another option for monitoring thermal and environmental data.

But swapping hardware? Not my first target choice here. Not without a peek at the crash stack (and preferably at a couple of crash stacks), at a minimum.

Any chance you get get somebody to look at this system and these dumps via the RCM console or other remote access? That'll probably be faster than this current approach. Better yet, get HP or your preferred support organization on-line here.

Stephen Hoffman
HoffmanLabs LLC

Sayed Shalaby
Occasional Advisor

Re: Alpha Server 4100

AFHSR$ TYPE CLUE$HISTORY
Date Version System/CPU Node Bugcheck Process PC Module Offset
----------------- -------- ------------------- ------ ------------ --------------- -------- ----------------------- --------
1-JUL-2002 18:01 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT TP_SERVER 00000000 00000000
4-JUL-2002 05:24 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 0255 00000000 00000000
11-OCT-2002 01:21 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1234 00000000 00000000
31-OCT-2002 15:18 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT TP_SERVER 00000000 00000000
24-JAN-2003 13:40 V6.2-1H3 AlphaServer 4100 5/ AFHSR MACHINECHK NULL 800525F4 EXCEPTION 000145F4
8-FEB-2003 09:18 V6.2-1H3 AlphaServer 4100 5/ AFHSR MACHINECHK NULL 800525F4 EXCEPTION 000145F4
8-FEB-2003 09:43 V6.2-1H3 AlphaServer 4100 5/ AFHSR MACHINECHK NULL 800525F4 EXCEPTION 000145F4
1-MAR-2003 10:26 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 0932 00000000 00000000
1-APR-2003 03:31 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1234 00000000 00000000
10-APR-2003 18:53 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT TP_SERVER 00000000 00000000
8-MAR-2004 09:12 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1031 00000000 00000000
22-JUL-2004 10:32 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 0219 00000000 00000000
10-AUG-2004 14:22 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 6162 00000000 00000000
4-DEC-2004 13:29 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1031 00000000 00000000
25-DEC-2004 18:22 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 0000 00000000 00000000
27-FEB-2005 19:13 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT TP_SERVER 00000000 00000000
19-MAR-2005 13:15 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 8524 00000000 00000000
6-MAY-2005 03:32 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1234 00000000 00000000
8-MAY-2005 13:42 V6.2-1H3 AlphaServer 4100 5/ AFHSR INVEXCEPTN MNGR_APEXP 00000000 00000000
22-MAY-2005 10:16 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 2324 00000000 00000000
23-DEC-2005 01:12 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 0092 00000000 00000000
23-MAY-2006 07:44 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1021 00000000 00000000
6-JUN-2006 08:45 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 8517 00000000 00000000
8-JUN-2006 05:52 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT CX9___CHEM 00000000 00000000
4-AUG-2006 14:47 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1070 00000000 00000000
19-SEP-2006 15:11 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 0465 00000000 00000000
26-SEP-2006 12:10 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 0131 00000000 00000000
31-OCT-2006 10:02 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT MNGR_CHRT1 00000000 00000000
10-NOV-2006 02:07 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 0092 00000000 00000000
19-JAN-2007 02:31 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1031 00000000 00000000
19-JAN-2007 22:07 V6.2-1H3 AlphaServer 4100 5/ AFHSR INVEXCEPTN NULL 8001F2D8 SYSTEM_PRIMITIVES_MIN 0000B2D8
16-FEB-2007 14:17 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1031 00000000 00000000
30-MAY-2007 15:09 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1080 00000000 00000000
2-JUN-2007 20:40 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 0083 00000000 00000000
1-JUL-2007 19:10 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 9548 00000000 00000000
9-JUL-2007 13:00 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 8592 00000000 00000000
15-JUL-2007 10:33 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT SRV0400_01_0300 00000000 00000000
19-AUG-2007 10:24 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT MNGR_PAT2 00000000 00000000
29-AUG-2007 19:28 V6.2-1H3 AlphaServer 4100 5/ AFHSR MACHINECHK SUB22306 800525F4 EXCEPTION 000145F4
5-SEP-2007 16:07 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1234 00000000 00000000
12-SEP-2007 13:58 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 3698 00000000 00000000
18-OCT-2007 14:36 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 0909 00000000 00000000
10-NOV-2007 08:25 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1080 00000000 00000000
11-NOV-2007 09:21 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1636 00000000 00000000
15-JAN-2008 14:40 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 3580 00000000 00000000
16-JAN-2008 12:00 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 3599 00000000 00000000
19-JAN-2008 13:06 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 3599 00000000 00000000
10-FEB-2008 13:10 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 3599 00000000 00000000
25-FEB-2008 13:18 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 3599 00000000 00000000
4-MAR-2008 13:14 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 3599 00000000 00000000
11-MAR-2008 13:11 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 3580 00000000 00000000
18-MAR-2008 13:18 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 3580 00000000 00000000
29-MAR-2008 09:22 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1636 00000000 00000000
4-APR-2008 09:58 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 2803 00000000 00000000
5-APR-2008 14:59 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1243 00000000 00000000
7-APR-2008 13:28 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 0131 00000000 00000000
12-APR-2008 14:08 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 3597 00000000 00000000
13-APR-2008 10:07 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 2675 00000000 00000000
13-APR-2008 11:13 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT SUB21635 00000000 00000000
27-APR-2008 09:02 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 3580 00000000 00000000
1-MAY-2008 11:31 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT MNGR_CHRT1 00000000 00000000
5-MAY-2008 09:54 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 1080 00000000 00000000
5-MAY-2008 13:01 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 3580 00000000 00000000
10-MAY-2008 14:44 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT SUB22890 00000000 00000000
24-MAY-2008 15:29 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT SUB20118 00000000 00000000
26-MAY-2008 12:55 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 3580 00000000 00000000
7-JUN-2008 14:00 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT SUB22890 00000000 00000000
15-JUN-2008 14:37 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 3599 00000000 00000000
20-JUN-2008 20:29 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 0029 00000000 00000000
21-JUN-2008 10:20 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 3580 00000000 00000000
21-JUN-2008 13:15 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT 9934 00000000 00000000
24-JUN-2008 13:53 V6.2-1H3 AlphaServer 4100 5/ AFHSR SSRVEXCEPT SRV0400_01_0300 00000000 00000000
Hoff
Honored Contributor

Re: Alpha Server 4100

Now that list of errors looks more like there's bad hardware involved.

Bring in your preferred service vendor, or you can keep swapping pieces. On no particular research and based on existing swaps of processor and power, I'd likely next head toward memory, then I/O, then...

And as mentioned in an earlier reply, I would also install all of the current mandatory ECO kits for OpenVMS Alpha V6.2-1H3, for the network stacks, and for anything else that is operating in kernel mode. (Not all of these errors are hardware errors, and big blocks of zeros can easily be caused by kernel-mode code bugs, too.) Do also look out the I/O path while you're working, as big blocks of zeros can also arrive back from a faulty I/O widget.

And as you appear inclined toward self-maintenance, do look around on the network. You'll be able to find the "AlphaServer 4000/4100 Service Manual" EK-4100A-SV. The folks over at the MANX archive likely have a copy.

And if this is a critical server, set up with a spare of this same model, and do look to move forward to newer (or newer used) hardware and to a newer pair of boxes.

Volker Halle
Honored Contributor

Re: Alpha Server 4100

Sayed,

the CLUE history tells it all: there is (most likely) some software problem in a specific process, which is causing crashes with nearly identical footprints since a couple of years !!! There have been some MACHINECHK crashes inbetween, but the majority of crashes are SSRVEXCEPTN crashes with PC=0. In nearly all cases, these are software problems strongly related the current process !

Unless proven or suspected otherwise, assume that there is a software problem.

Start with:

$ ANAL/CRASH SYS$SYSTEM
SDA> CLUE ERRLOG ! to extract errlog entries
SDA> EXIT

Look at ANAL/ERR SYS$SCRATCH:CLUE$ERRLOG.SYS
If there are any HW errors preceeding the crash (within less than 1 minute), try to fix the HW problem first. If there are no errors, please provide a full CLUE file from CLUE$COLLECT:CLUE$node_ddmmyy_hhmm.LIS as an attachment to your next reply.

Volker.
Sayed Shalaby
Occasional Advisor

Re: Alpha Server 4100

dear all,
Please fin attached :
$type sys$errorlog:clue$afhsr_280608-1423.lis;1
(bootstrap or powerfail)
thankyou
Volker Halle
Honored Contributor

Re: Alpha Server 4100

Sayed,

thanks for providing the 2 CLUE files. The major symptoms of both crashes are identical. The kernel stack is not accessible, the return PC in R26 as well as the current PC value are ZERO. In both cases, the current process has a XQP operation (file system IO) outstanding, so the problem could come from the file system IO or the lower layers (driver).

What do the following commands report ?

$ ANAL/CRASH SYS$SYSTEM:SYSDUMP.DMP
SDA> SHOW PROC/CHAN
! look for any busy channels
SDA> CLUE ERRLOG
! any error immediately preceeding the crash ?

To find out more about the context of the problem, one would need to check where all the registers point to for the current process:

Try SDA> FORMAT @Rx !(with x=0...31)
and SDA> EXAMINE @Rx ! for all registers

Do this for any future crashes and save the information.

One would also need to try and find the current stack prior to the exception. Until the problem is isolated, all crashdumps need to be saved (for future comparison). Consider to save the SYSDUMP.DMP for for the current crash (preferably with the SDA> COPY command, otherwise use BACKUP/IGN=NOBACKUP).

Volker.
Sayed Shalaby
Occasional Advisor

Re: Alpha Server 4100

Volker,
I have two nodes cluster, one of them is not working during application problem,
this can cause rebooting the server every few days
Sayed
Volker Halle
Honored Contributor

Re: Alpha Server 4100

Sayed,

OpenVMS system crashes can be diagnosed and in most cases, a reason for the crash can be determined. It's not magic or black art, but needs some experience working with OpenVMS crashes.

If the crash would really be caused by a problem within the OpenVMS operating system, you would need to check, whether you've applied the most recent patches for the part of the OS, in which the crash has happened (e.g. SCSI drivers). If it still crashes and it's not an application problem, you might be stuck, if you don't have a prior version support contract for V6.2-1H3.

I can help you in this forum, but you have to supply the requested information. I can also help you formally, if you would like so, as the company I work for, also does OpenVMS support. It's your choice...

Volker.