Operating System - OpenVMS
1839144 Members
3066 Online
110136 Solutions
New Discussion

Re: CPUSPINWAIT, CPU spinwait timer expired

 
Gregory Githens
Occasional Advisor

CPUSPINWAIT, CPU spinwait timer expired

Yesterday morning I applied the following patches to our OpenVMS 7.3-2 system;
DEC AXPVMS VMS732_GRAPHICS V5.0
DEC AXPVMS VMS732_GRAPHICS V4.0
DEC AXPVMS VMS732_FIBRE_SCSI V11.0
DEC AXPVMS VMS732_DCL V8.0
DEC AXPVMS VMS732_SYS V13.0
DEC AXPVMS VMS732_RMS V4.0
DEC AXPVMS VMS732_AUDSRV V4.0
DEC AXPVMS TCPIP_ECO V5.4-156
I applied them in order from the bottom up in the early morning hours. At around 1:30 pm our system crashed with CPUSPINWAIT, CPU spinwait timer expired. I am attaching the output of ANALYZE/CRASH_DUMP and the clue file.

I would greatly appreciate any help in figuring out what happened and what I can do to prevent this in the future.

Could this be from the patches applied or is that just a coincidence?

As a little bit of background our system usually has 100-200 users logged in using ssh2 via public key authentication. The user that it shows in the clue file and ANALYZE/CRASH_DUMP is an unprivileged user and the executable DEBTOR.EXE is a custom app compiled from Basic that we have been running for years and years without this problem.

Thanks for the help.
Greg Githens
17 REPLIES 17
Gregory Githens
Occasional Advisor

Re: CPUSPINWAIT, CPU spinwait timer expired

Here is the clue file.
Volker Halle
Honored Contributor

Re: CPUSPINWAIT, CPU spinwait timer expired

Gregory,

CPU 01 incurred a HALT instruction in kernel mode. The HALT PC reported is 8044F1FC
CPU 0 tried to send an interprocessor interrupt to CPU 01, but the operation timed out, so CPU 0 took down the system with a CPUSPINWAIT crash.

Your AUTO_ACTION console environment variable is most likely NOT set to RESTART but to HALT. Otherwise you would have gotten a HALT restart bugcheck.

The problem is caused by whatever code was executing on CPU 1.

SDA> EXA/INS 8044F1FC

if this shows a HALT instruction, continue with:

SDA> EXA/INS 8044F1FC-20;30

Consider setting AUTO_ACTION RESTART and try to capture the console output, so you have some more data, if this problem happens again.

Also try SDA> CLUE ERRLOG to check, if there were errors reported immediately preceeding the crash.

Volker.
Duncan Morris
Honored Contributor

Re: CPUSPINWAIT, CPU spinwait timer expired

Gregory,

just an aside, your console firmware version is very old (6.6 - currrent = 7.3).

Your KGPSA and Gig-E adapters would probably benefit from the console update as well! See the release notes for the DS20 firmware.

The firmware page is here....

ftp://ftp.digital.com/pub/DEC/Alpha/firmware/index.html

Regards,

Duncan
Dean McGorrill
Valued Contributor

Re: CPUSPINWAIT, CPU spinwait timer expired

Hi Greg,
Volker's right on, I seem remember you get a cpuspinlock timeout crash if its not the primary cpu that issues a halt. I'd doubt your basic app would be at fault unless your doing something tricky, ie stack
swapping. curious what you find.
Gregory Githens
Occasional Advisor

Re: CPUSPINWAIT, CPU spinwait timer expired

Volker,
Thanks for the help. I ran the commands your talking about but I really didn't understand the output. I am attaching the output.

Also we have a dumb terminal connected to the console, how can I capture the output? I was thinking of maybe setting up a pc with a termial emulation program to capture the output but I am kind of leary to do that in case the pc hangs.

Any further assistance or info you can give would be greatly appreciated.


Duncan,
Thanks for the information about the console firmware version. I will look into updating it.
Volker Halle
Honored Contributor

Re: CPUSPINWAIT, CPU spinwait timer expired

Greg,

the instruction stream is inside RMS, but there is no HALT instruction in that instruction stream. The HALT PS = 0000000A is somehow consistent with being in RMS (current mode = EXEC).

When the crash footprint is not explainable, one starts to think about possible HW (CPU) problems. CPU 1 is an older EV6 Pass 2.3 module.

Consider connecting a PC with a terminal emulator and enable session logging to capture the console output. Make a copy of the crash (SDA> COPY dev:filename) for further reference.

Volker.
Dean McGorrill
Valued Contributor

Re: CPUSPINWAIT, CPU spinwait timer expired

Unless I missed it as (rusty), what is the spinlock and who owned it? you can get that from sda show spinlock/full on the dump. It might help. We, (decnet+) used get a few iolock8 spinwait timeout crashs until we lightened our heavy handed use of it.
Gregory Githens
Occasional Advisor

Re: CPUSPINWAIT, CPU spinwait timer expired

Volker,
Thanks for the info about the older cpu. My hardware supplier thinks there may be an issue with having a pass 2.3 cpu and pass 2.5 cpu on the same system so we are going to look upgrading the 2.3 one.

Dean,
I am attaching the output of the command you requested.

Thanks,
Greg Githens
Gregory Githens
Occasional Advisor

Re: CPUSPINWAIT, CPU spinwait timer expired

Ooops, I didn't notice the Press return for more. Attached is the full output.
Dean McGorrill
Valued Contributor

Re: CPUSPINWAIT, CPU spinwait timer expired

Greg
there it is, SPL$C_INVALIDATE. its timeout is 300000. my system is only 100000, I wonder if someone saw this before and upped it. I'll look this spin up when I get home but that, and the 01 CPUCEASED sure puts me in the 'its hardware' camp.
you might want to shut it down to keep
from crashing again while you find a new
cpu board.

$ stop/cpu 1
Volker Halle
Honored Contributor

Re: CPUSPINWAIT, CPU spinwait timer expired

Dean,

this CPUSPINWAIT bugcheck is not the 'usual' bugcheck, where a CPU requests a spinlock held by another CPU, which is not giving up the spinlock in time.

In this case, CPU 0 wants to invalidate a TB (translation buffer) entry on all other CPUs. It did send an interprocessor interrupt to CPU 1 (see inv_tbs bit set in WorkReq for CPU 1), but CPU 1 did not execute the interprocessor interrupt and did not set the ACK bit in time. This is because it seems to have HALTed unexpectedly...

Volker.
Gregory Githens
Occasional Advisor

Re: CPUSPINWAIT, CPU spinwait timer expired

I am getting ready to upgrade my firmware from 6.6 to 7.3 but I have a fibre channel encryption device that encrypts data going to the disk array and it is kind of picky, is there a way to back out the firmware update if I have a problem?
Dean McGorrill
Valued Contributor

Re: CPUSPINWAIT, CPU spinwait timer expired

hi Greg,
back out of a alpha firmware update, not really. you could reinstal your older one I suppose if you have it. it writes it to erom. make sure you have the right version for that cpu, and don't power it
off until it prompts you. how have you
been doing with it, any more crashes? did
you stop cpu 1? -Dean
Gregory Githens
Occasional Advisor

Re: CPUSPINWAIT, CPU spinwait timer expired

I haven't stopped cpu 1. I was worried about performance issues. It has crashed 3 more times since then on the 12th, 13th, and early this morning. I should have a replacement cpu in a couple days. And I can also get a 6.6 firmware cd from our vendor. Thanks for your help.
Jur van der Burg
Respected Contributor

Re: CPUSPINWAIT, CPU spinwait timer expired

The 500Mhz EV6 pass 2.3 or lower cpu's are known for problems, especially when combined with newer pass cpu's. They can be very sensitive to the layout of data in memory, so I can understand that the crashes started after applying software updates. I've seen that many times before. Replacement of that cpu with at least v2.5 will fix it.

Jur.

Dean McGorrill
Valued Contributor

Re: CPUSPINWAIT, CPU spinwait timer expired

I wonder was the footprint the same for all? if so I'd check and see if the system is that busy. I know
I (and my users) would rather limp a long a bit then see crashes!. great on the cpu coming in! Dean
John Travell
Valued Contributor

Re: CPUSPINWAIT, CPU spinwait timer expired

When I was still 'in house' there used to be an ECO program for replacing 500Mhz rev 2.3 cpu's on ES40's. I don't know if there was any program put in place for other machines using the same cpu chips.
The problem was believed at the field support level to be a memory interlock timing error, and was only a problem on multi-cpu machines.
This crash is entirely consistent with the rev 2.3 problem, and as others have said, 'should' vanish when that cpu is replaced.
JT: