Operating System - OpenVMS
1755883 Members
3919 Online
108838 Solutions
New Discussion юеВ

Disk File Optimizer Error

 
SOLVED
Go to solution
Jim_McKinney
Honored Contributor

Re: Disk File Optimizer Error

Before concluding that the spin wait timeout was a result of software you should verify that CPU 1 is currently functional and also inspect the error log for possible failures of CPU 1. Failed CPUs often result in machinechecks such as this.

$ show cpu ! are there 2 present?
$ diag/sinc=yest/incl=cpu ! any failure?
$ anal/cras sys$system:sysdump.dmp
SDA> clue errlog
SDA> exit
$ diag/incl=cpu clue$errlog.sys
Phillip Thayer
Esteemed Contributor

Re: Disk File Optimizer Error

Opps, Your right. It should be SMP_LNGSPINWAIT not SMP_SPINWAIT. Sorry.

Phil
Once it's in production it's all bugs after that.
Lisa Collins
Advisor

Re: Disk File Optimizer Error

System: NEOMN2, AlphaServer DS20 500 MHz

CPU ownership sets:
Active 0,1
Configure 0,1

CPU state sets:
Potential 0,1
Autostart 0,1
Powered Down None
Failover None

I don't have Diagnose installed. I was able to create the CLUE$ERRLOG.SYS, but I get a lot of binary when I use just analyze. At the end I see this

Analyze Object File 3-MAR-2006 12:51:25.82 Page 20
D1:[COLLINS]CLUE$ERRLOG.SYS;1
ANALYZ A07-04

This is an OpenVMS VAX object file

3. MODULE HEADER (OBJ$C_HDR_MHD), 1720 bytes
*** End of module record is missing from previous module.

structure level: 0
maximum record size: 0
module name: ""
*** Length of symbol is not within valid range of 1 to 39.
module version: ""
*** Length of symbol is not within valid range of 1 to 31.
creation date/time:
*** Date is not in the correct 17-byte format (dd-mmm-yyyy hh:mm).
*** Object record is longer than maximum size in module header (0 bytes).


*** End of module record is missing from previous module.
*** No psects were defined in this module.


Analyze Object File 3-MAR-2006 12:51:25.82 Page 21
D1:[COLLINS]CLUE$ERRLOG.SYS;1
ANALYZ A07-04

SUMMARY STATISTICS:

Record Type Count Total Bytes

OBJ$C_HDR 2 9912
OBJ$C_GSD 0 0
OBJ$C_TIR 0 0
OBJ$C_EOM 0 0
OBJ$C_DBG 0 0
OBJ$C_TBT 0 0
OBJ$C_LNK 0 0
OBJ$C_EOMW 0 0

Totals 2 9912


The analysis uncovered 13 errors.
Lisa Collins
Advisor

Re: Disk File Optimizer Error

If it is the smp_lngspinwait, then I am already set to 300000

SMP_LNGSPINWAIT 3000000 3000000 1 8388607 10 usec.
Volker Halle
Honored Contributor

Re: Disk File Optimizer Error

Lisa,

when you did a SDA> CLUE ERRLOG to create CLUE$ERRLOG.SYS, can you post the SDA output ? If a HW problem is causing the CPUSPINWAIT crash, an errlog entry should be immediately preceeding the crash entry.

The CPUSPINWAIT crash was caused by CPU 00 executing code in system space and holding the SCHED spinlock for too long. We need to find out, which code CPU 00 was executing:

$ ANAL/CRASH SYS$SYSTEM:
SDA> READ/EXEC/NOLOG
SDA> SET CPU 00
SDA> EXA/INS 8FA229AC-30;50
SDA> SHOW PROC
SDA> CLUE CALL

You need DECevent ($ DIAGNOSE command) or even SEA to analyse CLUE$ERRLOG.SYS. No need to do this, if there were no HW errors immediately preceeding the crash (as shown by SDA> CLUE ERRLOG).

Volker.
Jim_McKinney
Honored Contributor

Re: Disk File Optimizer Error

> when I use just analyze

Try

$ ANALYZE/ERROR/ELV TRANS CLUE$ERRORLOG.SYS

rather than plain old ANALYZE (which defaults to /OBJECT). (I'm not sure of that syntax as I am able use DECevent on all of my systems.) I don't know if you'll be able to decode DS20 error packets with that utility - but, give it a try.
Zeni B. Schleter
Regular Advisor

Re: Disk File Optimizer Error

SMP_LNGSPINWAIT by default is 3 million . It was the SMP_SPINWAIT that was too low and needed to be raised. This doesn't mean that the crash was not for a real problem. Just it could be that one processor was busy for an unusually long time and the SPINWAIT timed out because it was too low for your purposes. Also this was a problem encountered in 7.1 and I don't know if it was there in another form later on. Doing a wildcard search through the 7.1 release_notes for patches got a lot of hits.

Here is a bit from the ALPPORTS01_071

o Multiprocessor Systems with CIPCAs: SMP_SPINWAIT Restriction

If your system uses a CIPCA adapter and you operate with
MULTIPROCESSING set to a non-zero value, you must reset the
value of the SMP_SPINWAIT parameter to 300000 (3 seconds)
instead of the default 100000 (1 second).

If you do not change the value of SMP_SPINWAIT, a CIPCA adapter
error could generate a CPUSPINWAIT system bugcheck similar to
the following:

**** OpenVMS (TM) Alpha Operating System V7.1 - BUGCHECK ****
** Code=0000078C: CPUSPINWAIT, CPU spinwait timer expired

This restriction will be removed in a future OpenVMS release.

- I looked on the patch release notes on a VMS 7.3-2 system and there were still mentions of cpuspinwait bugchecks.

Volker Halle
Honored Contributor

Re: Disk File Optimizer Error

All,

another 'speculation' warning !

A CPUSPINWAIT crash can have many different reasons. It is possible to extract some more information from the dump, to be able to do a more 'educated' guess about the problem. See my previous request for more info...

Lisa,

please make sure, that you enable a logfile for the DFG$SCRIPT32 script. Without the error message from the logfile (/NOLOG is the default), you open up room for even more speculation:

$ DEFRAGMENT MODIFY DFG$SCRIPT32/LOG=SCRIPT32.LOG

Volker.
Lisa Collins
Advisor

Re: Disk File Optimizer Error

Volker,

Here is the output of the creating the CLUE$ERRLOG.SYS

Dumpfile Errorlog Entry Information:
------------------------------------
Sequence Date Time Error Message Type
-------- ----------- ----------- --------------------------------
38069 3-MAR-2006 07:21:19.81 Timestamp Entry
38061 3-MAR-2006 07:02:01.03 Volume Mount
38070 3-MAR-2006 07:26:28.09 * Crash Entry

Config Entry and Errlog Entries written to CLUE$ERRLOG.SYS file, use COMPAQ Anal
yze or DECevent to analyze.



Here is the output of the additional analyze of the dump that you outlined.

$ ANAL/CRASH SYS$SYSTEM:

OpenVMS (TM) Alpha system dump analyzer
...analyzing a compressed selective memory dump...

Dump taken on 3-MAR-2006 07:26:28.09
CPUSPINWAIT, CPU spinwait timer expired

SDA> READ/EXEC/NOLOG
%SDA-W-OPENIN, error opening SDA$READ_DIR:RCDDRIVER as input
-RMS-W-FNF, file not found
%SDA-W-OPENIN, error opening SDA$READ_DIR:RMTDRIVER as input
-RMS-W-FNF, file not found
%SDA-W-OPENIN, error opening SDA$READ_DIR:LAVDRIVER as input
-RMS-W-FNF, file not found
%SDA-W-OPENIN, error opening SDA$READ_DIR:PWIPDRIVER as input
-RMS-W-FNF, file not found
%SDA-W-OPENIN, error opening SDA$READ_DIR:UCXDRIVER as input
-RMS-W-FNF, file not found
%SDA-W-OPENIN, error opening SDA$READ_DIR:NTYDRIVER as input
-RMS-W-FNF, file not found
%SDA-W-OPENIN, error opening SDA$READ_DIR:INETDRIVER as input
-RMS-W-FNF, file not found
SDA> SET CPU 00
SDA> EXA/INS 8FA229AC-30;50
FFFFFFFF.8FA2297C: ZAPNOT R20,#XFD,R20
FFFFFFFF.8FA22980: BIS R31,R31,R17
FFFFFFFF.8FA22984: BIS R18,R20,R18
FFFFFFFF.8FA22988: STL R18,#X000C(R16)
FFFFFFFF.8FA2298C: LDA R27,#XFD88(R2)
FFFFFFFF.8FA22990: LDL R22,#X003C(R21)
FFFFFFFF.8FA22994: ADDL R22,#X01,R22
FFFFFFFF.8FA22998: STL R22,#X003C(R21)
FFFFFFFF.8FA2299C: BSR R26,#XFFF70C
FFFFFFFF.8FA229A0: LDL R6,(R6)
FFFFFFFF.8FA229A4: XOR R6,R5,R0
FFFFFFFF.8FA229A8: BNE R0,#XFFFFE9
FFFFFFFF.8FA229AC: LDL R3,(R3)
FFFFFFFF.8FA229B0: BIS R31,R4,R16
FFFFFFFF.8FA229B4: BLBS R3,#X000002
FFFFFFFF.8FA229B8: MTPR IPL
FFFFFFFF.8FA229BC: BR R31,#X000004
FFFFFFFF.8FA229C0: BIS R31,R31,R31
FFFFFFFF.8FA229C4: BIS R31,R4,R0
FFFFFFFF.8FA229C8: LDA R27,#X5730(R2)
FFFFFFFF.8FA229CC: BSR R26,#XFF3FB2
SDA> sh proc

Process index: 013E Name: PMDFFB pmas Extended PID: 0000053E
--------------------------------------------------------------------
Process status: 00040001 RES,PHDRES
status2: 00000001 QUANTUM_RESCHED

PCB address 816D0B00 JIB address 817F2240
PHD address 85C5A000 Swapfile disk address 00000000
KTB vector address 816D0DEC HWPCB address FFFFFFFF.85C5A080
Callback vector address 00000000 Termination mailbox 001D
Master internal PID 0001013E Subprocess count 0
Creator extended PID 00000000 Creator internal PID 00000000
Previous CPU Id 00000000 Current CPU Id 00000000
Previous ASNSEQ 000000000000034A Previous ASN 0000000000000013
Initial process priority 4 # open files remaining 189/200
Delete pending count 0 Direct I/O count/limit 200/200
UIC [00001,000004] Buffered I/O count/limit 10000/10000
Abs time of last event 000251E3 BUFIO byte count/limit 97824/97824
# of threads 1 ASTs remaining 248/250
Swapped copy of LEFC0 00000000 Timer entries remaining 199/200
Swapped copy of LEFC1 00000000 Active page table count 0
Global cluster 2 pointer 00000000 Process WS page count 729

Press RETURN for more.
SDA>

Process index: 013E Name: PMDFFB pmas Extended PID: 0000053E
--------------------------------------------------------------------
Global cluster 3 pointer 00000000 Global WS page count 336
PCB Specific Spinlock 816D3A00

Press RETURN for more.
SDA>

Process index: 013E Name: PMDFFB pmas Extended PID: 0000053E
--------------------------------------------------------------------

Thread index: 0000
------------------
Current capabilities: System: 0000000C QUORUM,RUN
User: 00000000
Permanent capabilities: System: 0000000C QUORUM,RUN
User: 00000000
Current affinities: 00000000
Permanent affinities: 00000000
Thread status: 00040001
status2: 00000001

KTB address 816D0B00 HWPCB address FFFFFFFF.85C5A080
PKTA address 7FFEFF98 Callback vector address 00000000
Internal PID 0001013E Callback error 00000000
Extended PID 0000053E Current CPU id 00000000
State CUR 00 Flags 00000000
Base priority 4 Current priority 5
Waiting EF cluster 4 Event flag wait mask FFFFFFFE

Press RETURN for more.
SDA>

Process index: 013E Name: PMDFFB pmas Extended PID: 0000053E
--------------------------------------------------------------------
CPU since last quantum FFF7 Mutex count 0
ASTs active NONE

SDA>
SDA> clue call


Call Chain: Process index: 013E Process name: PMDFFB pmas PCB: 816D0B00
(CPU 0)
--------------------------------------------------------------------------------
-------
Procedure Frame Procedure Entry Return A
ddress
------------------ ---------------------------------------------- --------
----------------------------------------
7FFA1DE8 Null 800B82B0 SMP$ACQUIREL_C
7FFA1E40 Stack 8FA228D8 8FA2342C

7FFA1E80 Stack 8FA233E0 8F9EF994

7AE27DD0 Stack 0005FD00 PMAS_MASTER+0005FD00 000632BC
PMAS_MASTER+000632BC
7AE27F80 Stack 0005FD00 PMAS_MASTER+0005FD00 000632BC
PMAS_MASTER+000632BC
7AE28130 Stack 0005FD00 PMAS_MASTER+0005FD00 00062D88
PMAS_MASTER+00062D88
7AE282E0 Stack 0005FD00 PMAS_MASTER+0005FD00 000621DC
PMAS_MASTER+000621DC

Press RETURN for more.
SDA>

Call Chain: Process index: 013E Process name: PMDFFB pmas PCB: 816D0B00
(CPU 0)
--------------------------------------------------------------------------------
-------
Procedure Frame Procedure Entry Return A
ddress
------------------ ---------------------------------------------- --------
----------------------------------------
7AE28490 Stack 0005FD00 PMAS_MASTER+0005FD00 0006441C
PMAS_MASTER+0006441C
7AE28640 Stack 00063B60 PMAS_MASTER+00063B60 00056BF8
PMAS_MASTER+00056BF8
7AE28760 Stack 00056B60 PMAS_MASTER+00056B60 00043DA4
PMAS_MASTER+00043DA4
7AE287C0 Stack 00043670 PMAS_MASTER+00043670 00046408
PMAS_MASTER+00046408
7AE28A10 Stack 00045400 PMAS_MASTER+00045400 00041930
PMAS_MASTER+00041930
7AE28F90 Stack 00040F00 PMAS_MASTER+00040F00 00040068
PMAS_MASTER+00040068


Press RETURN for more.
SDA>

Call Chain: Process index: 013E Process name: PMDFFB pmas PCB: 816D0B00
(CPU 0)
--------------------------------------------------------------------------------
-------
Procedure Frame Procedure Entry Return A
ddress
------------------ ---------------------------------------------- --------
----------------------------------------
7AE29800 Stack 00040000 PMAS_MASTER+00040000 7BCE05B0
PTHREAD$RTL+525B0
7AE29850 Stack 7BCE0320 PTHREAD$RTL+52320 7BCBE31C
PTHREAD$RTL+3031C
7AE29A50 Stack 7BCBE140 PTHREAD$RTL+30140 8028D63C
SYS$IMGSTA_C+0015C
7AE2DB30 Stack 8028D4E0 SYS$IMGSTA_C 7AF75DD8

7AE2DBB0 Stack 7AF75C2C



Zeni,
You are saying to still change my
SMP_SPINWAIT from 100,000 to 300,000 correct? Do you think I should also change my SMP_LNGSPINWAIT from 3 million to 9 million as the 7.3--1 release notes (the link I posted) suggests? It mentions modifying this param to reduce the likelood of encountering a problem.


Jim,
The command ANALYZE/ERROR/ELV TRANS CLUE$ERRORLOG.SYS did not work.

Jim_McKinney
Honored Contributor

Re: Disk File Optimizer Error

> ANALYZE/ERROR/ELV TRANS CLUE$ERRORLOG.SYS did not work.

> 38061 3-MAR-2006 07:02:01.03 Volume Mount
> 38070 3-MAR-2006 07:26:28.09 * Crash Entry

No need to examine the errorlog entries - as Volker previously stated - if you didn't find an entry that was almost simultaneous with the crash then there'll be nothing to look at. Let Volker (or whomever) follow the trail thru SDA and see if they can find the fault. Again, heed Volker's advice and not jump to a conclusion that you need just increase the spinwait timeout value and your problem is cured. There are many things that can produce failures such as this.