Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Periodic access violation

Steve Meredith_1
Occasional Visitor

Periodic access violation

Thank you in advance for any and all help with this issue. We are running OpenVMS V7.1 on an Alpha 4100 in a process control/manufacturing facility with programs written in Fortran. Using Oracle DB for production storage.

The problem being described did not begin until after a recent patch/minor upgrade to Oracle to fix a bug (imagine that!). The program in question runs every hour, reading data from one instance, does some manipulation, and inserts the data into another instance. The program crashes on a regular basis of every 3 - 3.5 days.

I have several examples from the crash output, and each one seems to be in a different location of the program task. Of the 14 examples I have, the reason mask is the same (00), the virtual address is the same, with the exception of 2 of them, PC is the same for all, as well as PS.

Below is a copy of one of the access violation crash outputs. Again, thanks for any and all help/suggestions.

Steve

%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=000000007AFA37C0, PC=FFFFFFFF8091A7A0, PS=0000001B

Improperly handled condition, image exit forced.
Signal arguments: Number = 0000000000000005
Name = 000000000000000C
0000000000010000
000000007AFA37C0
000000008091A7A0
000000000000001B

Register dump:
R0 = 0000000002607934 R1 = 00000000026444E8 R2 = 0000000000812600
R3 = 00000000026444E8 R4 = 00000000008126E0 R5 = 0000000000000000
R6 = 0000000000000001 R7 = 0000000002607934 R8 = 00000000008801B8
R9 = 0000000000000000 R10 = 0000000000008000 R11 = 0000000002651080
R12 = 000000007AFA4F6C R13 = 00000000026B4868 R14 = 000000000260EB60
R15 = 0000000000000000 R16 = 00000000026444E8 R17 = 00000000008126E0
R18 = 0000000000000000 R19 = 0000000000000001 R20 = 0000000000000000
R21 = 0000000002607938 R22 = 0000000000000100 R23 = 0000000000000100
R24 = 0000000000000100 R25 = 0000000000000000 R26 = FFFFFFFF8091C07C
R27 = 0000000000812528 R28 = FFFFFFFF8091A67C R29 = 0000000000000003
SP = 000000007AFB0000 PC = FFFFFFFF8091A7A0 PS = 000000000000001B
3 REPLIES
Volker Halle
Honored Contributor

Re: Periodic access violation

Steve,

the reason mask value in the Signal Array (10000) indicates an access violation during an instruction fetch.

This exception happens in a system routine (PC=8091A7A0). You can determine which instruction was being executed and which routine this is in, if you issues the following commands in the running system (the same one, where this program has run):

$ ANAL/SYS ! needs CMKRNL privilege
SDA> READ/EXEC
SDA> EXA/INS 8091A7A0
SDA> EXA/INS 8091A7A0-30;40
SDA> EXIT

Then please provide the data reported by EXA/INS

You should also set up the process to write a process dump: $ SET PROC/DUMP before running the image in this process. If the process will abort with an improperly handled condition, a process dump will be written (as imagename.DMP in the current default directory).

Analysing process dumps on V7.1 with the PC in system space may prove quite complicated. More recent versions of OpenVMS have improved the IMGDMP facility and allow much better analysis.

Volker.
Steve Meredith_1
Occasional Visitor

Re: Periodic access violation

Volker, thanks for your help.

Below is the output from the SDA commands you recommended. I also wanted to setup this proc to take a proc-dump when it crashes, but apparently, the SET PROC/DUMP qualifier is only valid for the current process. The EXEs that run on my system are managed by a process management program that stops, starts, restarts and overall controls the images.
SDA> EXA/INS 8091A7A0
FFFFFFFF.8091A7A0: STQ R27,(SP)
SDA> EXA/INS 8091A7A0-30;40
FFFFFFFF.8091A770: LDQ R3,#X0018(FP)
FFFFFFFF.8091A774: LDQ R4,#X0020(FP)
FFFFFFFF.8091A778: LDQ R5,#X0028(FP)
FFFFFFFF.8091A77C: LDQ R6,#X0030(FP)
FFFFFFFF.8091A780: LDQ R7,#X0038(FP)
FFFFFFFF.8091A784: LDQ R8,#X0040(FP)
FFFFFFFF.8091A788: LDQ FP,#X0048(FP)
FFFFFFFF.8091A78C: LDA SP,#X0050(SP)
FFFFFFFF.8091A790: RET R31,(R26)
FFFFFFFF.8091A794: BIS R31,R31,R31
FFFFFFFF.8091A798: LDA SP,#XF520(SP)
FFFFFFFF.8091A79C: BIS R31,R16,R1
FFFFFFFF.8091A7A0: STQ R27,(SP)
FFFFFFFF.8091A7A4: LDA R16,#X0060(SP)
FFFFFFFF.8091A7A8: STQ R26,#X0A78(SP)
FFFFFFFF.8091A7AC: LDA R0,#X0081(SP)
FFFFFFFF.8091A7B0: STQ R2,#X0A80(SP)
SDA> exit

Volker Halle
Honored Contributor

Re: Periodic access violation

Steve,

although the information in the signal array (reason mask) and the instruction stream are not 100% consistent, I would assume a stack overflow condition as a possible reason for this process crash.

The ACCVIO happened on an instruction trying to write to the user stack. The stack pointer (SP) had just been decremented by 2784. bytes.

Failing virtual address: 7AFA37C0
SP decrement: FFFFF520
Previous SP value: 7AFA42A0

So the previous SP pointed to another page. Each Alpha page is 8kb in size (hex 2000), so the stack pointer crossed the page boundary at 7AFA4000. When trying to access the next-lower page, the ACCVIO seems to have happened.

The user stack normally expands automatically towards lower addresses in P1 space. Stack expansion may be limited by VIRTALPAGECNT or process PGFLQUOTA.

Try to watch this process during normal operations using SHOW PROC/CONT/ID= and look for increasing values of Virtual Pages and whether this value apporaches the VIRTAULPAGECNT system parameter.

You could also use SDA to check the P1 space low address:

$ ANAL/SYS
SDA> SET PROC/ID=
SDA> SHOW PROC/PHD
...
First free P1 VA 00000000.7AE38000 ...
...

If this address continually decreases while the process is in operation, this may indicate a stack overflow problem.

You can also check the process for remaing page file quota using

$ SHOW PROC/QUOTA/ID=
If Paging File Quota continously decreases, you can also predict, that the process is going to crash at some point in time...

Volker.