Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Access violation where VA = PC

 
Sebastian Bazley
Regular Advisor

Access violation where VA = PC

I've been given details of an access violation.

The reason code is 00 (data read).

What is a bit odd is that the virtual address is the same as the PC.

I've tried to reproduce this condition (by hacking function pointers and stack stomping) but so far failed - the PC is always different from the virtual address.

Any idea how this situation might come about?

i.e. the sort of bugs to look for.

Details below.

%SYSTEM-F-ACCVIO, access violation,
reason mask=00
virtual address=005D99A8005D98C0,
PC=005D99A8005D98C0
PS=0000001B

Improperly handled condition, image exit forced.
Signal arguments: Number = 0000000000000005
Name = 000000000000000C
0000000000010000
005D99A8005D98C0
005D99A8005D98C0
000000000000001B

I've asked for the MAP file which may give some more clues.
19 REPLIES
Hein van den Heuvel
Honored Contributor

Re: Access violation where VA = PC

Looks like a 32-bit / 64-bit confusion
Like a RETURN address on a stack being corrupted.
Or a badly passed AST address?

Is this new code? Language? recent modificarrtions on working code? 'Nothing changed' ?

>> What is a bit odd is that the virtual address is the same as the PC.

Well, that PV would be a bad virtual address in all likelyhood no?

The real target address is probably 0x5D98C0

Check that address (debugger? MAP?) what is expected there for clues.

hth,
Hein.
Sebastian Bazley
Regular Advisor

Re: Access violation where VA = PC

It's C-code - unfortunately the crash does not always happen...

We are asking the customer to enable crash dumps in case it happens again.
labadie_1
Honored Contributor

Re: Access violation where VA = PC

You have a very good document on that subject at

http://www.eight-cubed.com/articles/traceback.html
Hein van den Heuvel
Honored Contributor

Re: Access violation where VA = PC



Consider this:

$ mcr sys$login:tmp -11 -12
%SYSTEM-F-ACCVIO, access violation,
reason mask=00,
virtual address=000201B8000201B8, PC=000201B8000201B8, PS=0000001B

Code?


#include

b (int i, int j, int *x)
{
x[i] = x[j];
}

a (int i, int j, int *x)
{
b(i, j, x);
}


main(int argc,char *argv[])
{
int i,j,x[100];
i = atoi(argv[1]);
j = atoi(argv[2]);
a(i,j,x);
}


Cheers,
Hein.
Hoff
Honored Contributor

Re: Access violation where VA = PC

Something probably romped on the stack or on the heap or on a queue header, by the look of it. (Yep, that really narrows it down. There's a corruption. :-) A hammerlock grasp of the obvious, eh?) This footprint can could be a stack corruption or stack-stomp causing a return heading off into hyperspace; the error here is usually secondary to the actual failure.

Do post the full stack trace.

If that's the full stack trace, then you'll probably end up adding at least debugging information and potentially instrumenting the code. Unfortunately, adding debug or instrumentation can perturb the bug, and might well cause it to enter remission.

Stephen Hoffman
HoffmanLabs


John Gillings
Honored Contributor

Re: Access violation where VA = PC

Sebastian,

I'd be enabling PROCESS dumps - SET PROCESS/DUMP

You don't mention a version - make sure you're at V7.3-2 at least. Use DEBUG or SDA to analyse the resulting dump.

VA=PC is a fairly specific footprint. Since there are few modern language constructs which allow jumping to an arbitrary address, the most likely cause is loading a PC from the stack. So, you're looking for something that overwrites the return address in a call frame. The "improperly handled condition" suggests that other pieces of the call frame have also been corrupted.

The biggest problem with tracking these type of errors down is the "distance" between cause and effect. The corruption could be caused by the first executable line in the routine, but not actually seen until you attempt to exit. In DEBUG you can step a few lines, then check the integrity of the stack with SHOW CALL. From within a program, you can do something similar with LIB$GET_INVO* routines (potentially non-destructive). Another option would be to declare a condition handler at top level which can SS$_CONTINUE some specific condition. You can then check the integrity of the stack by LIB$SIGNALling that condition. If it's good, you continue, if not you'll get an improperly handled condition.

If your magic number "005D99A8005D98C0" is always the same, have a look at the 32 bit addresses in a live process from SDA. I'd guess they're pointers within a data structure which has a size mismatch between a the caller (too small) and called routine.
A crucible of informative mistakes
Robert Gezelter
Honored Contributor

Re: Access violation where VA = PC

Sebastian,

First, a question. Does enabling DEBUG and using the debugger preserve the problem?

If so, then you can use the more advanced facilities in the DEBUGGER to isolate the region in the program where the stack is being corrupted. Most importantly, is there a test case that consistently produces the failure, and does it still fail when the DEBUGGER is enabled.

In nearly 30 years of debugging programs under OpenVMS, for myself and clients, I have had a few cases where stack corruption bugs manifested themselves at great distances from the actual corruption. Tracking them down can be difficult. Some of them have even appeared/disappeared depending on the presence of the DEBUGGER, which can be particularly aggravating.

-Bob Gezelter, http://www.rlgsc.com

John Travell
Valued Contributor

Re: Access violation where VA = PC

In my experience, VA = PC means that code tried to use the bogus VA as the destination of a JSR. If you have the register contents, have a look at R26.
This *may* contain the address the supposed JSR should have returned to.
If you have what looks like a valid address there, check to see if it is code, and if so, is that address preceded by a JSR R26(R26) ?
If all this works out, you should be able to work out where your bogus VA came from.

Sebastian, should you wish to do so, you are welcome to call me, I think I gave you my card at one of the London seminars.
JT:
Sebastian Bazley
Regular Advisor

Re: Access violation where VA = PC

Thanks for all the replies.

I've suggested that the customer enable process dumps for the application. It is a batch process so that's easy to do. Using DEBUG on the live process is not an option.

Hein: I tried your crash program on VMS 7.3-1 (Compaq C V6.4-008) and it does not crash using -11 -12 as parameters - nor any others I tried. (I do get an access violation if the parameters are omitted). Also fails to crash using Compaq C V6.5-001 on OpenVMS Alpha V7.3-2.

Not yet had a chance to look at the other suggestions.
Hein van den Heuvel
Honored Contributor

Re: Access violation where VA = PC

I compiled my 'reproducer' with /noopt.
That may well make a difference.

The point is of course not the actualy crash, but to show how a simple arrary reference out of bounds can zap a return address on the stack and that this can in turn cause an ACCVIO with va=pc

Array out of bound is one method.
An other method (user code bug) is to remember the addresses of local variable outside the scope of those variable and then use those.

Problem track dow remains as suggested by many. Try run with debugger, or at very least with SET PROC/DUM. And use the lower 32 bits of the VA for an indication of an address involved with the issue.

Good luck,
Hein.
Sebastian Bazley
Regular Advisor

Re: Access violation where VA = PC

Yes, /noopt is essential. From looking at the machine code, A was not being called by MAIN when default optimisation was used...

I now get crashes with -9 -10 and -10 -11, but not -11 -12. No matter; at least I can reproduce the symptoms.

==

There are 3 crashes (see attached) for which I have details - in each case the VA is different, but in each case VA = PC = R26.

This suggests that the problem might be corruption of the return address in R26, i.e the fault could well be caused by the next instruction after a "RET R26".

The SP in the 3 cases is the same, so that suggests the corruption is occurring in the same place each time.

The customer has been asked to send us the crash dump if/when it next occurs.

Unless anything obvious jumps out from the stack traces, I think we're stuck until we have a dump.
Hein van den Heuvel
Honored Contributor

Re: Access violation where VA = PC

>> There are 3 crashes (see attached) for which I have details - in each case the VA is different, but in each case VA = PC = R26.

There is SO much more information to play with.
For example, did you notice how the upper 32 bits and lower 32 bits in the VA are always 0x232 apart? No coincidence!
Other registers have values that look 'intersting'

>> This suggests that the problem might be corruption of the return address in R26, i.e the fault could well be caused by the next instruction after a "RET R26".

Well duh! But the corruption would 99.99% sure have happened on the stack before it was popped back into r26

>> The SP in the 3 cases is the same, so that suggests the corruption is occurring in the same place each time.

Yes, some piece of code could have (erroneoulsy) passed along the address of its local variables and it was then used out of scope.
With the debugger you could potentially set a watch point on the localtion.

>> The customer has been asked to send us the crash dump if/when it next occurs.

Process dump. Not crash dump.

>> Unless anything obvious jumps out from the stack traces, I think we're stuck until we have a dump.

I think you can work it some more, but it will be tedious and expensive in time.
Most importantly you need to explain the customer that they have bad code, not matter how long it seemed to work before.
Good luck,
Hein.
John Gillings
Honored Contributor

Re: Access violation where VA = PC

Sebastian,
I'd be looking around 005BD400 and 005B8C00 - they're low enough to be static code. Try RUN/DEBUG the program. Don't panic, you won't actually execute anything, just poke around the address space. Look at SHOW IMAGE and SHOW MODULE, then EXAMINE any addresses you see in registers. The P1 and high P0 space addresses are unlikely to exist, but anything in low P0 is likely to be static data or code.

Alternative is to ANALYZE/SYSTEM and look at the live process. That will only give you offsets into images, so you'll need full MAP files to identify modules.
A crucible of informative mistakes
Sebastian Bazley
Regular Advisor

Re: Access violation where VA = PC

We now have a process dump. This has been sent to us using zip "-V", but we cannot seem to read it on our system. ANA/RMS does not show any errors, and I have tried using zip "-V" locally without problems.

Anyone know what the following errors mean?

================

OpenVMS Alpha Debug64 Version V7.3-200


%SDA-W-EXCLDATA, data excluded from dump due to insufficient privilege
%SDA-W-LINKTIMEMISM, link time of SYS$COMMON:[SYS$LDR]SYS$BASE_IMAGE.EXE;2
( 6-SEP-2006 16:04) does not match link time of image in process dump ( 8-FEB-2006 13:47)
%SDA-W-SDALINKMISM, link time of SYS$BASE_IMAGE built into SDA$SHARE
( 6-SEP-2006 16:04) does not match link time of image in system dump ( 8-FEB-2006 13:47)
%DEBUG-W-IMAGENF, target image xxxxxxxxxxxxxxxxxxxxx not found on host system
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=000000007FEE
A5EC, PC=0000000000048A38, PS=0000001B

...

module name routine name line rel PC abs PC
%DEBUG-I-TRUNC64, address 00551A0831A97000 being truncated in DBGKREGISTERS\DBG$SETUP_INVCTX
%DEBUG-I-TRUNC64, address 00551A0831A97000 being truncated in ALPHA_PROCEDURE_UTILS\DBG$IS_EXC_DISPATCH_FRAME
%DEBUG-I-TRUNC64, address 00551A0831A97000 being truncated in ALPHA_PROCEDURE_UTILS\DBG$IS_AST_DISPATCH_FRAME

Due to this internal error this debug session may be unreliable.

%DEBUG-E-INTERR, debugger error in DBGMAIN\SUB_CALL_KERNEL_RPC - error in kernel
routine GET_IMAGE_ENTRY or session corruption
DBG> show stack
%DEBUG-E-NOPROCESSES, the current command is targetted at an empty process set
DBG>

============

The differences in dates is understandable, but I don't understand why we cannot analyse the dump on a different system.
Hein van den Heuvel
Honored Contributor

Re: Access violation where VA = PC


You are using $ANAL/PROCESS_DUMP right?

Anyway, you indicate you have a problem unzipping. The zip file came via FTP probably. Was it transferred in bin mode? Are the attibutes reasonable?
If it failed to unzip correctly, then IMHo all bets are off for using the dump.

Still, you get a complained about version mismatch. What are the versions involved? Does ANAL/pROC dump work on a freshly produced dump on your local system?

You may have to analyze this on the target machine!

Regards,
Hein.
Brian Reiter
Valued Contributor

Re: Access violation where VA = PC

We had a similar problem caused by an old bit of code which issued a QIO with the IOSB declared locally to the procedure.

There was no attempt to check the IOSB later (just as well), the assumption was that the AST firing implied it all worked.

The crashing got worse after a minor change to improve throughput.


Sebastian Bazley
Regular Advisor

Re: Access violation where VA = PC

Sorry, my posting was ambiguous - we did not have any problem unzipping the DMP file; and ANAL/RMS on the DMP did not show any errors.

However, using ANAL/PROC on the extracted DMP file produces the errors as shown.

We've now sent a script to the customer to hopefully get at least some of the info from the dump. But obviously it would be easier if we could analyse the dump on our own system. [The customer is in a different time-zone]
Hoff
Honored Contributor

Re: Access violation where VA = PC

I might well next try to analyze the process dump directly on the system that generated it. That might be the most expedient approach.

(For quite some time, the process dump mechanism really only mostly worked directly on the node that generated the dump. That did get fixed, but I don't immediately recall the OpenVMS version where that got fixed. Even with the fix -- what amounted to an overhaul of the exception handling and the debugger bootstrap -- the nodes involved can still need to be fairly similar in the configuration. There are still ECOs for process dump for various OpenVMS releases.)

There is another and more subtle problem, too, in that a process that has badly corrupted its call stack is sometimes rather difficult to analyze.


Sebastian Bazley
Regular Advisor

Re: Access violation where VA = PC

Unfortunately the dump cannot be analysed even on the system where it was created.

Thanks again for all the useful suggestions and information.