Re: Access violation where VA = PC

Hein van den Heuvel · ‎02-19-2007

I compiled my 'reproducer' with /noopt.
That may well make a difference.

The point is of course not the actualy crash, but to show how a simple arrary reference out of bounds can zap a return address on the stack and that this can in turn cause an ACCVIO with va=pc

Array out of bound is one method.
An other method (user code bug) is to remember the addresses of local variable outside the scope of those variable and then use those.

Problem track dow remains as suggested by many. Try run with debugger, or at very least with SET PROC/DUM. And use the lower 32 bits of the VA for an indication of an address involved with the issue.

Good luck,
Hein.

Sebastian Bazley · ‎02-19-2007

Yes, /noopt is essential. From looking at the machine code, A was not being called by MAIN when default optimisation was used...

I now get crashes with -9 -10 and -10 -11, but not -11 -12. No matter; at least I can reproduce the symptoms.

==

There are 3 crashes (see attached) for which I have details - in each case the VA is different, but in each case VA = PC = R26.

This suggests that the problem might be corruption of the return address in R26, i.e the fault could well be caused by the next instruction after a "RET R26".

The SP in the 3 cases is the same, so that suggests the corruption is occurring in the same place each time.

The customer has been asked to send us the crash dump if/when it next occurs.

Unless anything obvious jumps out from the stack traces, I think we're stuck until we have a dump.

Hein van den Heuvel · ‎02-19-2007

>> There are 3 crashes (see attached) for which I have details - in each case the VA is different, but in each case VA = PC = R26.

There is SO much more information to play with.
For example, did you notice how the upper 32 bits and lower 32 bits in the VA are always 0x232 apart? No coincidence!
Other registers have values that look 'intersting'

>> This suggests that the problem might be corruption of the return address in R26, i.e the fault could well be caused by the next instruction after a "RET R26".

Well duh! But the corruption would 99.99% sure have happened on the stack before it was popped back into r26

>> The SP in the 3 cases is the same, so that suggests the corruption is occurring in the same place each time.

Yes, some piece of code could have (erroneoulsy) passed along the address of its local variables and it was then used out of scope.
With the debugger you could potentially set a watch point on the localtion.

>> The customer has been asked to send us the crash dump if/when it next occurs.

Process dump. Not crash dump.

>> Unless anything obvious jumps out from the stack traces, I think we're stuck until we have a dump.

I think you can work it some more, but it will be tedious and expensive in time.
Most importantly you need to explain the customer that they have bad code, not matter how long it seemed to work before.
Good luck,
Hein.

John Gillings · ‎02-19-2007

Sebastian,
I'd be looking around 005BD400 and 005B8C00 - they're low enough to be static code. Try RUN/DEBUG the program. Don't panic, you won't actually execute anything, just poke around the address space. Look at SHOW IMAGE and SHOW MODULE, then EXAMINE any addresses you see in registers. The P1 and high P0 space addresses are unlikely to exist, but anything in low P0 is likely to be static data or code.

Alternative is to ANALYZE/SYSTEM and look at the live process. That will only give you offsets into images, so you'll need full MAP files to identify modules.

A crucible of informative mistakes

Sebastian Bazley · ‎02-20-2007

We now have a process dump. This has been sent to us using zip "-V", but we cannot seem to read it on our system. ANA/RMS does not show any errors, and I have tried using zip "-V" locally without problems.

Anyone know what the following errors mean?

================

OpenVMS Alpha Debug64 Version V7.3-200

%SDA-W-EXCLDATA, data excluded from dump due to insufficient privilege
%SDA-W-LINKTIMEMISM, link time of SYS$COMMON:[SYS$LDR]SYS$BASE_IMAGE.EXE;2
( 6-SEP-2006 16:04) does not match link time of image in process dump ( 8-FEB-2006 13:47)
%SDA-W-SDALINKMISM, link time of SYS$BASE_IMAGE built into SDA$SHARE
( 6-SEP-2006 16:04) does not match link time of image in system dump ( 8-FEB-2006 13:47)
%DEBUG-W-IMAGENF, target image xxxxxxxxxxxxxxxxxxxxx not found on host system
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=000000007FEE
A5EC, PC=0000000000048A38, PS=0000001B

...

module name routine name line rel PC abs PC
%DEBUG-I-TRUNC64, address 00551A0831A97000 being truncated in DBGKREGISTERS\DBG$SETUP_INVCTX
%DEBUG-I-TRUNC64, address 00551A0831A97000 being truncated in ALPHA_PROCEDURE_UTILS\DBG$IS_EXC_DISPATCH_FRAME
%DEBUG-I-TRUNC64, address 00551A0831A97000 being truncated in ALPHA_PROCEDURE_UTILS\DBG$IS_AST_DISPATCH_FRAME

Due to this internal error this debug session may be unreliable.

%DEBUG-E-INTERR, debugger error in DBGMAIN\SUB_CALL_KERNEL_RPC - error in kernel
routine GET_IMAGE_ENTRY or session corruption
DBG> show stack
%DEBUG-E-NOPROCESSES, the current command is targetted at an empty process set
DBG>

============

The differences in dates is understandable, but I don't understand why we cannot analyse the dump on a different system.

Hein van den Heuvel · ‎02-20-2007

You are using $ANAL/PROCESS_DUMP right?

Anyway, you indicate you have a problem unzipping. The zip file came via FTP probably. Was it transferred in bin mode? Are the attibutes reasonable?
If it failed to unzip correctly, then IMHo all bets are off for using the dump.

Still, you get a complained about version mismatch. What are the versions involved? Does ANAL/pROC dump work on a freshly produced dump on your local system?

You may have to analyze this on the target machine!

Regards,
Hein.

Brian Reiter · ‎02-21-2007

We had a similar problem caused by an old bit of code which issued a QIO with the IOSB declared locally to the procedure.

There was no attempt to check the IOSB later (just as well), the assumption was that the AST firing implied it all worked.

The crashing got worse after a minor change to improve throughput.

Sebastian Bazley · ‎02-21-2007

Sorry, my posting was ambiguous - we did not have any problem unzipping the DMP file; and ANAL/RMS on the DMP did not show any errors.

However, using ANAL/PROC on the extracted DMP file produces the errors as shown.

We've now sent a script to the customer to hopefully get at least some of the info from the dump. But obviously it would be easier if we could analyse the dump on our own system. [The customer is in a different time-zone]

Hoff · ‎02-21-2007

I might well next try to analyze the process dump directly on the system that generated it. That might be the most expedient approach.

(For quite some time, the process dump mechanism really only mostly worked directly on the node that generated the dump. That did get fixed, but I don't immediately recall the OpenVMS version where that got fixed. Even with the fix -- what amounted to an overhaul of the exception handling and the debugger bootstrap -- the nodes involved can still need to be fairly similar in the configuration. There are still ECOs for process dump for various OpenVMS releases.)

There is another and more subtle problem, too, in that a process that has badly corrupted its call stack is sometimes rather difficult to analyze.

Sebastian Bazley · ‎03-05-2007

Unfortunately the dump cannot be analysed even on the system where it was created.

Thanks again for all the useful suggestions and information.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Access violation where VA = PC