Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Can be r1 (gp) register overwritten by error in code? (run on Itanium with OpenVMS)

Krax
Occasional Visitor

Can be r1 (gp) register overwritten by error in code? (run on Itanium with OpenVMS)

Hello,

below I pasted a piece of my function and related Instructions (unfortunatelly optimized) from dump analyzer.

Register r21 is filled by adding constant (I guess position of gaul_trc_fld variable) and r1 (gp) register. It seems that r1 contains error value, because access to that memory causes "Access Violation". Is it possible? Can be r1 overwritten for example by buffer overflow or something like that?

What I can not understand is that code crashes at instruction marked below just time to time. (Function is called many times per second, programs runs for example a week and then crashes).


21065: void queue_ev( T_MSG *pi_ar_msg,
unsigned short pi_uw_event,
unsigned short pi_uw_process,
unsigned long pi_ul_cust,
void *pi_ar_cust )
{

21072: assert( pi_ar_msg != NULL );

/*
** Add the event to the queue:
*/
21077: pi_ar_msg->w_num_elem_queue++;
21078: assert(pi_ar_msg->w_num_elem_queue <= SIZE_QUEUE ); /* Should not happen */

/* Add to queue, when at end wrap around: */
21081: pi_ar_msg->w_last_elem_queue++;
21082: if( pi_ar_msg->w_last_elem_queue == SIZE_QUEUE )
{
21084: pi_ar_msg->w_last_elem_queue = 0;
}

/* Set the process and event: */
21088: pi_ar_msg->r_queue[pi_ar_msg->w_last_elem_queue].uw_event = pi_uw_event;
21089: pi_ar_msg->r_queue[pi_ar_msg->w_last_elem_queue].uw_process = pi_uw_process;
21090: pi_ar_msg->r_queue[pi_ar_msg->w_last_elem_queue].ul_cust = pi_ul_cust;
21091: pi_ar_msg->r_queue[pi_ar_msg->w_last_elem_queue].ar_cust = pi_ar_cust;

21093: if( ( (*gaul_trc_fld) & (gul_stm_msk) ) || ( (*gaul_trc_fld) & (0) )
{
.....

Some declarations:
unsigned long *gaul_trc_fld;
unsigned long gul_stm_msk;


21065: alloc r38 = ar.pfs, 08, 08, 00
: mov r37 = b0
21072: cmp4.eq p7, p0 = r32, r0
21065: mov r39 = r1
21072: (p7) br.cond.dpnt.few 0000010
: br.many 0000050 ;;
: add r3 = 216560, r1
: add r42 = 216550, r1
: mov r25 = 000003 ;;
: ld8 r3 = [r3]
: ld8 r41 = [r42]
: mov r42 = 0000FB ;;
: mov r40 = r3
: nop.f 000000
: br.call.sptk.many b0 = 00AB8B0 ;;
: mov r1 = r39
: nop.f 000000
: nop.i 000000
21077: mov r3 = 0049CC
: mov r8 = 0049CC
21078: mov r9 = 0049CC ;;
21077: add r3 = r32, r3
: add r8 = r32, r8
21078: add r9 = r32, r9 ;;
21077: ld2 r3 = [r3] ;;
: nop.m 000000
: sxt2 r3 = r3 ;;
: add r3 = 0001, r3 ;;
: st2 [r8] = r3
: nop.i 000000 ;;
21078: ld2 r9 = [r9] ;;
: nop.m 000000
: sxt2 r9 = r9 ;;
: cmp.lt p7, p0 = 19, r9
: (p7) br.cond.dpnt.few 0000010
: br.many 0000050 ;;
: add r42 = 216550, r1
: add r10 = 216568, r1
: mov r25 = 000003 ;;
: ld8 r41 = [r42]
: ld8 r40 = [r10]
: mov r42 = 000101
: nop.m 000000
: nop.f 000000
: br.call.sptk.many b0 = 00AB830 ;;
: mov r1 = r39
: nop.f 000000
: nop.i 000000
21081: mov r3 = 0049D0
: mov r8 = 0049D0 ;;
: add r3 = r32, r3
: add r8 = r32, r8 ;;
: ld2 r3 = [r3]
: nop.i 000000 ;;
: nop.m 000000
: sxt2 r3 = r3 ;;
: add r3 = 0001, r3 ;;
: nop.m 000000
: zxt2 r3 = r3 ;;
21082: cmp.eq p0, p6 = 19, r3
21081: st2 [r8] = r3
: nop.f 000000
21082: (p6) br.cond.dpnt.few 0000030 ;;
21084: mov r9 = 0049D0 ;;
: add r9 = r32, r9
: nop.i 000000 ;;
: st2 [r9] = r0
: nop.f 000000
: nop.i 000000
21088: mov r10 = 0049D0
21093: add r21 = 2159D8, r1
21088: mov r17 = 0049D4 ;;
: add r10 = r32, r10
21089: mov r18 = 0049D6
21090: mov r19 = 0049D8
21091: mov r20 = 0049DC ;;
21088: ld2 r44 = [r10]
21093: add r22 = 215AF8, r1
: ld8 r21 = [r21] ;; <=============== %SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=FFFFFFFFFFE159D8, PC=000000000014C800, PS=0000001B
break on unhandled exception at XXX_MISC\queue_ev\%LINE 21093+14
: ld8 r22 = [r22]
21088: sxt2 r45 = r44 ;;
: shladd r11 = r45, 1, r45 ;;
: nop.m 000000
: sxt4 r11 = r11 ;;
: shladd r11 = r11, 2, r32 ;;
: add r17 = r11, r17
21089: add r18 = r11, r18
21090: add r19 = r11, r19
21091: add r20 = r11, r20
: nop.i 000000 ;;
21088: st2 [r17] = r33
21089: st2 [r18] = r34
: nop.i 000000 ;;
21090: st4 [r19] = r35
21091: st4 [r20] = r36
: nop.i 000000 ;;
21093: ld4 r21 = [r21]
: ld4 r23 = [r22]
: nop.i 000000 ;;
: nop.m 000000
: sxt4 r21 = r21
: nop.b 000000 ;;
: ld4 r21 = [r21] ;;
: and r21 = r21, r23
: nop.i 000000 ;;
: cmp4.eq p9, p0 = r0, r21
: nop.f 000000
: (p9) br.cond.dpnt.few 00003A0 ;;
13 REPLIES
Joseph Huber_1
Honored Contributor

Re: Can be r1 (gp) register overwritten by error in code? (run on Itanium with OpenVMS)

Why are You guessing and fiddling with machine code, better use the source language debugger!

Obviously the access violation happens at
if( ( (*gaul_trc_fld) & (gul_stm_msk) ) || ( (*gaul_trc_fld) & (0) )
when dereferencing the pointer gaul_trc_fld:
just the variable You don't show us.
look with the debugger at the content, and try to find out where it is set (with an invalid address).
http://www.mpp.mpg.de/~huber
Hoff
Honored Contributor

Re: Can be r1 (gp) register overwritten by error in code? (run on Itanium with OpenVMS)

Ok, so this is what looks to be a memory queue, this is likely involving an Integrity multiprocessor, it's transient, and you've hit a clobbered value.

My bet? Contention. I see no interlocking on that queue, which means the code is potentially vulnerable to contention.

What to do? Switch to the C builtin.h interlocked queue routines, or to the analogous RTL queue calls, or set up spinlocks or such, or implement one of Knuth's algorithms existing code that ensure the queue access is "safe", or...

If you want to test this stuff, hack this particular queue code out of the enveloping application code and build a skeleton that hammers on these queues from multiple threads or multiple processes. Get rid of everything else in the application around this code that can reduce contention on these queues. Let this stuff run full-tilt.
Joseph Huber_1
Honored Contributor

Re: Can be r1 (gp) register overwritten by error in code? (run on Itanium with OpenVMS)

And yes, since it happens only somtime, the pointer variable has been filled with a wrong/invalid address, could well be by buffer overflow or array index violation.

Compile with CC/NOOPT/CHECK=(BOUNDS)/DEBUG and hope to catch the error (if using dimensioned arrays, not only pointers).

But any other language but C is probably better in avoiding such errors...
http://www.mpp.mpg.de/~huber
Hoff
Honored Contributor

Re: Can be r1 (gp) register overwritten by error in code? (run on Itanium with OpenVMS)

ps: This sort of stuff is a bit more involved to implement on OpenVMS than on other platforms (and Itanium isn't "fun" for multithreading; see the KP Threads services if that requirement is involved here), but OpenVMS has one tool that can be brought to bear here: program the OpenVMS Debugger to break on the failing code and conditionally check for the corruption or error (preferably within the test harness mentioned earlier), and turn it loose waiting for the corruption to arise. But again, queues need to be interlocked or bad things happen when contention arises.
John Reagan
Respected Contributor

Re: Can be r1 (gp) register overwritten by error in code? (run on Itanium with OpenVMS)

So are you sure that is really the failing instruction? The GP is saved into R39 in the prologue and restored back to R1 after the external routine calls. Unfortunately, since you can't reproduce it everytime, it would be interesting to see what is really in R1 (although you can reverse engineer that with the failing VA).

So how could R39 get trashed? The only way I can think of is that the called routines (which call other routines, etc.) eventually use enough registers that the R39 from this frame eventually got pushed out to the register backing store. In theory, you could trash it there and eventually when it gets reloaded back into the chip and finally back into the frame when the routine returns, the saved R1 in R39 is the wrong value. That said, I've never seen such a failure. The register backing store isn't anywhere near the rest of your data or the memory stack.

Of course, there could be a bug in the OS (notably SWIS) dealing with saved register values, etc. Did you say what version of OpenVMS you are running and what ECOs you have installed? Did you say what version of the compiler you are running?

For those confused, the real fetches of the user data (ie, the pointers) occur later in the instruction stream as:

ld4 r21 = [r21]
ld4 r23 = [r22]
Krax
Occasional Visitor

Re: Can be r1 (gp) register overwritten by error in code? (run on Itanium with OpenVMS)

Hi John,

OpenVMS debugger/analyzer marked this instruction with small triangle on the left.

Thank you for confirming that it crashes before accessing referenced data, i.e. it crashes at reading pointer value and not referenced data.

It is compiled with: HP C V7.2-022 on OpenVMS IA64 V8.3.
I have been trying to find out information on system where it crashed. I will supply this information when I find it out.

Other strange thing that I dug out from analyzer:

DBG> show stack
Invocation block 0 Invocation handle 8786429167104
GP: 0 <====================
PC: XXX_MISC\queue_ev\%LINE 21093+14
RETURN PC: XXX_MISC\queue_event\%LINE 21058+14
SP: 2059492768
BSP: 8786429167104
Is register stack frame:
previous BSP: 8786429167056
CFM: 1040
Ins/Locals R32:R39 Outs R40:R47
Invocation block 1 Invocation handle 8786429167056
GP: 4980736
PC: XXX_MISC\tcap_queue_event\%LINE 21058+14
RETURN PC: XXX_TO\to_data\%LINE 21622+16
SP: 2059492768
BSP: 8786429167056
Is register stack frame:
previous BSP: 8786429166960
CFM: 650
Ins/Locals R32:R36 Outs R37:R41

Contents of GP is really... "interesting". R39 contains 0 as well.

Furthermore, R21 contains value of -2008616, what is FFFFFFFFFFE159D8 (virtual address where program crashes)

Any ideas?
H.Becker
Honored Contributor

Re: Can be r1 (gp) register overwritten by error in code? (run on Itanium with OpenVMS)

>>>
DBG> show stack
Invocation block 0 Invocation handle 8786429167104
GP: 0 <====================
...
Furthermore, R21 contains value of -2008616, what is FFFFFFFFFFE159D8 (virtual address where program crashes)
<<<

Everything is fine except the contents of the GP. If you are wondering about the negative value in r21, that's because the add actually adds a negative number to zero. The compilers use short literals to get to addresses. Short means 22-bit sign-extended offsets, so add r21 = 2159D8, r1 is an add ffffffff.ffe159d8 to 0.

How the GP got zeroed is the question. John had some answers. The previous GP, 4980736 (Hex = 004C0000) looks like a nice GP of a main image.

Any ASTs involved? Not that I think they are the problem, but it looks like there is more going on than just some C code calling some other C code.
Dennis Handly
Acclaimed Contributor

Re: Can be r1 (gp) register overwritten by error in code? (run on Integrity with OpenVMS)

>Can be r1 overwritten for example by buffer overflow or something like that?

(My experience is for HP-UX.)
It would be very hard to do. The registers are typically saved in the RSE stack, not the user stack.
GP: 0
SP: 2059492768
BSP: 8786429167104
(It would be better if this was in hex.)

>Contents of GP is really... "interesting". R39 contains 0 as well.

This isn't good. r1 was bad when this function was called. You need to look there.

>Joseph: Why are You guessing and fiddling with machine code, better use the source language debugger!

This is a case to look for zebras, not horses. ;-)
A horse tool isn't likely to find why r1 is 0.

>John: since you can't reproduce it every time, it would be interesting to see what is really in R1

On HP-UX, I found where BOR calls weren't thread safe and corrupted R1. I was lucky to guess the cause, for something that occurred very infrequently. Fixes had to be made in the compiler, linker and dld. With an extra change in the linker to help detect if the other two weren't fixed, setting dummy GP to -1.

>I've never seen such a failure. The register backing store isn't anywhere near the rest of your data or the memory stack.

You haven't lived long enough. ;-)
I had an example where a move was reversed and shuffled the whole stack down.
Also a thread stack overflow would do it.

>H.Becker: Short means 22-bit sign-extended offsets, so add r21 = 2159D8, r1 is an add ffffffff.ffe159d8 to 0.

Actually it means the disassembler is broken. All literals should be signed extended to 64 bits, to make it easier to understand.

>The previous GP, 0x004C0000, looks like a nice GP of a main image.

On HP-UX, when (possibly) transferring from one load module to another, the PC and GP are fetched from the Procedure Linkage Table and if this is overwritten, bad things can happen. But typically both are corrupted so you end up in an invalid location, not the right location but wrong GP.

Krax
Occasional Visitor

Re: Can be r1 (gp) register overwritten by error in code? (run on Itanium with OpenVMS)

Hi Dennis,

thank you for valuable comments.

Calling function is located in the same module as function where crash occurs. There is an output of "show stack" in my previous post - you can see that GP was fine when queue_ev function was called.

Calling function looks very simple:

21054: void queue_event( T_MSG *pi_ar_imsg,
unsigned short pi_uw_event,
unsigned short pi_uw_process )
{
21058: queue_ev( pi_ar_imsg,
21059: pi_uw_event,
21060: pi_uw_process,
21061: 0,
21062: NULL );
21063: }

And related instructions:

21054: alloc r36 = ar.pfs, 05, 05, 00
21059: mov r8 = 00FFFF
21054: mov r35 = b0 ;;
21058: mov r41 = r0
: mov r40 = 000000
: mov r37 = r32
21060: and r39 = r34, r8 ;;
21059: and r38 = r33, r8
: nop.i 000000
: nop.m 000000
: nop.f 000000
21058: br.call.sptk.many b0 = 0000030 ;;
: nop.m 000000
21063: mov.i ar.pfs = r36 ;;
: mov b0 = r35
: nop.m 000000
: nop.f 000000
: br.ret.sptk.many b0 ;;

I don't see anything what could get wrong, what could corrupt GP.

Some information about target system:

> tcpip sh ver

HP TCP/IP Services for OpenVMS Industry Standard 64 Version V5.6 - ECO 2 on an HP BL860c (1.59GHz/9.0MB) running OpenVMS V8.3-1H1

> product sh hist

------------------------------------ ----------- ----------- --- -----------
PRODUCT KIT TYPE OPERATION VAL DATE
------------------------------------ ----------- ----------- --- -----------
HP I64VMS OVPA V4.0-37 Full LP Install (U) 13-FEB-2009
HP I64VMS OVPA V4.0-37 Full LP Install (U) 11-FEB-2009
HP I64VMS OVPA V4.0-37 Full LP Install (U) 30-JAN-2009
HP I64VMS OVPA V4.0-37 Full LP Install (U) 30-JAN-2009
HP I64VMS VMSSPI V8.0-1 Full LP Install Val 23-OCT-2008
HP I64VMS OVEAAGT V8.0-1 Full LP Install Val 23-OCT-2008
HP I64VMS OVCTRL V8.0-1 Full LP Install Val 23-OCT-2008
HP I64VMS OVDEPL V8.0-1 Full LP Install Val 23-OCT-2008
HP I64VMS OVSECCC V8.0-1 Full LP Install Val 23-OCT-2008
HP I64VMS OVBBC V8.0-1 Full LP Install Val 23-OCT-2008
HP I64VMS OVSECCO V8.0-1 Full LP Install Val 23-OCT-2008
HP I64VMS OVCONF V8.0-1 Full LP Install Val 23-OCT-2008
HP I64VMS OVXPL V8.0-1 Full LP Install Val 23-OCT-2008
HP I64VMS SEA V5.1 Full LP Install (U) 06-JUN-2008
HP I64VMS WEBES V5.1 Platform Install (U) 06-JUN-2008
HP I64VMS WCCPROXY V2.1 Full LP Install (U) 06-JUN-2008
HP I64VMS VMS831H1I_UPDATE V1.0 Patch Install Val 25-MAR-2008
HP I64VMS AVAIL_MAN_BASE V8.3-1H1 Full LP Install (U) 06-NOV-2007
HP I64VMS CDSA V2.3-306 Full LP Install Val 06-NOV-2007
HP I64VMS DECNET_PLUS V8.3-1H1 Full LP Install Val 06-NOV-2007
HP I64VMS DWMOTIF_SUPPORT V8.3-1H1 Full LP Install (U) 06-NOV-2007
HP I64VMS KERBEROS V3.1-152 Full LP Install Val 06-NOV-2007
HP I64VMS OPENVMS V8.3-1H1 Platform Install Sys 06-NOV-2007
HP I64VMS TCPIP V5.6-9ECO2 Full LP Install Val 06-NOV-2007
HP I64VMS TDC_RT V2.3-1 Full LP Install Val 06-NOV-2007
HP I64VMS VMS V8.3-1H1 Oper System Install Sys 06-NOV-2007
HP I64VMS WBEMCIM V2.61-A070728 Full LP Install Val 06-NOV-2007
HP I64VMS WBEMPROVIDERS V1.5-31 Full LP Install Val 06-NOV-2007
HP I64VMS DWMOTIF_ECO01 V1.6 Patch Install Val 04-APR-2007
JFP I64VMS PYTHON250 V1.18-0 Full LP Install (U) 02-MAR-2007
JFP I64VMS ZLIB V1.2-3 Full LP Install (U) 02-MAR-2007
JFP I64VMS LIBBZ2 V1.0-2 Full LP Install (U) 02-MAR-2007
HP I64VMS MSA_UTIL V1.0-1 Full LP Install Val 05-FEB-2007
HP I64VMS DFU V3.1-1 Full LP Install (U) 17-AUG-2006
HP I64VMS DWMOTIF V1.6 Full LP Install Val 14-AUG-2006
HP I64VMS SSL V1.3-284 Full LP Install Val 14-AUG-2006

H.Becker
Honored Contributor

Re: Can be r1 (gp) register overwritten by error in code? (run on Itanium with OpenVMS)

>>>
On HP-UX, when (possibly) transferring from one load module to another, the
PC and GP are fetched from the Procedure Linkage Table and of this is
overwritten, bad things can happen. But typically both are corrupted so you
end up in an invalid location, not the right location but wrong GP.
<<<
Later info states that the shown functions are in one module. On VMS this is usually a source or object module, which is linked into one image. On VMS, when transferring from one image to another image PC and GP are fetched from function descriptors in short data. By default function descriptors live in read-only memory. Only explicitly linking with /segment=short=write makes them (and other linker generated short data) writable.
Dennis Handly
Acclaimed Contributor

Re: Can be r1 (gp) register overwritten by error in code? (run on Itanium with OpenVMS)

>you can see that GP was fine when queue_ev function was called.

On HP-UX GP isn't part of the unwind info so it can't be restored. While doing exception handling, the unwinder can look search the module table by PC and then get a value. Perhaps the debugger is displaying what it should be and for the top frame what's actually there?

>Calling function looks very simple:

Yes, very simple.

> 21058: br.call.sptk.many b0 = 0000030 ;;

You might want to display the value of GP before the call and single step at the instruction level to watch how GP changes.

Is it actually going to 0x30, is this a broken disassembler that doesn't show the actual target, just the offset?

>H.Becker: By default function descriptors live in read-only memory.

On HP-UX this can't occur since these can be dynamically changed at loadtime and the dynamic loader needs to modify them.
H.Becker
Honored Contributor

Re: Can be r1 (gp) register overwritten by error in code? (run on Itanium with OpenVMS)

>>>
On HP-UX this can't occur since these can be dynamically changed at loadtime and the dynamic loader needs to modify them.
<<<
On VMS, the corresponding changes (fixups and relocations) are done in inner mode. Before tranferring control to the program, the pages are set to read-only.

>>>
On HP-UX GP isn't part of the unwind info so it can't be restored. While doing exception handling, the unwinder can look search the module table by PC and then get a value. Perhaps the debugger is displaying what it should be and for the top frame what's actually there?
<<<
This isn't different on VMS, the GP is not in the unwind info. I don't know what the debugger shows, but it seems to behave as you describe.

It looks like it uses r1 for the current call stack and the (image) GP for the other call
stacks in this image. A simple example where I zeroed r1 before calling other functions supports this assumption: SHOW STACK only shows the zero for the current, top call stack,

So the GP of the caller looks good in the SHOW STACK output, but may already be zero for the caller.
Dennis Handly
Acclaimed Contributor

Re: Can be r1 (gp) register overwritten by error in code? (run on Itanium with OpenVMS)

>H.Becker: Before transferring control to the program, the pages are set to read-only.

Ah, we have an option +protect to do that. It would have to align on page boundaries to do mprotect(2), so it isn't done by default.

>So the GP of the caller looks good in the SHOW STACK output, but may already be zero for the caller.

So the debugger is showing you what you want to see but not what it is. :-)
So looking at the caller and the caller's caller may be helpful.
(gdb's info shared will show the GPs and addresses for each shlib.)