Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

ILLEGAL_SHADOW error in C, casting NaN to unsigned int

 
SOLVED
Go to solution
Craig A Berry
Honored Contributor

ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Running the small C program in the attached listing gives the rather spectacular crash indicated below. If I compile /NOOPT, the problem goes away, which leads me to believe the optimiser is getting confused somehow. I only see this on Alpha (with CC v7.1 as well as v7.3) but not on Itanium.

According to the traceback, the line where the error occurs is this one:

return (unsigned int) d;

but if I remove the preceding if block (which does not modify d in any way), the crash doesn't happen, so the traceback info is suspect and running under debug prevents the error from occurring, so I'm a bit stumped.

Whether it makes sense to cast a double holding an IEEE NaN into an unsigned int is an interesting question, but this code is found, not made (i.e., I didn't write it), and I'm stuck with it whether what it's doing makes sense or not. I'm open to suggestion about whether the compiler is doing something wrong or the code is doing something wrong or some combination thereof.

$ cc/vers
HP C V7.3-009 on OpenVMS Alpha V8.3
$ cc/float=ieee/ieee=denorm/list/show=expansion/machine nan
$ link/trace nan
$ run nan
%SYSTEM-F-ILLEGAL_SHADOW, illegal formed trap shadow, Imask=00000000, Fmask=00008000, summary=03, PC=0000000000020098, PS=0000001B
%TRACE-F-TRACEBACK, symbolic stack dump follows
image module routine line rel PC abs PC
NAN NAN cast_uv 1824 0000000000000098 0000000000020098
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=0000000100000000, PC=0000000100000000, PS=0000001B

Improperly handled condition, image exit forced.
Signal arguments: Number = 0000000000000005
Name = 000000000000000C
0000000000010000
0000000100000000
0000000100000000
000000000000001B

Register dump:
R0 = 0000000000000001 R1 = 0000000000000001 R2 = 000000007BF7C590
R3 = 000000007ADDF2F0 R4 = 000000007ADDF2E0 R5 = 000000007ADDF2C8
R6 = 000000007ADDF360 R7 = FFFFFFFF81D4CD20 R8 = 000000007FF9CDE8
R9 = 000000007FF9DDF0 R10 = 000000007FFA4F28 R11 = 000000007FFCDC18
R12 = 000000007FFCDA98 R13 = FFFFFFFF81D4D1F0 R14 = 0000000000000000
R15 = 000000007AEE2670 R16 = 0000000000000EE0 R17 = FFFFFFFF77773700
R18 = 0000000100044D18 R19 = 000000007ADDF030 R20 = 0000000000000729
R21 = 000000007B67C848 R22 = 0000000100044CD8 R23 = 000000007ADDF020
R24 = 0000000000000000 R25 = 0000000000000001 R26 = 0000000100000002
R27 = 000000007B63F590 R28 = 000000007BF90438 R29 = 000000007ADDEFF0
SP = 000000007ADDEFF0 PC = 0000000100000000 PS = 300000000000001B
21 REPLIES 21
Hoff
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Try returning an integer, rather than a float or double cast, as a test.

I'm wondering if the attempt to cast the NaN is what's nailing the sequence. (The error that's signaled certainly points this way; trying to use a NaN...)

That, and the other part that's a little odd here is at the end of the main function; falling off the end can tend to spew whatever value was in R0 last as the final status.
Craig A Berry
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Thanks, I actually had already tried that modification, specifically this change from what I posted before:

$ gdiff -pu0 nan.c;-2 nan.c
--- nan.c;-2 Sat Dec 15 10:40:33 2007
+++ nan.c Sat Dec 15 15:15:47 2007
@@ -9,0 +10,2 @@ cast_uv(double d)
+ unsigned int u;
+
@@ -11 +13 @@ cast_uv(double d)
- return d < IV_MIN
+ u = d < IV_MIN
@@ -17 +19 @@ cast_uv(double d)
- return (unsigned int) d;
+ u = (unsigned int) d;
@@ -18,0 +21 @@ cast_uv(double d)
+ return u;
[end of diff]

The shadow error is not triggered for what should be code with identical behavior. We still cast a NaN to an unsigned int, the only difference being the result of the cast is now stored in a local variable and that variable is returned rather than the result of an expression being returned directly. Whether it can be said to "work" is an open question, since what it means to give the following value as the unsigned int representation of a NaN is difficult to say:

$ run nan
2079679152

As far as the main() function not having an explicit exit(), that shouldn't be necessary, and looking at the machine listing you can see that it moves 1 into R0. The case that blows up does so before it gets anywhere near that far anyway.

BTW, I forgot to mention before that the ACCVIO occurs during the traceback. If I link /NOTRACE, the ACCVIO does not occur.
John Gillings
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Craig,

Debugging this type of error is the art of not looking where you think you should be looking.

The "trap shadow" has to do with pipelining instructions. Floating point instructions, in particular can take many cycles to complete, so between issuing the instruction, and finding some error, other instructions may have been issued, or completed. The "TRAP_SHADOW" is the range of instructions between the failing instruction and the current one. To try to assist debugging there are structures which (hopefully) point to the real culprit.

My guess is the cast is generating instructions that are dealing with the same object as both floating point and as integer. The processor thinks they can be executed in parallel, but they can't. Somehow that's messing up the trap shadow structures.

Look at the instruction stream around the reported error, or at least where you think it's happening. Work backwards, looking for floating point operations that might fail.


This won't happen on Itanium because it's the compiler doing any pipelining, not the processor. There is no trap shadow, so it can't be illegally formed (that's the fundamental architectural difference between Alpha - RISC and Itanium - EPIC)

A crucible of informative mistakes
Craig A Berry
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Thanks for the explanation, John. That makes a lot of sense, but at the same time is rather over my head. When you say to look at the instruction stream, I assume you mean more than just look at the machine listing, which I would guess is serialized for human consumption, or at least I don't know how to assess what in it is parallelized.

I do have one other observation. In the if block that looks like this:

if (d < UV_MAX_P1) {
return (unsigned int) d;
}

we should never hit the cast-and-return line when d is a NaN because I think any comparison with a NaN is supposed to be false, and stepping through the non-optimized version with the debugger confirms that we don't get there except when compiled with default optimizations turned on.
Highlighted
Volker Halle
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Craig,

this is a nice problem ;-)

I will try to share, what I know about ILLEGAL_SHADOW traps, as I've worked an IPMT involving such thing once and also found a problem causing an ILLEGAL_SHADOW trap in an earlier version of the PersonalAlpha emulator.

If your code incurs an ILLEGAL_SHADOW exception, you need to work backwards in the Alpha instruction stream from the TRAPB (Trap Barrier) instruction (pointed to by the exception PC) to the first instruction found using the /S qualifier (requesting software completion).

CMPTLT/SU F16, F14, F15
FCMOVNE F18, F16, F18
TRAPB <- exception PC points here

The Imask and Fmask provide a bit for each register, which was a target of any instruction issued inside the 'trap shadow'.

The exception summary bits: summary=03 indicate:

bit 0 = SWC (Software Completion)
bit 1 = INV (Invalid Operation)

In this case, Fmask=00008000 points to F15 being a target register and therefore identifies the CMPTLT/SU F16, F14, F15 instruction as the one causing the trap.

summary=03 indicates an INV, this bit is set, when one of the operands has an illegal value.

The CMPTLT (IEEE Floating Compare) instruction will trap, if one of the input operands (F16 or F14) is a NaN. In this case it's F16.

The software completion is to be handled by the Operating System, in this case [SYS]IEEE_INST. If this handler believes, there is an inconsistency (there are lots of rules for a trap shadow to be valid) in the instruction stream preceeding the TRAPB instruction, which declared the exception, or it incurs any other error while checking this, it will signal the ILLEGAL_SHADOW trap. So this is all done by software !

I also get 'interesting' results, if I run the SAME NAN.EXE on different versions of OpenVMS and real (or emulated) Alphas. As the ILLEGAL_SHADOW is being reported from the exec, you also need to include the version of EXCEPTON.EXE as an additional 'parameter' to this problem.

Running NAN.EXE on an AlphaServer 1000A with OpenVMS V8.2, I get:

AXPVMS $ run nan
%SYSTEM-F-ILLEGAL_SHADOW, illegal formed trap shadow, Imask=00000000, Fmask=0000
8000, summary=03, PC=0000000000020098, PS=0000001B
%TRACE-F-TRACEBACK, symbolic stack dump follows
image module routine line rel PC abs PC
NAN NAN cast_uv 1824 0000000000000098 0000000000020098
NAN NAN main 1833 00000000000001DC 00000000000201DC
NAN NAN __main 1829 0000000000000174 0000000000020174
0 FFFFFFFF8031DF94 FFFFFFFF8031DF94

Note: no ACCVIO during traceback handling !

On a PersonalAlpha (V1.2.2) OpenVMS V8.3 with VMS83A_UPDATE-V0400, I get:
CHAALP $ run nan
2079916720


This seems to be an interesting corner case and noone except maybe HP OpenVMS engineering has a good chance of solving this mystery. You may need to also have a good reading in the Alpha Architecture Reference Manual to at least get an idea of what may be happening.

Volker.
Volker Halle
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Craig,

there is a path through your cast_uv routine, which does NOT return a status value:

...
if (d < UV_MAX_P1) {
return (unsigned int) d;
}
return (unsigned int) d; /* also return a value here */
}
...

When adding this 'fix', I get reliable results and no ILLEGAL_SHADOW or ACCVIO anymore.

When analyzing the machine code flow through cast_uv, I found a path, which does not load R0 and so returned a bogus value for R0. This explains, why I got different printed values from the printf when run on different machines.

Volker.
Craig A Berry
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Volker,

Thanks for your replies and for taking the time to do your own testing. You are quite right about the cast_uv function in my example not returning a valid value when either of the two if blocks in it evaluates to false. I now think this is what Hoff meant by stepping off the end of the main function (I had thought he meant the function main(), but now think he just meant the primary function in the example, which is cast_uv).

The reason there is no else clause or fallback return statement in the example is that I deleted it in trying to reduce the example to the smallest possible reproducer; code you've deleted can't be causing the problem. It's a red herring in this case, though I did allow it to confuse me.

It is true that if I put your fallback return statement:

return (unsigned int) d;

as the last statement in the function, the illegal shadow problem goes away. However, if, instead of your fallback statement I restore the original one I deleted (for which you'd need to include llmits.h):

return d > 0 ? UINT_MAX : 0;

the exact same problem is still there as in my original example.

There appear to be any number of ways to rewrite the function such that it dodges the illegal shadow problem, but then how to know that the next innocent edit won't trigger it again? I'm more convinced than ever that the function as written is legal (if a bit strange) yet triggers pathological behavior when optimized.

For the curious the original comes from the Perl sources and can be seen by hunting for "Perl_cast_uv" here:

http://public.activestate.com/cgi-bin/perlbrowse/f/numeric.c

John Gillings
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Craig,

>The reason there is no else clause ...
>
>code you've deleted can't be causing the
>problem. It's a red herring in this case,
>though I did allow it to confuse me.

Don't be so sure! In the world of heavily optimised and pipelined processors there's a concept of "speculative execution". That is, the pipeline may be busily processing BOTH sides of a conditional before (or while) the test is evaluated. When the test result is known the result for the other branch is discarded. This can avoid a true branch operation, (which tend to slow down the pipe). There are obvious things like function side effects that cannot be done like this - the optimiser will know.

One of the potential consequences is handling exceptions for non-taken branches! As well as creating some interesting cases when debugging.

I'm not sure how much this is used by which Alpha processor versions, but it might explain the differences between different systems Volker observed. Note that you may see even more of this type of thing on Itanium.

As processors grow more threads, cores, pipelines and execution units, your code no longer can be seen as a strict linear sequence of operations. Compilers and processors get more dependent on the "complete" correctness of the code.
A crucible of informative mistakes
Willem Grooters
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Please forgive me my ignorance - I have no C experience - but IMHO the cause is casting a "NaN" - in other words: do something with an uninitialzed variable. I've been educated to try to prevent this type of conditions (and (blush) tend to forget about it)...

WG
Willem Grooters
OpenVMS Developer & System Manager