Operating System - OpenVMS
1821051 Members
2575 Online
109631 Solutions
New Discussion юеВ

ILLEGAL_SHADOW error in C, casting NaN to unsigned int

 
SOLVED
Go to solution
Craig A Berry
Honored Contributor

ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Running the small C program in the attached listing gives the rather spectacular crash indicated below. If I compile /NOOPT, the problem goes away, which leads me to believe the optimiser is getting confused somehow. I only see this on Alpha (with CC v7.1 as well as v7.3) but not on Itanium.

According to the traceback, the line where the error occurs is this one:

return (unsigned int) d;

but if I remove the preceding if block (which does not modify d in any way), the crash doesn't happen, so the traceback info is suspect and running under debug prevents the error from occurring, so I'm a bit stumped.

Whether it makes sense to cast a double holding an IEEE NaN into an unsigned int is an interesting question, but this code is found, not made (i.e., I didn't write it), and I'm stuck with it whether what it's doing makes sense or not. I'm open to suggestion about whether the compiler is doing something wrong or the code is doing something wrong or some combination thereof.

$ cc/vers
HP C V7.3-009 on OpenVMS Alpha V8.3
$ cc/float=ieee/ieee=denorm/list/show=expansion/machine nan
$ link/trace nan
$ run nan
%SYSTEM-F-ILLEGAL_SHADOW, illegal formed trap shadow, Imask=00000000, Fmask=00008000, summary=03, PC=0000000000020098, PS=0000001B
%TRACE-F-TRACEBACK, symbolic stack dump follows
image module routine line rel PC abs PC
NAN NAN cast_uv 1824 0000000000000098 0000000000020098
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual address=0000000100000000, PC=0000000100000000, PS=0000001B

Improperly handled condition, image exit forced.
Signal arguments: Number = 0000000000000005
Name = 000000000000000C
0000000000010000
0000000100000000
0000000100000000
000000000000001B

Register dump:
R0 = 0000000000000001 R1 = 0000000000000001 R2 = 000000007BF7C590
R3 = 000000007ADDF2F0 R4 = 000000007ADDF2E0 R5 = 000000007ADDF2C8
R6 = 000000007ADDF360 R7 = FFFFFFFF81D4CD20 R8 = 000000007FF9CDE8
R9 = 000000007FF9DDF0 R10 = 000000007FFA4F28 R11 = 000000007FFCDC18
R12 = 000000007FFCDA98 R13 = FFFFFFFF81D4D1F0 R14 = 0000000000000000
R15 = 000000007AEE2670 R16 = 0000000000000EE0 R17 = FFFFFFFF77773700
R18 = 0000000100044D18 R19 = 000000007ADDF030 R20 = 0000000000000729
R21 = 000000007B67C848 R22 = 0000000100044CD8 R23 = 000000007ADDF020
R24 = 0000000000000000 R25 = 0000000000000001 R26 = 0000000100000002
R27 = 000000007B63F590 R28 = 000000007BF90438 R29 = 000000007ADDEFF0
SP = 000000007ADDEFF0 PC = 0000000100000000 PS = 300000000000001B
21 REPLIES 21
Hoff
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Try returning an integer, rather than a float or double cast, as a test.

I'm wondering if the attempt to cast the NaN is what's nailing the sequence. (The error that's signaled certainly points this way; trying to use a NaN...)

That, and the other part that's a little odd here is at the end of the main function; falling off the end can tend to spew whatever value was in R0 last as the final status.
Craig A Berry
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Thanks, I actually had already tried that modification, specifically this change from what I posted before:

$ gdiff -pu0 nan.c;-2 nan.c
--- nan.c;-2 Sat Dec 15 10:40:33 2007
+++ nan.c Sat Dec 15 15:15:47 2007
@@ -9,0 +10,2 @@ cast_uv(double d)
+ unsigned int u;
+
@@ -11 +13 @@ cast_uv(double d)
- return d < IV_MIN
+ u = d < IV_MIN
@@ -17 +19 @@ cast_uv(double d)
- return (unsigned int) d;
+ u = (unsigned int) d;
@@ -18,0 +21 @@ cast_uv(double d)
+ return u;
[end of diff]

The shadow error is not triggered for what should be code with identical behavior. We still cast a NaN to an unsigned int, the only difference being the result of the cast is now stored in a local variable and that variable is returned rather than the result of an expression being returned directly. Whether it can be said to "work" is an open question, since what it means to give the following value as the unsigned int representation of a NaN is difficult to say:

$ run nan
2079679152

As far as the main() function not having an explicit exit(), that shouldn't be necessary, and looking at the machine listing you can see that it moves 1 into R0. The case that blows up does so before it gets anywhere near that far anyway.

BTW, I forgot to mention before that the ACCVIO occurs during the traceback. If I link /NOTRACE, the ACCVIO does not occur.
John Gillings
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Craig,

Debugging this type of error is the art of not looking where you think you should be looking.

The "trap shadow" has to do with pipelining instructions. Floating point instructions, in particular can take many cycles to complete, so between issuing the instruction, and finding some error, other instructions may have been issued, or completed. The "TRAP_SHADOW" is the range of instructions between the failing instruction and the current one. To try to assist debugging there are structures which (hopefully) point to the real culprit.

My guess is the cast is generating instructions that are dealing with the same object as both floating point and as integer. The processor thinks they can be executed in parallel, but they can't. Somehow that's messing up the trap shadow structures.

Look at the instruction stream around the reported error, or at least where you think it's happening. Work backwards, looking for floating point operations that might fail.


This won't happen on Itanium because it's the compiler doing any pipelining, not the processor. There is no trap shadow, so it can't be illegally formed (that's the fundamental architectural difference between Alpha - RISC and Itanium - EPIC)

A crucible of informative mistakes
Craig A Berry
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Thanks for the explanation, John. That makes a lot of sense, but at the same time is rather over my head. When you say to look at the instruction stream, I assume you mean more than just look at the machine listing, which I would guess is serialized for human consumption, or at least I don't know how to assess what in it is parallelized.

I do have one other observation. In the if block that looks like this:

if (d < UV_MAX_P1) {
return (unsigned int) d;
}

we should never hit the cast-and-return line when d is a NaN because I think any comparison with a NaN is supposed to be false, and stepping through the non-optimized version with the debugger confirms that we don't get there except when compiled with default optimizations turned on.
Volker Halle
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Craig,

this is a nice problem ;-)

I will try to share, what I know about ILLEGAL_SHADOW traps, as I've worked an IPMT involving such thing once and also found a problem causing an ILLEGAL_SHADOW trap in an earlier version of the PersonalAlpha emulator.

If your code incurs an ILLEGAL_SHADOW exception, you need to work backwards in the Alpha instruction stream from the TRAPB (Trap Barrier) instruction (pointed to by the exception PC) to the first instruction found using the /S qualifier (requesting software completion).

CMPTLT/SU F16, F14, F15
FCMOVNE F18, F16, F18
TRAPB <- exception PC points here

The Imask and Fmask provide a bit for each register, which was a target of any instruction issued inside the 'trap shadow'.

The exception summary bits: summary=03 indicate:

bit 0 = SWC (Software Completion)
bit 1 = INV (Invalid Operation)

In this case, Fmask=00008000 points to F15 being a target register and therefore identifies the CMPTLT/SU F16, F14, F15 instruction as the one causing the trap.

summary=03 indicates an INV, this bit is set, when one of the operands has an illegal value.

The CMPTLT (IEEE Floating Compare) instruction will trap, if one of the input operands (F16 or F14) is a NaN. In this case it's F16.

The software completion is to be handled by the Operating System, in this case [SYS]IEEE_INST. If this handler believes, there is an inconsistency (there are lots of rules for a trap shadow to be valid) in the instruction stream preceeding the TRAPB instruction, which declared the exception, or it incurs any other error while checking this, it will signal the ILLEGAL_SHADOW trap. So this is all done by software !

I also get 'interesting' results, if I run the SAME NAN.EXE on different versions of OpenVMS and real (or emulated) Alphas. As the ILLEGAL_SHADOW is being reported from the exec, you also need to include the version of EXCEPTON.EXE as an additional 'parameter' to this problem.

Running NAN.EXE on an AlphaServer 1000A with OpenVMS V8.2, I get:

AXPVMS $ run nan
%SYSTEM-F-ILLEGAL_SHADOW, illegal formed trap shadow, Imask=00000000, Fmask=0000
8000, summary=03, PC=0000000000020098, PS=0000001B
%TRACE-F-TRACEBACK, symbolic stack dump follows
image module routine line rel PC abs PC
NAN NAN cast_uv 1824 0000000000000098 0000000000020098
NAN NAN main 1833 00000000000001DC 00000000000201DC
NAN NAN __main 1829 0000000000000174 0000000000020174
0 FFFFFFFF8031DF94 FFFFFFFF8031DF94

Note: no ACCVIO during traceback handling !

On a PersonalAlpha (V1.2.2) OpenVMS V8.3 with VMS83A_UPDATE-V0400, I get:
CHAALP $ run nan
2079916720


This seems to be an interesting corner case and noone except maybe HP OpenVMS engineering has a good chance of solving this mystery. You may need to also have a good reading in the Alpha Architecture Reference Manual to at least get an idea of what may be happening.

Volker.
Volker Halle
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Craig,

there is a path through your cast_uv routine, which does NOT return a status value:

...
if (d < UV_MAX_P1) {
return (unsigned int) d;
}
return (unsigned int) d; /* also return a value here */
}
...

When adding this 'fix', I get reliable results and no ILLEGAL_SHADOW or ACCVIO anymore.

When analyzing the machine code flow through cast_uv, I found a path, which does not load R0 and so returned a bogus value for R0. This explains, why I got different printed values from the printf when run on different machines.

Volker.
Craig A Berry
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Volker,

Thanks for your replies and for taking the time to do your own testing. You are quite right about the cast_uv function in my example not returning a valid value when either of the two if blocks in it evaluates to false. I now think this is what Hoff meant by stepping off the end of the main function (I had thought he meant the function main(), but now think he just meant the primary function in the example, which is cast_uv).

The reason there is no else clause or fallback return statement in the example is that I deleted it in trying to reduce the example to the smallest possible reproducer; code you've deleted can't be causing the problem. It's a red herring in this case, though I did allow it to confuse me.

It is true that if I put your fallback return statement:

return (unsigned int) d;

as the last statement in the function, the illegal shadow problem goes away. However, if, instead of your fallback statement I restore the original one I deleted (for which you'd need to include llmits.h):

return d > 0 ? UINT_MAX : 0;

the exact same problem is still there as in my original example.

There appear to be any number of ways to rewrite the function such that it dodges the illegal shadow problem, but then how to know that the next innocent edit won't trigger it again? I'm more convinced than ever that the function as written is legal (if a bit strange) yet triggers pathological behavior when optimized.

For the curious the original comes from the Perl sources and can be seen by hunting for "Perl_cast_uv" here:

http://public.activestate.com/cgi-bin/perlbrowse/f/numeric.c

John Gillings
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Craig,

>The reason there is no else clause ...
>
>code you've deleted can't be causing the
>problem. It's a red herring in this case,
>though I did allow it to confuse me.

Don't be so sure! In the world of heavily optimised and pipelined processors there's a concept of "speculative execution". That is, the pipeline may be busily processing BOTH sides of a conditional before (or while) the test is evaluated. When the test result is known the result for the other branch is discarded. This can avoid a true branch operation, (which tend to slow down the pipe). There are obvious things like function side effects that cannot be done like this - the optimiser will know.

One of the potential consequences is handling exceptions for non-taken branches! As well as creating some interesting cases when debugging.

I'm not sure how much this is used by which Alpha processor versions, but it might explain the differences between different systems Volker observed. Note that you may see even more of this type of thing on Itanium.

As processors grow more threads, cores, pipelines and execution units, your code no longer can be seen as a strict linear sequence of operations. Compilers and processors get more dependent on the "complete" correctness of the code.
A crucible of informative mistakes
Willem Grooters
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Please forgive me my ignorance - I have no C experience - but IMHO the cause is casting a "NaN" - in other words: do something with an uninitialzed variable. I've been educated to try to prevent this type of conditions (and (blush) tend to forget about it)...

WG
Willem Grooters
OpenVMS Developer & System Manager
Craig A Berry
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

John, I perhaps wasn't clear in expressing that I still see the error after deleting the final fallback return, which is why I omitted it from my original reproducer. But your point is well taken since if I omit either of the remaining if blocks, things don't blow up, but with both of them there together, the illegal trap shadow happens.

Willem, casting a NaN to an int is a bit weird, but it works ok by itself, which I think has as much to do with IEEE floating-point semantics as it does with C. There is a nice article on IEEE floating point here for the mathematically inclined:

http://docs.sun.com/source/806-3568/ncg_goldberg.html

The context of the code that generates the error is that Perl is a dynamic language and whether a variable is a string or a number and whether a number is an integer or floating point are things that get determined on the fly. This involves asking a lot of questions of the form, "Can this chunk of memory be treated as ...?" Even if the answer is no, the question still has to be asked, which leads to some interesting conversion attempts such as the one that triggered the error.
Dennis Handly
Acclaimed Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

The results on PA and HP-UX IPF indicate that the function cast_uv is missing a return.

In fact it seems the OVMS compiler may be broken because no compares with NaN should be true and it should fall off cast_uv as Volker points out.

>Volker: When analyzing the machine code flow through cast_uv,

No analysis is needed. A real compiler should have told you that. :-)
warning #2940-D: missing return statement at end of non-void function "cast_uv"

>when either of the two if-blocks in it evaluates to false.

These should always evaluate to false if d is a NaN.

>WG: the cause is casting a "NaN" - in other words: do something with an uninitialized variable.

For IPF, the hardware says the result is a long long 0x800000000000000LL. PA-RISC handles it in a kernel trap handler but with a completely different value.

>casting a NaN to an int is a bit weird, but it works ok by itself

On HP-UX, it gets truncated to 0 (IPF) or UINT_MAX (PA).
Craig A Berry
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Dennis,

Thanks for the reply.

>The results on PA and HP-UX IPF indicate that the >function cast_uv is missing a return.

And as we've already discussed at some length, the return is only missing from the pared-down reproducer, and its presence or absence makes no difference as far as the trap shadow error; the function never returns at all when the error is triggered, so the return value, bogus though it may be, is not of particular interest to the problem at hand, and is not visible in an environment that exercises the bug. I've attached a revised reproducer with the return statement restored just so we can stop confusing ourselves about it.

>In fact it seems the OVMS compiler may be broken
>because no compares with NaN should be true and it >should fall off cast_uv as Volker points out.

They aren't true on OVMS either. There does appear to be a gotcha in the Alpha compiler as far as one path of parallel execution not defending itself quite enough from what another path might be doing at the same time in this rather odd corner case.

>>when either of the two if-blocks in it evaluates to false.

>These should always evaluate to false if d is a NaN.

They do -- except when they blow up and neither is evaluated.


>>casting a NaN to an int is a bit weird, but it works ok by itself

>On HP-UX, it gets truncated to 0 (IPF) or UINT_MAX (PA).

On VMS, it also gets truncated to 0, but that's irrelevant. It would have been better if I never said anything about "casting NaN" in my subject line; "parallel comparison operations with NaN" is more to the point except I did not yet know that was the problem at the time I posted.

Volker Halle
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Craig,

I've built your new example and run it on a V8.2 AlphaServer 1000A 5/400 (EV56) and it fails with ILEGAL_SHADOW. I've copied the same image to eisner (.decuserve.org) (DS20 V7.2-1) and it runs there without a failure.

As I said before, the ILLEGAL_SHADOW is a condition detected and reported by the IEEE handler in EXCEPTION.EXE. There are about a dozen checks, which may report this condition.

If John talks about 'speculative execution', this only seems to apply to Alpha 21264 (EV6 or higher) CPUs, so apparently can be ruled out on my EV56.

Volker.
Volker Halle
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Craig,

I've reduced the instruction stream causing the ILLEGAL_SHADOW to a simple MACRO-64 program (see attached). Swapping the instruction following the CMPTLT/SU F16,F14,F15 instruction causes various types of failures or causes the ILLEGAL_SHADOW to disappear, but on the other hand works on some Alphas without a problem.

There is something wrong here, so please - if you can - log a call with HP.

Volker.
Craig A Berry
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Volker,

Thanks and double thanks. One thing I never mentioned is that I was seeing this on a DPW 500au, thus EV56, which confirms your experiments. I think there's plenty of info in this thread for someone with access to the compiler sources to dig in and fix the problem. For me this is hobbyist work done on my own time, and the only way to report it is to post here and at the C compiler feedback link off the OpenVMS home page, which I have now done.

Folks following this thread may be interested to know Hoff has written a nice background article on the alpha trap shadow here:

http://64.223.189.234/node/690



Happy New Year.
Craig


John Reagan
Respected Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

il CMPTLT/SU F16,F14,F15
FCMOVNE F18,F16,F18 ; inserted, causing ILLEGAL_SHADOW !
;; FCMOVNE F18,F17,F18 ; inserted, causing ILLEGAL_SHADOW !
;; FBNE F18, out ; inserted, causes HPARITH
;; LDA R0,nan ; inserted - no problem
;; CPYSE F18,F19,F20 : inserted - no problem

Of course you can get illegal shadows if you code in Macro-64 since you are responsible for following (or ignoring) the rules.

The Trap Shadow Rules are in the Alpha Architecture Manual, section 4.7.7.3.1.

The FCMOV instructions violate rule #4. You used F18 as both an input and output register inside the trap shadow.

The FBNE violates rule #2. No branches or jumps allowed in a shadow. I would have expected an illegal shadow message here as well.

The rules allow the OS to re-execute the faulting instruction and all the other instructions upto the TRAPB.

I haven't been reading all the C examples. If somebody has a short C example where the compiler violates the rules, email it to me please.
Volker Halle
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

John,

I've mailed you the C code and instructions for reproducing this problem.

I had written the MACRO-64 example strictly based on the I-stream generated by the C compiler.

This 'little example' seems to show a couple of different problems in various components of OpenVMS.

Volker.
John Reagan
Respected Contributor
Solution

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

Well, I don't see any OpenVMS issues other than the HPARITH vs ILLEGAL_SHADOW for the branch inside the shadow. The other behaviours look correct.

Now, the C compiler shouldn't be generating the

FCMOVNE F18, d, F18

inside the shadow. I'll see if I can reproduce it with the latest compiler.
John Reagan
Respected Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

With /NOOPT, the suspect FCMOV is below the TRAPB instruction. The peepholer is trying to move instructions into the trap shadow as an optimization. It moves the FCMOV into the shadow intentionally.

When the same register appears in more than one operand, I found a comment:

"If Rb==Rc, there is no real move occuring at all. If Ra==Rc, and the move didn't occur the first time, then Ra/Rc will be unchanged. If it does happen the first time, Ra/Rc will get the new value from Rb, but then it doesn't matter if the move happens again or not."

Only when all three register operands are disjoint will the instruction not be moved into any trap shadow.

I also found some other comments about some ECO to the Alpha Architecture which adds some more wording to the trap shadow rules. Perhaps the OS' trap shadow checking code didn't catch up. I'll check that case.
Craig A Berry
Honored Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

John R.,

Thanks for the additional insight and for checking into what the compiler is doing and why. Note that the illegal shadow error only appears when operating on a NaN -- it may be the optimization is kosher for other values but not when one of the operands is a NaN.
John Reagan
Respected Contributor

Re: ILLEGAL_SHADOW error in C, casting NaN to unsigned int

For those of you keeping score at home, I've done some archaeology and decided that:

1) the compiler is correct.

2) the OpenVMS IEEE handler is broken when it comes to validating instructions in the trap shadow.

3) the Alpha SRM section on Trap Shadows was actually written by the GEM team to capture the implementation details, not the other way around.

I've entered a problem report against the code, but the authors are long gone. That makes me the "expert" based on my 15 minutes of looking (to be honest, the code in question isn't that large, just ugly). I guess that means I should fix it myself when I can suppress my gag reflex for 30 minutes.