Re: IA64 dec$ForRTL exp(-z) 2000x too slow for large z?

Charles F. Driscoll · ‎04-05-2008

Has anyone seen documentation (or a patch) for
glacially slow exp(-z) evaluations on IA64?

The exponential function exp(-z) in dec$ForRTL on IA64 takes 2000 times as long to return 0.0 for z > 120. as it does to return 1.8e-35 for z= 80. This can be a real "killer" for particle simulations.

This extreme slowdown is observed for both /Float=IEEE_Float and /IEEE_Mode=Fast. Strangely, when /IEEE_Mode=DeNorm, the slowdown is "only" 40x. The slowdown is about the same for /Real_Size=64 and /Real_Size=32.

On an Alpha XP-1000, the slowdown is only 2x when returning 0.0 for large z, making the Alpha 300x as fast as the AXP in this regime.

Details:
OpenVMS 8.3, Fortran V8.1-10492
FORTRAN Source, Listing, Map and Results at
http://sdphI0.ucsd.edu/Exp_Time_Test_Results.txt

C. Fred Driscoll
Physics, UCSD

Hoff · ‎04-06-2008

Short of rolling your own (faster) code for this function, it is likely you will have to work directly with HP Fortran RTL engineering team.

For giggles, I might turn on alignment fault monitoring and see if that's where all the time is going here. Such faults can be potentially subtle, as the code is working and is producing the correct answer, albeit glacially slow. (And if it is alignment within the RTL, it's HP's code.)

Regardless, this looks to involve a look at and an update to the Fortran RTL, and -- short of an RTL patch kit (which you've undoubtedly already looked for and tried* -- there's not much folks out here can do.

--
*If you haven't already looked for kits, I'd start with VMS83I_FORRTL V3, VMS83I_UPDATE V5, and any other mandatory ECOs not already loaded. These may nor may not cure this (and I'd lean toward "not"), but if you don't have these installed, you'll likely be asked to install them.

Preemptive installation of ECOs is a technique intended to reduce the time spent waiting for the front-line support spinlock to clear.

Charles F. Driscoll · ‎04-06-2008

Hoff,

Thanks for your thoughtful reply. I had missed the recent ForRTL_v3 update, but I tested it and found no improvement over the
ForRTL_v2 initially tested.

I also checked $Monitor Alignment, and saw essentially NO faults.

I have no service contract, so my hope is that this post will attract the attention of the Fortran RTL engineering team. This bug is certainly worth attention, since it can completely "stall" typical particle simulations in physics research.

Best,
Fred

Steven Schweda · ‎04-06-2008

> I have no service contract, so my hope is
> that this post will attract the attention
> of the Fortran RTL engineering team.

You might also try the official unofficial
complaint Web form:

http://h71000.www7.hp.com/fb_business.html

Charles F. Driscoll · ‎04-06-2008

>You might also try the official unofficial
>complaint Web form:
>
>http://h71000.www7.hp.com/fb_business.html

Thanks, I now submitted the error description.

My (limited) experience has been that when something like this gets to the right person in DEC/Compaq/HP, it gets fixed. Sometimes they even distribute an informal patch that's waiting for an ECO-spinlock to clear :)

John Reagan · ‎04-07-2008

Certainly is slower for the overflow cases. I did 1,000,000 exp(-80) in 0.08 seconds (on my rx2600) and the same number of exp(-120) took 1min:40sec. I did this in Pascal. No Fortran in sight.

The MATH$EXP_S routine that does all the work is hand-crafted Itanium assembly. We'll get this to the right folks.

I can believe that /IEEE_MODE=FAST is slower. In that mode, we ask the hardware to complain more often about IEEE special values. For /IEEE=DENORM, we just close our eyes and let whatever happens, happen.

Charles F. Driscoll · ‎04-12-2008

The day after my Web problem report to HP, Deborah Belcher @ hp.com responded by email that "we'll get this reported formally", and "the math team [now] has the problem queued".
Sounds promising.

Dennis Handly · ‎04-14-2008

There is no slowdown for this on HP-UX in C for Integrity.

John Reagan · ‎04-15-2008

The slowdown isn't with the algorithm itself, but with how the "underflow to 0" errors are caught and dismissed (at least for the /IEEE=FAST case). The OpenVMS code for exception handling has some known performance issues with reading/processing the unwind descriptors.

For the /IEEE=DENORM case, much of the slowdown is from playing (via system services) with the FPSR settings.

Jon Pinkley · ‎04-16-2008

Why can't MATH$EXP_S just explicitly check the input value for min and max that won't cause an exception, thus avoiding the costly exception processing operation?

My point is that there is a value that will cause underflow and there is a value that will not. I would expect those values to be constant. There would be a small constant cost for doing the check before starting the work for cases that won't underflow/overflow, but that cost is probably insignificant compared to the cost of the procedure call, and much less than the cost of an exception.

Jon

it depends

Jon Pinkley · ‎04-16-2008

More precisely:

There exist a value L for which exp(L) does not underflow, but for which all x
There exist a value U for which exp(U) does not overflow, but for which all x>U, exp(x) will overflow.

Therefore if L<=x<=U then exp(x) will not cause an underflow or overflow exception.

it depends

John Reagan · ‎04-17-2008

Yes the code seems to have some input checking. However, the algorithm (which I don't completely understand) seems to think there is some numbers which might or might not underflow since you can get smaller results if denorms are enabled. We're still looking at the underlying cause.

John Reagan · ‎04-17-2008

For those of you keeping score at home, more info...

The Math RTL routines need to check with their callers to see if they should raise an error or do something else.

On Alpha, that information is inside the procedure descriptor (PDSC$V_EXCEPTION_MODE) with the default being PDSC$K_EXC_MODE_SIGNAL : Raise exceptions for all error conditions except for underflows producing a 0 result.

So even when the RTL checks the input arguments, it has to walk back up the stack to find the calling routine to see if that routine wanted underflow checking or not (F90 has a /CHECK=UNDERFLOW).

On I64, which doesn't have a single data structure like a procedure descriptor, has that information buried down in the unwind descriptors.

Part of the slowdown is the stack walk via the LIB$I64 calling standard routines to see if the caller wants the underflow raised as an error or just mapped to 0.

Dennis Handly · ‎04-18-2008

Ok, my previous numbers were for doubles, which don't underflow. When I call expf, it is about 40 times slower, unless I use flush denorms.

$ time a.out 80
exp(-80.000000) = 1.80485e-35
real 0m02.64s user 0m02.63s sys 0m00.00s

$ time a.out 120
exp(-120.000000) = 0
real 1m44.51s user 0m28.71s sys 1m15.54s

The timing info is real interesting in that it clearly is blaming the kernel and the sloppy hardware that hates denorms.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: IA64 dec$ForRTL exp(-z) 2000x too slow for large z?

IA64 dec$ForRTL exp(-z) 2000x too slow for large z?