Loading an Array (Fortran 90 on Alpha systems)

HDS · ‎11-10-2009

Hello.

This is a question of curiosity.

On an Alpha ES40 with OpenVMS V8.3, is there a difference in performance if using

a) DO II = 1, MAX_VALUE
IVALUE(II) = JVALUE(II)
ENDDO

Or...

b) IVALUE(:) = JVALUE(:)

There are often cases where I would be loading up an array of hundreds of thousands of elements. I sometimes see cases where the latter approach would visibly run faster (clock time), but am not able to maintain that result.

Test evaluations are on Alpha ES40s with 2x 667Mhz EV67 cpus. I generally will compile with the following options....
/CHECK=(BO,UN,OV)
/ALIGN=(COMM=NATU,RECO=PACK)
/OPT=(TUNE=EV67,LEVEL=5)
/REAL=64/EXT
/WARN=(NOALIGN,DECL,NOUSAGE)
/FLOAT=G

Or, is it simply a matter of presenting the same thing two ways...both resulting in the same underlying instruction sets? And my occasional apparent faster executions of method #B are really a matter of cachings, timing with other processes, etc?

Thank you in advance.

-Howard-

Hoff · ‎11-10-2009

If you're chasing performance (and have not already done so), then instruction tracing and wallclock and code profiling are the gold standard; find your slowest code, and optimize from there.

As for your question, look at the generated machine code, and see what instruction streams get generated for each.

Look at what the process is doing, too. Is it incurring paging overhead, for instance?

On OpenVMS I64 (and to a rather lesser extent on OpenVMS Alpha), look around for alignment faults, too. Alignment faults can really hit OpenVMS I64 application performance hard.

There are also cases when inlining code is actually slower than size-optimized code, as longer code sequences can knock the code out of L1 or sometimes even L2 processor cache.

And depending on how much you want to look at the low-level activities if your code, have a look at DCPI:

http://h71000.www7.hp.com/openvms/products/dcpi/dcpid.html

Traditional variable-based loading and RMS record-based coding practices do tend to have lower performance. Where permissible by the application and within the available algorithms, I generally look to load whatever I can from a cache or a database or whatever using one or a few big wads of data.

Alternatives to implementing your own copies can include double-mapping data (I've posted source code that double-maps between 32- and 64-bit space, for instance), or (when you really have to move big blocks) using the system block move primitives within OpenVMS.

HDS · ‎11-10-2009

Thank you Hoff.

That is quite a bit of info to digest, so it may take me a little time to take it all in. Allow me to respond at this time with the following:

"look at the generated machine code" I understand and agree that such would likely shed light on my question. However, I cannot say that I have ever done that, and welcome info on methods to do so.

All to familiar with the alignment fault concerns with the Itanium, and understand that such can also pop up on Alpha. I will certainly check that to see if there is any difference between the two methods.

"cases when inlining code is actually slower that size-optimized code" Very interesting....very.

Thank you for the link to DCPI. I had not heard of that and will look into it....if not for just this question, for future projects.

In this particular case, the loads are occurring from locations in the VM footprint, not from a file or anywhere on disk. As I recall, faulting is similar for both instances...but will confirm. In fact, as I read this info from you, I feel that I could step through in debug and watch the process activity from another session while stepping. (I may have asked my question prematurely...I may already have what is needed to answer it.)

Appreciate the info.

-H-

Hoff · ‎11-10-2009

>However, I cannot say that I have ever [investigated the instructions], and welcome info on methods to do so.

When you're doing production development, keep the compiler listings and the maps around.

Maintaining these listings and map files is helpful for tracking application crashes and is (also) useful for cases such as this. The former is discussed directly here:

http://labs.hoffmanlabs.com/node/800

And the generated instruction streams are shown in the compiler listings. Which is what you want to see, but for other reasons here.

The best way to go faster is to avoid copying the data altogether where you can (possibly through double-mapping or through the use of indirection and pointers), to work with and to transfer the bigger wads of data as a unit (usually), to look at and implement data caching where that helps, to use system block transfers, and only as a last resort read individual file records or otherwise copy smaller wads of data around. The classic OpenVMS programming techniques are comparatively slow.

Using the source code debugger generally doesn't help all that much with questions involving performance monitoring; that tool is very useful in getting the code running stably and reliably. Then come tools such as the PCS and FLT tools "within" ANALYZE /SYSTEM and the DECset PCA tool and (at the lowest levels) DCPI; tools that specifically target performance monitoring and application profiling.

Hoff · ‎11-10-2009

ps: Links to various Alpha hardware docs and to other low-level resources including the Alpha Macro64 instruction set reference materials:

http://labs.hoffmanlabs.com/node/407

Jon Pinkley · ‎11-10-2009

Howard,

RE:"look at the generated machine code" I understand and agree that such would likely shed light on my question. However, I cannot say that I have ever done that, and welcome info on methods to do so."

Perhaps you know the following already.

To get the compiler to tell you what machine code it is generating:

$ fortran /list[=listing_file]/machine_code

Try with a small program that does nothing but load the array, and then at least reference the copied array (otherwise the complier may optimize the code so it does not even load the array)

Then compile with different optimization levels and use the /list and /machine code qualifiers in the compile command.

Then look at the listing.

If Ivalue and Jvalue have the same dimensions, you may be able to copy the whole array with a block move, I would expect that to be near optimal (assuming aligned arrays). If MAX_VALUE can have a value less than the size of the array dimensions, a block move should still work given that both arrays are single dimension.

If you have never looked at the Alpha instruction set, it is quite a bit different than the VAX (in some respects it is more like the pdp-8 than the VAX)

You can download the Alpha architecture manual in pdf format (use Google). It has complete descriptions of the instruction set, but I wouldn't consider it to be tutorial in nature. It's more of a technical specification than a user's guide.

Actually, doing this exercise is a reasonably good way to learn the AXP instructions. Sometimes, the compilers use non-standard mnemonics for the machine instructions, so you may need to look at the Hex opcodes (at left) to be able to map what the instruction is called in the Architecture manual. They also make the machine code listing more human friendly by using the variable names instead of register names. The comments on the right have the actual registers used.

Also, the f90 and f77 compilers are quite different in the code they generate.

Jon

it depends

John Gillings · ‎11-10-2009

Howard,

>is there a difference in performance?

That's an easy question to answer - time it yourself. Look at CPU time, Clock time and pagefaults. It might help to page in both arrays first, then perform your timing runs.

In theory, the "whole array" copy gives the compiler the option to do a block memory transfer, possibly avoiding the overhead of the loop (increment, test, branch). For small MAX_VALUEs the compiler may unroll a loop, but probably won't coalesce the elements.

If you want to force a block move, try using one of the library routines:

CALL OTS$MOVE3(MAX_VALUE*ValueSize,JVALUE,IVALUE)

On the other hand, if they're large enough, the arrays may need to be moved in and out of memory, in which case the dominant cost to the operation will probably be pagefaults. Plenty of memory and plenty of working set might help.

Note that multi dimensional arrays are a whole different story. If you use nested loops, make sure you use the correct order!

A crucible of informative mistakes

Jansen_8 · ‎11-11-2009

As others already stated, you should test it in your own application.

That said: My experience on a XP1000 with a EV67 processor is that in most cases the F90 array operations are much faster than the equivalent do-loops.

Jouk

Bob Blunt · ‎11-11-2009

Howard,

At the expense of seeming out of touch and dating myself... In the dark ages array initilization was very sensitive to the order of initialization. I suspect this was probably architecture-specific (VAX) but I haven't deliberately tested it myself since I have usually just ported working HLL code directly to newer platforms. If you're checking code then compare slow to fast and see if one is initializing rows first and the other is initializing columns first. On VAX the difference was rather dramatic.

bob

HDS · ‎11-12-2009

Thank you all.

Some of the responses re-sparked stuff in my head. The rest contained information that is new to me; this has been a learning experience. Very rewarding to say the least.

I will be playing with this again at some time shortly. (As I said, this was simply a question of curiosity.) I will keep this thread open for a couple days so that I can post the results...in case any of you are as curious.

Thanks again

-H-

HDS · ‎11-19-2009

Hello.

Thank you all for the useful information. As it happens, I took simple modules, one that uses DO-Loop to populate the array and one that uses the single instruction, reviewed the machine code listings and ran them.

Bottom line, the single line approach generates slightly more machine code as it appears to take the single line, assign [internal] temp variables and then manipulate them in the manner used by the Do-Loop. As to their executions, very close...but, it appears that, by cpu time, the do-loop is faster by as little as a fraction of a 10ms tick.

In conclusion, I lean towards there being no real difference. Through all of this, I learned quite a bit, so it was well worth it.

Thanks again to all.

-H-

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Loading an Array (Fortran 90 on Alpha systems)

Loading an Array (Fortran 90 on Alpha systems)

Re: Loading an Array (Fortran 90 on Alpha systems)

Re: Loading an Array (Fortran 90 on Alpha systems)

Re: Loading an Array (Fortran 90 on Alpha systems)

Re: Loading an Array (Fortran 90 on Alpha systems)

Re: Loading an Array (Fortran 90 on Alpha systems)

Re: Loading an Array (Fortran 90 on Alpha systems)

Re: Loading an Array (Fortran 90 on Alpha systems)

Re: Loading an Array (Fortran 90 on Alpha systems)

Re: Loading an Array (Fortran 90 on Alpha systems)

Re: Loading an Array (Fortran 90 on Alpha systems)