Re: Compiler Code Generation Question

sccr13plyr · ‎12-08-2008

Hello,

I do not know if this forum is the right place to post this message. If it is not, please let me know a better place to get my question
addressed. I am an old C programmer that became a self-taught C++ programmer. So, the attached may simply be my ignorance of the C++ language.

I am at a site that has an older application written in C that has over time been migrated to C++. This is a high performance application that processes a 100 million records a day that are then broken down into component transactions. We have a set of algorithms that process these transactions and have a lot of duplicate logic. We want to add functionality, simplify the code, and keep the performance.

This approach would collapse multiple algorithms (eight) into one body of logic in GenericLogic. By using constants, my desire is to have the compiler create three separate versions of GenericLogic (one for each of the new class declarations in main), removing code (and comparisons) that would never need to be used.

Should I expect that the compiler is actually removing unreachable code? The machine listing appears to show the comparisons. Should I be utilizing a different C++ language construct?

Thanks in advance for your assistance, and sorry if this is not the
right venue...

sccr13plyr

silly site will only allow one attachment...

#include

#pragma message disable BOOLEXPRCONST

class BaseAlgorithm
{
public: virtual ~BaseAlgorithm (){}

public: inline void Display (const int id, const char* output)
{
(void) printf ("Algorithm: %d - %s\n", id, output);
}
};

template
class GenericAlgorithm : public BaseAlgorithm
{
private: static const int algorithmId_ = id;
private: static const bool checkPriceChanged_ = checkPrice;
private: static const bool specificCode_ = specificCode;
private: static const bool currentTransOnly_ = currTran;

public: GenericAlgorithm (){}
public: virtual ~GenericAlgorithm (){}
public: inline void GenericLogic (int* data, int* env, char* retInfo)
{
if (data == NULL)
if (env == NULL)
if (retInfo == NULL)
{
//
}

if (checkPriceChanged_)
{
if (specificCode_)
{
if (currentTransOnly_)
{
char v1[1000][2000];
v1[999][999] = 'd';
if (v1[999][999] == 'd')
{
}

Display (algorithmId_, "Check Price, Specific Code, Current Tran");
}
else
{
Display (algorithmId_, "Check Price, Specific Code, All Trans");
}
}
else
{
if (currentTransOnly_)
{
Display (algorithmId_, "Check Price, All Code, Current Tran");
}
else
{
char v1[1000][2000];
v1[999][999] = 'd';
if (v1[999][999] == 'd')
{
}

Display (algorithmId_, "Check Price, All Code, All Tran");
}
}
}
else
{
if (specificCode_)
{
if (currentTransOnly_)
{
Display (algorithmId_, "Ignore Price, Specific Code, Current Tran");
}
else
{
Display (algorithmId_, "Ignore Price, Specific Code, All Trans");
}
}
else
{
if (currentTransOnly_)
{
Display (algorithmId_, "Ignore Price, All Code, Current Tran");
}
else
{
Display (algorithmId_, "Ignore Price, All Code, All Trans");
}
}
}
}
};

#include
#include "AlgorithmTemplate.h"

int main (void)
{
GenericAlgorithm<159, true, true, true> test;
test.GenericLogic (NULL, NULL, NULL);

GenericAlgorithm<225, true, false, true> newTest;
newTest.GenericLogic (NULL, NULL, NULL);

GenericAlgorithm<246, true, false, false> *oldTest = new GenericAlgorithm<246, true, false, false>;
oldTest->GenericLogic (NULL, NULL, NULL);

return EXIT_SUCCESS;
}
HP C++ V7.3-023 on OpenVMS IA64 V8.3-1H1
$ CXX 'P1 /OPTIMIZE=(LEVEL=5,INLINE=SPEED,INTRINSICS,UNROLL=0,OVERRIDE_LIMITS,TUNE=HOST)/STANDARD=LATEST/WARNINGS=(ENABLE=ALL)/list/machine

Hoff · ‎12-08-2008

This is probably not the answer you want.

Out of curiosity, why is the removal of unreachable code important to you? Sure, (if that doesn't happen) it chews up some virtual memory, but at your execution rates I'd be a whole lot more interested in finding the bottlenecks and tuning the code that is reachable.

Load up DECset PCA and such, and find out where the production code is spending its time. Work from there, identifying and investigating the hottest parts and the slowest parts of the application. (Expect to be surprised here too, as the performance bottleneck(s) may well be in completely unexpected regions of the application source code.)

The C and C++ compilers don't traditionally in-line across source code compilation modules. There are ways that you can allow the compilers to remove code and to better optimize code through judicious use of the static keyword on the function declaration. But I've not seem an OM-like tool for OpenVMS.

Recognize too that Itanium doesn't really like to branch, as a general rule. In-line code runs faster than code that branches, or incurs frequent faults or such. And with Itanium, definitely also investigate alignment faults, and other higher-level factors.

And if you really need absolute performance, you may (will) end up identifying, reworking, tweaking or otherwise re-coding parts of your most performance-critical code paths. But identify first where the application is spending its time, then design and prototype and (probably most importantly!) performance-test the tweaks and the changes.

This may well involve re-thinking how the application is designed and how it performs its I/O, for instance. If you're working with 100 million individual records per day through standard APIs using traditional OpenVMS or C coding techniques, I/O is a likely candidate for tweaks. I might well map the whole show into memory with as few I/Os as I could reasonably manage (possibly with overlapping I/Os), and then use 64-bit address space to directly pound on the data structures. This assumes that DECset PCA shows I/O as a major factor in the aggregate application performance, of course.

Put another way, I'd suggest you don't optimize anything until you first know you need to optimize the code (and which is apparently the case here?), and then only after you investigate and profile and know *where* to optimize the application.

sccr13plyr · ‎12-08-2008

Hoff,

Thanks for your response! It has been four years since I was last working at a VMS site. I am glad to be back. I will investigate whether they have ever used PCA here. I started not to long ago and still coming up to speed regarding the environment.

Based on the declaration of the code, I would expect the compiler to identify and remove the comparisons and branches for constant values. However, the listing seems to show them. Unfortunately, IA64 assembly is less understandable than Macro.

The concept for the test is to validate that the related algorithms can be combined into one with branches that can be optimized away during compilation.

So, I wanted to get some feedback from the OVMS C++ engineering team. Am I misreading the machine code listing? Am I misunderstanding how the compiler should generate the machine code? Or is it that the compiler should identify (it does provide an informational) and eliminate the extra comparison/branch cycles and doesn't?

sccr13plyr

Hoff · ‎12-08-2008

> I will investigate whether they have ever used PCA here. I started not to long ago and still coming up to speed regarding the environment.

EPIC and the Itanium assembler is comparatively impenetrable for the uninitiated. Start with the Itanium manual set from Intel, if you want to look at this level. But I'd start with DECset PCA and with application-wide tuning first.

>Based on the declaration of the code, I would expect the compiler to identify and remove the comparisons and branches for constant values. However, the listing seems to show them. Unfortunately, IA64 assembly is less understandable than Macro.

Allow me to characterize one run-time behavior you're bypassing here.

One alignment fault can require the equivalent of 10 to 15 thousand instructions to process, per one of the HP folks that deals with compilers. That's 10,000 to 15,000 instructions. Per alignment fault.

Have you confirmed few or no alignment faults exist in critical paths?

>The concept for the test is to validate that the related algorithms can be combined into one with branches that can be optimized away during compilation.

So stick a constant or a particular address reference or a debugger trap or such into the code, and see if it gets generated in the output. What does dead code matter here? As a rule, dead code is not a performance issue. And Itanium can intersperse instructions and routines within the bundles, all to avoid branches.

>So, I wanted to get some feedback from the OVMS C++ engineering team.

Ok. This is a customer forum. I haven't seen a member of the compiler team posting in here recently, but one or two of the compiler folks do occasionally stop by.

Here, you're not really even asking a compiler question, you're asking for details of the Intel back-end code generator. (C++ uses the Intel EDG front-end and ECG back-end. It's one of the few OpenVMS compilers that does that right now. Most of the other OpenVMS compilers use the GEM back-end. This IIRC, and based on various published discussions.)

I don't recall having seen any Intel Itanium code generator engineers post here, either.

If you want formal feedback or a direct discussion with the folks that work on the compilers or on the code generator, then you'll likely want to log a support call.

>Am I misreading the machine code listing?

Unclear. Could be. Stick a few printf or some constants or some specific instructions (bread crumbs) into the code, and see. As a general rule, make sure the entry points are declared static, if you want the compiler to have free run of the inlining processing. Traditional syntax on OpenVMS preserves the entry points otherwise; the code doesn't (and can't) get fully inlined when it has an entry point as the compiler doesn't know what (other) module(s) might reference the code when the application is linked.

>Am I misunderstanding how the compiler should generate the machine code? Or is it that the compiler should identify (it does provide an informational) and eliminate the extra comparison/branch cycles and doesn't?

If you *really* want to pursue this, get yourself the four volume set of books from Intel on the Itanium architecture. (The version I've worked with was four books, each of somewhere between typical and large thicknesses. Not sure if the shelf has been revised since then.) There are some really wooly features in the IA64/IPF/EPIC architecture that you'll need to understand before you wade further. Predication comes to mind here, for instance.)

Also be careful on how you address those arrays, as some of the ways an array can be traversed can lead to massive numbers of page faults.

In the absence of alignment fault data and page fault data and particularly the collection and analysis of PCA data, I'd view instruction-level code optimization as premature.

John Gillings · ‎12-08-2008

Reinforcing Hoff's comments...

Forget everything you know about optimization on VAX and other CISC architectures.

Also forget all the stuff you know about RISC architectures (Alpha).

EPIC is an entirely different beast, with some highly non-intuitive performance characteristics. There's no way any mere human would be able to determine the opimality of a particular machine code sequence by inspection. It just doesn't work like that! Odd concepts like predication and speculative execution would seem to be insanity in a VAX world view.

As Hoff has suggested, the biggest hitter in performance by several orders of magnitude are alignment faults. Get them sorted and you won't need to make your brain hurt by trying to figure out how the machine code works.

A crucible of informative mistakes

Dennis Handly · ‎12-08-2008

>By using constants,
>template
class GenericAlgorithm ... {
static const int algorithmId_ = id;

Why are you making the compiler sweat? Why not simply use these constant template parms ("id", checkPrice, etc.) in the body of the template class? Since there are no variables declared, the compiler can't go too badly wrong.

>Should I expect that the compiler is actually removing unreachable code? The machine listing appears to show the comparisons. Should I be utilizing a different C++ language construct?

If you have a "real" C++ compiler and a "real" optimizer it should work fine. I assume you are optimizing?

With aCC6 on HP-UX, the only thing that is in main are the 3 printfs:
Algorithm: 159 - Check Price, Specific Code, Current Tran
Algorithm: 225 - Check Price, All Code, Current Tran
Algorithm: 246 - Check Price, All Code, All Tran

Your OpenVMS compiler should have a similar frontend but the backend is completely different, as Hoff mentions.

>silly site will only allow one attachment.

Then you need to provide comments and a name for each. Or use tar/zip to include both.

>Based on the declaration of the code, I would expect the compiler to identify and remove the comparisons and branches for constant values. However, the listing seems to show them.

Yes, a real compiler does this when optimizing.

>Am I misreading the machine code listing?

Attached is a gzipped +O2 .s file from HP-UX.

>Hoff: why is the removal of unreachable code important to you?

It's the useless compares and cache bloating that may kill you.

>C++ compilers don't traditionally inline across source code compilation modules.

This is C++, where the Standard practically demand it be inlined. And in this case, the first "file" is a .h file.

>And with Itanium, definitely also investigate alignment faults

On HP-UX, the default is to abort, so not much investigation is needed. :-)

>I'd suggest you don't optimize anything until you first know you need to optimize the code

Exactly.

>I haven't seen a member of the compiler team posting in here recently

We now work for the same manager.

>the code doesn't (and can't) get fully inlined when it has an entry point as the compiler doesn't know what (other) module(s) might reference the code when the application is linked.

The C++ Standard allows the compiler to completely delete any inlined function, whether static or not. This isn't true for C99 or for some variants of the C++ ABI.

>The version I've worked with was four books

It is now 3.

Hoff · ‎12-08-2008

For giggles, have you tried this same basic code on a different architecture? It'd be interesting to learn the performance differences. Alpha or x86-64 would be the obvious choices for comparison against EPIC.

>It's the useless compares and cache bloating that may kill you.

You may be right, but I would prefer to see this proved as being a (relevant) performance factor (here). And if this is a factor, it's good fodder for an enhancement request for ECG. EPIC has various (sometimes weird) considerations which can help or can hinder performance. Some can be quite surprising. (If dead code ends up in the processor cache and evicting "real" code, that's not goodness.)

And non-trivial applications can have their own unique surprises -- this might well be your hottest code path. But I'd not assume that.

sccr13plyr · ‎12-11-2008

Dennis/Hoff,

Thanks for your responses. I had roughly tossed together a small prototype that would give the idea of what I was trying to accomplish. The goal being was to try to understand whether unreachable code was being generated. It looks like it isâ ¦ This would seem to be an OpenVMS C++ compiler issue.

To review:
We have two related algorithms at our site and are adding a new third algorithm. There is a lot of redundant code and want to refactor out as much as possible and have the compiler do all of the work. Our concept is to combine all three algorithms into one method.

This is fine, but we then have some unneeded comparisons and code making their way into the executable. So, we templatized the class looking for the compiler to generate the specializations for the declared classes.

Alg 1 becomes class 1::GenericLogic()
Alg 2 becomes class 2::GenericLogic()
Alg 3 becomes class 3::GenericLogic()

The desire is to have the compiler use the template parameters to optimize out unused branches for a particular class instantiation. Thus, class1::GenericLogic() machine code looks almost exactly like Alg 1 machine code, class 2::GenericLogic() looks almost exactly like Alg 2, and so on...

I placed a â stupidâ output in a branch that should be optimized away. However, it shows up in the executables on both IA VMS and Windows.

As you may notice, I disable the BOOLEXPRCONST compiler message. So, the compiler knows that there is branching on an unchanging value. The â goodâ news is that the compiler seems to be generating three separate methods. So, it should be a simple matter of optimization for each instance.

I have attached a Windows compressed folder containing the two sources files and the "build" command. As you can see, I tried to activate every possible tuning on the compiler.

Hoff · ‎12-11-2008

With classic C, non-static function declarations traditionally cannot be fully inlined. The compiler has to leave the entry points.

If I were here, I'd a: confirm that this was a performance factor. (I've learned - the hard way - not to assume where the bottlenecks lurk. To always profile the code.) If this dead code was central, I'd then look to use static inlining, and then to moving from C++ to (maybe) C for the most critical of code paths, and (maybe) from smaller and more disjoint code to larger source modules with more aggressive inlining. But I'd only proceed down this path if I'd confirmed that this dead code is a factor.

Comments in the C++ code imply there's a trsnsactional database here.

If there is a support contract in place, do call HP support and ask for discussions with the compiler and code generator teams. As an alternative, look at gaining access to the OpenVMS customer porting lab or such.

Dennis Handly · ‎12-12-2008

>The goal being was to try to understand whether unreachable code was being generated.

The EDG frontend generates code like:
if (1) {
It expects the optimizer to fix it.
The HP-UX optimizer because it had to deal with a dumb aCC5 frontend, handles this fine.

>Hoff: With classic C, non-static function declarations traditionally cannot be fully inlined. The compiler has to leave the entry points.

Right. But they can be inlined at the call sites.

>ask for discussions with the compiler and code generator teams

I dropped them a message to have them look at this thread.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Compiler Code Generation Question

Compiler Code Generation Question