Operating System - HP-UX
1833378 Members
3040 Online
110052 Solutions
New Discussion

Compilation Intensive Load rx2660 vs rx5670 performance

 
SOLVED
Go to solution
Michael S Costello
Occasional Advisor

Compilation Intensive Load rx2660 vs rx5670 performance

Have a rx2660 dual core IA64 running HP-UX 11.23 that is taking an intolerably long time to compile c++ code.

Are there any recommendations anyone has on optimizing for "aCC" and "ecom" programs when they are apparently not memory bound but bound by cpu performance and perhaps disk io?

I've been unable to build on a dual core 1398MHz rx2660 in less than 6 hours what a slightly beefier server dual-core 1500 MHz rx5670 can build in 1.5 hours. The latter machine actually has less ram installed.

I guess my question is: Is the difference between these two models hardware and io performance that pronounced?
If not should I look to tuning and / or hardware failure as a remedy for the differences.
16 REPLIES 16
Dennis Handly
Acclaimed Contributor

Re: Compilation Intensive Load rx2660 vs rx5670 performance

It depends on your source sizes and what opt levels you are using.
How many sources?

Are your source and object directories on NFS?
Gokul Chandola
Trusted Contributor

Re: Compilation Intensive Load rx2660 vs rx5670 performance

Hi,
No sir, there are many factors, it requres a lot of deep analysis.

Regards,
Gokul Chandola
There is always some scope for improvment.
Michael S Costello
Occasional Advisor

Re: Compilation Intensive Load rx2660 vs rx5670 performance

To answer Dennis:
Number of Source Files: 8218

# find . -type f -name '*.c*' | wc -l
8218

NFS: No nfs is not used for accessing any files used except when a build completes, compilation is performed on a local filesystem.

Building 64 bit code using appropriate flags for that purpose.
Don Morris_1
Honored Contributor
Solution

Re: Compilation Intensive Load rx2660 vs rx5670 performance

Since there are a bunch of read/writes going on under compilations -- did you per chance lower dbc_min_pct / dbc_max_pct on the rx2660 and shrink the buffer cache? Throttling the cache could be aggravating I/O bottlenecks.
Michael S Costello
Occasional Advisor

Re: Compilation Intensive Load rx2660 vs rx5670 performance

Don, thanks for the response, here is the range of values that I've already tried:
I had been tuning the rx2660 dbc_max_pct and dbc_min_pct in the following ranges with no difference in compilation time, it's fixed at 6 hours with each.

* defaults 50/5 max/min
* 20/20 (as configured on the dev machine rs5670)
* 90/5 (most recent setting)

Top down make time was unaffected with each, though with the middle value other tunings were applied, essentially attempting to slavishly match the 5670's values. I suspect that the 5670 had been tuned by HP hosts during a 64 bit porting seminar some years ago.

I'm going to do a complete reset to default kctune'd values to establish a baseline time later today and see what information I can get with those values then apply some different values for whatever anyone suggests here.

Seeing about getting a copy of glance as well, but it's hard to know when I'll be able to run that tool for this.
Dennis Handly
Acclaimed Contributor

Re: Compilation Intensive Load rx2660 vs rx5670 performance

You might want to look at caliper.
Also what version of aC++ do you have?
You didn't mention how many CPUs cores you have and whether you are doing parallel makes.
Michael S Costello
Occasional Advisor

Re: Compilation Intensive Load rx2660 vs rx5670 performance

Dennis:
Caliper is something that is available, but I honestly don't know what to look for with it.

aCC version: HP C/aC++ Developer's Bundle C.11.23.15.2

The builds are non-parallel makes, gmake is performing them, and enabling extra jobs causes compilation to fail, so It's not running parallel makes on the Dev machine.

Both machines have single cpu's with dual cores. the rx5670 has "beefier" MHz values, but I believe the rx2660 has a family 32 and the 5670 has a family 31 model 1 revision 5 (older cpu, pre 9000 Itanium 2) which makes me wonder if one can downgrade families for better performance. :)

Also, I HyperThreading is enabled on both (not 100% sure on the 5670 though) toggling threading on the 2660 didn't have a noticeable effect. Since the jobs are not parallel though, I don't know that the additional cpu's will help at all, would just make the system more responsive while one cpu is loaded.

While the compiles are rolling 100% utilizations alternate between cpu 0 and 1 with the relaxed cpu hovering at 10-30% values. As top reports things anyway.
Dennis Handly
Acclaimed Contributor

Re: Compilation Intensive Load rx2660 vs rx5670 performance

>aCC version: HP C/aC++ Developer's Bundle C.11.23.15.2

This is not the version I want and requires a secret decoder ring to get A.06.20. What does "aCC -V" show?
And you didn't mention what opt level you were using?

>The builds are non-parallel makes, gmake is performing them and enabling extra jobs causes compilation to fail, so It's not running parallel makes on the Dev machine.

Then you aren't using the resources you have. Why does it fail? (Out of swap or bad makefiles?) You probably need more memory/swap.

>HyperThreading is enabled on both

I didn't think you could do that on 11.23 and you probably don't want it.

>I don't know that the additional CPUs will help at all

You do multiple compiles at once.
Michael S Costello
Occasional Advisor

Re: Compilation Intensive Load rx2660 vs rx5670 performance

aCC -v output:

aCC: HP C/aC++ B3910B A.06.20 [May 13 2008]

Optimization Level: -fast flag is used for most of the .o files produced and the -D64 and -W*** flags (which aren't optimizations per se) no -O# optimizations are being used.

Memory/Swap:
Physical Memory: 4079.1 MB
Real Memory: Active/Total ~ 300 MB/500MB
Virtual Memory: Active/Total ~ 700MB/800MB
swap is 8 Gb and approximately 500-700 Mb of that is used.

So how do I make the machine forego using swap, if it is thrashing on swap and seems to have enough physical memory to get by just fine?

Threads: initial runs were with threading disabled, threading was enabled later in the game to see if it would affect performance either way, they haven't. I (admittedly ignorantly) suspect that 11.23 kernel might be ignoring the EFI set threading behavior anyway since it doesn't measurably change behavior or introduce breaks.

Multiple Compiles: The product built with it's existing makefiles do not support multiple job makes (apologies, thought I had written that in to an earlier posting) so that's not an option, but it does rule out the other machine being run with multiples to explain the difference.
Dennis Handly
Acclaimed Contributor

Re: Compilation Intensive Load rx2660 vs rx5670 performance

>-fast flag is used for most of the .o files produced

That's +O2. You might want to see how fast the default (+O1) is.

-D64, you mean +DD64?

> Physical Memory: 4079.1 MB

Probably not enough if you do more than one compile.

>So how do I make the machine forgo using swap, if it is thrashing on swap and seems to have enough physical memory to get by just fine?

You don't have enough physically memory if there is swap being used. What does "swapinfo -tam" show?

>I suspect that 11.23 kernel might be ignoring the EFI set threading behavior

I think so since it is only supported on 11.31.

>The product built with it's existing makefiles do not support multiple job makes (apologies, thought I had written that in to an earlier posting)

You did write that but I wanted to know why it fails. (Out of swap or using something that hasn't been built yet?) It is well worth the time to add the dependencies so that parallel make works.

Doing that may cut the builds in half.
Michael S Costello
Occasional Advisor

Re: Compilation Intensive Load rx2660 vs rx5670 performance



Dennis (thanks for all of your help BTW):
This was taken while in the thick of things running a test build:

# swapinfo -tam
Mb Mb Mb PCT START/ Mb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 8192 19 8173 0% 0 - 1 /dev/vg00/lvol2
reserve - 1241 -1241
memory 4075 1360 2715 33%
total 12267 2620 9647 21% - 0 -

This is in the thick of the compile, I suppose I'm reading the dev as 8192 Swap available, 19 used, reserve is "reserved"
4075 Physical, non-swap memory 1360 Used

Physical never seems to be entirely consumed, swapping is minimal. The Rx 2660 has 2x the non-swap ram and long build times. If it is ram related it baffles me how the other machine gets by at all with only 2Gb of ram and some swap in 1/4 the time!


I'll try taking out the -fast flag (which should result in -O1) and yes +DD64 was the flag, several other flags are on the line, that aren't necessarily optimizations, and they vary depending on the stage of the build. I'll see about picking a representative set of flags and posting it.
Dennis Handly
Acclaimed Contributor

Re: Compilation Intensive Load rx2660 vs rx5670 performance

>Physical never seems to be entirely consumed, swapping is minimal.

Yes.
Olivier Masse
Honored Contributor

Re: Compilation Intensive Load rx2660 vs rx5670 performance

I find it really hard to beleive that an rx5670 could outperform an Montecito-based rx6600. There's really something wrong here.

This might be a far shot, but I've seen issues in the past of mysteriously bad performance with a server that didn't have the advanced OnlineJFS features enabled. I think this can happen if you initially install the wrong OE. If you're entitled to a license, compare the /etc/inittab on both servers, and look for a "vxen" entry. Then compare the /etc/fstab on each server, comparing the mount options.

Good luck
Olivier Masse
Honored Contributor

Re: Compilation Intensive Load rx2660 vs rx5670 performance

Er, I meant rx2660 of course. :)
Michael S Costello
Occasional Advisor

Re: Compilation Intensive Load rx2660 vs rx5670 performance

Dennis:

swapinfo -tam for the rx5670 idle and during build:

Idle
~ % swapinfo -tam
Mb Mb Mb PCT START/ Mb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 4096 199 3897 5% 0 - 1 /dev/vg00/lvol2
reserve - 280 -280
memory 2037 1257 780 62%
total 6133 1736 4397 28% - 0 -

During Initial Component Build
~ % swapinfo -tam
Mb Mb Mb PCT START/ Mb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 4096 199 3897 5% 0 - 1 /dev/vg00/lvol2
reserve - 280 -280
memory 2037 1257 780 62%
total 6133 1736 4397 28% - 0 -

Later Measurements
Mb Mb Mb PCT START/ Mb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 4096 199 3897 5% 0 - 1 /dev/vg00/lvol2
reserve - 338 -338
memory 2037 1260 777 62%
total 6133 1797 4336 29% - 0 -
Two Samples During the most resource intensive components:
Mb Mb Mb PCT START/ Mb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 4096 450 3646 11% 0 - 1 /dev/vg00/lvol2
reserve - 871 -871
memory 2037 1261 776 62%
total 6133 2582 3551 42% - 0 -

Mb Mb Mb PCT START/ Mb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 4096 731 3365 18% 0 - 1 /dev/vg00/lvol2
reserve - 595 -595
memory 2037 1262 775 62%
total 6133 2588 3545 42% - 0 -

Back to less intensive phase
Mb Mb Mb PCT START/ Mb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 4096 210 3886 5% 0 - 1 /dev/vg00/lvol2
reserve - 389 -389
memory 2037 1262 775 62%
total 6133 1861 4272 30% - 0 -

Dennis Handly
Acclaimed Contributor

Re: Compilation Intensive Load rx2660 vs rx5670 performance

>swapinfo -tam for the rx5670 idle and during build:
memory 2037 1261 776 62%
total 6133 2582 3551 42%

This shows you might be able to do two or more builds at a time.