topic Re: Performance slower on RX8640 then Blade BL860c and rx2620 in Operating System - HP-UX

Performance slower on RX8640 then Blade BL860c and rx2620

isaac_loven — Wed, 25 Nov 2009 05:29:27 GMT

Hi All.
we have 2 RX8640 Servers, each with 3 fully populated cell boards ( 24cpu/192Gig in one npar) runing vpars and HPUX 11.23.
Oracle RAC 10G was running at 60% CPU load.
To increase capacity HP added a fourth cell board to each RX8640 (took it from a working Dev Rx8640 - same model number).

Servers booted fine, but when users started to use the database, all CPUs hit 100%, and i/o dropped to almost zero. We had to

delete the cell board from the npar to restore normal operation. Has anybody else had this problem?

As these are production servers it is hard to reproduce the error. So we created a test 64bit C program that adds the

contents of 2 large arrays and uses 1 Gig of memory. Average run times are as follows:
BL870c 10 seconds
BL860c 12 seconds
rx2620 12.2 sec
rx8640 with once cell board 17.2 sec
rx8640 with once cell board 23.5 sec
rx8640 with once cell board 28.5 sec

Do you think we have missed something when adding the cell board ( parmodify â p 0 â a 3:base:y:ri )? It this a memory

interleaving problem ?

This is the code. I am enclosing the executible. Can anybody else benchmark their servers please.

/* memtest.c */

#define BIG 100000000

int a[BIG],b[BIG],c[BIG];

int main()
{
int i,j;
for(j=0;j<10;++j)
for(i=0;i a[i]=b[i]+c[i];
return 0;
}

# cc +DD64 -o memtest memtest.c
# time ./memtest
Isaac Loven

Re: Performance slower on RX8640 then Blade BL860c and rx2620

Kranti Mahmud — Wed, 25 Nov 2009 06:21:42 GMT

Hi Issac,

Check the firmware revision of the MP. Log to MP and run the command sysrev.

Rgds-Kranti

Re: Performance slower on RX8640 then Blade BL860c and rx2620

Torsten. — Wed, 25 Nov 2009 06:26:02 GMT

It is only slow if the cell is added?

Re: Performance slower on RX8640 then Blade BL860c and rx2620

Dennis Handly — Wed, 25 Nov 2009 08:28:13 GMT

> It this a memory interleaving problem?

Yes, most likely. Though "all CPUs hit 100%, and I/O dropped to almost zero" seem rather extreme.

Those blade and non-cell based systems can run rings around a cell based system if you don't get interleaving right.

Re: Performance slower on RX8640 then Blade BL860c and rx2620

Don Morris_1 — Wed, 25 Nov 2009 11:52:41 GMT

The DB problem is one thing [and I can only make wild guesses given we have no real data - so I won't], but as far as the raw time issue... this isn't interleave being wrong, that's how interleave _works_.

Your program has no threads and I assume you run it such that only a single instance is instantiated at a time (multiple runs for the average, but as multiple executions sequentially, not a massive parallel execution). [I'd also comment that your program uses 100,000,000 * 3 64-bit integers, so you're really looking at 24 * 100,000,000 bytes or 2.235Gb [not 1Gb]].

So -- what you have is the following scenarios.

Assume a 1 to 4 cell nPartition.

The cost of accessing memory in the same cell as you are running is X.

The cost of accessing memory in an adjacent cell is Y (where X < Y).

For a single cell rx8640 -- all accesses are X.

When you add a cell, if all memory is configured to be interleaved, your private object has a 50% chance of getting a cache line in the same cell and a 50% chance of getting a remote cell (assuming balanced ILV, each cell contributes equal cache lines). Hence your application accesses at .5*X + .5*Y (which since Y is strictly greater than X is greater than X).

When you add _another_ cell, your accesses become .33*X + .66*Y [approx, 1/3 and 2/3 really].

And when you add the fourth cell, your accesses become .25*X + .75*Y.

As you can tell -- you approach Y instead of X (and if you went 8 cell this actually gets more interesting in that usually there's a higher cost for some cells rather than others). Hence why you see your output climbing with additional cells (that's how I interpret your "with once cell board" since otherwise, I'm not sure what you're saying).

Now if your application were moving processor context such that your accesses were also across the machine, you'd have a better chance of any given access being local -- this is what ILV is meant for, objects shared across the entire platform.

What you would want for your application to perform here is Cell Local memory (you'd configure each cell to only give 75% [or 50%, etc.]) to the interleave). Then all accesses in your program would stay X regardless of how many cells were in the system (assuming sufficient CLM in the cell the program is executing in, of course).

With 64Gb per cell (and a need for 2.235Gb for your program) configuring each cell to have 1/8th of memory cell local would probably be enough [assuming little else running to steal your CLM in the given execution context] to see more performant behavior. Since Oracle does work with a large shared memory set which is typically accessed from everywhere in the partition -- you certainly want to leave significant ILV configured, but you may want to consider reducing ILV. At a bare minimum you'd want to have enough for the SGA, your SYS memory load (since this is v2) and a reasonable extra for things like binaries, shared libraries, etc. Having some CLM available will help the process-private data accesses.

Re: Performance slower on RX8640 then Blade BL860c and rx2620

Don Morris_1 — Wed, 25 Nov 2009 14:19:31 GMT

(No points on this, just a clarification).. sometimes I need to drink coffee first. 64-bit declaration of "int" is 4 bytes, of course -- not 8. One would think I'd remember that but my brain jumped to "64-bit.... int" and glued them together as "64-bit integer" (long).

So yes, 1Gb -- or close enough. Kind of irrelevant, but worth precluding a post where you have to tell me I screwed up.

Re: Performance slower on RX8640 then Blade BL860c and rx2620

Torsten. — Wed, 25 Nov 2009 14:23:55 GMT

Nice explanation Don. But with vPars in use (configuration details???) all this become even a bit more complicated. So there isn't a solution possible without knowing all the details.

Re: Performance slower on RX8640 then Blade BL860c and rx2620

Don Morris_1 — Wed, 25 Nov 2009 16:42:42 GMT

Ah... good point, missed the "and vpars" in there since all the talk was about nPar operations.

Well then, the big thing to check would be that your cpu and memory align reasonably in the vPar.

Most especially, if you have a vPar with only a cell's (or sub-cell or really, really close to a cell) worth of processors and the I/O hubs are off of the same cell -- it would be worth your while to configure CLM such that the vPar can have either a little ILV and almost all memory as CLM. Effectively, you want a vPar on a multi-cell nPar to look like the smallest number of cells as possible if you want performance.

And with the caveat that this isn't official "This is supported" doctrine -- I configure IPF vPars with no ILV all the time. I swear every time folks ask this there's some firmware or vPar reason to keep some ILV around -- but in my opinion it is worth it if you have a vPar running in a single or sub-cell to be 100% local to that cell. And hence, it is worth a try to see if you can configure a vPar that way (since if the vPar won't load, you can just add some ILV [keep the nPar with some] and reload the vPar).

Any vPars you have which require more than a cell or two of resources (say you use 3 vPars, 2 fit in less than a cell apiece [maybe the same cell] and the other requires the other 3 cells), you can plan your nPar for it in a way that's good for the vPars but not the usual pattern for nPar mode. (In this case, you could configure the nPar with only 3 cells contributing to the ILV by 50%, and the 4th cell 100% CLM and place the two sub-cell vPars in the 4th cell with the multi-cell vPar getting all the ILV and resources of the other 3 cells).

Re: Performance slower on RX8640 then Blade BL860c and rx2620

isaac_loven — Wed, 13 Jan 2010 04:01:35 GMT

As this problem hit 2 production servers running Oracle 10G RAC. We had Oracle and HP-Mission critical support scratching their heads.
So we organised to replicate the problem at DR ( identical HW, adding 4th cell board to rx8640). With 3 cell boards, oracle ran fine. With 4 cell boards, the RAC slowed enormoursly and CPUs hit 100%.
After much testing we found the huge load is during the "sqlplus as /" connection, even before an sql is run ( we had to run 100 in parallel to see the problem).
Finally Oracle identified Bug 9205576: CONNECTION TAKES MORE TIME WITH PRE_PAGE_SGA=TRUE IN 4 CELL COMPARED TO 3 CELLS. The SGA was fully scanned on each sqlplus connection. Poblem is solved by changing pre_page_sga to false.

Thanks to all for taking the time to respond to my question.

Re: Performance slower on RX8640 then Blade BL860c and rx2620

isaac_loven — Wed, 13 Jan 2010 04:03:36 GMT

Solution found and 4th cell board added to production successfully.