SMP scaling issue only on Superdome with HPUX 11i, how do I profile the kernel

Laurence Oberman · ‎01-08-2004

Hello,

I am having a performance issue when running multiple processes on a superdome that access and open the same input file. The superdome has 8 CPU's.
There is obviously contention but it is much worse on the Superdome than other SMP systems.
Are there any known issues regarding the scheduler on the Superdome. Also how can I profile the kernel to see how to isolate this.
In the supporting data you will note the much higher than expected system time.

The test program is a simple C program that performs an open, 1000 single character reads and then a close.
The program scales much better on a 4 processor L3000 and also on a 4 processor RP5670(HPUX 11.22)

If I start the test program 6 times and bind all 6 to a single CPU then the problem is not apparent. As soon as I let the scheduler decide, or bind each of the processors to a seperate CPU the run times are much longer. over four times as long as 6 parallel tasks on a single CPU.

Supporting data
----------------
Same file opened allowing the scheduler to decide which CPU to use for each of the processes.

for i in 1 2 3 4 5 6^Jdo^Jtimex ./iotestparm sys.cfg &^Jdone
[1] 14861
[2] 14862
[3] 14863
[4] 14864
[5] 14865
[6] 14866

real 1:23.77
user 0.72
sys 1:19.12

real 1:23.86
user 0.72
sys 1:19.16

real 1:23.97
user 0.72
sys 1:19.32

real 1:24.09
user 0.72
sys 1:19.36

real 1:24.17
user 0.72
sys 1:19.46

real 1:24.16
user 0.72
sys 1:19.46

Same open file but forced to run on CPU 2
for i in 1 2 3 4 5 6^Jdo^Jmpsched -c 2 timex ./iotestparm sys.cfg &^Jdone
[1] 14945
[2] 14946
[3] 14947
[4] 14948
[5] 14949
[6] 14950
#
real 26.27
user 0.71
sys 3.76

real 26.34
user 0.71
sys 3.76

real 26.49
user 0.71
sys 3.76

real 26.28
user 0.71
sys 3.76

real 26.26
user 0.71
sys 3.76

real 26.48
user 0.71
sys 3.76

C program

#include
#include
main(argc,argv)
int argc;
char *argv[];
{
int fd ;
char c[1] ;
int iread ;
int j ;

for (j = 0 ; j < 2000 ; j++ )
{
fd = open(argv[1] ,O_RDONLY );
if(fd < 0 )
{
printf("open file error " );
exit(1);
}

for (; ; )
{
iread=read(fd , c , 1 );
if (iread == 0 )
{
close(fd);
break ;
}
}
}

}

}

Mic V. · ‎01-08-2004

Laurence,

Nice command line. :)

Sorry I can't solve, but maybe there's some additional data that will help someone else to answer.

What has changed in the environment/did it always behave this way (i.e., worse on Superdome than other systems), or is this the first time it's being tried (i.e., no baseline)? Which C compiler did you use?

Maybe describe the *PAR configuration, as well as the system's patch status.

Mic

What kind of a name is 'Wolverine'?

Laurence Oberman · ‎01-09-2004

Hi Mic,
This issue has been apparent from the start on the Superdome. The C compiler is the ansi C HP compiler.

I already ran cxperf but this pretty much told me the same stuff I am seeing in Glance.
The read system call rate is much lower.

I am a Tru64 guy, and we have a tool called DLPI that will tell me where in the kernel I am spending time. This is probably what I need for HPUX.

Thanks for the help so far.

Laurence Oberman · ‎01-09-2004

I meant to add:

The issue that puzzles me most is that smaller systems seem to scale better. Both
a Keystone and L3000 have better performance in terms of scaling. This is the real question here.

The system has the latest Gold patch bundle.
The problem appears to be a lock contention issue because even when forcing each of the six processes to not migrate. i.e run one on each processor the problem is still there.

The only way to improve the performance is to schedule all on a single processor.

I understand the SMP scaling issues quite well on Tru64 and Solaris. I am a Digital/Compaq convert, so my knowledge of the kernel behavior on HPUX is limited.

Todd McDaniel_1 · ‎01-09-2004

Laurence,

I get the impression this is a multi partitioned box... is that the case? It seems rather strange to have a box that can support 32/64 CPU only has 8 at the moment, unless this is an Npar. Im only curious...

To your question:
It seems to me that your issue is kernel related but more directly your kernel parms...

What are your hard and soft file kernel parms are for maxfiles and maxfiles_lim...

I had a situation where the soft limit maxfiles_lim was too low and the number of concurrent files that could be open was restricted...

I will include my Kernel parms for my 48-way Superdome. it may offer you a guide even though I have 72GB of memory...

These are not all, but the major ones...

max_async_ports 50 - 50
max_fcp_reqs 512 - 512
max_mem_window 3 - 3
max_thread_proc 256 - 256
maxdsiz 0x50000000 - 0X50000000
maxdsiz_64bit 0x40000000 - 0X40000000
maxfiles 500 - 500
maxfiles_lim 1024 Y 1024
maxqueuetime - - 0
maxssiz 0x800000 - 0X800000
maxssiz_64bit 0x800000 - 0X800000
maxswapchunks 5000 - 5000
maxtsiz 0x4000000 Y 0X4000000
maxtsiz_64bit 0x40000000 Y 0X40000000
maxuprc 4000 Y 4000
maxusers 2000 - 2000
maxvgs 255 - 255
modstrmax 500 - 500
msgmax 8192 Y 8192
netmemmax - - 0
nfile 29529 - (16*(NPROC+16+MAXUSERS)/10+32+2*(NPTY+NSTRPTY+
NSTRTEL))
scsi_max_qdepth 8 Y 8
scsi_maxphys 1048576 - 1048576
sema 1 - 1
semaem 16384 - 16384
semmap 1026 - (SEMMNI+2)
semmni 1024 - 1024
semmns 8000 - 8000
semmnu 1024 - 1024
semmsl 2048 Y 2048
semmsl_override - - 2048
semume 64 - 64
semvmx 32767 - 32767
sendfile_max 0 - 0
shmmax 1073741824 Y 1073741824

Unix, the other white meat.

Laurence Oberman · ‎01-09-2004

Hello,

Thanks for your information so far.

After further analysis and using kgmon and kprof, the issue appears to be related to SMP lock contention and gets worse the more CPUS you have.

I got equal performance to a Keystone when I removed a cell and only ran the 6 processes on 4 CPUS.

After performing kernel tracing it is clear that the problem is due to lock contention when multiple CPU's have
the file open. The scheduler does not appear to have any effect on the problem.

The more CPU's you have in the system the worse the lock contention, and that is why the Superdome has the
worst performance. The high system time is also much more apparent when the lock contention takes place.
As already mentioned keeping each process running on a separate CPU makes no difference.

I performed further tests after reducing the number of CPU's on the Superdome to four.
The performance was then equal if not a little better than the Keystone.

Supporting data
------------------

KGPROF data when running all 6 processes on a single CPU

(6) K-spitfire::root cat kgmon_second_run_on_CPU2.txt
kprof compiled Oct 29 2003 15:35:08
kprof invoked Fri Jan 9 13:25:24 2004

granularity: each 2 byte sample hit represents
0.01 seconds spent executing 4 bytes of instructions.
The total CPU time sampled is 172.11 seconds across 8 CPUs (21.51 wallclock)

flat profile:

%time the percentage of the total running time of the
program used by this function.

cumsecs a running sum of the number of seconds accounted
for by this function and those listed above it.

seconds the number of seconds accounted for by this
function alone. This is the major sort for this
listing.

calls the number of times this function was invoked, if
this function is profiled, else blank.

name the name of the function. This is the minor sort
for this listing.

%time cumsecs seconds calls msec/call name
91.3 157.19 157.19 idle
1.2 159.17 1.98 IN_USER_MODE
0.8 160.50 1.33 syscall
0.5 161.34 0.84 spinlock
0.4 161.97 0.63 syscallinit
0.3 162.49 0.52 $syscallrtn
0.3 162.96 0.47 read
0.3 163.42 0.46 vx_fast_read
0.2 163.75 0.33 lbzero_pcxu_method
0.2 164.08 0.33 vx_rdwr
0.2 164.34 0.26 brelse
0.2 164.60 0.26 ftlb_patch9_end
0.1 164.85 0.25 setjmp
0.1 165.08 0.23 getf
0.1 165.31 0.23 spinunlock
0.1 165.53 0.22 spluser
0.1 165.75 0.22 vx_read1
0.1 165.96 0.21 vasusage
0.1 166.15 0.19 lacc
.
.Truncated
0.0 172.09 0.01 vx_logbuf_get
0.0 172.10 0.01 vx_readnomap

KGPROF data when running across CPUS.
(10) K-spitfire::root cat kgmon_second_run_on_any_CPU.txt
kprof compiled Oct 29 2003 15:35:08
kprof invoked Fri Jan 9 13:28:00 2004

granularity: each 2 byte sample hit represents
0.01 seconds spent executing 4 bytes of instructions.
The total CPU time sampled is 608.35 seconds across 8 CPUs (76.04 wallclock)

flat profile:

%time the percentage of the total running time of the
program used by this function.

cumsecs a running sum of the number of seconds accounted
for by this function and those listed above it.

seconds the number of seconds accounted for by this
function alone. This is the major sort for this
listing.

calls the number of times this function was invoked, if
this function is profiled, else blank.

name the name of the function. This is the minor sort
for this listing.

%time cumsecs seconds calls msec/call name
60.7 369.22 369.22 idle
12.9 447.69 78.47 preArbitration
11.2 515.71 68.02 wait_for_lock
2.4 530.54 14.83 spinlock
1.5 539.44 8.90 spinunlock
0.9 544.93 5.49 brelse
0.8 549.63 4.70 getblk1
0.7 554.10 4.47 vx_read1
0.6 557.98 3.88 ogetblk
0.6 561.38 3.40 owns_spinlock
0.6 564.75 3.37 IN_USER_MODE
0.4 567.47 2.72 vx_fast_read
0.4 570.13 2.66 dbc_reference
0.4 572.33 2.20 wakeup
0.3 574.20 1.87 spinlockx
0.3 575.82 1.62 syscall
0.3 577.35 1.53 vx_active_common_flush
0.2 578.80 1.45 _setrq
0.2 580.25 1.45 find_thread_my_spu
0.2 581.58 1.33 vx_read_ahead
0.2 582.90 1.32 _sleep
0.2 584.15 1.25 vx_notavail
0.2 585.35 1.20 vx_bmap
0.2 586.36 1.01 vx_ilock
.
.Truncated
0.0 608.27 0.01 vx_do_pageout
0.0 608.28 0.01 vx_event_wait
0.0 608.29 0.01 vx_lookup
0.0 608.30 0.01 vx_mapstrategy

Mic V. · ‎01-09-2004

I seem to remember seeing a lot of patches relating to Superdome. I wonder if some might be applicable. This thing is annoying.

I don't think I can be much more help -- I don't have a Superdome to test with -- but my thoughts are: see if the ANSI C documentation discusses kernel profiling, maybe try a GCC compile in case it's related to the generated code, upgrade your C compiler if applicable, and perhaps watching things in glance (you can install free from the Application CDs for a trial run) might help suggest a solution. You can trace a variety of things related to a specific process (curses version, at least).

Good luck,
Mic

What kind of a name is 'Wolverine'?

Laurence Oberman · ‎01-09-2004

Hi everybody,

I finally put this issue to rest using spinwatcher, as this enabled me to isolate the particular locks in question. Special thanks to Ken Johnson and Marcus Ostrowicki

I learnt a lot about profiling the kernel on HPUX using kgmon/kprof and spinwatcher.
Hey, I need to as Tru64 Unix goes away.

Here is my case update. We will be aaking the customer to tune his application.
----------------------------------------

ICBC Summary
The problem seen at ICBC was characterized and the kernel was traced to analyze the slow
performance seen on the Superdome.

MSE was able to show that the issue is specifically an SMP lock contention issue for two kernel buffers
that is even more apparent the more CPU's a machine has installed.
The buffers in particular are:
spn%kern cumpct seconds spn%cpu lock name
17.27 17.27 68.25 17.24 bufhash lock <<<<<<
4.37 21.65 17.28 4.36 buf_free_list_lock <<<<<

The method used in the test application causes the Kernel to spend far too much time managing locks
rather than moving data because of the constant thrashing of the locks.

By reducing the number of CPU's on the Superdome we saw a significant reduction in lock contention.

We further believe that the IBM system will also see degradation when more CPU's are added.

MSE analyzed the test program and made a small change:
Instead of reading one byte at a time we tested with 8, 16 and 32.

One byte reads completed 32 simultaneous tasks in 249 seconds,
32 bytes reads completed 32 simultaneous tasks in 8.33 seconds

This is a 3000% improvement.

Test with one byte reads
#!/usr/bin/ksh
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
do
timex ./iotestparm sys.cfg &

real 4:09.32
user 0.34
sys 1:01.93

Test with 32 byte reads
#!/usr/bin/ksh
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
do
timex ./iotestparm32 sys.cfg &
done

real 8.33
user 0.01
sys 1.98

Modified Program reading 32 bytes at a time.

#include
#include
main(argc,argv)
int argc;
char *argv[];
{
int fd ;
char c[32] ;
int iread ;
int j ;
/*printf("this is :%d", argv[1]);*/
for (j = 0 ; j < 1000 ; j++ )
{
/* fd = open("sys.cfg" , O_SYNC|O_RDONLY ); */
fd = open(argv[1] ,O_RDONLY|O_RDWR,0600 );
if(fd < 0 )
{
printf("open file error " );
exit(1);
}
for (; ; )
{
iread=read(fd , c , 32 );
if (iread == 0 )
{
close(fd);
break ;
}
/*printf("%s",c);*/
}
}
}

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

SMP scaling issue only on Superdome with HPUX 11i, how do I profile the kernel

SMP scaling issue only on Superdome with HPUX 11i, how do I profile the kernel

Re: SMP scaling issue only on Superdome with HPUX 11i, how do I profile the kernel

Re: SMP scaling issue only on Superdome with HPUX 11i, how do I profile the kernel

Re: SMP scaling issue only on Superdome with HPUX 11i, how do I profile the kernel

Re: SMP scaling issue only on Superdome with HPUX 11i, how do I profile the kernel

Re: SMP scaling issue only on Superdome with HPUX 11i, how do I profile the kernel

Re: SMP scaling issue only on Superdome with HPUX 11i, how do I profile the kernel

Re: SMP scaling issue only on Superdome with HPUX 11i, how do I profile the kernel