Is this system overworked? What can be done?

Chris Adkisson · ‎05-24-2005

Gurus,
I have a system with the following uname output:

UNAME: HP-UX boom B.11.11 U 9000/800 2364765051 unlimited-user license

16GB of physical memory.

This system runs Veritas NetBackup. It is the only master server in the environment. It processes thousands of jobs per day and can backup upwards of 20TB per night (with the help of 50 SSO media servers).

Recently, there have been issues where the message queue used by NetBackup (0x412a4250) becomes nearly full and the system comes to a crawl. This results in having to shutdown NBU, ipcrm the NBU message queues, and restart NBU.

The point of this post is to determine if there is any pointers on the kernel tuning and if there are other OS-related items that could be addressed to handle the load presented.

I'm attaching a kmtune -l output. Any advice is appreciated.

Steven E. Protter · ‎05-24-2005

kmtune output is going to show kernel parameters, not actual use.

Question for you: Is the system slow? Are there complaints or issues?

To meausre performance and workload:

top
glance/gpm (disk 1 of the app cd's)
sar scripts(attached)

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Chris Adkisson · ‎05-24-2005

Thanks Steven.

Thanks for the script, I'll give that a try. I assume from a quick look at it that it runs one time and then you collect the output for review.

I have been using top - usually only at the time when the message queue issue surfaces. This usually shows me NetBackup bpsched zombie processes. These process are the getting wiped and I'm not sure how or why and I'm thinking they do not have a chance to clean up correctly, hence causing the message queue to begin filling. Here's a recent top that I saved while the message queue issue was persisting:

System: boom Tue May 10 16:07:24 2005
Load averages: 0.00, 0.01, 0.02
296 processes: 268 sleeping, 22 running, 6 zombies
Cpu states:
CPU LOAD USER NICE SYS IDLE BLOCK SWAIT INTR SSYS
0 0.01 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
1 0.00 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
2 0.00 0.0% 0.0% 2.0% 98.0% 0.0% 0.0% 0.0% 0.0%
3 0.00 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
4 0.01 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
5 0.00 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
6 0.00 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
7 0.00 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
--- ---- ----- ----- ----- ----- ----- ----- ----- -----
avg 0.00 0.0% 0.0% 1.0% 99.0% 0.0% 0.0% 0.0% 0.0%
Memory: 593164K (432480K) real, 1383724K (892260K) virtual, 12019980K free Page# 1/9
CPU TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU COMMAND
4 ? 17800 root 149 20 32K 32K zomb 0:00 3.57 3.56 bpsched
5 ? 17798 root 134 20 32K 32K zomb 0:00 3.54 3.54 bpsched
1 ? 4644 b1patrol 154 21 38864K 30984K sleep 671:38 0.93 0.93 PatrolAgent
6 ? 42 root 152 20 7424K 7424K run 345:21 0.73 0.73 vxfsd
6 ? 9693 b1patrol 154 20 14564K 9924K sleep 163:43 0.66 0.65 bgscollect
5 ? 2387 root -16 20 40616K 18804K run 1284:44 0.44 0.44 midaemon
3 ? 2469 root 152 20 242M 24636K run 119:59 0.38 0.38 prm3d
0 ? 1880 root 152 20 214M 22340K run 71:56 0.29 0.29 java
5 pts/3 21492 root 168 20 6996K 5064K sleep 0:00 0.26 0.24 top
5 ? 3431 root 152 20 50892K 13672K run 16:24 0.22 0.21 vxsald
4 ? 3545 root 152 20 62072K 31480K run 50:39 0.20 0.20 mhragent
1 ? 5984 root 152 20 19152K 7612K run 3:18 0.18 0.18 rep_server
7 ? 5992 root 152 20 12872K 2372K run 2:25 0.18 0.18 agdbserver
5 ? 3310 root 152 20 21220K 5900K run 20:58 0.16 0.16 mstragent
0 ? 721 root 152 20 2632K 472K run 7:59 0.16 0.16 syncer
3 ? 5640 b1patrol 154 20 31760K 25336K sleep 97:24 0.15 0.15 bgsagent
4 pts/tb 21530 ma0496 158 20 2636K 252K sleep 0:00 0.25 0.14 ksh
6 ? 2907 root 152 10 5716K 1056K run 0:09 0.12 0.12 memlogd
1 ? 5994 root 152 20 25904K 14760K run 4:45 0.12 0.12 alarmgen
7 pts/tb 21549 root 158 20 556K 184K sleep 0:00 0.27 0.10 sh
6 ? 3086 root 152 20 64464K 45212K run 3:19 0.10 0.10 vxsvc
3 ? 3574 root 152 20 14536K 2804K run 0:41 0.10 0.10 samd
6 pts/tb 21548 ma0496 158 20 2152K 244K sleep 0:00 0.27 0.10 osh
5 pts/tb 21579 root 178 20 6864K 4940K run 0:00 2.00 0.10 top
4 ? 14112 root 152 20 32K 32K zomb 0:00 0.07 0.07 bpsched
7 ? 1926 root 154 20 1968K 356K sleep 36:10 0.06 0.06 pwgrd
6 ? 3242 root 168 20 9116K 704K sleep 13:12 0.06 0.06 mstragent
2 ? 1726 root 152 20 32676K 4728K run 0:00 0.06 0.06 vxsalperf
5 ? 2921 root 154 10 7220K 4244K sleep 18:47 0.06 0.06 psmctd
0 ? 19649 root 152 20 13568K 3568K run 0:34 0.06 0.06 mad
1 ? 3 root 128 20 32K 32K sleep 43:19 0.06 0.06 statdaemon
6 ? 2423 root 127 20 46752K 15960K sleep 62:06 0.05 0.05 scopeux
2 ? 7941 b1patrol 154 21 16060K 8732K sleep 10:37 0.05 0.05 dcm

I'll have to have a look at the glance product. I don't have any knowledge of what it does or what it can offer in this scenario.

Thanks again for the pointers.

Victor BERRIDGE · ‎05-24-2005

Start by removing the defunct processes:
ps -ef|grep efu then look by PID the parent and do a kill of both (execpt PPID of 1...)

The load should drop soon

All the best
Victor

Victor BERRIDGE · ‎05-24-2005

These for instance are to kill:
4 ? 17800 root 149 20 32K 32K zomb 0:00 3.57 3.56 bpsched
5 ? 17798 root 134 20 32K 32K zomb 0:00 3.54 3.54 bpsched
4 ? 14112 root 152 20 32K 32K zomb 0:00 0.07 0.07 bpsched

While you are doing this cleaning up I will stop patrol monitoring, its of no help here and just adding extra load....

All the best
Victor

Victor BERRIDGE · ‎05-24-2005

Hi,
My comments now of your top output:
All your processors are at almost 100% usage, but but good usage=> mostly system so you have an issue here, the 2 first processes showin in top are zombie and you should find a way of killing them if you can (true zombies are difficult to kill...) If you had glance it would tell you what processes are hogging or why the are blocked....

So to solve your issue is:
Why to my bpsched finish up as zombies?
Im no Netbackup specialist (we dont use it here...) but Ive seen our backup processes do the same at times (TSM) ...
It seem to come from communication failure
in my case dsmsched (notice the last letters?..) when or the server is busy or the client is busy and the backup scheduler end up waiting to schedule a backup that will never be made...

All the best
Victor

Bob Smith_23 · ‎05-24-2005

Victor,

It looks like from the output that all the CPUs are almost 100% idle. The load is 0.

Regards

Victor BERRIDGE · ‎05-24-2005

Thanks Bob...
I are right! I made a mistake of column while reading...

I apologize....

All the best
Victor

Bob Smith_23 · ‎05-24-2005

Based on the top output I think it is safe to say that the system is not overworked.

I notice you have measureware running on the system. You should review the logs and see where the bottleneck is (disk/lan card?), or perhaps it is a configuration issue with NetBackup. I'm not familiar with NB so I can't offer any assistance with respect to the product.

Regards.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Is this system overworked? What can be done?

Is this system overworked? What can be done?

Re: Is this system overworked? What can be done?

Re: Is this system overworked? What can be done?

Re: Is this system overworked? What can be done?

Re: Is this system overworked? What can be done?

Re: Is this system overworked? What can be done?

Re: Is this system overworked? What can be done?

Re: Is this system overworked? What can be done?

Re: Is this system overworked? What can be done?