cpu bottlenecks and telnetd errors in syslog

Rich Fink · ‎05-17-2006

Hi all,

I'm hoping someone can point me in the right direction here, as I'm about to pull out what little hair I have left.

We've got an old K460 running 10.20 which was rebooted in April. Last reboot before that was in January, before I started here.

We are experiencing multiple slowdowns every day, and a lot of messages in syslog. The system slowdowns generally last a few seconds, but can be up to 2 minutes.

Glance shows these cpu bottlenecks:

10:24:29 RED CPU Bottleneck probability= 100.00%
10:25:03 END End of CPU Bottleneck Alert
12:48:51 RED CPU Bottleneck probability= 100.00%
12:49:39 END End of CPU Bottleneck Alert
12:59:57 RED CPU Bottleneck probability= 100.00%
13:01:10 END End of CPU Bottleneck Alert
13:04:37 RED CPU Bottleneck probability= 100.00%
13:04:42 END End of CPU Bottleneck Alert
14:34:52 RED CPU Bottleneck probability= 100.00%
14:35:44 END End of CPU Bottleneck Alert
15:13:14 RED CPU Bottleneck probability= 100.00%
15:13:24 END End of CPU Bottleneck Alert
15:25:29 RED CPU Bottleneck probability= 100.00%
15:27:14 END End of CPU Bottleneck Alert

and syslog has numerous entries like this:

May 17 10:24:07 zaphod telnetd[28010]: recv: Connection reset by peer
May 17 10:24:08 zaphod telnetd[12440]: recv: Connection reset by peer
May 17 10:24:49 zaphod telnetd[28010]: Error checking child termination status: error 4: Interrupted system call
May 17 10:25:35 zaphod telnetd[17835]: recv: Connection reset by peer
.
May 17 10:51:29 zaphod telnetd[23752]: recv: Connection reset by peer
May 17 10:52:01 zaphod telnetd[23958]: Error checking child termination status: error 4: Interrupted system call
May 17 11:15:34 zaphod telnetd[1672]: recv: Connection reset by peer
.
May 17 13:50:55 zaphod telnetd[22592]: recv: Connection reset by peer
May 17 14:10:14 zaphod telnetd[1511]: recv: Connection timed out
May 17 14:35:37 zaphod telnetd[471]: setsockopt (TCP_NODELAY): Invalid argument
May 17 14:38:47 zaphod telnetd[1609]: recv: Connection reset by peer
.
May 17 15:12:23 zaphod telnetd[8146]: recv: Connection reset by peer
May 17 15:13:14 zaphod telnetd[8719]: setsockopt (TCP_NODELAY): Invalid argument
May 17 15:16:22 zaphod telnetd[6805]: recv: Connection reset by peer
May 17 15:25:55 zaphod telnetd[194]: recv: Connection reset by peer
May 17 15:26:59 zaphod telnetd[194]: Error checking child termination status: error 4: Interrupted system call

(you get the idea)

I have contacted HP, but they tell me first that "10.20 is no longer supported", and then to change things like /etc/shells, /etc/nsswitch.conf, and clear out wtmp/btmp. I have a hard time believing that /etc/shells or nsswitch, which haven't been changed in over 2 years, can suddenly be the culprit.

Any ideas/suggestions would be greatly appreciated!

-Rich

"UNIX is a user-friendly Operating System .. it's just picky about choosing its friends."

Tim Nelson · ‎05-17-2006

Rich,

I would initially suspect from the information given that the CPU bottle neck is causing some telnet clients to time-out or your users are ending their sessions due to the slowdown.

Resolve the CPU 100% and the telnet connection issue may go away.

This again is just first thought with the information provided.

sar, top, glance may lead you further.

Bill Hassell · ‎05-17-2006

From managing several K460's with 10.20, the "recv: Connection reset by peer" and related messages are normal and cannot be fixed. The problem is related to disconnected sessions caused by users that just blow away their telnet windows (rather than a logout) or their PCs crash or for Broadband user connections, dropped connections. You can try loading the last SupportPlus package of HWE and QPK patches but I patched mine to the max (which is basically 2001-2002) and there was no change.

As far as CPU maxing out, I NEVER pay attention to the bottleneck alerts from Glance. They are simply a dumb message that pops up when CPU usage is maxed out, something that anyone can do with a 4-line script. Instead, run top and see who is using the most CPU time. Then post the results. It is not a problem to use a lot of CPU if the programs are supposed to do that. Howevefr, there are HP-UX several programs (like pwgrd and smtpd) that can be fixed at 10.20.

Bill Hassell, sysadmin

Victor BERRIDGE · ‎05-17-2006

Hi Rich,
Believe me, unless you put a dynamite stick under your KXXX it will very hardly fail for load reason... Slow down - yes having difficulties to be able to connect - yes again but now that means if you were to have top or galnce running you would see 100% CPU for at least 15 minutes...
So my 2 cents now:
The messages you see may be artefacts because of the load of the dinausor, check if you dont have some zombies or abnormal load due to ???(a printer not responding? - a badly terminated process? a program entering a loop of some sort?...)
uptime of the machines can be very high (the 11.00 was rebooted 31.12.1999 of Y2K reason..
And last year because I had to move the box to a new site...
the 10.20 I reinstalled and patched injune last year and since is living fine (althought some days it load exceeds 11, CPU 100% for 15-30 minutes)...

All the best
Victor

Rich Fink · ‎05-18-2006

Thanks for the comments. The 100% CPU spikes we're seeing are only of a short duration. But when they hit, everything slows to a crawl.

I've been running both glance and top at the time, and have not found any particular process that's eating up all the cpu cycles. Some times it's a user switching between menus, once it was unhashdaemon, and another time it was me running top. And even then, top only showed those processes at <40%, out of 400%.

Could this be a symptom of a cpu about to fail? We've others go bad recently in other boxes, and with the age of these systems, it wouldn't surprise me.

Thanks again for your suggestions. I'll keep monitoring and see if I can get any more data when it spikes.

-Rich

"UNIX is a user-friendly Operating System .. it's just picky about choosing its friends."

Ninad_1 · ‎05-18-2006

Hi,

My observation is that when there are too many runnable processes a system responds very slowly. Just run a sar -uq and vmstat when the system is slow - infact you can start before the problem occurs and keep it running so that you can compare the stats when there is problem against the stats when the system is running fine- if its daily and then see if the run queue is too large and system is performing to many context switches.

Regards,
Ninad

Steven E. Protter · ‎05-18-2006

Shalom,

You might want to adapt this script:
http://www.hpux.ws/system.perf.sh

Its not tested on 10.20 but a lot of it should work.

Short term jumps to 100% are not a big problem.

In general, I'd suggest patching the system up to the latest patch set you have access to.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Bill Hassell · ‎05-18-2006

If you have disk mirroring, you may be hitting a huge bug in 10.20 handling of superpages. The symptom exhaustion of of superpages in the kernel table and very high workloads (10 to 50 seen by uptime) and very high system% seen in sar. You'll need to use an adb change to disable super pages. It can also be caused defects in the VxFS code. The adb change is:

adb -w /stand/vmunix
allow_superpage_text?W 0

which turns off superpages. This should elminate those giant spikes with very high system overhead periods.

Bill Hassell, sysadmin

Rich Fink · ‎05-24-2006

UPDATE:

I did a simple reboot of the server this past Saturday morning. Since then we've had no system slowdowns or CPU bottlenecks. We still get the 'connection reset by peer" messages in syslog, but no more "child termination" errors.

I'd like to thank everyone for their suggestions and advice. My main goal was to get the system stable enough so the users would stop yelling at me. :-) Now that they're off my back, hopefully I'll be able to get rid of the "connection reset" messages based on your input.

Thanks again,

-Rich

"UNIX is a user-friendly Operating System .. it's just picky about choosing its friends."

Rich Fink · ‎06-07-2006

Looks like the reboot resolved our problem.

We've been running fine for 2 weeks now, with no recurrence of the CPU Bottlenecks or telnetd 'Child Termination' errors.

Thanks to all for your suggestions. Now that the crisis has passed, hopefully I can work on the 'connection reset' messages - but they're not a cause of major concern right now.

-Rich

"UNIX is a user-friendly Operating System .. it's just picky about choosing its friends."

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

cpu bottlenecks and telnetd errors in syslog

cpu bottlenecks and telnetd errors in syslog

Re: cpu bottlenecks and telnetd errors in syslog

Re: cpu bottlenecks and telnetd errors in syslog

Re: cpu bottlenecks and telnetd errors in syslog

Re: cpu bottlenecks and telnetd errors in syslog

Re: cpu bottlenecks and telnetd errors in syslog

Re: cpu bottlenecks and telnetd errors in syslog

Re: cpu bottlenecks and telnetd errors in syslog

Re: cpu bottlenecks and telnetd errors in syslog

Re: cpu bottlenecks and telnetd errors in syslog