topic High percentage of processes blocked on STREAMS in Operating System - HP-UX

High percentage of processes blocked on STREAMS

Ralph Grothe — Tue, 09 Mar 2004 12:43:10 GMT

Hi,

it's still this performance drain that we are trying to trace.
We looked at all sorts of resource consumption that could manifest as possible performance bottlenecks.
But couldn't find evidence for a real performance bottleneck on behalf of the server nor the network.
So I've come to the conclusion that it must be due to badly managed IPC on behalf of the application.
What puzzles me is that the overwhelming wait reason for processes ist STREAMS.
I first had to look at knowledge base to find out what actually was meant by STRMS, where I discovered this interesting paper

http://www4.itrc.hp.com/service/cki/docDisplay.do?hpweb_printable=true&docLocale=en_US&admit=-938907319+1078851566775+28353475&docId=200000066917792

There I got my assumption confirmed that mainly IPC via BSD sockets was meant
(or maybe to be more precise API calls of the HP-UX STREAMS transport layer interface, viz. to those functions listed in /opt/perf/lib/mikslp.text)

In the paper it says that a high percentage of STRMS block reasons may not necessarily mean that one has a network bottleneck, but could mean that one has lots of IPC over sockets.

My question now is if I can pass the buck to the application developers/implementers to have a closer look at why their procs are being blocked.

To give you an impression.
Right now, a sample taken with glance, while the peak of connections is already over, reveals some 79% of procs blocked on STRMS.

# cat yyy
print gbl_stream_wait_pct, gbl_pri_wait_pct
# glance -adviser_only -syntax yyy -j 10 -iterations 10 2>/dev/null
78.8 0.0
78.9 0.0
78.7 0.0
78.7 0.0
78.6 0.0
78.7 0.0
78.6 0.0
78.7 0.0
78.7 0.0
78.7 0.0

Re: High percentage of processes blocked on STREAMS

A. Clay Stephenson — Tue, 09 Mar 2004 13:02:34 GMT

Hi Ralph:

There is still not enough data for a definitive answer. This may be perfectly normal behavior. Your cooperating processes may be sending large numbers of rather small packets each waiting for the other to complete some task. Yours is a prime example of why I insist that the developers work on someold sluggish boxes. That way, potentially poor designs manifest themselves before they are discovered only on production boxes. This would be my test: Intentinally throttle your (I assume 100MB or Gigabit) down to 10MB. Make sure this is done on both ends of each connection. Now your developers (I'm sure they are blaming the hardware/network/tuning) will tell you that that will tremendously slow you down (10X if we assume a 100MB to 10MB decrease) but I am willing to bet that the actual slowdown will be extremely small. If so, the network is not the bottleneck but rather the software protocol is the more likely suspect.

Re: High percentage of processes blocked on STREAMS

Sridhar Bhaskarla — Tue, 09 Mar 2004 13:51:44 GMT

Ralph,

As you already discovered, this doesn't mean a network bottleneck itself. I treat it just like disk IO where the process waits on it's request to complete. In case of sockets, they show as blocked on streams instead showing as sleeping. I would first understand the nature of transactions my system does before going to the developers.

If there is a network bottleneck, then it would be reflected it in the GBL_NETWORK_ERR* and GBL_NETWORK_SUBSYSTEM_QUEUE and BYNETIF* metrics.

-Sri

Re: High percentage of processes blocked on STREAMS

Todd Whitcher — Wed, 10 Mar 2004 08:39:52 GMT

Hi,

Here is some data on Glance and the Block On Streams message. As stated in the earlier posts this is not indicating a network bottleneck but typically an application issue. Good Luck, hope this helps.

Global and Process Blocked States:

On hpux 11.X glanceplus assigns a wait state to a process
using a textfile /opt/perf/lib/mikslp.text which provides a
translation between the specific kernel routine that caused the
process to last change its state and the wait state which will
be assigned to the process. The global wait states are simply
the sum of the individual process wait states. Therefore, the
sum of all the processes blocked on streams as a percentage of
the total number of processes on the system gives the % blocked
on streams as a global wait state.

The GlancePlus Help subsystem in both glance and GPM provides
help on this metric (GBL_STREAM_WAIT_PCT).

The definition states the following:

"The percentage of time processes or threads were blocked
on streams IO (waiting for a streams IO operation to
complete) during the interval.

This is calculated as the accumulated time that all
processes or threads spent blocked on STRMS (that is,
streams IO) divided by the accumulated time that all
processes or threads were alive during the interval..."

So to start with we are looking at a global metric that buckets
all the individual process and thread wait states when they are
blocked on STREAMS.

Basically we sum wait state values of all running threads and
processes during each measurement interval.

So the interesting component now becomes the individual thread
and process wait state for being blocked on streams.

GlancePlus also collects this data in the metrics
PROC_STREAM_WAIT_PCT and THREAD_STREAM_WAIT_PCT. The definitions
of these metrics simply state that they represent the percentage
of time that a process or thread was blocked on streams IO.

All the data that glance collects on process and thread wait
states is either via kernel instrumentation or by calls to
pstat().

B. Blocked on Streams:

Since the old 10.20 'blocked on SOCKET' has now transferred
to the stream IO subsystem, perhaps it may be useful to revisit
what blocked on socket means.

A socket wait does not mean there is a network bottleneck.
Programs using socket communication will spend a large chunk
of their time blocked on socket, due to the nature of sockets.
Only one side can have access at a time, and the size of the
socket is only 8k. This means if you have two processes
communicating, one of them will always be blocked on socket
waiting for the other one to get done.

Also, rather then blocking on SLEEP when it is waiting for the
next request, it will block on the socket looking for the data
that will appear.

This is the important point: The wait state of streams may
simply be indicating that processes (or threads) are
conducting IPC and waiting (listening) on a socket (or it's
streams replacement) for some data. Where IPC is involved
it becomes very difficult to know if the wait state is due
to some sort of system wide resource shortage, or if it is
simply normal IPC for the applications involved.

There is a good chance that a system showing a large amount
of time blocked on streams is simply running an application
that is IPC intensive. As with most System V IPC mechanisms,
it is better to make sure that there are adequate (or better)
values for such kernel parameters than to suffer a shortage.

One simple test to prove a point would be for an adiminstrator who
is experiencing a high value for GBL_STREAMS_WAIT_PCT to run glance
prior to starting up an application (perhaps after a reboot or
scheduled maintenance). The value will probably remain low until
the application comes up.

All the data that Glance collects on process and thread wait
states, is via kernel instrumentation or by calls to 'pstat()'.

NOTE: The following is an excerpt from /opt/perf/lib/mikslp.text.
It is a list of specific kernal routines that relate to the
state of 'blocked on streams' in Glance:

str_sched_up_daemon BLOCKED_STREAM
str_sched_mp_daemon BLOCKED_STREAM
str_sched_blk_daemon BLOCKED_STREAM
str_mem_daemon BLOCKED_STREAM
str_weld_daemon BLOCKED_STREAM
read_sleep BLOCKED_STREAM
hpstreams_read_tty BLOCKED_STREAM
write_sleep BLOCKED_STREAM
getmsg BLOCKED_STREAM
getpmsg BLOCKED_STREAM
putmsg BLOCKED_STREAM
putpmsg BLOCKED_STREAM
runq_remove BLOCKED_STREAM
_csq_acquire BLOCKED_STREAM
streams_mpsleep BLOCKED_STREAM
hpstreams_close_int BLOCKED_STREAM
hpstreams_write_int BLOCKED_STREAM
ioctl_sleep BLOCKED_STREAM
str_istr_ioctl BLOCKED_STREAM
str_plumb_ioctl BLOCKED_STREAM
str_head_ioctl BLOCKED_STREAM
str_trans_ioctl BLOCKED_STREAM
str_async_ioctl BLOCKED_STREAM
str_alive_ioctl BLOCKED_STREAM
str_socket_ioctl BLOCKED_STREAM
hpstreams_option1 BLOCKED_STREAM
str_tty_ioctl BLOCKED_STREAM
streams_poll BLOCKED_STREAM
streams_poll1 BLOCKED_STREAM
osr_run BLOCKED_STREAM
osr_close_subr BLOCKED_STREAM

Re: High percentage of processes blocked on STREAMS

rick jones — Wed, 10 Mar 2004 13:58:21 GMT

You might take some tusc traces of the application(s) and examine those - might be best to make them verbose traces with timestamps and printing of pid and lwpid.

Before you go there, you might look at process system calls in glance ("L" IIRC)

Also examine your netstat statistics. You may find beforeafter useful there:

ftp://ftp.cup.hp.com/dist/networking/tools/

and then

ftp://ftp.cup.hp.com/dist/networking/briefs/annotated_netstat.txt