Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

$GETRMI returning SS$_SUSPENDED

 
SOLVED
Go to solution
Mark Finn
Occasional Visitor

$GETRMI returning SS$_SUSPENDED

This error code is not documented as a possible return value of $GETRMI. What does it mean, and what might I be doing wrong to cause it? Thanks for any help anyone can give.
13 REPLIES 13
Jon Pinkley
Honored Contributor

Re: $GETRMI returning SS$_SUSPENDED

Can you show us the code?

Just the minimum to reproduce what you see?

Jon
it depends
John Gillings
Honored Contributor

Re: $GETRMI returning SS$_SUSPENDED

Mark,

You're probably not doing anything wrong. The exact reason it most likely dependent on what your item list is asking for and the state of the system at the time of your call.

Generally the system uses SS$_SUSPENDED when "something" is preventing the collection of data from a process. For example, SHOW PROCESS/CONTINUOUS will say "Process is suspended" when it's really in a MWAIT state, and thus can't respond.

Although it's not really a correct usage of SS$_SUSPENDED, it's expedient.

I'd guess that you're asing for a statistic that needs to be gathered from another process, and at the time the process is not responding. This may be a symptom of a real problem, or could be just timing (for example, the process is in RWSCS, waiting for a response from another node).

Post a summary of your item list, and maybe we can have a guess as to which item is the cause.

Of course, I'm assuming that this is an occasional, transient error. If it's repeatable, can you trim down your item list to the minimum required to get the error?

If it's transient, you need to decide what data to use for your missing sample point. Zero? Infinity? Missing? What data does $GETRMI return (if any)?
A crucible of informative mistakes
Mark Finn
Occasional Visitor

Re: $GETRMI returning SS$_SUSPENDED

$GETRMI returns good status; the SS$_SUSPENDED is coming in iosb.L0.

The error happens every few minutes, so repeating it is not a problem. Unfortunately, while I could iteratively remove itemcodes until I find the culprit, it would be very impractical because I'd have to release the software each time (it's running in a production environment and is not having this problem in the development environment - of course), and each release takes months.

I can list the itemcodes. Here they are:
RMI$_CPUIDLE, RMI$_CPUINTSTK, RMI$_CPUMPSYNCH, RMI$_CPUKERNEL, RMI$_CPUEXEC, RMI$_CPUSUPER, RMI$_CPUUSER, RMI$_DIRIO, RMI$_BUFIO
Jon Pinkley
Honored Contributor
Solution

Re: $GETRMI returning SS$_SUSPENDED

Mark,

What is different between the development environment and production environment? Lack of load? Single processor vs. SMP? Different versions of VMS? Different architectures?

What version of VMS is running on your production server, and what type of processor is it?

Here's the description from the SSREF manual (July 2006, OpenVMS I64 Version 8.3 OpenVMS Alpha Version 8.3)

--------------------------------------------------------------------------------

$GETRMI

Returns system performance information about the local system. $GETRMI is an asynchronous system service and requires the $SYNCH service or another wait-state synchronous mechanism to guarantee that the required information is available. There is no synchronous wait form for this system service.
For additional information about system service completion, see the Synchronize ($SYNCH) service.


--------------------------------------------------------------------------------

Format
SYS$GETRMI [efn] [,nullarg] [,nullarg] ,itmlst [,iosb] [,astadr] [,astprm]


--------------------------------------------------------------------------------

So, if you are using only the documented functionality, it isn't clear to me what process it could be waiting on (re John Gillings' comment). It isn't like $GETJPI where process specific information is being retrieved, and the documentation suggest it can't return information from another node in a cluster.

The data being requested is coming from cells in S0, for the itemcodes you list.

Are param 2 and 3 specifying 0 by value, or are you treating them like a $GETSYI call?

Does your $getrmi call look similar to this (lifted from http://www.eight-cubed.com/examples/framework.php?file=sys_getrmi.c )

r0_status = sys$getrmi (efn,
0,
0,
itemlist,
&iosb,
0,
0);


Are you using an event flag or ast completion for notification that the data is ready? If you aren't synchronizing, that could explain why it appears to work on a lightly loaded system, but sometimes fails on your production system.

$GETRMI is much more likely to have bugs than $GETSYI, since it is relatively new (7.3-2?) and is probably used much less frequently than $GETSYI. So it is possible that there is a bug or undocumented feature. But there is also the possibility that your code has a bug, and since we can't see how you are calling the service, and what synchronization you are using, a bug there can't be ruled out.

Can you try running the program on your test system with a low priority, and generate some load with something like
sys$test:uetp.com and possibly some compute intensive processes?

Good luck,

Jon
it depends
Hein van den Heuvel
Honored Contributor

Re: $GETRMI returning SS$_SUSPENDED

I concur with Jon's initial question, and detailed follow up... show me the code!

>> This error code is not documented as a possible return value of $GETRMI.

Correct, and t looks like it is not an system service return code.

>> $GETRMI returns good status; the SS$_SUSPENDED is coming in iosb.L0.

Is the code waiting to look into the iosb untill it is done, typically after a $synch call?

What is the scope of the iosb variable?

The only system service documented to return SS$_SUSPENDED is SYS$GETJPI. Is the programm also using that service?

Is the program using the same iosb for both calls?

Good luck,
Hein.



John Gillings
Honored Contributor

Re: $GETRMI returning SS$_SUSPENDED

Mark,

> $GETRMI returns good status; the SS$_SUSPENDED is coming in iosb.L0.

This is normal for an asynch service. The good return status means your call was well formed. The iosb sattus is the result of the request. I'm assuming you're properly synchronizing with $SYNCH, or equivalent?

Of your item codes, my suspect is DIRIO and BUFIO. All the CPU stuff is readily available from system data cells, but, depending on the definition, I/O counts might require gathering data from all(?) processes on the system. So even one process in an MWAIT state might give you SS$_SUSPENDED.

Looking at the time series of the data you're gathering, do you see any pattern in the returned data depending on the status? Try outputting your samples in T4 format, add a column with 0/1 depending on the state of SS$_SUSPENDED. Look at it under TLVIZ. First cut just do a "CORRELATE" against the status column.

>it would be very impractical because I'd
>have to release the software each time
>(it's running in a production environment
>and is not having this problem in the
>development environment - of course), and
>each release takes months.

So you need to write a baby program that just exercises this issue. I'd break the item list into two. One with all the CPU stuff, and one with the IO. Run it on your production system, in parallel with your production code. Yes, you'll get arguments, but do they want to answer this question or not? I'd also add items checking counters of process states. See if there are any MWAIT states reported, and if they correlate with the SS$_SUSPENDED.
A crucible of informative mistakes
Jon Pinkley
Honored Contributor

Re: $GETRMI returning SS$_SUSPENDED

With high probability, RMI$_DIRIO is coming from PMS$GL_DIRIO, and RMI$_BUFIO is coming from PMS$GL_BUFIO. If that were not the case, and instead the PHD$L_DIOCNT and PHD$L_BIOCNT fields from every PHD were being summed on each call, the values returned could decrease between calls, as some processes may have terminated since the previous call.

I think Hein's conjecture about the IOSB being shared with a $GETJPI call is much more likely.

The following is a great checklist when programs fail intermittently. It specifically addresses synchronization bugs that SMP systems tend to bring out of hiding.

http://h71000.www7.hp.com/wizard/wiz_1661.html

Since you have specified you are using an IOSB, and assuming you are using $synch, make sure that the IOSB is in memory that remains valid for the duration between the initial $getrmi and the $synch (using static storage is by far the easiest method to ensure that), that the $getrmi and the $synch are using the same iosb, and that nothing else is using the memory used by the IOSB (don't share this static storage with other concurrent asynch operations, For example an asynch $getjpi using the same IOSB as a concurrent $getrmi could produce results like you see.)

The following are some threads that describe problems that can arise with incorrect IOSB usage.

sys$qiow(efn$c_enf,...,iosb,...) - must iosb be specified?

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1163915

ASTs corrupting stack frames in DECC 6.5 /optimize

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=942947

Good luck,

Jon
it depends
Jon Pinkley
Honored Contributor

Re: $GETRMI returning SS$_SUSPENDED

Here is good circumstantial evidence that the PMS cells are the source of RMI$_DIRIO and RMI$_BUFIO items.

The sys_getrmi_dir_buf is a slightly modified version of sys_getrmi.c from James Duff's examples. See attached .zip file that contains everything you need to recreate the source code.

Here's an example run on a 4 processor ES40 running VMS 7.3-2

OT$ analyze/system

OpenVMS (TM) system analyzer

SDA> read sys$loadable_images:sysdef
%SDA-I-READSYM, 10724 symbols read from SYS$COMMON:[SYS$LDR]SYSDEF.STB;1
SDA> eval @pms$gl_dirio
Hex = 00000000.13C1D0BB Decimal = 331468987
SDA> eval @pms$gl_bufio
Hex = 00000000.07F87943 Decimal = 133724483
SDA> spawn run sys_getrmi_dir_buf
DIRIO: 331470726
BUFIO: 133725603
SDA> eval @pms$gl_dirio
Hex = 00000000.13C1D979 Decimal = 331471225
SDA> eval @pms$gl_bufio
Hex = 00000000.07F87EBC Decimal = 133725884
SDA>

Jon
it depends
Richard J Maher
Trusted Contributor

Re: $GETRMI returning SS$_SUSPENDED

Hi Mark,

I agree with others here, in so much as something else could be stomping on your IOSB. (What is ss$_suspended in ascii perhaps?)

One other option may be a TCP/IP $qio which can perfectly-well return ss$_suspended (A quick glance says I have some Multinet-specific code that I think is more to do with spurious ss$_shut rather than suspended)

Can't see why a $getrmi for the local system would ever use a TCP/IP call, but who knows?

Cheers Richard Maher
Mark Finn
Occasional Visitor

Re: $GETRMI returning SS$_SUSPENDED

Thank you for all the responses! Before I get into trying them out, though, I should ask about a possible explanation.

Mr. Pinkley quoted the SSREF manual for OpenVMS Alpha Version 8.3 as saying "There is no synchronous wait form for this system service". We are running OpenVMS Alpha Version 7.3-2, and that apparently is not the case for that version. I am using $GETRMIW. So, to all of you who asked I must say that I'm not using $SYNCH, nor am I waiting for the event flag or the AST. My call is as follows:

stat := $getrmiw (,,, %REF items, iosb);

Is $GETRMIW possibly no longer available because there were problems with it - like the one I'm having?
Hoff
Honored Contributor

Re: $GETRMI returning SS$_SUSPENDED

Rather than attempting to figure out when you can get away with something, a task which has inherent risk...

IMO, it is safest to assume that the EFN is not an optional argument; either allocate one via pairs of lib$get_ef and lib$free_ef or by use of the V7.1 and later "Do Not Care" EF value EFN$C_ENF from efndef, and to assume that the IOSB is not optional.

Both an allocated EF and the IOSB should be unique over the entire lifetime of the call.

The same risk aversion plays into the condition value processing, as well. Don't assume what's documented are the only possible errors (as new or weird errors can sometimes arise). Rather, catch the specific failure(s) or specific success(es) that you specifically care about, then use the architected low-bit tests for the blanket failure (low bit clear) or blanket success (low bit set) tests.

That list of common coding bugs over in ATW topic (1661) didn't come from anywhere strange; those are all coding bugs I've encountered over the years. (Though what is strange here: I'm delivering a training presentation on this same topic tomorrow...) Those bugs can lead to subtle and hard-to-find errors, too.

Stephen Hoffman
HoffmanLabs LLC


Jon Pinkley
Honored Contributor

Re: $GETRMI returning SS$_SUSPENDED

From the V7.3 SSREF manual. (the oldest online manual, and perhaps the first to document the $GETRMI service.

http://h71000.www7.hp.com/doc/73final/4527/4527pro_053.html#jul_3077

--------------------------------------------------------------------------------

$GETRMI

Returns system performance information about the local system.

--------------------------------------------------------------------------------

Format
SYS$GETRMI [efn] [,nullarg] [,nullarg] ,itmlst [,iosb] [,astadr] [,astprm]


--------------------------------------------------------------------------------

There is no $GETRMIW listed, but no explicit statement that it does not exist.

I can't get a call to sys$getrmiw to link on VMS 7.3-2.

I just edited sys_getrmi_dir_buf.c, changed the sys$getrmi to sys$getrmiw, saved to sys_getrmiw_dir_buf.c, recompiled and relinked.

Here is the original

OT$ sho sys/nopro
OpenVMS V7.3-2 on node OMEGA 11-AUG-2008 16:28:59.42 Uptime 55 05:30:05
OT$ cc sys_getrmi_dir_buf

r0_status = sys$getrmi (EFN$C_ENF,
................^
%CC-I-IMPLICITFUNC, In this statement, the identifier "sys$getrmi" is implicitly declared as a function.
at line number 41 in file ROOT$USERS:[JON.SYS_GETRMI]SYS_GETRMI_DIR_BUF.C;1
OT$ link sys_getrmi_dir_buf

Here is the modified

OT$ cc sys_getrmiw_dir_buf

r0_status = sys$getrmiw (EFN$C_ENF,
................^
%CC-I-IMPLICITFUNC, In this statement, the identifier "sys$getrmiw" is implicitly declared as a function.
at line number 41 in file ROOT$USERS:[JON.SYS_GETRMI]SYS_GETRMIW_DIR_BUF.C;2
OT$ link sys_getrmiw_dir_buf
%LINK-W-NUDFSYMS, 1 undefined symbol:
%LINK-I-UDFSYM, SYS$GETRMIW
%LINK-W-USEUNDEF, undefined symbol SYS$GETRMIW referenced
in psect $LINK$ offset %X00000050
in module SYS_GETRMIW_DIR_BUF file ROOT$USERS:[JON.SYS_GETRMI]SYS_GETRMIW_DIR_BUF.OBJ;4
OT$

Note that the link failed with undefined symbol SYS$GETRMIW

Your example call appears to be using Pascal

stat := $getrmiw (,,, %REF items, iosb);

Did you have to do anything special in the link statement?

Perhaps the STARLET.PEN file is mapping $GETRMIW to $GETRMI (if so I would consider that a bug).

If you want your code to work, and be supported, use $getrmi, followed by $synch. I would use EFN$C_ENF as the event flag instead of the implicit use of EFN 0, and make certain the iosb isn't being shared by anything but the $GETRMI and $SYNCH, and is not an automatic variable allocated on the stack.

Good luck,

Jon
it depends
Mark Finn
Occasional Visitor

Re: $GETRMI returning SS$_SUSPENDED

I was unable to write a minimal version of the program which reproduced the error, but I did find a machine on which I can run a test version of the software. I think either using $SYNCH or making sure the IOSB was unique to the $GETRMI call (or both) must have helped because I haven't received that error on the test machine for a couple days now. Thanks again.