Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

OPCOM cannot be stopped - KILL needed?

Paul Jerrom
Valued Contributor

OPCOM cannot be stopped - KILL needed?

Howdy all,

IA64 cluster of 2xRX2620s, running VMS V8.3. I haven't found out why yet, but OPCOM is running in a tight CPU loop. I cannot STOP/ID or STOP/ID/EXIT= or even kill it using a bit of macro that does a $forcex. There are no reads outstanding or IOs being clocked; the process is not reading its mailbox (so I've had to write a DCL routine to clear it out, otherwise other processes trying to communicate with OPCOM get a mailbox full error).
I HAVE managed to set the priority down to 0!!
Anyone know how I can kill this process? [Short of running OPCCRASH - I have a steel works attached to this cluster so really don't want to shutdown if I can help it, and next scheduled downtime is a week or so away!!]

Thanks,
PJ
Have fun,

Peejay
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If it can't be done with a VT220, who needs it?
17 REPLIES
John Gillings
Honored Contributor

Re: OPCOM cannot be stopped - KILL needed?

PJ,

Can you see what it's doing? Or even what it thinks it's doing? If STOP/ID doesn't help, the process is most likely in an inner mode, or at AST level (which is blocking the $FORCEX AST).

Does SET PROCESS/SUSPEND help? Otherwise, take some CPU samples and examine the instruction streams (though that's not exactly easy on an integrity). If you're really desperate, you may be able to find something in memory you can change to break out of the CPU loop, otherwise it's reboot time!

On the other hand, if you can SUSPEND the process, or can tolerate it running at priority 0, you may be able to start up another OPCOM process to service the mailbox (that will probably take a manual RUN command to change the process name, and it depends on what, if any, exclusive resources OPCOM is holding).
A crucible of informative mistakes
Paul Jerrom
Valued Contributor

Re: OPCOM cannot be stopped - KILL needed?

Hi John,

No, cannot suspend process, and if I try to create another OPCOM manually it stack dumps with a 'device allocated to another user' error.

Ho hum.
Have fun,

Peejay
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If it can't be done with a VT220, who needs it?
Volker Halle
Honored Contributor

Re: OPCOM cannot be stopped - KILL needed?

Paul,

consider to elevate this problem to HP. As far as I remember, there still might be a problem causing an OPCOM loop and OpenVMS engineering is/was working on that last time I've heard.

You can easily obtain PC samples with the PCS$SDA extension:

$ ANAL/SYS
SDA> PCS ! for help
SDA> PCS LOAD
SDA> PCS START TRACE/PID=
...
SDA> PCS STOP TRACE
SDA> PCS SHOW TRACE
SDA> PCS UNLOAD

If you can't stop OPCOM, the loop must be in the image/process rundown code in the operating system itself and may therefore also possible affect other processes ...

Are you up to the current patch level ?

Volker.
Paul Jerrom
Valued Contributor

Re: OPCOM cannot be stopped - KILL needed?

Howdy Volker,

As far as I am aware I am up to date, but will check...

Attached is PCS log, will attempt to log a call tomorrow (it's been too long a day to struggle with logging a support call now...).

Cheers,

PJ
Have fun,

Peejay
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If it can't be done with a VT220, who needs it?
Volker Halle
Honored Contributor

Re: OPCOM cannot be stopped - KILL needed?

Paul,

looping in LIBRTL !

This is the instruction reported most of the time in your PCS trace:

{ .mib
LIBRTL+001C8740:
cmp4.lt p6, p0 = r8, r0
mov r1 = r51
(p6) br.cond.dptk.few 1FFFFE0 ;;
}

So it 'looks' like a branch !!!

SDA> SET PROC OPCOM
SDA> SHOW CALL/SUMM

would report the call stack.

As far as I remember, this matches the symptoms engineering is/was working on...

Volker.
Volker Halle
Honored Contributor

Re: OPCOM cannot be stopped - KILL needed?

Paul,

did you try STOP/ID=.../EXIT=mode ?

Start with USER, then SUPER, then EXEC, then KERNEL.

Volker.
Hoff
Honored Contributor

Re: OPCOM cannot be stopped - KILL needed?

Use SDA, and take a look at the loop.

There are kernel-mode tools around which allow clearing the NODELET flag, after which the process can be nuked.

nb: I'm not where I can check an existing OpenVMS OPCOM process PCB right now, to see if this PCB$V_NODELET flag is set for this process.

If the bit _is_ set, here's an example Really Big Hammer for this task:

http://mvb.saic.com/freeware/vmslt00b/vu/stop-i-mean-it-src.txt

This is kernel-mode code and it writes to the process PCB, with all the risks inherent.

Personally, I'd tend to let this process mimic the null process for a week or two, assuming this is a production server and it can be held together, pending a reboot or input from HP. If you need to use the RBH approach, I'd first test it on an OpenVMS I64 box off to the side.

Stephen Hoffman
HoffmanLabs LLC

Volker Halle
Honored Contributor

Re: OPCOM cannot be stopped - KILL needed?

OPCOM does not have the NODELET bit set.

Process index: 0011 Name: OPCOM Extended PID: 22000411
--------------------------------------------------------------------
Process status: 00140001 RES,PHDRES,LOGIN
status2: 00000111 QUANTUM_RESCHED,TCB

Volker.
Dean McGorrill
Valued Contributor

Re: OPCOM cannot be stopped - KILL needed?

tx for the hammer pointer hoff,

someone in here had some code to set quotas,
one could kick down quotas and hope it goes
into rwast. but if its really in a tight
loop that might not work. Dean
Ian Miller.
Honored Contributor

Re: OPCOM cannot be stopped - KILL needed?

For altering quotas use Availability Manager. If, for some strange reason, you do not have that set up then use

http://www.quadratrix.be/qapq.html
____________________
Purely Personal Opinion
John Gillings
Honored Contributor

Re: OPCOM cannot be stopped - KILL needed?

PJ,

An even bigger, uglier and more dangerous hammer... only if you're truly desperate.

If the process really is looping in LIBRTL, and you can work out where (use the sources), and you can be fairly sure it's a rarely executed path... Since LIBRTL is resident, you could "patch" an instruction in on the fly from another process. Make it an illegal operand to force the process to crash.

Downsides - first you have to hope the process doesn't keep that particular instruction in an I-cache. Second you might kill innocent bystanders (including processes like the ones that might be controlling a pour of a few hundred tons of molten steel?)
A crucible of informative mistakes
Hoff
Honored Contributor

Re: OPCOM cannot be stopped - KILL needed?

John, I was refraining from suggesting it -- having gone as far as the RBH, but -- OK -- you started it. Just clonk a register or the stack from within kernel-mode, within in the context of the looping process. Preferably resetting whatever register is causing the loop, if you can sort that out. This assumes the loop is in user-mode. No i-cache work required.

John Gillings
Honored Contributor

Re: OPCOM cannot be stopped - KILL needed?

PJ,

Perhaps obvious... if you leave the system in its current bad state until you get a chance to shutdown, make sure you force a crash, and send the dump to HP for analysis. Anything that happens on OpenVMS which forces a reboot needs to be elevated to engineering.

Hoff, I was assuming that the failure of STOP/ID to affect a process which doesn't have NODELET set means it must be in inner mode, or at AST level. Wouldn't that prevent getting anything done in process context? (can't remember the rules for lobbing ASTs at other processes - does a SPKAST always get through?)

A crucible of informative mistakes
Volker Halle
Honored Contributor

Re: OPCOM cannot be stopped - KILL needed?

Paul,

just to confirm: OpenVMS engineering is still working on a looping OPCOM problem in OpenVMS I64 V8.3...

Please raise a call and provide information from your problem. The more information about such a problem is being collected at the engineering level, the better are the chances to pin down and solve the problem.

In any case, try to force a crash instead of just shutting down and rebooting the system.

Volker.
Robert_Boyd
Respected Contributor

Re: OPCOM cannot be stopped - KILL needed?

Paul,

As I recall -- OPCOM is where accounting runs -- do you have accounting enabled? If so, try stopping accounting and see if anything changes.

Robert
Master you were right about 1 thing -- the negotiations were SHORT!
Jim_McKinney
Honored Contributor

Re: OPCOM cannot be stopped - KILL needed?

> OPCOM is where accounting runs

How so? My recollection is that accounting is hooked into LOGINOUT.
Paul Jerrom
Valued Contributor

Re: OPCOM cannot be stopped - KILL needed?

Howdy all,
I was granted a small outage time so I could reboot this server, so was not able to generate dumps etc etc. So I'll keep mosying on and see if the same thing occurs again.

Thanks for the interest, all.

PJ
Have fun,

Peejay
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If it can't be done with a VT220, who needs it?