Processes Mysteriously Being Deleted

Jon Pinkley · ‎01-22-2008

I had a typo, I meant GBSINVRUN not DBSINVRUN.

it depends

Willem Grooters · ‎01-22-2008

What follows WAIT? It could be the process is deleted AFTER wait and before the buffers are flushed (a crash can ahppen within 5 seconds :)

Willem Grooters
OpenVMS Developer & System Manager

Wim Van den Wyngaert · ‎01-22-2008

Willem,

In that case the wait is not working correctly. He asked to wait 10 minutes ...

Wim

Wim

Wim Van den Wyngaert · ‎01-22-2008

Rob,

Is the job keeping a task-to-task open ?
If it doesn't receives any information for 75 seconds, the remote stuff stops (75 coming from our already gone 6.2 systems). May be this caused the bugcheck because the abort is about 60 seconds after the wait started.

Fwiw

Wim

Wim

Robert Atkinson · ‎01-22-2008

Presumably you're talking about DECnet or something?

Nothing like that happens in the process.

Wim Van den Wyngaert · ‎01-22-2008

Yes decnet. Something like
$ open/read/write x node::"task=xx.com".

Don't know if rms is involved but could be imo.

Or may be a spaw/nowait doing something ?

Wim

Wim

Willem Grooters · ‎01-22-2008

What you see in the logfile has been flushed BEFORE the process was deleted. The last DCL acivity you see is WAIT. Either WAIT ran into an error after the log buffer was flushed (unlikely, IMHO), or DCL ran into the RMS bugcheck on reading the next line of the commandprocedure, or the acivity on the next line. If your flush interval is set to 5 seconds, it still is worthwhile to know what could be executed in the 5 seconds after WAIT expired, because if the process is killed in that period, you won't see this in the logfile.

Willem Grooters
OpenVMS Developer & System Manager

Robert Atkinson · ‎01-22-2008

Willem, you're logic is wrong my friend.

If the process terminated at 07:45:52, the wait started at 07:45:00 and was 10 minutes long.....then it MUST have been running the wait when the DELPRC occurred.

This is how I know it was running a DCL WAIT, regardless of what information was flushed or not.

Rob.

Wim Van den Wyngaert · ‎01-22-2008

But the spawn could do it.

I just did a test where the parent was doing a wait while a sun terminated. The termination caused the parent to do a few IO's.

Wim

Wim

Jess Goodman · ‎01-22-2008

sure hope the sun hasn't terminated. The stock markets are already way down. :)

I have one, but it's personal.

Jon Pinkley · ‎01-22-2008

Rob,

Is it always the GBSINVRUN job that getting the RMS Bugchecks?

If so, what is special about that job?

Are there other processes in the job? If not then we can eliminate subprocesses as having an effect.

Can you do a

SDA> set proc/id=PID of GBSINVRUN job
SDA> show process/chan

while the GBSINVRUN job is in a WAIT?

Are any special drivers or execlets loaded?

SDA> show exec/summary

Is the job copying files from other systems? If so, how are these transfers happening? Any NFS or other software providing "disks" to VMS?

Are you still using HSG80, or something else?

Jon

it depends

Jon Pinkley · ‎01-22-2008

Does this ASCII string mean anything to you?

ETA$MGA

Look at the SP reported in the BUGCHECK, it does not appear to be a valid address, and it is not aligned correctly. And it has valid ASCII charaters in it.

FP x000000007FF8BC80

SP x0041474D24415445 <--

PC xFFFFFFFF804A3550

0x ASCII
45 E
54 T
41 A
24 $
4D M
47 G
41 A

It seems to me that something has trashed something that later got loaded into the SP.

it depends

Volker Halle · ‎01-22-2008

Rob,

the most important question to be answered: is there always a RMS non-fatal bugcheck associated with the forced process exit ?

Could the flushing of the logfile write have initiated the bugcheck ?

Can you look at the instruction stream around the reported PC ?

$ ANAL/SYS
SDA> EXA/INS 804A3550-40;50

Volker.

Robert Atkinson · ‎01-23-2008

> ETA$MGA

This looks like one of our tape drives, BETA$MGAn:

There have been around 6 different job streams that have failed, all doing completely different tasks.

Trying to find similarities is difficult because these are 3 years apart, so I can't tell you much more detail than you already know.

Even the backups we took have now expired, so I can't look at the original log files, only use what we wrote down about the problem.

Remember that the first time we saw this, we put it down to a glitch, and didn't take much notice of it.

Rob.

Volker Halle · ‎01-23-2008

Rob,

if these batch process exits are always associated with RMS non-fatal bugchecks, you don't need to activate all the auditing around process deletion. The best information you can get right now are the bugcheck entries from ERRLOG.SYS. Please look at all of them (consider to post them in an attached .TXT file) and look for similarities.

It will be very hard - if not impossible - to diagnose this problem based on this minimum amount of information. As Hein said, the only way to get real information is a system crash (setting BUGCHECKFATAL=1 dynamically once before expecting this problem to happen).

Some information may be spotted in the registers (as Jon found out !), but it would be pure luck, if you can nail the problem this way. Did all those batch jobs do anything with your Fibre Channel tapes ?

Please report the failing instruction stream, so that we can try to find out, if other registers may contain important information.

Volker.

Wim Van den Wyngaert · ‎01-23-2008

Tou could do "set watch file/class=all".

Wim

Wim

Volker Halle · ‎01-23-2008

Rob,

a plain call to $DELPRC in a program running in batch yields the same retained-on-error status as you've reported. RMS explicitly does a $DELPRC call after reporting the RMS internal error with BUG_CHECK RMSBUG. I would have expected an exit status of %RMS-F-RMSBUG, but I have no way to test this.

Running Hein's cmexec_crash example gives a SYSTEM-F-ACCVIO exit status and a non-fatal bugcheck, but it does not exactly create a RMS bugcheck !

Volker.

Richard W Hunt · ‎01-25-2008

This is a stab, admittedly. Is there a third-party interface to the tapes involved here? Or some third-party ACP? I find it most interesting that the string ETA$MGA is in SP as opposed to some other register.

This sounds (vaguely) like a context switch or condition handler tried to handle a condition and in the process, mis-aligned the stack so that you have data where addresses should have been. The crash would then occur because the real error is that I'll bet that SP address is no good.

So the RMS BUGCHECK error is merely that RMS suddenly found itself with an impossible stack, when in fact the true error should be some sort of hardware misalignment. The text leaves "45" as the low-order two hex digits in the SP register, which is clearly misaligned, can't use that as an SP on an Alpha 'cause it'll barf every time.

If you can enable a full-blown crash dump, you might be able to look at what else was running that might have hosed the EXEC stack, which is of course shared among all processes (I think I remember that from my years-ago Internals class...) - so the task that is dying is only partly the culprit and in fact might be a victim of a third-party ACP task that hoses the stack for you and leaves the mess for someone else to clean up.

Sr. Systems Janitor

Volker Halle · ‎01-26-2008

Richard,

the EXEC stack, which is of course shared among all processes

the EXEC stack is per process, as are the USER, SUPERVISOR and KERNEL stacks.

If those processes have not done anything to those tapes in EXEC (or higher) mode previously, it is hard to imagine, how the EXEC mode stack pointer could be corrupted with what looks like part of a tape device name.

Without a system crash (BUGCEHCKFATAL=1), it will be very unlikely to successfully diagnose this problem and even with a crash, it may be very hard.

Volker.

Hein van den Heuvel · ‎01-26-2008

While the ASCII value spotted in the SP is certainly interesting, and a good find (which I normally spot myself :-), I would not pay too much attention to that here.
In the very area of the reported RMS problem the system is busily juggling the stack pointer(s), getting ready to switch it to a fresh area to field an RMS requested AST.

Here is a few more interpreted register values:

R6 x0000000000018001 RMS$_STALL (or long at IFB+IFI for SYS$OUTPUT)

R11 x000000007FFD00B8 PIO$GW_PIOIMPA
R12 x000000007FFCDA60 CTL$AG_CLIDATA

R16 x00000000000003A0 BUG$_RMSBUG
R17 xFFFFFFFFFFFFFFFA ASBALLFAIL

R19 x000000007FFD0010 PIO$GW_STATUS

R16/r17 are consistent with and RMS BUG.

R19 suggests RMS was active as well.

R11 reads Process IO, not Image IO, suggesting it is DCL releated IO.
R12 is a global value used by DCL all the time.

So far so good.
Now ASBALLFAIL is only used from RM$ASY_THREAD_START.
That in turn is only (supposed to be) called from LOCKFILE for the file lock, and RELEASE for a blocking AST on a Deferred Write buffer.
Since DCL does not do DFW, it must be a file lock.
Typically DCL will do an (RMS) file lock either. It only grabs those for $OPEN/SHARE.

Robert, Is that the case for this script?
Can you run submit it, wait for it to be waiting, and do a ANAL/SYSTEM... SET PROC BATCH_xxx... SHOW PROC/LOCK ?
Is there a resource wih name RMS$...
Does it have a blocking AST perhaps for 86829400
Do you see that lock back with:
SDA> SHOW PROC/RMS=(PIO,SFSB)

[ Also, if interested, check out things like SDA> SHOW PROC/RMS=(PIO,BDBSUM) and examine the buffer contents. ]

Anyway, If so, any other process opening the same file will trigger an RMS file lock blocking AST out of context, at any time. During an image, during a wait, anytime.
RMS takes the AST, tries to allocate a thread control structure for it. If that fails, there is no place to report back, and it crashes the process as only way to report the fact that it could not do what it was asked to do (downgrade the lock).

What may have caused this?

Well, I suspect that the process must simply have run put of Process IO memory at an unfortunate time.
I suspect that if you change the script to have it open (a few) more files, notably indexed or shared, right where it is doing that WAIT, that it might error out with DME (can you try?!)

The amount of memory RMS allocated is MIN(2, SGN$GL_KSTACKPAG) times PAGE_SIZE (8192).
So it is also possible that the system has an excessive value for SGN$GL_KSTACKPAG.
Check with SDA or MCR SYSGEN SHOW KSTACK.
3 is typical. What is it on the system?

btw... the little tester Volker refers to is in my (underappreciated :^) reply to http://forums12.itrc.hp.com/service/forums/questionanswer.do?threadId=1195742

Hope this helps some,
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting

Jon Pinkley · ‎01-27-2008

Hein and Volker,

Perhaps I should open another thread, but how does RMS handle indefinite postponement in the case where a process is holding a lock, and someone with CMKRNL does a

$ set process/suspend=kernel

on the process holding the lock. That would seem to prevent the blocking AST from being delivered and therefore acted upon. I would assume that the can lead to other processes blocking, but that should not end in a bugcheck. And as this is the result of a user action, the fact that it can block other processes is to be expected. (I believe that set proc/susp used to have this effect, and that the "soft suspend in SUPERVISOR mode" was to address this issue.)

Another case that could delay for a long time is if the job holding the lock is running at a low process priority, and there are other compute bound jobs that are keeping the processor(s) busy. PIXSCAN would eventually give the process some CPU, but I am not aware of any mechanism to boost the software priority when exec mode ASTs are queued. (Special Kernel mode AST's can specify a priority boost, but I am not aware of any priority boost applied to processes that are blocking EXEC mode locks.)

I see that there was just a new patch released for 7.3-2 dealing with RMS locks, global buffers and CPU busy.

ftp://ftp.itrc.hp.com/openvms_patches/alpha/V7.3-2/VMS732_RMS-V0500.txt

5.2.2 Process Hang

5.2.2.1 Problem Description:

Under very rare conditions, processes on a system may
hang while accessing a file with global buffers enabled.
One of the processes will be in a compute bound loop.

it depends

Hein van den Heuvel · ‎01-27-2008

>> Perhaps I should open another thread,

Yeah you should.

>> but how does RMS handle indefinite postponement in the case where a process is holding a lock, and someone with CMKRNL does a
$ set process/suspend=kernel

I would hope it waits forever possibly triggering deadlock search, but it does not seem worth my limited brain cycles.
I hope it hurts! (Doctor it hurts...)
You should get what you ask for.
OpenVMS gives you the gun.
You can choose to shoot yourself in the foor or not.

>> but that should not end in a bugcheck.

I don't see why that would bugcheck.

>> Another case that could delay for a long time is if the job holding the lock is running at a low process priority, and there are other compute bound jobs that are keeping the processor(s) busy. PIXSCAN would eventually give the process some CPU,

That is the very reason pixscan was invented.

Mind you, rms will not generally hold 'bucket' locks. It does that only if the application request Deferred write. RMS will by default hold 1, the last accessed, record locked in a write share environment.

>> I see that there was just a new patch released for 7.3-2 dealing with RMS locks, global buffers and CPU busy.

I don't see that mentioning locks, even though there will of course be locks, for there to be global buffers.
I suspect that patch is for a day 1 bug with a timing window on the global buffer replacement algoritme, which is now easier to trigger with more, and faster, CPUs.
If it is that problem, then it also requires a severely undersized number of global buffers to be likely to trigger (like just 2 global buffers, which should be a little silly no matter what).

hth,
Hein.

Robert Atkinson · ‎01-28-2008

ALPHA_ROB$ MCR SYSGEN SHOW KSTACK
Parameter Name Current Default Min. Max. Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
KSTACKPAGES 6 1 1 768 Pages
ALPHA_ROB$

If I completely understood all of the low level stuff that you've been discussing, I'm not sure how we could take it any further, because the problem cannot be reproduced.

I've suggested we need to crash the system to get more information, but it's been decided that taking the hit on the process dying versus a full system crash is preferrable.

Unless the problem frequency increases, this is possibly one we'll never get to the bottom of. I'm hoping a move to Itanium and/or VMS 8.3 might magically cure the problem.

Rob.

Hein van den Heuvel · ‎01-28-2008

>> ALPHA_ROB$ MCR SYSGEN SHOW KSTACK
KSTACKPAGES 6 1 1 768 Pages

So that's a little increase, but to come from a limited area (PIOPAGES). Is the reason why this was changed from the default still valid?

>> I'm not sure how we could take it any further, because the problem cannot be reproduced.

Well, if you could run the script, in a test environment perhaps, or pointing to scratch copies of files, then you can possibly get to the bottom. Either analitically with ANAL/SYS checking memory consumption, or emperically by opening a few more (indexed) files right where that wait would be. See if you get RMS$_DME. With the DME error almost there, have an other process open/share some of the files opened by DCL in the waiting process. That might just trigger it.

>> I'm hoping a move to Itanium and/or VMS 8.3 might magically cure the problem.

Not if it is was I suspect, RMS simply running out of process (versus image) memory space.
That's much like the DME error, but at a bad time.
$ help /mes/fac=rms dme
So after all this, maybe just increase (double?) PIOPAGES a bunch (200? 500?) and hope for the best !?

Hein.

Robert Atkinson · ‎01-28-2008

We increased KSTACKPAGES because of the failure in Dec '06.

Trying to recreate the failure is impossible. We already run this script about 8 times every day, as it forms part of our Invoicing suite.

It's also run on various test systems, so I can't see how we could possibly make it go wrong, especially as we already know the problem only happens once a year, and it's not always the same script anyway.

Rob.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted