Processes Mysteriously Being Deleted

Richard W Hunt · ‎01-28-2008

Compared to the other guys here, I'm probably not a system debugging heavyweight, but "not always the same script anyway" tells me it isn't the script, it is something else you are running. Your script is merely a victim of "collateral damage" caused by something else. Wish I had a dollar for every time I've seen that, starting from VMS 2.0 to current.

The only way I know to fix this kind of problem is to somehow know EXACTLY what is running at the time of the event, which means briefly turning on image accounting.

Also, be absolutely sure you have enough space set aside for crash dump and for retention of the errorlog buffers so that you don't lose any more "evidence" than is possible. Keep everything logging full-out at the critical time starting just before that job's scheduled run. Turn it off after next year's crash. (Egad, that sounds weird to say it, but that what is being described.)

Do you have ANY third-party software running as an ACP? Is there any third-party device driver talking to a non-HP / COMPAQ / DEC device? (Don't know how old your system is, but we've got some of each on our old clunker of an Alpha cluster.) Did you buy your system from DEC / COMPAQ / HP or was there an OEM in the picture?

Do you have ANY performance tools from a third party?

Even if the EXEC stack is private vs. common to all, having text in the SP is not good because it means something is terminating and you are missing the real cause of the problem - an abort is leaving the stack useless and it is the stuff that has just been popped off the stack that tells you what really happened. If it is still there, which certainly is not guaranteed.

Sr. Systems Janitor

Jon Pinkley · ‎01-28-2008

Hein>"The amount of memory RMS allocated is MIN(2, SGN$GL_KSTACKPAG) times PAGE_SIZE (8192).
So it is also possible that the system has an excessive value for SGN$GL_KSTACKPAG."

Did you mean MAX(2, SGN$GL_KSTACKPAG) ?

The AXP V1.5 IDSM says "The Process I/O segment contains RMS data structures describing process-permanent files, those that can and usually do remain open across image activations. The SYSGEN parameter PIOPAGES specifies the size in pagelets"

If I understand what you wrote, the PIO segment is also used for the "thread control structure" you mention in [RMS takes the AST, tries to allocate a thread control structure for it. If that fails, there is no place to report back, and it crashes the process as only way to report the fact that it could not do what it was asked to do (downgrade the lock).].

Since PIOPAGES is expressed in 512 byte pagelets, there are 16 pagelets in an Alpha 8192 byte page. So if RMS is allocating 6 Alpha pages, that is 96 pagelets.

Hein's suggestion to try opening some additional file in the context of the job to see if you got a DME (dynamic memory exhausted) error prompted me to do a

$ help/message dme

That has a lot of useful information. I wasn't aware there was a limit of 63 Process Permanent Files. The other interesting thing is that it mentions buffers, so if set RMS /buff=x /block=y and x*y is a large value, and the command file is opening files, then the 2000 pagelets (125 Alpha Pages) for PIOPAGES may not be as big as it seems.

What would the SDA command to display available PIO segment memory be?

Rob,

I think the answer to your question "to see if anyone else had seen anything like this at yearend" is that no one (here) is aware of a similar problem to what you are seeing.

Did the increase in PIOPAGES reduce the frequency? (if you have had only 6 events in 3 years, I suppose that is not easy to say.

Do you know how many Process-Permanent Files (PPF) are open, and if so how many are using open/share?

It still isn't clear that from what you have told us whether there is an RMS bugcheck each time the process is deleted. For last December, you hinted that there were at least two unexplained process deletions, but showed us only one RMS Bugcheck errorlog entry. Was that the only one? If so, then the cause of the process deletions could be totally unrelated.

And I can understand management does not want to allow a crash, especially since there is no guarantee that you will be able to find the cause even with a crash dump. However, the probability of being able to find it is much higher with a valid crash dump than with only the extremely limited context that the bugcheck errorlog entry provides.

Jon

it depends

Hein van den Heuvel · ‎01-28-2008

Richard H wrote>> "Even if the EXEC stack is private vs. common to all, having text in the SP is not good because it means something is terminating and you are missing the real cause of the problem - an abort is leaving the stack useless and it is the stuff that has just been popped off the stack that tells you what really happened. If it is still there, which certainly is not guaranteed."

I appreciate the sentiment, but in this particular case the bogus SP is 99.99% to be a red herring. Remember on Alpha register numbers are just a matter of convention/convience.
Nothing at all to special about the SP. Just a register with a nickname.
If my ramblings are correct, then RMS is busy trying to come op with a fresh value to stick into the register conveniently labeled SP, and because it failed to find a new value it had to crash out.
That crash is perfectly controlled, with reasonalle looking output. So the bad looking SP is an effect, not a cause. This is re-enforced by the fine looking values for KSP, ESP, SSP and USP early in the error log entry.

Jon P wrote> Did you mean MAX(2, SGN$GL_KSTACKPAG) ?
Jon... there you go again. Hijacking the thread some more! Anyway... Yes of course.

> So if RMS is allocating 6 Alpha pages, that is 96 pagelets.
Yes. An ASB makes a big dent PIOPAGES. It is the largest structure (except for some IO buffers).

>> Hein's suggestion to try opening some additional file in the context of the job
I actually tried this also, opening an indexed file several times (bucket sized bufers) and cleaning the remainder with sequential file opens.
Easy enough to get the DME.
Could not trigger an ASBALLFAIL for now.
The ASB lookaside probably gave one ASB.
So we'd need a second interupt while the first is stalling.

Jon> so if set RMS /buff=x /block=y and x*y is a large value, and the command file is opening files, then the 2000 pagelets (125 Alpha Pages) for PIOPAGES may not be as big as it seems.

Yeah, but DCL/RMS does not always ust the defaults. Easy enough to verify with the earlier mentioned SDA> SHOW PROC/RMS=(PIO,BDBSUM).

Jon>> What would the SDA command to display available PIO segment memory be?

SDA> READ RMSDEF.STB
SDA> FORMAT/TYPE=IMP PIO$GW_PIOIMPA
SDA> VALIDATE QUEUE PIO$GW_PIOIMPA+IMP$L_ASB_LOOKASIDE_LIST
SDA> VALIDATE QUEUE PIO$GW_PIOIMPA+IMP$L_FREEPGLH
SDA> EXA @(PIO$GW_PIOIMPA+IMP$L_FREEPGLH);8
SDA> EXA @.;8 ! Repeat for each element in queue Total free is sum of the 6th longwords.

But ofcourse you'd need a single one big enough for an ASB. So if you open file A,B,C,D,E,F and close A,D and E, then it may look like plenty of room, but it might not be useable for an ASB.

"to see if anyone else had seen anything like this at yearend" is that no one (here) is aware of a similar problem to what you are seeing."

That's a valid and good reason for a post though! I do remember a couple reported issues with ASBALLFAIL during my time @HP.
I think ACMS was involved at times,
and I thing PIOPAGES always brought relieve.

$max = 1000 ! Fools guard
$if p2.nes."" then max = 'p2
$i=0
$open:
$i = i + 1
$open/share=write/read/write/error=error x'i 'p1 ! Big buffers
$if i .lt. p2 then goto open
$error:
$error = $status
$write sys$output "Error ''error' after ''i' files. ", f$mess(error)
$more:
$open/share=write/read/write/error=next x'i sys$login:login.com ! a few more Small buffers
$i = i + 1
$if i .lt. 1000 then goto more
$next:
$error = $status
$write sys$output "Next error ''error' after ''i' files. ", f$mess(error)
$inquire/nopun ok "Close files? "
$close:
$i = i - 1
$close/nolog x'i
$if i .gt. 1 then goto close

Volker Halle · ‎01-28-2008

Rob,

if you're seeing this probelm (seldom enough) with different DCL procedures, there should be something in common with those scripts PLUS some additional factor (like end-of-year load ?) to trigger the problem.

Do these scripts really open files at DCL level and keep them open during the 10-minute WAIT ?

Instead of allowing a crash (with BUGCHECKFATAL=1) you could - theoretically - replace the BUG_CHECK RMSBUG instruction inside RMS with a 'BR .' sending the process into a compute loop in EXEC mode. This would just effect this process, which would have been killed anyway, but let the system survive. Then you can lower the priority or suspend that looping process and force a crash at a convenient time...

Volker.

Robert Atkinson · ‎01-29-2008

Gentlemen, I think we could go round in circles with this one for some time to come.

The one thing we do all agree on is that there isn't enough information at the moment to determine the real cause.

I have noted everyones suggestions, and will determine what changes we can make towards the end of the year.

Although I'm closing this rather abruptly, I genuinely do appreciate everyones help with this, and I'll try and update you all when I get something.

Thanks, Robert.

Jon Pinkley · ‎12-15-2009

Rob,

It's that time of year! If you are going to prepare to put in extra monitoring, or plans to trap the process generating the bugcheck, now is the time to get ready.

Good luck,

Jon

it depends

Robert Atkinson · ‎12-21-2009

Cheers Jon.

We've been running a 'watcher' process for the last year. Hopefully our move to Itanium will see the end of this nasty little critter :)

Rob.

John McL · ‎12-21-2009

two thoughts.

1 - What else was running at the time that would have sufficient privilege to kill a process? I'm particularly interested in images that are not part of VMS because I'm pondering a "victim of mistaken identity" when trying to terminate another process (i.e. attempted termination but used wrong PID or wrong process name). This is a long shot because it doesn't account for the RMS error.

2 - Does any of your software do anything highly privileged like mess with S0 space in Executive or Kernel modes? I'm wondering about a data corruption deep in VMS data structures.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted

Re: Processes Mysteriously Being Deleted