Operating System - OpenVMS
1748170 Members
3950 Online
108758 Solutions
New Discussion юеВ

Re: Processes Mysteriously Being Deleted

 
Robert Atkinson
Respected Contributor

Re: Processes Mysteriously Being Deleted

> Are you up to date on OpenVMS and patches?

Yes, VMS 7.3-2 fully patched - we will probably be on VMS 8.3 when this comes around next year.

BUGCHECKing the system is probably not an option, as it happens at our busiest time of the year - we're a book distributor!

Rob.
Gregg Parmentier
Frequent Advisor

Re: Processes Mysteriously Being Deleted


Robert,

Do the programs process future data times? Is it possible that at 7:45 AM they are processing data associated with the following year for the first time?
If you're calculating something based upon differences in time (that effects memory allocation or some other resource allocation), and you get a negative number because the program is just using month and day and not paying attention to year, I can see a problem that would occur every late December.

Robert Atkinson
Respected Contributor

Re: Processes Mysteriously Being Deleted

The prgramming language takes care of dates/times, so that wouldn't cause a problem, plus it wouldn't explain why the error can occurr in DCL.
Willem Grooters
Honored Contributor

Re: Processes Mysteriously Being Deleted

It doen't have to be DCL. SYS$DELPRC - something stopped the process:

Process ID x007E0097

Process Name GBSINVRUN

A few thoughts:

I think accounting won't tell more, just DELPRC. If a RMS bugcheck forces process deletion using SYS$DELPRC, it's nasty if that information is lost....

Out of stack + busiest time of the year: If it is an RMS issue, I think it might be something in the program causing the stack to get exhausted, where 'normal' load willbe within limits. Too many concureent reads/writes, perhaps? Too many channels/files open?
Willem Grooters
OpenVMS Developer & System Manager
Robert Atkinson
Respected Contributor

Re: Processes Mysteriously Being Deleted

Willem, I think you misunderstood my last comment.

What I meant was that the failure can happen inside an application program and whilst the batch job is running DCL commands, so it's probably nothing to do with the application.

Rob.
Willem Grooters
Honored Contributor

Re: Processes Mysteriously Being Deleted

Agreed - a batch process runs a DCL procedure - but the procedure can run an image. The problem could happen in the image, causing the process to be deleted (to prevent even more damage). The message is clear enough: "forced exit of image or process". So it might well be the problem is within the image, ifthis RMS bugcheck kills the process.

Mu thought is that if this happens in a top-load period, AND it seems to be something related to RMS, there is some issue in either the DCL code or the image concerning file access. IMHO, date or time are no issues here, assuming your statement that the progream does handle future times properly.
Willem Grooters
OpenVMS Developer & System Manager
Willem Grooters
Honored Contributor

Re: Processes Mysteriously Being Deleted

I read your remark on DCL WAIT when it happened.... Weird, indeed.
Willem Grooters
OpenVMS Developer & System Manager
Jon Pinkley
Honored Contributor

Re: Processes Mysteriously Being Deleted

Rob,

Perhaps you weren't telling us all you knew about the problem at the start to see if anyone could come up with an alternate possibility than what your support organization suggested.

However, now that the RMS BUGCHECK is out of the bag, can you please tell us more?

Timestamps provide good circumstantial evidence that the termination of the GBSINVRUN job and the RMS BUGCHECK were related.

As Hein stated, an RMS BUGCHECK will definitely kill the process that detected the inconsistency.

Was there an RMS BUGCHECK at the time the process in a DCL WAIT was deleted? (BTW, how did you determine that was the state of the process? WAIT is a CLIROUTINE, and as such doesn't have an image associated with it.) What other events are showing up in the errlog?

What changed 3 years ago? Is that when you upgraded to the ES45?

Are you running T4 or some other system data collection software?

Issues caused by lack of proper synchronization are most likely to become evident when the system is busy, so it would not surprise me if the underlying cause is synchronization related. Unfortunately, these are in general not easy problems to debug, as often the detection of the corrupted data isn't discovered until later, and all VMS can do when it finds the inconsistency is to bugcheck. The point being that the process that detects the inconsistency may not be related to the cause. Since you don't want to allow a crash, is there a time that you can stress the system with BUGCHECKFATAL set to 1?

Is there other software involved you are neglecting to tell us about? Are you using RMS Journaling, Rdb, some third party remote journaling or shadowing product, etc. Especially anything that does stuff in exec or kernel mode.

If these are related to Exec mode bugchecks, I am not sure auditing is going to tell you much. Definitely won't hurt,

Jon
it depends
Robert Atkinson
Respected Contributor

Re: Processes Mysteriously Being Deleted

> how did you determine that was the state of the process

I can see from the batch logfile that the process was at a WAIT statement at the same time the BUGCHECK ocurred.

As far as I'm aware, nothing in particular changed 3 years ago. We we're running HSG80/ES40 when this first ocurred. We copied the system disk over to the new hardware, so if the problem is VMS related, it would have been moved over.

Although it is the end of the year this happens, the system is no busier than at any other point in time. The end of November is our peak processing point, so if it were load related, this is when I'd expect problems.

We have T4 installed, but I'm not sure what data it's collecting.

I checked the timing of the previous errors against this year, and some happen before our yearend processing, and some after, so that seems to rule out a rogue yearend program.

My main query with the group was to see if anyone else had seen anything like this at yearend, which seems to not be the case. I'm glad of everyones input here, especially the audit suggestions, so I'm happy to close this unless anyone would like to make more comments/suggestions?

Rob.
Jon Pinkley
Honored Contributor

Re: Processes Mysteriously Being Deleted

Rob,

RE:"I can see from the batch logfile that the process was at a WAIT statement at the same time the BUGCHECK ocurred."

Does that mean that the logfile ended with something like:

$ WAIT 00:05:00

and that was the last line in the file?

Much more conclusive would be if

$ SET PREFIX "(!8%T) "

was in effect and you have something like:

(23:03:18) $ WAIT 00:05:00

and you knew the bugcheck occurred at 23:04:39, i.e. before the wait had expired.

I would not be surprised if the logfile cannot be trusted to have flushed the buffers, because that's an RMS function, and there was an RMS bugcheck. Hein or Volker will know. My point is that I think there is a good possibility that code after the WAIT had completed was actually executing at the time of the Bugcheck.

Is every one of these unexplained process deletion events paired with an RMS Bugcheck?
If so, you really need to try to determine what the cause is. Making the EXEC mode Bugcheck fatal will give you the best chance of being able to determine the cause, but before you do that, make 100% sure your system is setup to be able to save the crash dump. It would be a waste to take the crash but not end up with a valid dump.

RE:"Although it is the end of the year this happens, the system is no busier than at any other point in time. The end of November is our peak processing point, so if it were load related, this is when I'd expect problems."

I was just going by your statement that you couldn't allow an avoidable crash because it was the busiest time of the year. Perhaps it is all the activity in the files at the end of Nov that is causing the files (I am assuming indexed files here) to be badly tuned, although that in itself shouldn't cause a Bugcheck. I am not familiar enough with RMS internals or decoding the errorlog contents to be able to know if anything useful can be obtained from your errorlog entries. Hein was able to derive some info, how much more is possible, I have no idea.

RE: "We have T4 installed, but I'm not sure what data it's collecting."

Having it installed without collecting data is possible, but not very useful. The default collection command procedure will collect a lot of useful info, but the raw data normally gets deleted after a relatively short period, so it is questionable if you still have the raw monitor.dat file.

RE:"I checked the timing of the previous errors against this year, and some happen before our yearend processing, and some after, so that seems to rule out a rogue yearend program."

Well it is possible that the fact these only happen at year end is a coincidence, but I would at least look for something that is triggering the bugchecks.

Is it always the same job that is affected, of just random batch jobs. If more than one job is affected, are there common files the jobs are using?

Jon
it depends