Re: Zillions of untraceable (?) subprocesses.

Jan van den Ende · ‎04-19-2005

Anybody recognise this?

We have all kinds of alarms set, but none were triggered.

Then it became apparent that something was eating our cluster-common disk at a fierce rate: 250 - 500 MB/day. Modern disks are big, but over some weeks this is quite some disk space.

It proved to be SECURITY.AUDIT$JOURNAL.

Every day we open a new one, rename the old to its date, and every week we archive to tape all older than two months.

It appears that there are VERY MANY short-lived subprocesses from username SYSTEM (use of which we try to avoid as much as possible)

Read very many as 8K+ /node /hour.
That is 35000 processes per hour for the cluster!

Between ACCOUNTING and ANALYSE/AUDIT I have tried to trace it.

Usually there appear to be 2 subprocesses of the same Owner ID within a few centisconds, each lasting 2 to 4 centiseconds, using 1 or 2 centiseconds CPU.

And now for my BIG RIDDLE:
I have always been convinced that Owner ID was the PID of the parent process.

That means, that eighter that PID is still there, or it is not any more. Case 1, I should find it with $ SHOW PROCESS/ID=
Case 2, ACCOUNTNG /PID=

should provide the info.

What am I doing wrong, or do not know, when BOTH return NOTHING?

The only changes that I can think of in our system recently are:
--- we started using a web applic, that uses APACHE for queries (but if that is the bad guy, shouldn't the user be WWW$APACHE?), and
--- we started using SANMGR Hostagent.

Can it be this?
If yes, how do we get it under control?

Proost.

Have one on me.

Jan

Don't rust yours pelled jacker to fine doll missed aches.

Volker Halle · ‎04-19-2005

Jan,

could the parent process have been a detached process run with /NOACCOUNTING ?

Or it could have 'disappeared' without any trace due to a non-fatal bugcheck causing process deletion ?

Can you temporarily enable SET AUDIT/ALARM/ENA=PROC=(CREPRC) to catch all process creations ?

Or temporarily disable SANMGR Hostagent ?

Volker.

Uwe Zessin · ‎04-19-2005

Yes, the OVSAM Host Agent is always good for a blame ;-)

There is patch SANMGR_00014 for version 3.2 which deals with CPU problems, but it does not say anything about lots of processes.

I agree with Volker that is makes sense to remove it from the system to help diagnose this problem.

The management server will no longer be able to map the hosts properly and you won't get any performance data (only if you have licensed and use that feature), but I think diagnostic goes first.

.

Robert Brooks_1 · ‎04-19-2005

It wouldn't surprise me that the SANMGR Hostagent stuff was the culprit. Until recently, OVSAM would spawn a process to parse SHOW DEVICE /FULL output to determine if a path to a multipath device was operational (by looking for the "Not responding" text).

Due to a request by the SANMGR folks, I added another parameter to $GETDVI to allow the specification of a path, such that one can now get path-specific information by $GETDVI.

The enhanced $GETDVI first shipped as part of V8.2 (both I64 and Alpha). It has been backported to V7.3-2, and is available with the SYS 600 kit (but I'd wait for the soon-to-be-released 700 kit, which contains
some useful $GETDVI fixes, although these fixes aren't related to the pathname parameter.

The V8.2 change is for F$GETDVI and LIB$GETDVI as well; the backported changes do not include LIB$GETDVI. The enhancement to F$GETDVI for V7.3-2 will appear in the next
DCL kit, which should be available soon.

All of this is documented in the V8.2 System
Services and DCL manuals. This work includes several new item codes to return path-specific information.

-- Rob

Wim Van den Wyngaert · ‎04-19-2005

Jan,

I try to monitor such things by monitoring the number of buffered IO's that JOB_CONTROL is doing. If it is more than e.g. 3 then investigation is needed.

Wim

Wim

Wim Van den Wyngaert · ‎04-19-2005

Of course I mean 3 per second.

Wim

Jan van den Ende · ‎04-19-2005

Volker,

That would explain my seeking in vain.
I will walk trough the procedures to check for it.

Uwe,

I had just time to enter this thread before going home, (today several personal activities, not at work, and tomorrow the Dutch Interex Security seminar, also not at work) but I HAD already made an apointment with the SAN managers to disable the hostagent for some hours. It is my prime suspect, but I need proof before a verdict.

Robert,

good to know.

perhaps you could extend your help to the OVSAM people to a little bit of sound VMS education?
< :-) >

Willem,
if they become of interest, I certainly will dive into such statistics as well!

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Jan van den Ende · ‎04-21-2005

Update:

Stopped the OVSAM agent clusterwide.

$ ACCOUNT /SIN=

On all nodes there was a sudden stop of the appearance of supprocess!

I do not think I wil start it again..

Our SAN managere _HAS ALREADY_ received some patches for the VMS agent.
There is some indication that an issue like this may be addressed.
I will work though the patches, and, depending on the need for reboot, will soon implement them, or they have to wait our next update/reboot cycle.

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Jan van den Ende · ‎06-16-2005

We received the new version of HPOVSANAGENT.

Minor remark: to make things easier it carries the exact same ID and version, although it IS definitely different.

The first thing noted is that it needs a higher patch level.
It least that gave us a sound reason to get up to date on patches again :-)

After the prerequisite red tape we did a nice rolling upgrade. (Apart from the minor issues with using new COPY qualifiers that were not yet in the processes' command tables) everything went smooth.

We did however have no desire to install this stuff shortly before leaving for Nashua.

Today we finally had occasion to do the instal.

Lo and behold: the big difference is that we see nothing different!

Starting the SAN agent again result in approx 10 subprocesses per second, which use 2 or 3 centiseconds CPU in ca 5 or ca centisecs clocktime, and end in Normal Termination.

Sufficient reason to kill the stuff again.

Anybody got any better results? If so, I would like to dig deep in the details of any differences we can find.

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Willem Grooters · ‎06-16-2005

looks like it, and not just that: It seems they need to be educated in programming style as well....

Willem

Willem Grooters
OpenVMS Developer & System Manager

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Zillions of untraceable (?) subprocesses.

Zillions of untraceable (?) subprocesses.