Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Phantom interactive login reset

 
Richard W Hunt
Valued Contributor

Phantom interactive login reset

Config: Alpha ES40 mod 2, reasonably up to date in firmware; OpenVMS 7.3-2, reasonably up to date in patches, maybe about 3 months ago was my last patch window. Next one is the end of the month.

We've had a problem in the past where something resets interactive logins to zero. Happens maybe once every several months, often more than a year apart. So rare that I cannot point to any job that I run in batch. But I've searched for whatever might do that with some pretty far-reaching SEARCH verbs. No joy in the searches so far.

After a couple of times when this happened in the past, I put a fragment in another job that runs every 15 minutes so that I could monitor login resets. It checks the current setting for interactive logins, writes it to a file.

System had been up over 80 days, no problems. Yesterday according to the logs, between 16:15 and 16:30, something or someone reset interactive logins to zero. I had walked out the door at 15:00 so wasn't me, and my operators know better than to futz around like that. The production support crew also claims ignorance of the matter. (I know there's a straight line in there somewhere, but this isn't funny so I'll skip it.)

I keep the binary audit logs for processing so I looked in there with a /FULL for everything from 15:00 to 18:00, no joy.. Nothing showed up as changing anything. I am set up to audit all of the following (bear with me):

ACL, Mount, Authorization, Install, SYSGEN, Breakin:(dialup,local,remote,network,detached,server), Logfailure:(batch,dialup,local,remote,network,subprocess,detached,server), privilege use (Security), Privilege failure (Security), File Access Bypass:(write,delete,control), Queue access: Other (Create)

I also referred to my operator.log file from the same time frame. No joy there.

The only code that I have to zero out the interactive logins like that is something that only runs during my customized, multi-threaded system startup. The synchronizer that controls the threads sets logins to zero and won't enable them until all of my other startup threads have completely exited. The code works fine and has worked for literally ten years without causing this kind of problem. The particular code segment that has the SET LOGIN commands in it has long since run (AND exited) and won't run again until I reboot. So I don't think my culprit is that startup module.

Question 1: Has anyone else seen a phantom reset of login/interactive like this?

Question 2: I cannot find ANY reference to this in either the audit logs or the operator log. Is there some other parameter I should set so I can trap this event in the logs?
Sr. Systems Janitor
20 REPLIES 20
Thomas Ritter
Respected Contributor

Re: Phantom interactive login reset

We use to run lat based load balancing software for juggling users across the cluster. It would set logins to zero to move new connections to another node. It was inhouse written.
My guess the command is located in some command procedure. Tried a big search across all your directories ? May be a buggy ops Menu.
Bets are there is $ set login/inter =0 somewhere on the system.
Hoff
Honored Contributor

Re: Phantom interactive login reset

Scan your audits for changes to the IJOBLIM system parameter; that's the core knob here.

Please modify the local IJOBLIM-related code to report (or to audit or to log) its own activation and its status, using syslog or whatever site-local means is in use here to track activity.

The relative age of the code does not point to the absence of bugs; I've encountered bugs that were latent and lurking for twenty or thirty years. That was in heavily-used code, too.

Trust, but verify.
Ian Miller.
Honored Contributor

Re: Phantom interactive login reset

Also look for something setting the IJOBLIM system parameter. Although I see that you have SYSGEN auditing so it should have shown up there.
____________________
Purely Personal Opinion
Richard W Hunt
Valued Contributor

Re: Phantom interactive login reset

To Thomas Ritter and Hoff: Thanks for your comments. However, I think I'm ahead of you there.

I have searched my personal directory, all operator-class directories, the system directory, and my production-support team. I have searched my special SYS$TOOLS folder where I keep home-grown useful programs. I have searched the user account support tools. I have searched the startup tools. Nobody has any code that contains the sequence "/intera" (looking for set login/interactive, of course). I found a few /inter that were other keywords. Nothing juxtaposed with login.

We are no longer a cluster. My nodes are now standalone. We do not run LAT support any more, either. Our security guys block LAT protocols at the closest "smart" switch. So it wouldn't be anything LAT based that I can see.

As to logging when my specialty startup code runs: That code does a $ REQUEST/TO=CENTRAL "message" any time it does anything. That REQUEST message shows up in the operator.log file among other places.

I have reviewed the operator logs, audit logs, and the accounts of everyone who has sufficient privilege to do this. I cannot find anything using

$ SEARCH domain "login","/inte"/match=AND

The domains I have used span at least 50-75 user home directories and their sub-directories. Not all of the users have the required privileges, but they are on the same disks as a few users who DO have that level of privilege, and I didn't constrain the search. I even checked my batch job list for that time. Nothing obvious.

This has happened before, perhaps a couple of years ago, and a couple of years before that. It's why I built the interactive login monitor feature. BUT... the last code fragment I found that could do that to me has been fixed long ago (and no, a backup restoration hasn't occurred since then).

While I freely admit there could be an elusive snippet of code somewhere that could do this, I've searched maybe a couple of hundred directories in total and so far, no joy. Given my system's disk architecture, I'm running out of places to look.
Sr. Systems Janitor
Ian Miller.
Honored Contributor

Re: Phantom interactive login reset

What privileged code do you run?
____________________
Purely Personal Opinion
Richard W Hunt
Valued Contributor

Re: Phantom interactive login reset

Ian, as to privileged code:

Most significant of what we run is ORACLE client to a back-end server. I searched their .COM and .LOG files anyway, nothing in their directories seems relevant.

We have a product called SmartStar that is a piece of middleware that speaks ORACLE-ese. The account is privileged enough to disable logins, but I looked both at the files that SmartStar runs and the last couple of days of log files. No joy in that search.

We run normal things like TCPIP Services for OpenVMS (v 5.4 ECO 7 at the moment). I searched that, too. Nada.

The only other site-specific things we run are all running in user context and have only those permissions and privileges of the individual application users. I.e not installed with privileges.

This problem occurred in the past but I can't find any reference to it subsequent to about summer, 2006 when we were still on a lower version of OpenVMS, v 7.2-something. My notes don't tell me when it last happened, and my (admittedly far from perfect) memory says it is VERY infrequent. Once every couple of years seems right. In fact, if I didn't have to account for it to the government supervisors, I would have blown it off and just enabled logins again. Logins ARE enabled, but I can't stop searching for an explanation just yet.

I'm a bit frustrated because I am audit-logging SYSGEN events, which should be sensitive to this. I have other events in the audit log, including when one of my operators had to reset a password to something pre-expired for a new operator, and then again when the new operator reset his own password. Password resets showed up, a couple of security queries showed up (using SECURITY privilege to look at something), but no reset of interactive login counts.
Sr. Systems Janitor
Hoff
Honored Contributor

Re: Phantom interactive login reset

If you're not seeing any IJOBLIM parameter audits but are seeing the value zeroed, then the next obvious step is a look at the kernel-mode code here and for CMEXEC or CMKRNL audits around the time of the error; privileged-mode code executing on this server would appear to be writing to the SCH$GW_IJOBLIM cell.

Richard W Hunt
Valued Contributor

Re: Phantom interactive login reset

Hoff, where would I look for this CMKRNL or CMEXEC level event if not in the AUDIT logs? Surely you don't mean the error logs? I thought that kernel mode exceptions would crash the system and that exec mode exceptions would crash a service. I'm looking at over 82 days uptime at the moment and have no service outages. Also, SHOW ERROR says only 1 count each from DQB0 and DQB1, both of them boot-up events.

I don't have a sources kit so can't search for anything in the executive code anyway. I'm willing to do some research but I've started to run out of options.
Sr. Systems Janitor
Ian Miller.
Honored Contributor

Re: Phantom interactive login reset

You can enable audits for the use of CMKEXEC and CMKRNL
____________________
Purely Personal Opinion
Hoff
Honored Contributor

Re: Phantom interactive login reset

"The error" here being the errant change to the IJOBLIM value.

Here, I'd review all of the add-on code with kernel-mode components eg: TCP/IP Services, Oracle, add-on device drivers or UWSS images or such, and basically anything else installed here that needs CMEXEC or CMKRNL to load or to INSTALL or to operate at run-time. Check for ECOs and updates, including for all layered products and for OpenVMS itself.

You're not particularly looking at the OpenVMS software itself here, as (while that's certainly a potential culprit here) any obvious or overt bugs within the OpenVMS software would tend to be visible on multiple systems; there'd be wider references to problems with IJOBLIM around the 'net. (Isolated manifestations of these sorts of bugs within OpenVMS are rare, though not unheard of.)

And I'd enable audits for CMEXEC and CMKRNL, as mentioned. This to capture local activity with these privileges.

And I'd SEARCH ddcu:[*...]*.*/WINDOW=0 IJOBLIM and related permutations, and investigate the matches. (There will be a few matches normally, for the library definitions and such.)

Also search for evidence of an aborted system SHUTDOWN sequence, too. Certainly unlikely, but that too will trigger a change to IJOBLIM.

Or (if you have support) ring up HP and have them search their support databases for stuff that sets IJOBLIM to zero.

Richard W Hunt
Valued Contributor

Re: Phantom interactive login reset

OK, the simple cases: No aborted system shutdowns. I modified that to add a logging event (silently) in a non-standard place besides whatever else it might write to the operator logs. Nothing to see.

I've searched the code for TCPIP, ORACLE, and SmartStar (my only products of note) plus LEGATO NETWORKER, which I forgot earlier. But so far, nothing is turning up. What really bothers me is that this is a truly infrequent event based on past cases.

The AUDIT settings now include CMKRNL and CMEXEC successes. I'm not worried about failures because that won't kill my system. It is only the successes that worry me.

I'm really torn between hoping I catch it right away and hoping that there will be nothing to catch before we upgrade to OpenVMS 8.3, which is on our schedule.

Sadly, at the moment we don't have s/w support, though we are also getting ready to have that reinstated. Hurricane Katrina screwed over our support budgets for the last couple of years because of all the other stuff we had to replace. (We are in New Orleans.) Support is on our budget for next year, so I guess that's a sign that we are getting back closer to normal. Anyway, I digress. As far as researching this, I'm on my own for a little while longer, except for whatever support I can get (with much thanks) from this forum and online searches.
Sr. Systems Janitor
Hoff
Honored Contributor

Re: Phantom interactive login reset

So a search of all files on all disks

SEARCH ddcu:[*...]*.*/WINDOW=0 IJOBLIM

turns up nothing?
Richard W Hunt
Valued Contributor

Re: Phantom interactive login reset

Searching for IJOBLIM turns up a few jobs that test it ($ IJL := F$GETSYI("IJOBLIM") type of thing) to see if logins are disabled at the time before attempting something that depends on enabled logins. But I found no references to anything that would attempt to set that value.

I specifically have avoided writing any privileged code for this site, or an image that could be installed privileged, when a script could do the job running under a privileged account. My tool scripts are all written to operate in user context precisely to avoid this kind of problem. There is nothing privileged at our site except well-known utility programs and scripts that could possibly do something if run from a hot enough account. But doing it that way would mean that the audit log would show the account under which this putative thing ran. And in this case, no audit entries show any such actions.
Sr. Systems Janitor
Hoff
Honored Contributor

Re: Phantom interactive login reset

So to confirm (as the discussion keeps wending its way back to *.COM files), the SEARCH targeted [*...]*.*;* on all disks? All files and all file types? And turned up nothing of interest?
Richard W Hunt
Valued Contributor

Re: Phantom interactive login reset

Hoff, you are almost correct. My searches have seen nothing suspect, but I haven't searched all disks visible to my systems.

I have over 70 disks to search, but only 6 of them house user accounts. The rest are database files and historical logs, old applications (payroll) data, and the like. I've searched the user home disks, the system disks, the disks dedicated to specific products and applications, the disks holding logs and reports, and anything else that can be used as a log-file archive. Also, some disks weren't mounted at the time and cannot have been involved because nothing here except OpenVMS itself uses physical disk I/O.

I have not touched the disks that retain separate personnel-related info and our applications reports, but I did look for the existence of log files there. I have not tried to search the disks that weren't mounted at the time.

Everything else I searched turns up no matches to

SEARCH [*...]*.*;* "login","/inte"/match=and

except for my startup code that manipulates the login settings. And I've already eliminated that based on the other logging that would have occurred if that code were somehow involved.

I ran a DIAGNOSE/TRANSLATE/FULL/OUT=xxxx on the active errorlog file. The good news is that no errors are showing up. (Yay! I've got a clean system!) The bad news is that no events of any kind occur except for a few timestamps and they didn't occur anywhere near the time in question.

The outage isn't an illusion because our Help Desk staff confirmed that they saw a problem. They all thought it must have been their passwords expiring and were waiting for one of the operators to come in who would reset their passwords. And of course, the operators, with OPER privilege, were able to log in anyway. In fact, quite a few passwords WERE reset because of the event. So it was a real event.

I had already enabled logins before I changed the audit log settings to include the CMKRNL and CMEXE privileges. Therefore, the ENABLE event didn't get logged either. I'll have to set aside a time to do a test on one of my stand-by servers. I just have to be sure to warn our network guys so they don't go bonkers when a server stands up and says Howdy on the network. (Government sites are like that, you know...)

As it stands, the event occurred and the only programmatic thing that saw it was my system health check that runs every 15 minutes. Nothing else was logged in the audit journal, operator log, or system error log. It would take some time to reconstruct a picture of who was logged in at the time of the original event by back-tracking accounting logs, though I could do that.

I'm beginning to think that I will see nothing there, either, because this is a Navy site. Our prime shift ended at 1500. The event in question occured over an hour later, and NOBODY from any other time zone than ours has adequate privilege to trigger this kind of event.

In summary, I have to believe that my audit log settings weren't adequate to detect the event and that's why I can't assign a cause.

I'm going to leave the thread open a while longer because I might post a solution myself later if I find one. But I won't be holding my breath until then.
Sr. Systems Janitor
EdgarZamora_1
Respected Contributor

Re: Phantom interactive login reset

For the sake of completeness (and your sanity), since you've already turned on auditing for CMKRNL and CMEXEC, you might as well turn on OPER:SUCCESS. The DCL command, SET LOGIN/INTER=0 (which I would suspect to be the culprit here since it's the most expedient way to turn off logins), will not generate CMKRNL or CMEXEC alerts, but will show up as an OPER alert.
Hoff
Honored Contributor

Re: Phantom interactive login reset

You're (still?) presuming the offending code here is using the DCL interface with your DCL search commands, and that usage should have showed in the audits.

Remember to try this search across your disks, too:

SEARCH [*...]*.*;* IJOBLIM

as this search is somewhat more likely to turn up (object and executable) code that goes directly after the OpenVMS kernel-mode data cell.
Richard W Hunt
Valued Contributor

Re: Phantom interactive login reset

I'll set the OPER logging, too.

Hoff, I searched for IJOBLIM references as well as the SET LOGIN commands. As noted before, the only IJOBLIM references were found where I was trying to decide from a script if logins were enabled at the time.

Nothing and no one programmatically sets IJOBLIM that I can find. My searches included source code in DCL, BASIC, and Ada and log files from all batch jobs that ran on that date. No other languages are supported on this system, but the searches would have even caught samples from the examples sub-folders of SYS$LIBRARY.

In my mind, I've exhausted the possibilities and cannot show any evidence of why this event happened. Probably because my audits weren't set correctly. Live and learn!
Sr. Systems Janitor
Hoff
Honored Contributor

Re: Phantom interactive login reset

I'm not referring to source-level file searches. Apparently we're talking past each other here, or I'm really dense today. I meant *.*.* here. Wildcards for everything. All files. All disks. Ah, well; I give up. I don't know what's going on here.
Richard W Hunt
Valued Contributor

Re: Phantom interactive login reset

Hoff, I appreciate your trying. I have done some searches that included .OBJ and .EXE files, but only in folders outside of SYS$SYSTEM and SYS$LIBRARY. (Because nobody has put anything there - we install our binaries from a different location.)

As was pointed out in a prior post, if this were a commonplace event, we would have heard about it long before this from a lot more irate users. So if it is original executables from SYS$SYSTEM or SYS$LIBRARY or whatever else is considered as "standard" system locations, I have pretty much ruled it out.

I'm still not going to close the thread, but the event has not recurred since first posted. Given its low frequency, I'm not going to hold my breath. Blue just isn't my color.

Sr. Systems Janitor