Re: Strange things in DECNET+

Wim Van den Wyngaert · ‎12-06-2005

This morning I was caught by alarms leading to a node that was saturated on decnet level. VMS 7.3 patched until about 25-aug-2003. Decnet 7.3 eco 3 dd 28-oct-2002.

Further info in attachment.

Anyone any idea what happened and how to investigate after the reboot ?

Wim

Wim

Heinz W Genhart · ‎12-06-2005

Hi Wim

as You can see, You have defined maximum tranport connections are defined as 500

So the limit of connections is reached.

You can change this with sys$startup:net$configure or by editing the file SYS$SPECIFIC:[SYSMGR]NET$NSP_TRANSPORT_STARTUP.NCL;

As a Guideline:

select 1000 transport connections with a maximum Window of 20 and maximum receive buffers of 20000.

Be aware tat maximum Window has a upper limit of 65535 (?? not absoluteli sure. It may be less)

Hope that helps

Regards

Heinz

Wim Van den Wyngaert · ‎12-06-2005

H.,

Forgot to mention that this is a AS1000 and that 500 is very high for this node.

Normally about 50 connections are open.

But I killed (almost) every process using connections and still 500 were used. 1 process I killed freed about 40 connections but they were taken again within 1 minute.
So, I guess there was some kind of attack in decnet.

Wim

Wim

Robert Gezelter · ‎12-06-2005

Wlm,

I would recommend NOT killing processes, but first displaying the list of active connections (and their originating and receiving processes) to a file.

Deleting processes is like washing down a crime scene, it destroys evidence of what is happening (or has happened).

The syntax for DECnet+ escapes me at the moment, but the Phase IV (NCP) syntax would be SHOW KNOWN LINKS.

- Bob Gezelter, http://www.rlgsc.com

Wim Van den Wyngaert · ‎12-07-2005

Bob,

I tried that but the programs (ncl and net$mgmt) both hanged. So, I tried to kill the processes 1 by 1 until I killed the one who held the connections.

In the mean time I found out that many other nodes logged "reject received" in the operator log file.

Wim

Wim

Peter Zeiszler · ‎12-07-2005

Sometimes people are given a higher priority to get the machine operational than to finding out what caused the problem.

Is there anything in the operator log or security audit of failed connections or attempts?

Do you have any monitoring that might show what time the additional connections started?

Was any processes in a Resource Wait state? (decnet primarily)

Edwin Gersbach_2 · ‎12-07-2005

Just a hint to the crash reason:

---------------
I Found the TNS1 process still active and tried to kill it. That restarted the
system.
MXM01/MGRWVW>stop/id=000000A0
---------------

But:

000000A0 TCPIP$INETACP HIB 10 691 0 00:00:13.57 217 144
0000012E AUDIT_CLIENT LEF 6 1469 0 00:00:06.41 323 176
0001B62F TCPIP$TNS1 HIB 6 120 0 00:00:00.27 532 32

So, you killed INETACP which I guess is hooked rather deeply in the kernel.

Edwin

Wim Van den Wyngaert · ‎12-07-2005

Peter,

The monitoring was stuck itself.

The process in RWxxx was a TPU session.

No audit alarm.

Nothing special in accounting.

No log files with other error messages (on client + server).

Because almost all decnet using processes were killed and still 500 connections were used, I think it must be a decnet bug. All nodes that were connected in decnet still had given messages to the node (were accepted and found back afterwards) but also received rejected messages.

Wim

Wim

Wim Van den Wyngaert · ‎12-07-2005

Edwin,

Very good. I made that mistake. But even that should not halt the system. Why was there no crash ?

Wim

Wim

Edwin Gersbach_2 · ‎12-07-2005

Why no crash?

>> halted CPU 0
>> halt code = 2
>> kernel stack not valid halt
>> PC = ffffffff801551a4

probably because the designers decided that after a kernel stack corruption it was to dangerous to perform even a dump. It could corrupt the disk if the dump code or parameters got wierd.

Edwin

John Travell · ‎12-07-2005

Edwin,
>> halted CPU 0
>> halt code = 2
>> kernel stack not valid halt
>> PC = ffffffff801551a4
>probably because the designers decided that after a kernel stack
>corruption it was to dangerous to perform even a dump. It could
>corrupt the disk if the dump code or parameters got wierd.

Not so, if your console is setup correctly, i.e. AUTO_ACTION is set to RESTART, then VMS will restart for the explicit purpose of taking an appropriate bugcheck. In this case it would have been KRNLSTAKNV.
There are of course situations where even this restricted restart is not possible, and others where the bugcheck code cannot write the dumpfile, but the ability to preserve the evidence after a pathological halt has always been present in VMS.

JT:

Wim Van den Wyngaert · ‎12-07-2005

AUTO_ACTION is on boot.

Anyone bad experiences with auto_action=RESTART ? E.g. automatic reboot failing ?

Wim

Wim

Peter Zeiszler · ‎12-08-2005

We have not had a problem with auto_action=RESTART. If there are any issues with having this as RESTART I would like to know also.

We had to make all of ours be restart after an issue with memory not having the system create the crash dump.

Jan van den Ende · ‎12-08-2005

Wim,

I have no clue as to what happened, but
>> kernel stack not valid halt
should DEFINITELY be a reason to write a dump!
During the CrashDumpAnalysis course prior to last Bootcamp one whole chapter was dedicated to just that kind of dumps.
But, you DO need the dumpfile... :-(

So, basically you now have two problems:
a- What happened to DecNet?
b- WHY is there no dumpfile?

Some help I am, eh?
Sorry.

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Wim Van den Wyngaert · ‎12-08-2005

Jan,

I was surprised to discover that no dump was done. I don't quite understand why in this case you have to specify "restart" to get the dump. The logic ???

In the mean time, I discovered that the problem began 6 dec at 0:05. I gues a collision between several decnet things happening at the same time (T2T, ncl).

In any case, I will classify this problem as a "very rare bug" and hope I will never see it again. But if I see it, I will crash the system myself.

Wim

Wim

John Travell · ‎12-08-2005

All, the point is that a KRNLSTAKNV is a pathological halt. When one of these has occurred, the CPU is in CONSOLE code, not VMS code.
Since the bugcheck code is part of VMS it will not happen UNLESS the CONSOLE takes action to restart VMS for the purposes of taking a restart bugcheck.

AUTO_ACTION is NOT just for BOOT, it comes into play EVERY time an uncontrolled entry to console code occurs. Just about the ONLY thing that constitutes a controlled entry is the end of a shutdown (or a bugcheck), where VMS (and probably Unix) tells the console to expect a halt, and perhaps what other action to take as well (think power off, reboot).

Power on, Kernel stack not valid, double error, halt instruction. All are considered uncontrolled console entries and cause the console to do whatever AUTO_ACTION dictates.

I DID once have a case where RESTART caused a problem, back in the days when V6.1 was current. A problem caused corruption of the system page table, which led to code winding down the stack. The KRNLSTAKNV triggered auto_action restart, the attempt to restart fell over the corrupted SPT, which led to another KRNLSTAKNV restart, which led to...
This problem was a bit of a bitch, it took us three weeks to fix it.

The issues around RESTART are mainly related to whether you want a cluster node to rejoin immediately or not. There are some sites where a failed machine is left failed until the next 'reboot opportunity', whenever that was. Turning off RESTART causes loss of the dump if a pathological halt occurs. For such a situation, a better solution may be to always stop at SYSBOOT and wait for a continue command.

JT:

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Strange things in DECNET+

Strange things in DECNET+