TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

witchy · ‎01-07-2021

Hi folks,

I have a couple of Integrity servers running OpenVMS 8.4 with TCP/IP Services 5.7. They are both very stable machines usually and had been up for a year before this problem started just before xmas last year.

Basically incoming SSH connections hang, and on examining the networking processes I see that TCPIP$INETACP is in CUR state rather than HIB, and because it's in this state I can't do anything with the process. Any commands that look at the network settings like TCPIP SHO COMM will result in my process hanging. Trying to restart the services with @sys$startup:tcpip$shutdown also results in a process hang which subsequently goes into RWAST state and can't be killed. TCPIP$INETACP similarly can't be killed. Because these are legacy servers now I can reboot them so usually that's the only way out.

Obviously TCPIP$INETACP is getting hung up on something but what? The activity on the machines has been reduced to almost nothing in recent months, in fact the 2nd machine literally does nothing but sit there. The primary machine is mostly a Samba target these days, a role it's been doing for years.

Any clues on how I can see what's causing TCPIP$INETACP to stay in CUR state?

Cheers

Witchy

Volker Halle · ‎01-07-2021

Witchy,

TCPIP$INETACP will be in a tight compute loop. As you seem to have a multi-CPU system, the process is in CUR state all of the time. On a single-CPU system, you would always see it as COM.

Using $ SHOW PROC/CONT/ID=<pid-of-TCPIP$INETACP> you could monitor the program counter (Current PC) and find out, whether this process does any Direct I/Os or Buffered I/Os.

This could be a TCPIP software problem - as always: consider to obtain and install the most recent TCPIP patches.

Or it could be some kind of incoming IP traffic, which triggers this problem. You should then see the TCPIP$INETACP process doing IOs. You may be able to find out about existing IP sessions etc. using $ ANA/SYSTEM and the TCPIP subcommands (see SDA> TCPIP HELP)..

Instead of just rebooting the OpenVMS system, you could try to force a crash, which allows you to the record the system state and later analyze it from the system dump file.

Volker.

witchy · ‎01-11-2021

Hi Volker,

I didn't consider crashing the system, good idea. I have to reboot it every time anyway. TCPIP$INETACP is stuck in so tight a loop the only data on the $SH PROC that changes is the CPU time, PC is constantly %x7AE6BB80. It also gets moved between CPUs so I can't blame a failing processor, and also why it started happening on two totally separate machines at the same time is baffling. The environment has been stable for over a year with no change to the workload - one server is hit fairly heavily with SaMBA traffic but always has been, and the 2nd server these days just sits there with nothing running.

Patching is unavailable unfortunately. VMS support for these servers ended a long time ago.

I've been looking at external networking too but there's a 3rd machine in the rack that's just sitting doing almost as much nothing as the 2nd machine. Waiting to see the external switch configs which we don't look after.

Cheers

Witchy

Volker Halle · ‎01-11-2021

Witchy,

are you sure about the Current PC value ? Isn't that the Current user SP ?

You could use SDA PC sampling to find out more about the PC values...

$ ANALYZE/SYS

SDA> PCS ! will show you the PC Sampling help information

SDA> PCS LOAD

SDA> PCS START TRACE ! maybe use /PID=<pid-of-TCPIP$INETACP>

and let it run for a while

SDA> PCS STOP TRACE

SDA> PCS SHOW TRACE/STAT

When you're finished: SDA> PCS UNLOAD

Maybe you can get an idea, in which code TCPIP$INETACP is looping.

Does SDA> TCPIP SHOW xxx work (see SDA> TCPIP HELP SHOW) ? Anything suspicious ?

If both systems are starting to show the problem at the same time of day, it is very likely, that the problem is coming 'from the network'. If the 3rd system is not affected, it might not be running a specific piece of software, that is affected.

Volker.

Volker Halle · ‎01-11-2021

Witchy,

exactly which version of TCPIP are you using ($ TCPIP SHOW VERSION) ?

Note: none of the patches ECO 1 ... 5 in the TCPIP V5.7 ECO 5 release notes, which have a deliverable of TCPIP$INETACP.EXE describe a similar symptom.

Volker.

witchy · ‎02-23-2021

Hi Volker,

We're running V5.7 ECO 2 and it was rock steady until the week before xmas last year when some network changes were made to the building-wide wifi and almost immediately these 2 servers started acting up. The 3rd box is just running, no actual apps or network shares are accessed on it any more.

Yesterday I tried swapping LAN cables between the main server and this idle one to see if the fault would follow the cables but no. The fault usually occurs overnight so I have to reboot one or both machines every morning. I've just done it again otherwise I'd have tried the SDA things you posted earlier. There's always tomorrow

We also don't have access to the network itself so I can't find out how the Cisco switches are configured, which makes things interesting. Like you I think the problem is coming from the network rather than the servers themselves.

Cheers

Witchy.

witchy · ‎02-24-2021

OK, both servers were in their usual inaccessible state this morning so I've done the PCS tracing. Both are looping at IMAGE_MANAGEMENT+40780

PC sampler information:
-----------------------
PC IPL Pid Count Routine Module
----------------- --- -------- ------------ -------------------------------- ------
FFFFFFFF.80A22280 0 00000424 23424 IMAGE_MANAGEMENT+40780 IMAGE_MANAGEMENT
FFFFFFFF.81A59BC0 8 00000424 1 TCPIP$INTERNET_SERVICES+002A3FC0 TCPIP$INTERNET_SERVICES
FFFFFFFF.81A4D710 8 00000424 1 TCPIP$INTERNET_SERVICES+00297B10 TCPIP$INTERNET_SERVICES
FFFFFFFF.81848500 8 00000424 1 TCPIP$INTERNET_SERVICES+92900 TCPIP$INTERNET_SERVICES
FFFFFFFF.817EFD40 8 00000424 1 TCPIP$INTERNET_SERVICES+3A140 TCPIP$INTERNET_SERVICES
FFFFFFFF.80E40240 8 00000424 1 SYS$PGQDRIVER+9A640 SYS$PGQDRIVER
FFFFFFFF.801281A0 8 00000424 1 EXE$DEANONPGDSIZ_C+00050 SYSTEM_PRIMITIVES_MIN
FFFFFFFF.800CA860 8 00000424 1 EXE$IPINT_VECTOR_C+000A0 SYSTEM_PRIMITIVES_MIN

PC sampler information:
-----------------------
Time = 24-FEB 10:03:40.475453
CPU = 02
IPL = 0
Mode = Kernel
PID = 00000424 TCPIP$INETACP
PC = FFFFFFFF.80A22280 IMAGE_MANAGEMENT+40780 IMAGE_MANAGEMENT
B0 = FFFFFFFF.80A22240 IMAGE_MANAGEMENT+40740

R2 = 00000000.0100004B
R3 = 00000000.00000000
R4 = FFFFFFFF.9081E540 PCB
R5 = 00000000.00000000
R6 = 00000000.7FF43F40
R7 = 00000000.00000001

Looking around other messages here I've also done

SDA> SH SUMM
SDA> SET PROC/IND=24
SDA> SH CALL/SUMM

Call Frame Summary
------------------

Frame Type (mode) Handle Current PC
------------------------ ----------------- -----------------
Memory Stack Frame (K) 00000000.7FF2E0E0 FFFFFFFF.80A22280 IMAGE_MANAGEMENT+40780
Memory Stack Frame (K) 00000000.7FF43F20 FFFFFFFF.800F40E0 EXE$CALL_SHSBA_SERVICE_C+00810
SS Dispatcher (K) 00000000.7FF2E000 FFFFFFFF.800EC2B0 SYSTEM_PRIMITIVES_MIN+BC2B0
Memory Stack Frame (E) 00000000.7FF461B0 FFFFFFFF.80A209E0 IMG$GET_NEXT_ISD_C+006D0
Memory Stack Frame (E) 00000000.7FF460F0 FFFFFFFF.80A23B70 IMAGE_MANAGEMENT+42070
Bottom of stack

SDA> READ/EXEC
SDA> SH CALL/SUMM

Call Frame Summary
------------------

Frame Type (mode) Handle Current PC
------------------------ ----------------- -----------------
Memory Stack Frame (K) 00000000.7FF2E0E0 FFFFFFFF.80A22280 EXE$EXIT_INT_C+003C0
Memory Stack Frame (K) 00000000.7FF43F20 FFFFFFFF.800F40E0 EXE$SS_DISP_C+006A0
SS Dispatcher (K) 00000000.7FF2E000 FFFFFFFF.800EC2B0 SWIS$$ENTER_KERNEL_MODE_FRAME_C+00020
Memory Stack Frame (E) 00000000.7FF461B0 FFFFFFFF.80A209E0 SYS$EXIT_C+00300
Memory Stack Frame (E) 00000000.7FF460F0 FFFFFFFF.80A23B70 EXIT_CALLBACK_C+00A20
Bottom of stack

Looks like the process is trying to exit something and can't?

Cheers

Witchy

Volker Halle · ‎02-24-2021

Witchy,

did you force a crash ? ctrl-p twice on the console then Y to force a crash.

What is the process priority of TCPIP$INETACP when it's looping like this ? $ SHOW SYS/PROC=TCPIP$INETACP should tell.

I think I can answer this question from the PC sampling data:

PC IPL Pid Count Routine Module
----------------- --- -------- ------------ -------------------------------- ------
FFFFFFFF.80A22280 0 00000424 23424 IMAGE_MANAGEMENT+40780 IMAGE_MANAGEMENT

What does SDA> EXA/INS EXE$EXIT_INT_C+3C0 report ?

What is the value of the system parameter BUGCHECKFATAL ? Consider to set it to 1 - it's a dynamic parameters. Then the system would crash and automatically reboot, if this problem happens !

There is some loop in EXE$EXIT_INT to declare a non-fatal bugcheck and issue a $DELPRC self. If that fails, the process will go into a close loop at priority 0.

I'm assuming, that PING to those servers still works, as it should be handled at driver level and not use the ACP. But TELNET/SSH and other services, which require the ACP to form a connection, would hang.

Would SDA> TCPIP SHOW DEV work ? It should in a crash ! Anything suspicious ?

So depending on the above data, TCPIP$INETACP has found some internal inconsistency and declared a non-fatal bugcheck and tries to delete itself. If BUGCHECKFATAL would have been 1, an INCONSTATE system crash would have been taken exactly at the time of the problem - and this should help to investigate this problem further.

Volker.

Volker Halle · ‎02-24-2021

Witchy,

this is what you could do:

$ MC SYSGEN

SYSGEN> USE ACTIVE

SYSGEN> SHOW BUGCHECKFATAL

SYSGEN> SET , 1

SYSGEN> WRITE ACTIVE

SYSGEN> EXIT

Then the system should crash with INCONSTATE at night and when it automatically boots, it will come up with BUGCHECKFATAL=0, so no futher crash. If TCPIP$INETACP is running o.k. in the morning, then you know, that this problem is only happening once per night.

And you have a crash for further analysis.

In a next step, you could turn on LAN device tracing, which may be able to capture the 'suspicious' packets in the dump...

Volkerl

Volker Halle · ‎02-24-2021

Witchy,

you could also look at the non-fatal bugchecks of the past using:

$ analyze/err/elv translate/incl=bugcheck/brief

This should tell you the exact timestamps of those INCONSTATE non-fatal bugchecks, leading to the TCPIP$INETACP looping.

Volker.

witchy · ‎03-01-2021

Morning Volker,

>>did you force a crash ? ctrl-p twice on the console then Y to force a crash.

I did a while back, there never seems to be time to look at it though. As it happens the idle 2nd machine went into CUR state a couple of days ago so I've got some answers for you.

>>What is the process priority of TCPIP$INETACP when it's looping like this ? $ SHOW SYS/PROC=TCPIP$INETACP should tell.

0

>>What does SDA> EXA/INS EXE$EXIT_INT_C+3C0 report ?

SDA> EXA/INS EXE$EXIT_INT_C+3C0
{ .mfb
EXE$EXIT_INT_C+003C0: nop.m 000000
nop.f 000000
br.many 0000000 ;;
}

>What is the value of the system parameter BUGCHECKFATAL ?

It was 0 so I've set it to 1 for the next time. The 2nd box is ideal for this.

>>I'm assuming, that PING to those servers still works, as it should be handled at driver level and not use the ACP. But TELNET/SSH and other services, which require the ACP to form a connection, would hang.

Exactly, yes.

Would SDA> TCPIP SHOW DEV work ? It should in a crash ! Anything suspicious ?

An awful lot of BG devices for an idle system, I must admit. This server used to be a dev box but that role ceased a couple of years ago so it just sits there waiting to be useful again.

And I've just broken it by doing $TCPIP SHO SERV SMBD. Dammit. I'll crash it then set BUGCHECKFATAL when it comes back up.

I did the $**bleep**/err/elv and it didn't show all the crashes, very few in fact. The machine has locked up pretty much every day for the last few weeks but the output of that command reckoned that there had only been half a dozen crashes, 4 on Feb 2nd for some reason.

Cheers!

Witchy

Volker Halle · ‎03-01-2021

Witchy,

TCPIP$INETACP is a NODELETE process, so it can't delete itself (SDA> SHOW PROC shows Process status:: ... NODELETE), so it's entering a CPU loop at priority 0 to 'signal a problem'.

You can use some of the TCPIP commands from SDA in the running system or on a dump. So SDA> TCPIP SHOW DEVICE against the dump should tell you something about all those BG devices and what they are being used for. There is a limit on the no. of possible BG devices (either 32k or 64k - as far as I remember)..

SDA> CLUE MEM/STAT should also tell, whether there have been nonpaged pool expansion failures.

Volker.

witchy · ‎03-03-2021

Morning Volker,

Hehehe the forum bleeped out my $ANALYZE command

Interestingly both machines locked up yesterday and the 2nd one didn't crash despite BUGCHECKFATAL being set. TCPIP NETSTAT -a shows a lot of the BG devices are connected to port 7920 and 7905 with what looks like loops:

tcp 0 0 LOCALHOST.7920 LOCALHOST.56004 ESTABLISHED
tcp 0 0 LOCALHOST.56004 LOCALHOST.7920 ESTABLISHED
tcp 0 0 LOCALHOST.7920 LOCALHOST.52980 ESTABLISHED
tcp 0 0 LOCALHOST.52980 LOCALHOST.7920 ESTABLISHED

tcp 0 0 LOCALHOST.7901 LOCALHOST.49163 ESTABLISHED
tcp 0 0 LOCALHOST.49163 LOCALHOST.7901 ESTABLISHED
tcp 0 0 LOCALHOST.7901 LOCALHOST.49174 ESTABLISHED
tcp 0 0 LOCALHOST.49174 LOCALHOST.7901 ESTABLISHED

tcp 0 0 LOCALHOST.7905 LOCALHOST.49171 ESTABLISHED
tcp 0 0 LOCALHOST.49171 LOCALHOST.7905 ESTABLISHED
tcp 0 0 LOCALHOST.7920 LOCALHOST.49670 ESTABLISHED
tcp 0 0 LOCALHOST.49670 LOCALHOST.7920 ESTABLISHED

tcp 0 0 LOCALHOST.7905 LOCALHOST.49175 ESTABLISHED
tcp 0 0 LOCALHOST.49175 LOCALHOST.7905 ESTABLISHED
tcp 0 0 LOCALHOST.7905 LOCALHOST.49176 ESTABLISHED
tcp 0 0 LOCALHOST.49176 LOCALHOST.7905 ESTABLISHED
tcp 0 0 LOCALHOST.7905 LOCALHOST.49177 ESTABLISHED
tcp 0 0 LOCALHOST.49177 LOCALHOST.7905 ESTABLISHED
tcp 0 0 LOCALHOST.7905 LOCALHOST.49178 ESTABLISHED
tcp 0 0 LOCALHOST.49178 LOCALHOST.7905 ESTABLISHED
tcp 0 0 LOCALHOST.7905 LOCALHOST.49179 ESTABLISHED
tcp 0 0 LOCALHOST.49179 LOCALHOST.7905 ESTABLISHED

Hm. Examining the BG devices shows that they're all owned by DESTA processes, and stopping the director closes every single one of them. The 3rd system also has quite a number of BG devices also owned by DESTA but hardly any open sockets in comparison and no loops. I'll leave the director off for a while, see if that improves things.

Cheers

Witchy

Volker Halle · ‎03-03-2021

Witchy,

ah DESTA ... This is not my favourite piece of software.

Did the 2nd machine 'lock up' in the same way ? TCPIP$INETACP looping at prio 0 at EXE$EXIT_INT_C+003C0 ? If not, then it may be a similar but different symptom of the underlying problem.

Volker.

witchy · ‎03-03-2021

Hi Volker,

Yes, they both lock up with the same symptoms and both started doing it around the 23rd Dec last year. The annoying thing is there's nothing shared by the machines, they're completely independent. Yet they both do this.

Cheers

Witchy

Volker Halle · ‎03-04-2021

Witchy,

Quote: the 2nd one didn't crash despite BUGCHECKFATAL being set.

There is no code path in [SYS]SYSEXIT leading to the loop at prio 0 and NOT executing the BUG_CHECK INCONSTATE first. Please re-check whether BUGCHECKFATAL really was set to 1.

Volker.

witchy · ‎03-04-2021

Hi Volker,

Yep -

$ mc sysman
SYSMAN> param use current
SYSMAN> param show bugcheckfatal
Node VISDEV: Parameters in use: CURRENT
Parameter Name Current Default Minimum Maximum Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
BUGCHECKFATAL 1 0 0 1 Boolean D

SYSMAN> exit

I have them both set now so we'll see what happens.

Interestingly I had to boot the primary again this morning. Just afterwards I logged in and shut DESTA down again only to get this message and TCPIP$INETACP going CUR immediately:

$ desta stop
Stopping the Director.
%%%%%%%%%%% OPCOM 4-MAR-2021 10:02:22.61 %%%%%%%%%%%
Message from user INTERnet on VISLIV
INTERnet ACP SSH Abort Request from Host: ::0 Port: 0

%%%%%%%%%%% OPCOM 4-MAR-2021 10:02:22.61 %%%%%%%%%%%
Message from user INTERnet on VISLIV
INTERnet ACP AUXS failure Status = %SYSTEM-F-SUSPENDED

A strange one indeed!

Cheers

Witchy

Dave Lennon · ‎03-04-2021

Hi,

Technically, you should use the ACTIVE parameter set to show that something is in effect. CURRENT is the current file set which would become ACTIVE at next boot. They may be different.

Just another thought, we had a situation where our IT security group began running scanning software periodically that would tickle all ports on the box and we correspondingly had failures on various network listening processes that were not, shall we say, coded with that in mind. You may want to check your system logs to see if it looked like something external was scanning or probing the box at the same time and/or ask your network and security groups.

- Dave

witchy · ‎03-08-2021

Hi Dave,

I realised that after posting my message Active BUGCHECKFATAL is also 1.

I've asked them to ask the network team if they're running any nessus-style port scanning because it's definitely something that started happening on the 23rd Dec which was the day after networks had done some upgrades. I'll mention it again though.

Cheers

Witchy

witchy · ‎03-18-2021

Still doing daily updates.

I now have a TCPDUMP running tracking everything, this runs alongside a utility that monitors the state of TCPIP$INETACP and reports to all terminals if it goes CUR and Priority 0 for 5 minutes. The idea behind this being that I can then go through 5 minute's worth of TCPDUMP to see if there's anything obvious going on that might upset the interface. So far nothing leaps out, though I'm puzzled about one particular IP address that seems to be doing regular probes against ports 22/135/137/139/161/443/445. The networks team are trying to find out what it is because it doesn't respond to pings or name lookups. The servers themselves never talk outbound to it, they're just inbound requests.

I'm beginning to wonder if it's something in the AWS cloud.

Cheers

Witchy

witchy · ‎03-25-2021

It's been a week now with no hangups. That external address was indeed in AWS and was probing the estate every few minutes so I blocked the address at a COMM level ($TCPIP SET {CONF} COMM/REJECT=HOST=<address>) and the machines have been stable ever since, though that address sometimes still manages to break through. bizarrely. I also blocked it at the SSH level too, but the COMM level should be the gatekeeper?

Anyway, thanks for the pointers Volker!

Cheers

Witchy

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.