Operating System - OpenVMS

TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

 
witchy
Advisor

TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Hi folks,

I have a couple of Integrity servers running OpenVMS 8.4 with TCP/IP Services 5.7. They are both very stable machines usually and had been up for a year before this problem started just before xmas last year.

Basically incoming SSH connections hang, and on examining the networking processes I see that TCPIP$INETACP is in CUR state rather than HIB, and because it's in this state I can't do anything with the process. Any commands that look at the network settings like TCPIP SHO COMM will result in my process hanging. Trying to restart the services with @sys$startup:tcpip$shutdown also results in a process hang which subsequently goes into RWAST state and can't be killed. TCPIP$INETACP similarly can't be killed. Because these are legacy servers now I can reboot them so usually that's the only way out.

Obviously TCPIP$INETACP is getting hung up on something but what? The activity on the machines has been reduced to almost nothing in recent months, in fact the 2nd machine literally does nothing but sit there. The primary machine is mostly a Samba target these days, a role it's been doing for years.

Any clues on how I can see what's causing TCPIP$INETACP to stay in CUR state?

Cheers

Witchy

20 REPLIES 20
Volker Halle
Honored Contributor

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Witchy,

TCPIP$INETACP will be in a tight compute loop. As you seem to have a multi-CPU system, the process is in CUR state all of the time. On a single-CPU system, you would always see it as COM.

Using $ SHOW PROC/CONT/ID=<pid-of-TCPIP$INETACP> you could monitor the program counter (Current PC) and find out, whether this process does any Direct I/Os or Buffered I/Os. 

This could be a TCPIP software problem - as always: consider to obtain and install the most recent TCPIP patches.

Or it could be some kind of incoming IP traffic, which triggers this problem. You should then see the TCPIP$INETACP process doing IOs.  You may be able to find out about existing IP sessions etc. using $ ANA/SYSTEM and the TCPIP subcommands (see SDA> TCPIP HELP)..

Instead of just rebooting the OpenVMS system, you could try to force a crash, which allows you to the record the system state and later analyze it from the system dump file.

Volker.

witchy
Advisor

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Hi Volker,

I didn't consider crashing the system, good idea. I have to reboot it every time anyway. TCPIP$INETACP is stuck in so tight a loop the only data on the $SH PROC that changes is the CPU time, PC is constantly %x7AE6BB80. It also gets moved between CPUs so I can't blame a failing processor, and also why it started happening on two totally separate machines at the same time is baffling. The environment has been stable for over a year with no change to the workload - one server is hit fairly heavily with SaMBA traffic but always has been, and the 2nd server these days just sits there with nothing running.

Patching is unavailable unfortunately. VMS support for these servers ended a long time ago.

I've been looking at external networking too but there's a 3rd machine in the rack that's just sitting doing almost as much nothing as the 2nd machine. Waiting to see the external switch configs which we don't look after.

Cheers

Witchy

 

Volker Halle
Honored Contributor

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Witchy,

are you sure about the Current PC value ? Isn't that the Current user SP ?

You could use SDA PC sampling to find out more about the PC values...

$ ANALYZE/SYS

SDA> PCS                ! will show you the PC Sampling help information

SDA> PCS LOAD

SDA> PCS START TRACE      ! maybe use /PID=<pid-of-TCPIP$INETACP>

and let it run for a while

SDA> PCS STOP TRACE

SDA> PCS SHOW TRACE/STAT

When you're finished: SDA> PCS UNLOAD

Maybe you can get an idea, in which code TCPIP$INETACP is looping.

Does SDA> TCPIP SHOW xxx work (see SDA> TCPIP HELP SHOW) ? Anything suspicious ?

If both systems are starting to show the problem at the same time of day, it is very likely, that the problem is coming 'from the network'. If the 3rd system is not affected, it might not be running a specific piece of software, that is affected.

Volker.

Volker Halle
Honored Contributor

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Witchy,

exactly which version of TCPIP are you using ($ TCPIP SHOW VERSION) ?

Note: none of the patches ECO 1 ... 5 in the TCPIP V5.7 ECO 5 release notes, which have a deliverable of TCPIP$INETACP.EXE describe a similar symptom.

Volker.

witchy
Advisor

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Hi Volker,

We're running V5.7 ECO 2 and it was rock steady until the week before xmas last year when some network changes were made to the building-wide wifi and almost immediately these 2 servers started acting up. The 3rd box is just running, no actual apps or network shares are accessed on it any more.

Yesterday I tried swapping LAN cables between the main server and this idle one to see if the fault would follow the cables but no. The fault usually occurs overnight so I have to reboot one or both machines every morning. I've just done it again otherwise I'd have tried the SDA things you posted earlier. There's always tomorrow  

We also don't have access to the network itself so I can't find out how the Cisco switches are configured, which makes things interesting. Like you I think the problem is coming from the network rather than the servers themselves.

Cheers

Witchy.

witchy
Advisor

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

OK, both servers were in their usual inaccessible state this morning so I've done the PCS tracing. Both are looping at IMAGE_MANAGEMENT+40780

PC sampler information:
-----------------------
PC IPL Pid Count Routine Module
----------------- --- -------- ------------ -------------------------------- ------
FFFFFFFF.80A22280 0 00000424 23424 IMAGE_MANAGEMENT+40780 IMAGE_MANAGEMENT
FFFFFFFF.81A59BC0 8 00000424 1 TCPIP$INTERNET_SERVICES+002A3FC0 TCPIP$INTERNET_SERVICES
FFFFFFFF.81A4D710 8 00000424 1 TCPIP$INTERNET_SERVICES+00297B10 TCPIP$INTERNET_SERVICES
FFFFFFFF.81848500 8 00000424 1 TCPIP$INTERNET_SERVICES+92900 TCPIP$INTERNET_SERVICES
FFFFFFFF.817EFD40 8 00000424 1 TCPIP$INTERNET_SERVICES+3A140 TCPIP$INTERNET_SERVICES
FFFFFFFF.80E40240 8 00000424 1 SYS$PGQDRIVER+9A640 SYS$PGQDRIVER
FFFFFFFF.801281A0 8 00000424 1 EXE$DEANONPGDSIZ_C+00050 SYSTEM_PRIMITIVES_MIN
FFFFFFFF.800CA860 8 00000424 1 EXE$IPINT_VECTOR_C+000A0 SYSTEM_PRIMITIVES_MIN

PC sampler information:
-----------------------
Time = 24-FEB 10:03:40.475453
CPU = 02
IPL = 0
Mode = Kernel
PID = 00000424 TCPIP$INETACP
PC = FFFFFFFF.80A22280 IMAGE_MANAGEMENT+40780 IMAGE_MANAGEMENT
B0 = FFFFFFFF.80A22240 IMAGE_MANAGEMENT+40740


R2 = 00000000.0100004B
R3 = 00000000.00000000
R4 = FFFFFFFF.9081E540 PCB
R5 = 00000000.00000000
R6 = 00000000.7FF43F40
R7 = 00000000.00000001

 

Looking around other messages here I've also done

SDA> SH SUMM
SDA> SET PROC/IND=24
SDA> SH CALL/SUMM

Call Frame Summary
------------------

Frame Type (mode) Handle Current PC
------------------------ ----------------- -----------------
Memory Stack Frame (K) 00000000.7FF2E0E0 FFFFFFFF.80A22280 IMAGE_MANAGEMENT+40780
Memory Stack Frame (K) 00000000.7FF43F20 FFFFFFFF.800F40E0 EXE$CALL_SHSBA_SERVICE_C+00810
SS Dispatcher (K) 00000000.7FF2E000 FFFFFFFF.800EC2B0 SYSTEM_PRIMITIVES_MIN+BC2B0
Memory Stack Frame (E) 00000000.7FF461B0 FFFFFFFF.80A209E0 IMG$GET_NEXT_ISD_C+006D0
Memory Stack Frame (E) 00000000.7FF460F0 FFFFFFFF.80A23B70 IMAGE_MANAGEMENT+42070
Bottom of stack

SDA> READ/EXEC
SDA> SH CALL/SUMM

Call Frame Summary
------------------

Frame Type (mode) Handle Current PC
------------------------ ----------------- -----------------
Memory Stack Frame (K) 00000000.7FF2E0E0 FFFFFFFF.80A22280 EXE$EXIT_INT_C+003C0
Memory Stack Frame (K) 00000000.7FF43F20 FFFFFFFF.800F40E0 EXE$SS_DISP_C+006A0
SS Dispatcher (K) 00000000.7FF2E000 FFFFFFFF.800EC2B0 SWIS$$ENTER_KERNEL_MODE_FRAME_C+00020
Memory Stack Frame (E) 00000000.7FF461B0 FFFFFFFF.80A209E0 SYS$EXIT_C+00300
Memory Stack Frame (E) 00000000.7FF460F0 FFFFFFFF.80A23B70 EXIT_CALLBACK_C+00A20
Bottom of stack

Looks like the process is trying to exit something and can't?

Cheers

Witchy

Volker Halle
Honored Contributor

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Witchy,

did you force a crash ? ctrl-p twice on the console then Y to force a crash.

What is the process priority of TCPIP$INETACP when it's looping like this ? $ SHOW SYS/PROC=TCPIP$INETACP should tell.

I think I can answer this question from the PC sampling data:

PC IPL Pid Count Routine Module
----------------- --- -------- ------------ -------------------------------- ------
FFFFFFFF.80A22280 0 00000424 23424 IMAGE_MANAGEMENT+40780 IMAGE_MANAGEMENT

What does SDA> EXA/INS EXE$EXIT_INT_C+3C0 report ?

What is the value of the system parameter BUGCHECKFATAL ? Consider to set it to 1 - it's a dynamic parameters. Then the system would crash and automatically reboot, if this problem happens !

There is some loop in EXE$EXIT_INT to declare a non-fatal bugcheck and issue a $DELPRC self. If that fails, the process will go into a close loop at priority 0.

I'm assuming, that PING to those servers still works, as it should be handled at driver level and not use the ACP. But TELNET/SSH and other services, which require the ACP to form a connection, would hang. 

Would SDA> TCPIP SHOW DEV work ? It should in a crash ! Anything suspicious ?

So depending on the above data, TCPIP$INETACP has found some internal inconsistency and declared a non-fatal bugcheck and tries to delete itself. If BUGCHECKFATAL would have been 1, an INCONSTATE system crash would have been taken exactly at the time of the problem - and this should help to investigate this problem further.

Volker.

Volker Halle
Honored Contributor

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Witchy,

this is what you could do:

$ MC SYSGEN

SYSGEN> USE ACTIVE

SYSGEN> SHOW BUGCHECKFATAL

SYSGEN> SET , 1

SYSGEN> WRITE ACTIVE

SYSGEN> EXIT

Then the system should crash with INCONSTATE at night and when it automatically boots, it will come up with BUGCHECKFATAL=0, so no futher crash. If TCPIP$INETACP is running o.k. in the morning, then you know, that this problem is only happening once per night.

And you have a crash for further analysis.

In a next step, you could turn on LAN device tracing, which may be able to capture the 'suspicious' packets in the dump...

Volkerl

Volker Halle
Honored Contributor

Re: TCPIP$INETACP process goes to CUR state, becomes unresponsive, can't be restarted.

Witchy,

you could also look at the non-fatal bugchecks of the past using:

$ analyze/err/elv translate/incl=bugcheck/brief

This should tell you the exact timestamps of those INCONSTATE non-fatal bugchecks, leading to the TCPIP$INETACP looping.

Volker.