Operating System - Tru64 Unix
1831351 Members
3033 Online
110024 Solutions
New Discussion

Re: Froze process -- really wierd behavior

 
Cosmin_2
Occasional Contributor

Froze process -- really wierd behavior

Hi guys,

I have a really weird problem on our production system. The system is a Tru64 Unix v5.1B2 PK4.
We have a Universe database from IBM installed on it, and in the night we must run some procedures that end with a database backup.
The procedure is initiated remotely by ssh. Tuesday night something strange happened. The operators called us saying that their screen froze.
Suspecting a connection problem I logged on the machine just to find the following situation:

1(init)
...
101 sshd
102 /usr/local/pty /usr/uv/bin/uv
103 /usr/uv/bin/uv
104 sh backup.sh
105 globus.backup 2
106 vdump -DUv -f dev/tape/tape0_d1 /u01

Process ID's are not real and tabs represent the hierarchy.

Process 106 (vdump) should of lasted maximum 20 minutes, but this was 2 hours from when it started. The CPU load was 0 for all of them.
The tape was in the tape drive and afterwards was tested and it was good.
I did the following steps:
1. kill 106 - didn't work
2. kill -9 106 - worked
At this point I was expecting the script (104) to detect that vdump didn't returned 0 at exit and make a disk backup, but to my surprise, nothing happened! The script was stuck also.
3. kill 105 - didn't work
4 kill -9 105 - worked
Process 104 died.
Process 103 remained listed.
Unfortunately, I didn't knew how this situation was going to be, and I didn't kept full logs of the operations I did that night.
I tried to test restore from the tape and at some point it asked for the second one, even though it was at only 25% use.

The next day I run trough all the logs trying to find something. The following things popped out:
1. At 23:00 the line went bad -> 10% packet loss
2. At 23:00 processor load on the machine drops from 25% to 0%
3. The amount of data found on the tape is consistent with the idea that at 23:00 for whatever reasons it stopped.
4. The operator close ssh client window at 01:00. The sshd process that corresponded to this connection exited 2 hours later !!! From test conducted on the same machine, sshd detects that connection was lost within minutes.

What is really weird?
1. How did that sshd process remained active after connection was lost?
From tests done on the same machine, process 103 changes it's parent to 1 if ssh connection dies , so it's not process 103 that kept 101 and 102 opened.
2. How could vdump get stuck?
3. How could the script that launched vdump get stuck? The script is 2-3 years old and in all the tests done afterwards on the machine it had a consistent behavior?

I really don't know where investigate further . We tried to get the machine in the same state this weekend and with all our efforts to get it stuck it behaved beautifully . This worries me as I’m beginning to suspect an unstable configuration .
Any idea is welcomed .
If more info it's required please ask and I’ll provide it !

Thanks ,

Cosm
1 REPLY 1
Erich Wimmer
Valued Contributor

Re: Froze process -- really wierd behavior

Cosm,
I believe your first problem happened because your script do not deal with the question for a new tape. If your tape holds only 25 % of the expected data the tape media may have been defective (check binary errlog).
Or you was using a tape initialized by another drive with different density which shouldn't be a problem if your tape drive is using newest firmware (SDLT).

Why does the script not detect you have killed vdump: May be the script is programmed to ignore the dead of child.

The 2nd problem may be caused by a bad line. The sshd doesn't recognize your client was exiting and therefore it has been closed by keep-alive timeout (default time 2 hours).
You didn't tell how your script depends on the line except starting.
Erich.