HPE GreenLake Administration
- Community Home
- >
- Servers and Operating Systems
- >
- Legacy
- >
- Operating System - Tru64 Unix
- >
- Re: Froze process -- really wierd behavior
Operating System - Tru64 Unix
1831351
Members
3033
Online
110024
Solutions
Forums
Categories
Company
Local Language
back
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
back
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Blogs
Information
Community
Resources
Community Language
Language
Forums
Blogs
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-13-2004 12:51 AM
12-13-2004 12:51 AM
Froze process -- really wierd behavior
Hi guys,
I have a really weird problem on our production system. The system is a Tru64 Unix v5.1B2 PK4.
We have a Universe database from IBM installed on it, and in the night we must run some procedures that end with a database backup.
The procedure is initiated remotely by ssh. Tuesday night something strange happened. The operators called us saying that their screen froze.
Suspecting a connection problem I logged on the machine just to find the following situation:
1(init)
...
101 sshd
102 /usr/local/pty /usr/uv/bin/uv
103 /usr/uv/bin/uv
104 sh backup.sh
105 globus.backup 2
106 vdump -DUv -f dev/tape/tape0_d1 /u01
Process ID's are not real and tabs represent the hierarchy.
Process 106 (vdump) should of lasted maximum 20 minutes, but this was 2 hours from when it started. The CPU load was 0 for all of them.
The tape was in the tape drive and afterwards was tested and it was good.
I did the following steps:
1. kill 106 - didn't work
2. kill -9 106 - worked
At this point I was expecting the script (104) to detect that vdump didn't returned 0 at exit and make a disk backup, but to my surprise, nothing happened! The script was stuck also.
3. kill 105 - didn't work
4 kill -9 105 - worked
Process 104 died.
Process 103 remained listed.
Unfortunately, I didn't knew how this situation was going to be, and I didn't kept full logs of the operations I did that night.
I tried to test restore from the tape and at some point it asked for the second one, even though it was at only 25% use.
The next day I run trough all the logs trying to find something. The following things popped out:
1. At 23:00 the line went bad -> 10% packet loss
2. At 23:00 processor load on the machine drops from 25% to 0%
3. The amount of data found on the tape is consistent with the idea that at 23:00 for whatever reasons it stopped.
4. The operator close ssh client window at 01:00. The sshd process that corresponded to this connection exited 2 hours later !!! From test conducted on the same machine, sshd detects that connection was lost within minutes.
What is really weird?
1. How did that sshd process remained active after connection was lost?
From tests done on the same machine, process 103 changes it's parent to 1 if ssh connection dies , so it's not process 103 that kept 101 and 102 opened.
2. How could vdump get stuck?
3. How could the script that launched vdump get stuck? The script is 2-3 years old and in all the tests done afterwards on the machine it had a consistent behavior?
I really don't know where investigate further . We tried to get the machine in the same state this weekend and with all our efforts to get it stuck it behaved beautifully . This worries me as I’m beginning to suspect an unstable configuration .
Any idea is welcomed .
If more info it's required please ask and I’ll provide it !
Thanks ,
Cosm
I have a really weird problem on our production system. The system is a Tru64 Unix v5.1B2 PK4.
We have a Universe database from IBM installed on it, and in the night we must run some procedures that end with a database backup.
The procedure is initiated remotely by ssh. Tuesday night something strange happened. The operators called us saying that their screen froze.
Suspecting a connection problem I logged on the machine just to find the following situation:
1(init)
...
101 sshd
102 /usr/local/pty /usr/uv/bin/uv
103 /usr/uv/bin/uv
104 sh backup.sh
105 globus.backup 2
106 vdump -DUv -f dev/tape/tape0_d1 /u01
Process ID's are not real and tabs represent the hierarchy.
Process 106 (vdump) should of lasted maximum 20 minutes, but this was 2 hours from when it started. The CPU load was 0 for all of them.
The tape was in the tape drive and afterwards was tested and it was good.
I did the following steps:
1. kill 106 - didn't work
2. kill -9 106 - worked
At this point I was expecting the script (104) to detect that vdump didn't returned 0 at exit and make a disk backup, but to my surprise, nothing happened! The script was stuck also.
3. kill 105 - didn't work
4 kill -9 105 - worked
Process 104 died.
Process 103 remained listed.
Unfortunately, I didn't knew how this situation was going to be, and I didn't kept full logs of the operations I did that night.
I tried to test restore from the tape and at some point it asked for the second one, even though it was at only 25% use.
The next day I run trough all the logs trying to find something. The following things popped out:
1. At 23:00 the line went bad -> 10% packet loss
2. At 23:00 processor load on the machine drops from 25% to 0%
3. The amount of data found on the tape is consistent with the idea that at 23:00 for whatever reasons it stopped.
4. The operator close ssh client window at 01:00. The sshd process that corresponded to this connection exited 2 hours later !!! From test conducted on the same machine, sshd detects that connection was lost within minutes.
What is really weird?
1. How did that sshd process remained active after connection was lost?
From tests done on the same machine, process 103 changes it's parent to 1 if ssh connection dies , so it's not process 103 that kept 101 and 102 opened.
2. How could vdump get stuck?
3. How could the script that launched vdump get stuck? The script is 2-3 years old and in all the tests done afterwards on the machine it had a consistent behavior?
I really don't know where investigate further . We tried to get the machine in the same state this weekend and with all our efforts to get it stuck it behaved beautifully . This worries me as I’m beginning to suspect an unstable configuration .
Any idea is welcomed .
If more info it's required please ask and I’ll provide it !
Thanks ,
Cosm
1 REPLY 1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-19-2004 07:19 PM
12-19-2004 07:19 PM
Re: Froze process -- really wierd behavior
Cosm,
I believe your first problem happened because your script do not deal with the question for a new tape. If your tape holds only 25 % of the expected data the tape media may have been defective (check binary errlog).
Or you was using a tape initialized by another drive with different density which shouldn't be a problem if your tape drive is using newest firmware (SDLT).
Why does the script not detect you have killed vdump: May be the script is programmed to ignore the dead of child.
The 2nd problem may be caused by a bad line. The sshd doesn't recognize your client was exiting and therefore it has been closed by keep-alive timeout (default time 2 hours).
You didn't tell how your script depends on the line except starting.
Erich.
I believe your first problem happened because your script do not deal with the question for a new tape. If your tape holds only 25 % of the expected data the tape media may have been defective (check binary errlog).
Or you was using a tape initialized by another drive with different density which shouldn't be a problem if your tape drive is using newest firmware (SDLT).
Why does the script not detect you have killed vdump: May be the script is programmed to ignore the dead of child.
The 2nd problem may be caused by a bad line. The sshd doesn't recognize your client was exiting and therefore it has been closed by keep-alive timeout (default time 2 hours).
You didn't tell how your script depends on the line except starting.
Erich.
The opinions expressed above are the personal opinions of the authors, not of Hewlett Packard Enterprise. By using this site, you accept the Terms of Use and Rules of Participation.
Company
Events and news
Customer resources
© Copyright 2025 Hewlett Packard Enterprise Development LP