1834028 Members
2253 Online
110063 Solutions
New Discussion

V class hang

 
Bryan D. Quinn
Respected Contributor

V class hang

We had a situation this past Sunday night/Monday morning with our production V class server. Just after midnight the os became unaccessible. I could not login to the OS. SAP and Oracle were humming along without a care. I could see some OS information from within SAP and there was one process that had one cpu pegged out at 100%. That was cpu 0, but we have 16 cpus in this box and the other 15 were 97% idle across the board. Our memory utilization looked fine, nothing pegged out there.

So, my question is, what would keep me from logging in? I got a login prompt and I would enter root and it would never prompt me for a password it would just hang.

To add a little history to this situation, this is the third time this has happened in about the past 3 to 3 1/2 years. Also, everytime it has happened it was just after midnight on Sunday night/Monday morning. Prior to this Sunday, the last time it happened was on Sunday Sept. 18 2005.

Any thoughts on this issue would be greatly appreciated.

-Bryan
28 REPLIES 28
Coolmar
Esteemed Contributor

Re: V class hang

Hi Bryan,

Do you have any batch jobs that run at that time that every once in a while drain all your system resources?
Do you see anything in the syslog?
How long before you are able to login finally?
Bryan D. Quinn
Respected Contributor

Re: V class hang

Hello

No, nothing in syslog that says there is a problem going on. The last entry in syslog was at 12:07.

As for finally being able to login. I don't know, we ended up shutting SAP and the database down from within SAP and performing a do_reset on the V from it's test station.

I did, however, try to login after SAP and Oracle were down. It still hung after entering the username.

After the reset everything came back up without a hitch.
Coolmar
Esteemed Contributor

Re: V class hang

What shell is the default for root on your system? I wonder if it might be a shell problem that requires a patch or something, especially since your resources seem fine.
Bryan D. Quinn
Respected Contributor

Re: V class hang

The default shell is POSIX.
Coolmar
Esteemed Contributor

Re: V class hang

What version of the OS are your running?
Bryan D. Quinn
Respected Contributor

Re: V class hang

hpux 11.0
spex
Honored Contributor

Re: V class hang

Hi Bryan,

What scheme you using for authentication? /etc/passwd? NIS? NIS+? LDAP?

Any recent changes to /etc/passwd, /etc/group, /etc/nsswitch.conf, /etc/resolv.conf, or /etc/hosts?

What do 'pwck' and 'grpck' return?

PCS
Bryan D. Quinn
Respected Contributor

Re: V class hang

We use /etc/hosts

results from pwck:
I got a couple of 'Login directory not found' for the oracle and sap users.

No results from grpck.
Coolmar
Esteemed Contributor

Re: V class hang

Did you try logging in with another ID or just root?
Coolmar
Esteemed Contributor

Re: V class hang

Also, how are you trying to login - telnet? CDE? SSH?
Bryan D. Quinn
Respected Contributor

Re: V class hang

I only tried the root id, but one of our operators tried logging on with the operations id and he had the same problem.
Bryan D. Quinn
Respected Contributor

Re: V class hang

I tried to login in with telnet and rlogin from another box.
Coolmar
Esteemed Contributor

Re: V class hang

Were any filesystems full at that time? Perhaps caused by Oracle or SAP, and even shutting the apps down didn't release the process and therefore the resources - which a reset would do?
Bryan D. Quinn
Respected Contributor

Re: V class hang

No, looking at the OS from within SAP we did not see any file systems that were full. After shutting down SAP/Oracle, I took down the SG package also, which unmounts all of the SAP/Oracle file systems. I still could not login.
Coolmar
Esteemed Contributor

Re: V class hang

I am wondering more if any of the OS filesystems were full -like /, /var,/usr, etc
Coolmar
Esteemed Contributor

Re: V class hang

I am wondering more if any of the OS filesystems were full -like /, /var,/usr, etc
Did you happen to see if any of them were full?
Matti_Kurkela
Honored Contributor

Re: V class hang

We've had similar situations once in a while, luckily mostly on our testing servers when testing application performance under maximum load ("torture tests"). It is usually a cue to re-examine the kernel parameter values and the amount of swap space.

Lack of various system resources may cause a situation like this. Often the system is critically out of usable memory (i.e. there may be free real memory, but all of the swap space is reserved, so the system will not accept new memory allocations), or the process table is almost completely full.

The problem with logging in might be as follows: as the "getty" process is already established and running, it can display the "login:" prompt just fine. But after you enter the username, getty will try to exec() the "login" process, and that step fails. The problem may be anything that keeps you from allocating more memory and/or starting up new processes.

A nasty side effect of this situation is that the syslog daemon may sometimes die when the system is struggling with the lack of resources, so you may not have all the kernel error messages stored in syslog. They should be in the console output though: check the V-class console log on the test station, it might offer some additional clues.

I would recommend monitoring/logging "swapinfo -t" on Sunday nights.

If you have Glance available, monitoring its "system tables" page might be useful too. If any of the system tables is over 75% utilization, consider increasing the respective kernel parametres.

You said you could see one CPU pegged at 100% busy. Did you see the name of the process it was running?
MK
Coolmar
Esteemed Contributor

Re: V class hang

I don't know that I would change kernel parameters just yet as this seems to happen only once a year...but I do agree that monitoring resources with glance, sar, and swapinfo would be a very good idea.
Bryan D. Quinn
Respected Contributor

Re: V class hang

From what I saw in SAP, granted that information was correct at the given time, there were no file systems full. Definately not / or /var, the base OS file systems were the first ones I checked.

As for memory, that is what I am thinking but we have 28 GB of memory in this box and have never come close to having any problems with memory. It would have to be one heck of a hog to eat up our memory, atleast I would think.

The process that had cpu 0 pegged out was telalertm. Which might have been valid since our archives were failing and it was trying to page us via our beepers and mobiles. Another little add on here, the telalert messages were not coming through but I also send the same messages out via sendmail and they were coming through just fine.
Coolmar
Esteemed Contributor

Re: V class hang

What else you might try, next time it happens is after you shut down Oracle and SAP....do an "lsof" and see if there are still open processes hanging around chewing up resources. Did you try killing the telalertm process and login before the reboot as well?
Bryan D. Quinn
Respected Contributor

Re: V class hang

I was unable to login even after shutting down SAP/Oracle and halting the service guard package. We could see the process from SAP but did not have a way of killing the process from within SAP.
Coolmar
Esteemed Contributor

Re: V class hang

Of course, sorry about that. Now I see that you did everything from SAP...originally I thought you happen to have a session opened to see processes, etc and just couldn't open any others after that one. Ok, well I am thinking that telalert might be the culprit.
Bryan D. Quinn
Respected Contributor

Re: V class hang

I am kind of suspicious of telalert also. I am going to nose around and see if I can find some logs for telalert. Maybe there is something in there that might give me some information.
Steve Post
Trusted Contributor

Re: V class hang

I don't mean to send you off on a different course. But here's times when I had a locked V-class server.

The DNS server the V-Class box used was down. The system was hung until that DNS server came back up. Now this was during a maintenance day. The V-class server was hung while going to multiuser mode.

I have also had a bad/full disk stop the login process. But in that case, I would be able to log in, THEN it would hang. Apparently the login process checks for space on the filesystems.

IF you had a unix prompt at the time no one could log in, you could run bdf and see if hangs....or not.

I just remembered another lame reason for logins failing. A guy accidentally killed the telnet process. Running "inetd -c" fixed it immediately. Of course, this is also from the unix prompt.

So in summary possible causes:
- a full or bad filesystem
- loss of DNS
- the telnet daemon killed

Those were the three I had at least.