Operating System - Tru64 Unix
1839245 Members
2885 Online
110137 Solutions
New Discussion

Re: GS80 fails due to high temperature - how to set threshold?

 
SOLVED
Go to solution
Aco Blazeski
Regular Advisor

GS80 fails due to high temperature - how to set threshold?

Hi folks,

[I've send this in servers forums, but maybe someone does not follow it.]

Whenever temperature raises above 29*C in data center, one of two GS80 fails.

Q1: I'm not sure why only one of them fails, althouogh this one is more loaded ?

Q2: Also is there a way to see/set temperature thresholds for QBBs and PCI drawer?

As far I can see, on SCM console I can only see the current temperature.
Q3: (Currently QBBs/PCI temp is round 30*C, is this ok or is it high)?

I've run through "AlphaServer Management Station User's guide" - there is a note that "The warning limits are not user-configurable." (regards to temperature).

Q4:So here another question: It seems that this software is not installed on my GS80s. Where can I find this software? I can't find "installation guide" on HP site regarding AMS.

Thanks in advance for your comments (which will be appreciated :-) )
13 REPLIES 13
Michael Schulte zur Sur
Honored Contributor

Re: GS80 fails due to high temperature - how to set threshold?

Hi,

you can see the current temperature via
sysconfig -q envmon
See man envconfig

greetings,

Michael
Aco Blazeski
Regular Advisor

Re: GS80 fails due to high temperature - how to set threshold?

Thanks Michael,

Here's another question then:

jerry:root# envconfig -q
ENVMON_CONFIGURED = 1
ENVMON_GRACE_PERIOD =
ENVMON_MONITOR_PERIOD =
ENVMON_HIGH_THRESH = 50
ENVMON_USER_SCRIPT =
jerry:root#

According above, envmon is configured, but on my GS80s it does not start :

"Environmental Monitoring Daemon did not start...trying again"

So if it is not started, then envmond could not shut down the server due to temp.threshold exceeding. Is that right?
So the server shuts down by itself ?
Michael Schulte zur Sur
Honored Contributor

Re: GS80 fails due to high temperature - how to set threshold?

Aco,

are there any relevant error messages in /var/adm/messages?

If so, please post them.

Michael
Aco Blazeski
Regular Advisor

Re: GS80 fails due to high temperature - how to set threshold?

Regarding server crash due to high temperature no relevant messages in /var/adm/messages or syslod.dated.

Regarding envmond, during server startup:
"Environmental Monitoring Daemon did not start...trying again", three times.
Johan Brusche
Honored Contributor

Re: GS80 fails due to high temperature - how to set threshold?


In some versions ie: configs with V5.1A+PK#3/PK#4/PK#5(you did not mention what you have), envmond does not start if the "community public" has been removed from /etc/snmpd.conf or/and if the the system is not configured as a DNS client (ie nslookup exits unsuccessfull)

If that is not the issue you can debug the envmon startup as follows:

cd /usr/sbin
cp -p envmond envmond.orig
vi envmond (and comment out the line Env_Daemonize)
(see extract below between --------)
---------------------------------
proc envmon_main {} {

# Env_Daemonize

global SYSMANUI
set SYSMANUI cli
---------------------------------

Then run envmond interactively as follows:

envconfig stop

/usr/sbin/envmond -ui cli

In the output on your terminal, you migth see
things like:

No response from server
while executing
"exec /bin/nslookup $cluAlais "
(procedure "getLocalIPv4List" line 15)
invoked from within
"getLocalIPv4List "
(procedure "getCommunityString" line 52)
invoked from within
"getCommunityString "
(procedure "envmon_main" line 54)
invoked from within
"envmon_main"
(file "/usr/sbin/envmond" line 852)


__ Johan /.

_JB_
Aco Blazeski
Regular Advisor

Re: GS80 fails due to high temperature - how to set threshold?

Thanks Johan,

I have Tru64 5.1 with PK5, and here's what I've found according your suggestions, no output during envmond start :-( :

jerry:root# grep public /etc/snmpd.conf
community public 0.0.0.0 read

jerry:root# grep Dae envmond
# Env_Daemonize

jerry:root# envconfig stop
jerry:root# envmond -ui cli

jerry:root# ps -Af | grep env
root 525311 524289 0.0 Oct 24 ?? 0:23.12 /usr/openv/volmgr/bin/ltid
root 919597 902628 0.0 16:37:19 pts/3 0:00.01 grep env
Aaron Biver_2
Frequent Advisor
Solution

Re: GS80 fails due to high temperature - how to set threshold?

Aco,

Are you sure it was an environmental event that brought it down? Please paste a console excerpt, or suggest other evidence. Any messages output by the firmware before taking your system down will not show up in the messages file, and must be copied from your personal console logs.

On the GS80-class system, the thermal shutdown threshold is 50C, though I think the firmware may take other action at 45C. If you are saying your lab temperature was near 30C, then maybe your internal system temperature was near 45C, and maybe the rug was properly pulled out from under you. I cannot explain why envmond did not do it for you, but it should have.

FYI, at the SCM prompt, "show status" will show you the last alert, which may give you more info on the environmental event that brought the system down. ...or it may simply tell you that you pulled the plug last week (or whatever).

As for getting envmond to run, one reason it will fail to start is if /var/run/envmond.pid is lying around and stale, though I'm not sure how this could happen. Check for this file and delete it if found.

It is possible that, if a valid thermal shutdown was initiated by firmware (and envmond was somehow unable to let you down gracefully), the envmond script did not have time to delete its pid file, and won't start now.

Also, I believe envmond can be run in debug mode:

envconfig stop
rcmgr set ENVMON_DEBUG 1
envconfig start

This may give you some output in syslog's user.log.
(note that this mode is unsupported, so do not run in production like this, though I really don't see much harm except chatty log files)


Aco Blazeski
Regular Advisor

Re: GS80 fails due to high temperature - how to set threshold?

Aaron 10x for your reply,

First of all, you're right about /var/run/envmond.pid. This solved envmond issue :), and just for the record:

jerry:root# envconfig stop
jerry:root# rcmgr set ENVMON_DEBUG 1
jerry:root# envconfig start
Environmental Monitoring Daemon did not start...trying again
Environmental Monitoring Daemon did not start...trying again
Environmental Monitoring Daemon did not start after 3 tries.
jerry:root# rm /var/run/envmond.pid
jerry:root# envconfig start
Environmental Monitoring Daemon started.

"show status" displays no alerts, but alerts were not enabled. So I've issued "enable alert" and now I'll wait for another failure.

I'll post any new info regarding this.

Again thanks to all folks who spent some time on this thread !
Aco Blazeski
Regular Advisor

Re: GS80 fails due to high temperature - how to set threshold?

Hello people,

Temperature in data center got high again and server failed again.

For the record GS80 is composed of two QBBs:
QBB0 with 2 CPUs and 2 Memory modules
QBB1 with 4 CPUs and 4 Memory modules

Unfortunately on scm console there is no alert :-(.
However "show system" on scm shows CPU 0 and QBB backplane on QBB0 as faulted:

Par hrd/csb CPU Mem IOR3 IOR2 IOR1 IOR0 GP QBB Dir PS Temp
QBB# 3210 3210 (pci_box.rio) Mod BP Mod 321 (ºC)

(-) 0/30 --pf --pp --.- --.- P0.1 P0.0 P f P PPP 25.5
(-) 1/31 PPPP PPPP --.- --.- --.- --.- P P P PPP 25.0

But after 2-3 hours, when temperature gets lower, GS80 boots ok, and "show system" output is without faults.

Do you have some opinion regarding this?

10x to all,
Regards
Aaron Biver_2
Frequent Advisor

Re: GS80 fails due to high temperature - how to set threshold?

Are there any machine checks in the binary errorlog?
Aco Blazeski
Regular Advisor

Re: GS80 fails due to high temperature - how to set threshold?

On 2 nov 2005 server failed near after 9:00.
We booted server manually near quarter after 13:00.

In attachment there is extraction of binary.errorlog in that time interval, and it can be seen that server boot with no problem.

Normally I have noticed PSM error, but since server DID boot without problems, power supplies must be ok ?!

Aaron Biver_2
Frequent Advisor

Re: GS80 fails due to high temperature - how to set threshold?

Aco,

I theorize there is a bad power supply, and the system has been receiving two types of failure notices: the polite kind, which has been ignored, and the not-so-polite kind, which can shut a system down without notice.

A power failure is one of the the only things that can bring a system down like this, and it could also be responsible for the apparent POST failures of the QBB and CPU components that show up as "f" in your SCM display. Like all hardware failures, these types of failures are flakey and unpredictable.

If you delve into the power supply failure messages a little more deeply, I think you can identify the failed component. Cybrary suggests that the 48V supplies allow redundancy, so if you have an extra, you may be able to remove one. If you do not have an extra, consider consolidating your equipment to one QBB while you consider repair/replacement options.

Aco Blazeski
Regular Advisor

Re: GS80 fails due to high temperature - how to set threshold?

Hi Aaron,

10x for your comments.
The strange thing for me here, is that error messages are generated AFTER failure, just before booting (after switching power button on).

However I'll check again with power supplies, they seems to be fine.

Surely I'll post if I encounter new moments :)