Re: Server was down this morning. Need help.

Kevin Farrell_4 · ‎08-22-2005

Our server was down this morning. This is an oracle web server for oracle applications. Web/forms tier.

We could ping it, got a login prompt, logged in but it never came back.

Is there a system log somewhere I might be able to look at that might give me a clue of what happened? We eventually just power cycled the box. Now all is well.

#uname -a
HP-UX falcon B.11.11 U 9000/800 128921567 unlimited-user license
root(falcon)/root:
#model
9000/800/L2000-5X
root(falcon)/root:
#
Thanks, Kevin

Magnus Linner_2 · ‎08-22-2005

Have a look at /var/adm/syslog/syslog.log.
Also run the command #dmesg and make sure you dont have any strange problems

Devender Khatana · ‎08-22-2005

Hi,

You could check /var/adm/asyslog/OLDsyslog.log ?

This is the syslog.log which was active prior to reboot. Also check GSP logs the error should be reported there if it was a hardware issue.

HTH,
Devender

Impossible itself mentions "I m possible"

Kevin Farrell_4 · ‎08-22-2005

Heres the dmesg output. Not sure. I'm not a strong unix guy.

#dmesg

Aug 22 08:56
gate64: sysvec_vaddr = 0xc0002000 for 2 pages
NOTICE: autofs_link(): File system was registered at index 3.
NOTICE: cachefs_link(): File system was registered at index 5.
NOTICE: nfs3_link(): File system was registered at index 6.
0 sba
0/0 lba
0/0/0/0 btlan
0/0/1/0 c720
0/0/1/0.7 tgt
0/0/1/0.7.0 sctl
0/0/1/1 c720
0/0/1/1.2 tgt
0/0/1/1.2.0 sdisk
0/0/1/1.7 tgt
0/0/1/1.7.0 sctl
0/0/2/0 c720
0/0/2/0.0 tgt
0/0/2/0.0.0 sdisk
0/0/2/0.2 tgt
0/0/2/0.2.0 sdisk
0/0/2/0.7 tgt
0/0/2/0.7.0 sctl
0/0/2/1 c720
0/0/2/1.2 tgt
0/0/2/1.2.0 sdisk
0/0/2/1.7 tgt
0/0/2/1.7.0 sctl
0/0/4/0 asio0
0/0/5/0 asio0
0/1 lba
0/2 lba
0/3 lba
0/4 lba
0/4/0/0 PCItoPCI
0/4/0/0/4/0 btlan
0/4/0/0/5/0 btlan
0/4/0/0/6/0 btlan
0/4/0/0/7/0 btlan
0/5 lba
0/6 lba
0/7 lba
8 memory
160 processor
166 processor
btlan: Initializing 10/100BASE-TX card at 0/0/0/0....

System Console is on the Built-In Serial Interface
btlan: Initializing 10/100BASE-TX card at 0/4/0/0/4/0....
btlan: Initializing 10/100BASE-TX card at 0/4/0/0/5/0....
btlan: Initializing 10/100BASE-TX card at 0/4/0/0/6/0....
btlan: Initializing 10/100BASE-TX card at 0/4/0/0/7/0....
Entering cifs_init...
Initialization finished successfully... slot is 9
Logical volume 64, 0x3 configured as ROOT
Logical volume 64, 0x2 configured as SWAP
Logical volume 64, 0x2 configured as DUMP
Swap device table: (start & size given in 512-byte blocks)
entry 0 - major is 64, minor is 0x2; start = 0, size = 8388608
Dump device table: (start & size given in 1-Kbyte blocks)
entry 0000000000000000 - major is 31, minor is 0x12000; start = 101216, size = 4194304
Starting the STREAMS daemons-phase 1
Create STCP device files
Starting the STREAMS daemons-phase 2
$Revision: vmunix: vw: -proj selectors: CUPI80_BL2000_1108 -c 'Vw for CUPI80_BL2000_1108 build' -- cupi80_bl2000_1108 'CUPI80_BL2000_1108' Wed Nov 8 19:24:56 PST 2000 $
Memory Information:
physical page size = 4096 bytes, logical page size = 4096 bytes
Physical: 4194304 Kbytes, lockable: 3897160 Kbytes, available: 3703032 Kbytes

root(falcon)/root:
#

Geoff Wild · ‎08-22-2005

What never came back? The web server?

If the server was okay from an admin point of viw - IE - you can log in, issue commands...then check the application logs and/or oracle logs.

Rgds...Geoff

Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.

Victor Fridyev · ‎08-22-2005

Hi,

I had such a problem, which was resolved in one case by replacing the system disk and in another case by GSP console firmware upgrade.
Run
dmesg -

Try to find SCSI errors.

HTH

Entities are not to be multiplied beyond necessity - RTFM

Kevin Farrell_4 · ‎08-22-2005

here is from the Old syslog

Aug 21 15:25:21 falcon EMS [1680]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/core_hw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 110100482 -r /system/events/core_hw/core_hw -n 110100490 -a
Aug 21 16:20:55 falcon EMS [1680]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/core_hw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 110100482 -r /system/events/core_hw/core_hw -n 110100491 -a
Aug 21 17:02:13 falcon EMS [1680]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/core_hw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 110100482 -r /system/events/core_hw/core_hw -n 110100492 -a
Aug 21 17:19:13 falcon vmunix: DIAGNOSTIC SYSTEM WARNING:
Aug 21 17:19:13 falcon vmunix: The diagnostic logging facility has started receiving excessive
Aug 21 17:19:13 falcon vmunix: errors from the I/O subsystem. I/O error entries will be lost
Aug 21 17:19:13 falcon vmunix: until the cause of the excessive I/O logging is corrected.
Aug 21 17:19:13 falcon vmunix: If the diaglogd daemon is not active, use the Daemon Startup command
Aug 21 17:19:13 falcon vmunix: in stm to start it.
Aug 21 17:19:13 falcon vmunix: If the diaglogd daemon is active, use the logtool utility in stm
Aug 21 17:19:13 falcon vmunix: to determine which I/O subsystem is logging excessive errors.
Aug 21 17:19:24 falcon vmunix: LVM: vg[1]: pvnum=0 (dev_t=0x1f020000) is POWERFAILED
root(falcon)/var/adm/syslog:
#

Rgomes · ‎08-22-2005

Hi,

Pls run the below command:

#/opt/resmon/bin/resdata -R 110100482 -r /system/events/core_hw/core_hw -n 110100492 -a

You can check the log file: /var/resmon/log/event.log if there is any h/w issue.

Regards,
Richard

Rajesh SB · ‎08-22-2005

Hi,

POWERFAILED message is indiacation of disk failure in the VG.

"
Aug 21 17:19:13 falcon vmunix: to determine which I/O subsystem is logging excessive errors.
Aug 21 17:19:24 falcon vmunix: LVM: vg[1]: pvnum=0 (dev_t=0x1f020000) is POWERFAILED
"

First you backup the critical data from the server.

Regards,
Rajesh

Kevin Farrell_4 · ‎08-22-2005

Summary:
Processor cabinet intake temperature is too hot

Description of Error:

The system intake temperature is too high.

Probable Cause / Recommended Action:

Something is blocking the cooling intakes in the system processing unit
(SPU).
Check for obstructions.

The room containing the SPU is too hot.
Check for problems with the room air conditioning.

Additional Event Data:
System IP Address...: 192.2.2.40
Event Id............: 0x4308d52100000000
Monitor Version.....: B.01.00
Event Class.........: System
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_core_hw.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 1
Associated OS error log entry id(s):
0x4308d52100000000
Additional System Data:
System Model Number.............: 9000/800/L2000-5X
EMS Version.....................: A.03.20
STM Version.....................: A.28.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_core_hw.htm#33

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v

>---------- End Event Monitoring Service Event Notification ----------<

>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Sun Aug 21 16:20:55 2005

falcon sent Event Monitor notification information:

/system/events/core_hw/core_hw is >= 1.
Its current value is CRITICAL(5).

Event data from monitor:

Event Time..........: Sun Aug 21 16:20:55 2005
Severity............: CRITICAL
Monitor.............: dm_core_hw
Event #.............: 33
System..............: falcon

Summary:
Processor cabinet intake temperature is too hot

Description of Error:

The system intake temperature is too high.

Probable Cause / Recommended Action:

Something is blocking the cooling intakes in the system processing unit
(SPU).
Check for obstructions.

The room containing the SPU is too hot.
Check for problems with the room air conditioning.

Additional Event Data:
System IP Address...: 192.2.2.40
Event Id............: 0x4308e22700000000
Monitor Version.....: B.01.00
Event Class.........: System
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_core_hw.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 1
Associated OS error log entry id(s):
0x4308e22700000000
Additional System Data:
System Model Number.............: 9000/800/L2000-5X
EMS Version.....................: A.03.20
STM Version.....................: A.28.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_core_hw.htm#33

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v

>---------- End Event Monitoring Service Event Notification ----------<

>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Sun Aug 21 17:02:13 2005

falcon sent Event Monitor notification information:

/system/events/core_hw/core_hw is >= 1.
Its current value is CRITICAL(5).

Event data from monitor:

Event Time..........: Sun Aug 21 17:02:13 2005
Severity............: CRITICAL
Monitor.............: dm_core_hw
Event #.............: 33
System..............: falcon

Summary:
Processor cabinet intake temperature is too hot

Description of Error:

The system intake temperature is too high.

Probable Cause / Recommended Action:

Something is blocking the cooling intakes in the system processing unit
(SPU).
Check for obstructions.

The room containing the SPU is too hot.
Check for problems with the room air conditioning.

Additional Event Data:
System IP Address...: 192.2.2.40
Event Id............: 0x4308ebd500000000
Monitor Version.....: B.01.00
Event Class.........: System
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_core_hw.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 1
Associated OS error log entry id(s):
0x4308ebd500000000
Additional System Data:
System Model Number.............: 9000/800/L2000-5X
EMS Version.....................: A.03.20
STM Version.....................: A.28.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_core_hw.htm#33

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v

>---------- End Event Monitoring Service Event Notification ----------<
root(falcon)/var/opt/resmon/log:
#

TwoProc · ‎08-22-2005

I'm thinking the temp in your computer got hot. Do you have other machines with logs in there also? What about the A/C unit in the computer room? Does it have a log of the temp warnings?

We are the people our parents warned us about --Jimmy Buffett

Kevin Farrell_4 · ‎08-22-2005

We think there might have been a power outage.
thanks, for the help.

Kevin

Bill Hassell · ‎08-22-2005

High temperature warnings is a serious problem because it indicates that something very serious went wrong in your computer room, namely, an air conditioner shutdown and no one was notified. While your HP-UX server may have been able to shutdown before it melted, the disks, tape drives, network boxes and everything else in the computer may be damaged.

No, you don't want to hook up some sort of log monitoring script to email your pager when the temperature goes too high. While it sounds whizzy, there are serious reliability issues with such schemes. The first is whether the email system is still running (may be down due to overtemp). Second is delays that can exist (out of your control) in forwarding email to your pager. Third is whether the pager is in range or even turned on.

I would make a disaster prevention plan the very first priority. You might have only lost one airconditioner but consider the consequences of some electrician turning off all the airconditioners for your server room. How long would it take to destroy every piece of equipment in the room? 1 hour? 10 minutes? If you don't know, then the equipment is at serious risk of complete destruction because (according to Murphy's Law) it will happen at a time when no one is around and it will take hours to get to someone who can drive in and at least pull the plug to prevent further damage.

Loss of aircondiftioning in a computer room is equal in seriousness to a fire. The difference is that the fire might spread to other parts of the building. So contact your alarm company and get them to add temperature sensors to the fire sensors they have now. Then setup a serious of trained people that can do something quickly. Be very careful of high turnover security guards in this role. I would also add a remote controlled power contactor for the entire room, one that will open when the temperature goes above 95-105 deg F. No time to notify computers to shutdown, just pull the plug on everything in the room. Better to clean up some filesystems than to order new equipment.

Bill Hassell, sysadmin

Raj D. · ‎08-22-2005

Hi Kevin,

i)Is it due to a powe outage ,
ii)or CPU temperature too hot , as per the resmon EMS data , "Processor cabinet intake temperature is too hot."

Then you can isolate the problem accordingly.
Check if in the same server room , if another server is there and experienced the same problem . You can check syslog and dmesg.

Cheers,

RajD.
----

" If u think u can , If u think u cannot , - You are always Right . "

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Server was down this morning. Need help.

Server was down this morning. Need help.