- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Re: Server was down this morning. Need help.
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2005 12:43 AM
08-22-2005 12:43 AM
Server was down this morning. Need help.
We could ping it, got a login prompt, logged in but it never came back.
Is there a system log somewhere I might be able to look at that might give me a clue of what happened? We eventually just power cycled the box. Now all is well.
#uname -a
HP-UX falcon B.11.11 U 9000/800 128921567 unlimited-user license
root(falcon)/root:
#model
9000/800/L2000-5X
root(falcon)/root:
#
Thanks, Kevin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2005 12:50 AM
08-22-2005 12:50 AM
Re: Server was down this morning. Need help.
Also run the command #dmesg and make sure you dont have any strange problems
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2005 12:52 AM
08-22-2005 12:52 AM
Re: Server was down this morning. Need help.
You could check /var/adm/asyslog/OLDsyslog.log ?
This is the syslog.log which was active prior to reboot. Also check GSP logs the error should be reported there if it was a hardware issue.
HTH,
Devender
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2005 12:53 AM
08-22-2005 12:53 AM
Re: Server was down this morning. Need help.
#dmesg
Aug 22 08:56
gate64: sysvec_vaddr = 0xc0002000 for 2 pages
NOTICE: autofs_link(): File system was registered at index 3.
NOTICE: cachefs_link(): File system was registered at index 5.
NOTICE: nfs3_link(): File system was registered at index 6.
0 sba
0/0 lba
0/0/0/0 btlan
0/0/1/0 c720
0/0/1/0.7 tgt
0/0/1/0.7.0 sctl
0/0/1/1 c720
0/0/1/1.2 tgt
0/0/1/1.2.0 sdisk
0/0/1/1.7 tgt
0/0/1/1.7.0 sctl
0/0/2/0 c720
0/0/2/0.0 tgt
0/0/2/0.0.0 sdisk
0/0/2/0.2 tgt
0/0/2/0.2.0 sdisk
0/0/2/0.7 tgt
0/0/2/0.7.0 sctl
0/0/2/1 c720
0/0/2/1.2 tgt
0/0/2/1.2.0 sdisk
0/0/2/1.7 tgt
0/0/2/1.7.0 sctl
0/0/4/0 asio0
0/0/5/0 asio0
0/1 lba
0/2 lba
0/3 lba
0/4 lba
0/4/0/0 PCItoPCI
0/4/0/0/4/0 btlan
0/4/0/0/5/0 btlan
0/4/0/0/6/0 btlan
0/4/0/0/7/0 btlan
0/5 lba
0/6 lba
0/7 lba
8 memory
160 processor
166 processor
btlan: Initializing 10/100BASE-TX card at 0/0/0/0....
System Console is on the Built-In Serial Interface
btlan: Initializing 10/100BASE-TX card at 0/4/0/0/4/0....
btlan: Initializing 10/100BASE-TX card at 0/4/0/0/5/0....
btlan: Initializing 10/100BASE-TX card at 0/4/0/0/6/0....
btlan: Initializing 10/100BASE-TX card at 0/4/0/0/7/0....
Entering cifs_init...
Initialization finished successfully... slot is 9
Logical volume 64, 0x3 configured as ROOT
Logical volume 64, 0x2 configured as SWAP
Logical volume 64, 0x2 configured as DUMP
Swap device table: (start & size given in 512-byte blocks)
entry 0 - major is 64, minor is 0x2; start = 0, size = 8388608
Dump device table: (start & size given in 1-Kbyte blocks)
entry 0000000000000000 - major is 31, minor is 0x12000; start = 101216, size = 4194304
Starting the STREAMS daemons-phase 1
Create STCP device files
Starting the STREAMS daemons-phase 2
$Revision: vmunix: vw: -proj selectors: CUPI80_BL2000_1108 -c 'Vw for CUPI80_BL2000_1108 build' -- cupi80_bl2000_1108 'CUPI80_BL2000_1108' Wed Nov 8 19:24:56 PST 2000 $
Memory Information:
physical page size = 4096 bytes, logical page size = 4096 bytes
Physical: 4194304 Kbytes, lockable: 3897160 Kbytes, available: 3703032 Kbytes
root(falcon)/root:
#
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2005 12:53 AM
08-22-2005 12:53 AM
Re: Server was down this morning. Need help.
If the server was okay from an admin point of viw - IE - you can log in, issue commands...then check the application logs and/or oracle logs.
Rgds...Geoff
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2005 12:55 AM
08-22-2005 12:55 AM
Re: Server was down this morning. Need help.
I had such a problem, which was resolved in one case by replacing the system disk and in another case by GSP console firmware upgrade.
Run
dmesg -
Try to find SCSI errors.
HTH
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2005 12:58 AM
08-22-2005 12:58 AM
Re: Server was down this morning. Need help.
Aug 21 15:25:21 falcon EMS [1680]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/core_hw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 110100482 -r /system/events/core_hw/core_hw -n 110100490 -a
Aug 21 16:20:55 falcon EMS [1680]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/core_hw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 110100482 -r /system/events/core_hw/core_hw -n 110100491 -a
Aug 21 17:02:13 falcon EMS [1680]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" for Resource: "/system/events/core_hw/core_hw" (Threshold: >= " 3") Execute the following command to obtain event details: /opt/resmon/bin/resdata -R 110100482 -r /system/events/core_hw/core_hw -n 110100492 -a
Aug 21 17:19:13 falcon vmunix: DIAGNOSTIC SYSTEM WARNING:
Aug 21 17:19:13 falcon vmunix: The diagnostic logging facility has started receiving excessive
Aug 21 17:19:13 falcon vmunix: errors from the I/O subsystem. I/O error entries will be lost
Aug 21 17:19:13 falcon vmunix: until the cause of the excessive I/O logging is corrected.
Aug 21 17:19:13 falcon vmunix: If the diaglogd daemon is not active, use the Daemon Startup command
Aug 21 17:19:13 falcon vmunix: in stm to start it.
Aug 21 17:19:13 falcon vmunix: If the diaglogd daemon is active, use the logtool utility in stm
Aug 21 17:19:13 falcon vmunix: to determine which I/O subsystem is logging excessive errors.
Aug 21 17:19:24 falcon vmunix: LVM: vg[1]: pvnum=0 (dev_t=0x1f020000) is POWERFAILED
root(falcon)/var/adm/syslog:
#
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2005 01:03 AM
08-22-2005 01:03 AM
Re: Server was down this morning. Need help.
Pls run the below command:
#/opt/resmon/bin/resdata -R 110100482 -r /system/events/core_hw/core_hw -n 110100492 -a
You can check the log file: /var/resmon/log/event.log if there is any h/w issue.
Regards,
Richard
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2005 01:07 AM
08-22-2005 01:07 AM
Re: Server was down this morning. Need help.
POWERFAILED message is indiacation of disk failure in the VG.
"
Aug 21 17:19:13 falcon vmunix: to determine which I/O subsystem is logging excessive errors.
Aug 21 17:19:24 falcon vmunix: LVM: vg[1]: pvnum=0 (dev_t=0x1f020000) is POWERFAILED
"
First you backup the critical data from the server.
Regards,
Rajesh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2005 01:11 AM
08-22-2005 01:11 AM
Re: Server was down this morning. Need help.
Summary:
Processor cabinet intake temperature is too hot
Description of Error:
The system intake temperature is too high.
Probable Cause / Recommended Action:
Something is blocking the cooling intakes in the system processing unit
(SPU).
Check for obstructions.
The room containing the SPU is too hot.
Check for problems with the room air conditioning.
Additional Event Data:
System IP Address...: 192.2.2.40
Event Id............: 0x4308d52100000000
Monitor Version.....: B.01.00
Event Class.........: System
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_core_hw.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 1
Associated OS error log entry id(s):
0x4308d52100000000
Additional System Data:
System Model Number.............: 9000/800/L2000-5X
EMS Version.....................: A.03.20
STM Version.....................: A.28.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_core_hw.htm#33
v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v
>---------- End Event Monitoring Service Event Notification ----------<
>------------ Event Monitoring Service Event Notification ------------<
Notification Time: Sun Aug 21 16:20:55 2005
falcon sent Event Monitor notification information:
/system/events/core_hw/core_hw is >= 1.
Its current value is CRITICAL(5).
Event data from monitor:
Event Time..........: Sun Aug 21 16:20:55 2005
Severity............: CRITICAL
Monitor.............: dm_core_hw
Event #.............: 33
System..............: falcon
Summary:
Processor cabinet intake temperature is too hot
Description of Error:
The system intake temperature is too high.
Probable Cause / Recommended Action:
Something is blocking the cooling intakes in the system processing unit
(SPU).
Check for obstructions.
The room containing the SPU is too hot.
Check for problems with the room air conditioning.
Additional Event Data:
System IP Address...: 192.2.2.40
Event Id............: 0x4308e22700000000
Monitor Version.....: B.01.00
Event Class.........: System
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_core_hw.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 1
Associated OS error log entry id(s):
0x4308e22700000000
Additional System Data:
System Model Number.............: 9000/800/L2000-5X
EMS Version.....................: A.03.20
STM Version.....................: A.28.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_core_hw.htm#33
v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v
>---------- End Event Monitoring Service Event Notification ----------<
>------------ Event Monitoring Service Event Notification ------------<
Notification Time: Sun Aug 21 17:02:13 2005
falcon sent Event Monitor notification information:
/system/events/core_hw/core_hw is >= 1.
Its current value is CRITICAL(5).
Event data from monitor:
Event Time..........: Sun Aug 21 17:02:13 2005
Severity............: CRITICAL
Monitor.............: dm_core_hw
Event #.............: 33
System..............: falcon
Summary:
Processor cabinet intake temperature is too hot
Description of Error:
The system intake temperature is too high.
Probable Cause / Recommended Action:
Something is blocking the cooling intakes in the system processing unit
(SPU).
Check for obstructions.
The room containing the SPU is too hot.
Check for problems with the room air conditioning.
Additional Event Data:
System IP Address...: 192.2.2.40
Event Id............: 0x4308ebd500000000
Monitor Version.....: B.01.00
Event Class.........: System
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_core_hw.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 1
Associated OS error log entry id(s):
0x4308ebd500000000
Additional System Data:
System Model Number.............: 9000/800/L2000-5X
EMS Version.....................: A.03.20
STM Version.....................: A.28.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_core_hw.htm#33
v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v
>---------- End Event Monitoring Service Event Notification ----------<
root(falcon)/var/opt/resmon/log:
#
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2005 01:18 AM
08-22-2005 01:18 AM
Re: Server was down this morning. Need help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2005 02:11 AM
08-22-2005 02:11 AM
Re: Server was down this morning. Need help.
thanks, for the help.
Kevin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2005 06:09 AM
08-22-2005 06:09 AM
Re: Server was down this morning. Need help.
No, you don't want to hook up some sort of log monitoring script to email your pager when the temperature goes too high. While it sounds whizzy, there are serious reliability issues with such schemes. The first is whether the email system is still running (may be down due to overtemp). Second is delays that can exist (out of your control) in forwarding email to your pager. Third is whether the pager is in range or even turned on.
I would make a disaster prevention plan the very first priority. You might have only lost one airconditioner but consider the consequences of some electrician turning off all the airconditioners for your server room. How long would it take to destroy every piece of equipment in the room? 1 hour? 10 minutes? If you don't know, then the equipment is at serious risk of complete destruction because (according to Murphy's Law) it will happen at a time when no one is around and it will take hours to get to someone who can drive in and at least pull the plug to prevent further damage.
Loss of aircondiftioning in a computer room is equal in seriousness to a fire. The difference is that the fire might spread to other parts of the building. So contact your alarm company and get them to add temperature sensors to the fire sensors they have now. Then setup a serious of trained people that can do something quickly. Be very careful of high turnover security guards in this role. I would also add a remote controlled power contactor for the entire room, one that will open when the temperature goes above 95-105 deg F. No time to notify computers to shutdown, just pull the plug on everything in the room. Better to clean up some filesystems than to order new equipment.
Bill Hassell, sysadmin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2005 06:38 AM
08-22-2005 06:38 AM
Re: Server was down this morning. Need help.
i)Is it due to a powe outage ,
ii)or CPU temperature too hot , as per the resmon EMS data , "Processor cabinet intake temperature is too hot."
Then you can isolate the problem accordingly.
Check if in the same server room , if another server is there and experienced the same problem . You can check syslog and dmesg.
Cheers,
RajD.
----