1839216 Members
4341 Online
110137 Solutions
New Discussion

loop in system reboot

 
SOLVED
Go to solution
mobidyc
Trusted Contributor

loop in system reboot

Hello,

one N4000 server running hpux 11.00 have started to hung since the last 3 weeks.

the HP support have decided to change these elements:
a power supply
power monitor
GSP
mainboard
add a trim card
a CPU

the last intervention for change this materials was very bad experience, the server rebooted continuously after software alert 12 ,(
the HP support indicates me to reinstall the OS with the reason that it is corrupted after all these hung and RS.

after two days, the server is reinstalled patched and tuned.
after 2 reboots for verify the stability of the server before launching applications there was no problem.

after to have launched sybase and oracle on this server i've lost the control, the connexions have been lost and unable to obtain new connexions, even from the console...

i've decided to make a TC but now, i can't restart the server, it reboot itself continuously and i don't know what to do .
i've tried to boot from the previous kernel and i have the same problem.

the HP support seems not to be able to solve this problem, they have changed many things and the hpux 1100 is no longer supported.
do you have an advice ?

at the boot test sequence i have these errors(?):
processor slave rendezvous 1C40
processor test 1142
processor test 113B
platform test 612A
I/O config 8238
platform test 6129
I/O config 8236
platform test 613F
platform test 613D
unknown, no source stated legacy PA HEX chassis-code CE00
unknown, no source stated legacy PA HEX chassis-code CE01
unknown, no source stated legacy PA HEX chassis-code CE0F
unknown, no source stated legacy PA HEX chassis-code CEC0
unknown, no source stated legacy PA HEX chassis-code CED0
unknown, no source stated legacy PA HEX chassis-code CEDB
unknown, no source stated legacy PA HEX chassis-code CEDF
processor legacy PA HEX chassis-code CEE0
processor legacy PA HEX chassis-code CEE1
processor legacy PA HEX chassis-code CEF0
processor display_activity() update 1F00
processor legacy PA HEX chassis-code CEF2
processor slave rendezvous 1C4E
processor slave rendezvous 1C4E

when i try to boot the system, i've the following alerts:
ISL> hpux /stand/vmunix.prev

Boot
: disk(0/0/2/0.6.0.0.0.0.0;0)/stand/vmunix.prev
8531968 + 1205976 + 2365784 start 0x25a0e8


alloc_pdc_pages: Relocating PDC from 0xffff800000 to 0x7fa03000.
gate64: sysvec_vaddr = 0xc0002000 for 1 pages
NOTICE: autofs_link(): File system was registered at index 3.
NOTICE: nfs3_link(): File system was registered at index 5.
td: claimed Tachyon TL/TS Fibre Channel Mass Storage card at 0/4/0/0
td: claimed Tachyon TL/TS Fibre Channel Mass Storage card at 1/10/0/0

System Console is on the Built-In Serial Interface
Logical volume 64, 0x3 configured as ROOT

************* SYSTEM ALERT **************
SYSTEM NAME: nr0u0170
DATE: 03/14/2007 TIME: 14:57:59
ALERT LEVEL: 12 = Software failure

REASON FOR ALERT
SOURCE: 1 = processor
SOURCE DETAIL: 1 = processor general SOURCE ID: 0
PROBLEM DETAIL: 0 = no problem detail

LEDs: RUN ATTENTION FAULT REMOTE POWER
FLASH FLASH OFF OFF ON
LED State: Running non-OS code. Non-critical error detected.
Check Chassis and Console Logs for error messages.

0xF8E000C01100B800 00000000 0000B800 - type 31 = legacy PA HEX chassis-code
0x58E008C01100B800 00006B02 0E0E393B - type 11 = Timestamp 03/14/2007 14:57:59
A: ack read of this entry - X: Disable all future alert messages
Anything else skip redisplay the log entry
->Choice:Timeout!
*****************************************
->Choice:a
*****************************************

************* SYSTEM ALERT **************
SYSTEM NAME: nr0u0170
DATE: 03/14/2007 TIME: 14:58:13
ALERT LEVEL: 3 = System blocked waiting for operator input

REASON FOR ALERT
SOURCE: 1 = processor
SOURCE DETAIL: 1 = processor general SOURCE ID: 0
PROBLEM DETAIL: 0 = no problem detail

LEDs: RUN ATTENTION FAULT REMOTE POWER
FLASH FLASH OFF OFF ON
LED State: Running non-OS code. Non-critical error detected.
Check Chassis and Console Logs for error messages.

0xF8E000301100E000 00000000 0000E000 - type 31 = legacy PA HEX chassis-code
0x58E008301100E000 00006B02 0E0E3A0D - type 11 = Timestamp 03/14/2007 14:58:13
A: ack read of this entry - X: Disable all future alert messages
Anything else skip redisplay the log entry
->Choice:a
*****************************************

************* SYSTEM ALERT **************
SYSTEM NAME: nr0u0170
DATE: 03/14/2007 TIME: 14:58:15
ALERT LEVEL: 3 = System blocked waiting for operator input

REASON FOR ALERT
SOURCE: 1 = processor
SOURCE DETAIL: 1 = processor general SOURCE ID: 0
PROBLEM DETAIL: 0 = no problem detail

LEDs: RUN ATTENTION FAULT REMOTE POWER
FLASH FLASH OFF OFF ON
LED State: Running non-OS code. Non-critical error detected.
Check Chassis and Console Logs for error messages.

0xF8E000301100EFFF 00000000 0000EFFF - type 31 = legacy PA HEX chassis-code
0x58E008301100EFFF 00006B02 0E0E3A0F - type 11 = Timestamp 03/14/2007 14:58:15
A: ack read of this entry - X: Disable all future alert messages
Anything else skip redisplay the log entry
->Choice:a
*****************************************

->Choice:a
*****************************************

************* SYSTEM ALERT **************
SYSTEM NAME: nr0u0170
DATE: 03/14/2007 TIME: 14:58:13
ALERT LEVEL: 3 = System blocked waiting for operator input

REASON FOR ALERT
SOURCE: 1 = processor
SOURCE DETAIL: 1 = processor general SOURCE ID: 0
PROBLEM DETAIL: 0 = no problem detail

LEDs: RUN ATTENTION FAULT REMOTE POWER
FLASH FLASH OFF OFF ON
LED State: Running non-OS code. Non-critical error detected.
Check Chassis and Console Logs for error messages.

0xF8E000301100E000 00000000 0000E000 - type 31 = legacy PA HEX chassis-code
0x58E008301100E000 00006B02 0E0E3A0D - type 11 = Timestamp 03/14/2007 14:58:13
A: ack read of this entry - X: Disable all future alert messages
Anything else skip redisplay the log entry
->Choice:a
*****************************************

************* SYSTEM ALERT **************
SYSTEM NAME: nr0u0170
DATE: 03/14/2007 TIME: 14:58:15
ALERT LEVEL: 3 = System blocked waiting for operator input

REASON FOR ALERT
SOURCE: 1 = processor
SOURCE DETAIL: 1 = processor general SOURCE ID: 0
PROBLEM DETAIL: 0 = no problem detail

LEDs: RUN ATTENTION FAULT REMOTE POWER
FLASH FLASH OFF OFF ON
LED State: Running non-OS code. Non-critical error detected.
Check Chassis and Console Logs for error messages.

0xF8E000301100EFFF 00000000 0000EFFF - type 31 = legacy PA HEX chassis-code
0x58E008301100EFFF 00006B02 0E0E3A0F - type 11 = Timestamp 03/14/2007 14:58:15
A: ack read of this entry - X: Disable all future alert messages
Anything else skip redisplay the log entry
->Choice:a
*****************************************

********** VIRTUAL FRONT PANEL **********
System Boot detected
*****************************************
LEDs: RUN ATTENTION FAULT REMOTE POWER
FLASH FLASH OFF OFF ON
LED State: Running non-OS code. Non-critical error detected.
Check Chassis and Console Logs for error messages.

platform config 626F
processor slave rendezvous 1C17
processor slave rendezvous 1C17
processor test 1142
processor test 1100


what can i do ?
i don't know if i can do something with a recovery shell (i've no tested it).

Regards,
Cedirick Gaillard
Best regards, Cedrick Gaillard
12 REPLIES 12
Peter Godron
Honored Contributor

Re: loop in system reboot

Hi,
I would remove any external connection (network/Fibre Channel etc) and try again.

Anything odd in pim from Service Menu ?
What do the command au and bid return from the Configuration Menu ?
goldboy
Trusted Contributor

Re: loop in system reboot

It seems as if you are not getting to the BCH prompt.
I would suggest to pull out all the I/O interfaces out of the server (after labeling them) and try to run the POST again.

Tal
"Life is what you make out of them!"
mobidyc
Trusted Contributor

Re: loop in system reboot

friday evening, i've pull out all the interface card (fiber channel, ethernet and scsi), with the same problem again.

the system was completely reinstalled on the advices of the HP support but nothing is solved for this server.

yesterday evening, i've taken the disks and i've put them in another n4000, the system can boot correctly now, os, i'm 99% sure that this is a material problem (but what?).

peter, where is the service or configuration menu? in the GSP ?

thanks for your help.

Cheers,
Cedrick Gaillard
Best regards, Cedrick Gaillard
Laurent Menase
Honored Contributor

Re: loop in system reboot

Cedrick,
Tries to pull out your interfaces.
Else did you try to install a 11.11 just to see?If you can reproduce on 11.11 support can't say anything. - And it should because your problem looks really like an hardware problem.

You can also try to look at the hard logs, and find the address where the system is when the problem happens.

Else did you try to make the support solve the none os problem:
unknown, no source stated legacy PA HEX chassis-code CE00
unknown, no source stated legacy PA HEX chassis-code CE01
unknown, no source stated legacy PA HEX chassis-code CE0F
unknown, no source stated legacy PA HEX chassis-code CEC0
unknown, no source stated legacy PA HEX chassis-code CED0
unknown, no source stated legacy PA HEX chassis-code CEDB
unknown, no source stated legacy PA HEX chassis-code CEDF
goldboy
Trusted Contributor

Re: loop in system reboot

Cedrick,
The service menu is located in the BCH (that is the menu that appers after the post and gives you 10 sec to interact and stop before booting the system up.....it seems as if you are not getting to that phase).

Since you tried to run the system without the interfaces I would suggest that if you have multiple memory carriers and processors, extract them all and try to get the system up with a single proc and min memory.

If the system will pass the post wothout all the extra parts then one of them is causing the problem.

Tal
"Life is what you make out of them!"
Tim Nelson
Honored Contributor

Re: loop in system reboot

Here is just a shot in the dark.

There is a firmware update for the GSP that resolves an issue with system crashes for unknown reasons. We experienced this on a couple N4000 servers way back. HP should have checked this.

Is your GSP firmware up to date ?

At GSP enter he. The help screen prints the fw version at the top.

As I said, this is a left field item.

mobidyc
Trusted Contributor

Re: loop in system reboot

thanks all for your inputs.

the server was changed with an other n4000 with the same characteristics, the system have correctly booted.

Problem:
i have the symptoms as the original server, in load, the server falls...and unable to power on it through the GSP:
GSP> pc

PC

System Power: Off Power Switch: On
Do you wish to turn power On ? (Y/[N]) Y
Y
Enter delay (in minutes) of delay to power On (CR for no delay)?

System will be powered On in 0 minutes.
Confirm? (Y/[N]): Y
Y

The system did not respond to the Power-On command.
Turn the system power switch Off and On again to reset the system.


i had exactly the same problems with the original server (before the reboot in loop).

then GSP indicates me the following errors:
ALERT LEVEL: 13 = System hang detected via timer popping
CALLER SUBACTIVITY: 00 = implementation dependent

ALERT LEVEL: 14 = Fatal power or environmental problem prevents operation
CALLER SUBACTIVITY: 04 = low voltage power supply

ALERT LEVEL: 14 = Fatal power or environmental problem prevents operation
CALLER SUBACTIVITY: 04 = low voltage power supply

HP is comming to add a trim card to this server but i'm surprising because this server had been stopped one year ago and was very loaded, it has never had the problem i encouter.
i'll keep this post updated after the trim card be added.

Regards,
Cerick Gaillard
Best regards, Cedrick Gaillard
goldboy
Trusted Contributor

Re: loop in system reboot

Cedrick,
you would be amazed how many times I got to see servers that are loaded work for a while and then crash because they did not have the Trim card.

The system will not power up if 2 or more power supplies exist and have power also the power has to be 220V !

maybe one of the conditions is not being met ?

Tal
"Life is what you make out of them!"
mobidyc
Trusted Contributor

Re: loop in system reboot

Hello,

the trim card has solved my problem for 5 days only ;(

the server has stopped it and impossible to restart it by the PC command in the GSP.
(The system did not respond to the Power-On command.
Turn the system power switch Off and On again to reset the system.)

i really don't understand why it switch off because:
- system reinstalled from scratch on an other server (same characteristics)
- full patch for now (all special instructions and conflicts patch warning was rode)
- no error logged in the system.

the only errors i have are in the GSP are:
Log Entry # 0 :
SYSTEM NAME: nr0u0170
DATE: 03/20/2007 TIME: 20:29:51
ALERT LEVEL: 13 = System hang detected via timer popping

SOURCE: 1 = processor
SOURCE DETAIL: 1 = processor general SOURCE ID: 0
PROBLEM DETAIL: 4 = timeout

CALLER ACTIVITY: F = display_activity() update STATUS: 0
CALLER SUBACTIVITY: 00 = implementation dependent
REPORTING ENTITY TYPE: E = HP-UX REPORTING ENTITY ID: 00

0x78E000D41100F000 00000003 00000000 type 15 = Activity Level/Timeout
0x58E008D41100F000 00006B02 14141D33 type 11 = Timestamp 03/20/2007 20:29:51
Type CR for next entry, Q CR to quit.



Log Entry # 1 :
SYSTEM NAME: nr0u0170
DATE: 03/20/2007 TIME: 20:29:55
ALERT LEVEL: 14 = Fatal power or environmental problem prevents operation

SOURCE: 4 = power
SOURCE DETAIL: 4 = high voltage DC power SOURCE ID: FF
PROBLEM DETAIL: 0 = no problem detail

CALLER ACTIVITY: 4 = monitor STATUS: F
CALLER SUBACTIVITY: 04 = low voltage power supply
REPORTING ENTITY TYPE: 2 = power monitor REPORTING ENTITY ID: 00

0x002000E044FF404F 00000000 00000000 type 0 = Data Field Unused
0x582008E044FF404F 00006B02 14141D37 type 11 = Timestamp 03/20/2007 20:29:55
Type CR for next entry, - CR for previous entry, Q CR to quit.

Log Entry # 2 :
SYSTEM NAME: nr0u0170
DATE: 03/20/2007 TIME: 20:29:55
ALERT LEVEL: 14 = Fatal power or environmental problem prevents operation

SOURCE: 4 = power
SOURCE DETAIL: 4 = high voltage DC power SOURCE ID: FF
PROBLEM DETAIL: 4 = output undervoltage

CALLER ACTIVITY: 4 = monitor STATUS: F
CALLER SUBACTIVITY: 04 = low voltage power supply
REPORTING ENTITY TYPE: 2 = power monitor REPORTING ENTITY ID: 00

0x002000E444FF404F 00000000 00000000 type 0 = Data Field Unused
0x582008E444FF404F 00006B02 14141D37 type 11 = Timestamp 03/20/2007 20:29:55
Type CR for next entry, - CR for previous entry, Q CR to quit.

however note than there is no cooling problem, there is no electrical problem
the HP support doesn't know how to solve the problem, we don't know it it's hardware or software.
it's HPUX 11.00 and there is no level 2 for this release since the the end of life is came in december 2006.

i'm lost.

Regards,
Cedrick Gaillard
Best regards, Cedrick Gaillard
Steven E. Protter
Exalted Contributor

Re: loop in system reboot

Shalom Cedirick,

HP has not fully resolved the hardware problem.

You may need a new system board based on the number of failures going on. Hardware needs to come out again, do full diagnosis and replace the part that is broken.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Laurent Menase
Honored Contributor

Re: loop in system reboot

Hi Cedrick,

Clearly the symptoms you describe are hardware.

Hardware support should do a full diagnostic of the system.


goldboy
Trusted Contributor
Solution

Re: loop in system reboot

according to your prior response you had mentioned that the server has already been replaced.
and the error messages from the GSP are during the POST and indicate power issues.

if the system was replaced by HP and worked for 5 days until it failed, I would suggest to check your power outlets/UPS.

Since it is for sure a hardware problem I would start again with that.

Tal
"Life is what you make out of them!"