HPE 9000 and HPE e3000 Servers
1752704 Members
6277 Online
108789 Solutions
New Discussion

Re: rp7410 fan failure causing shutdown?

 
klb
Valued Contributor

rp7410 fan failure causing shutdown?

 

I've got a 7410 that was down, went into MP and could not get it to power on.  All I could find in SL output was this errir:

EXTERNAL_RD_ERR

 

... and couldn't get the power to ON for the I/O chassis.  I saw FAULT detected at the I/O Chassis level and FAILURE DETECTED at the Cell level with INVALID CELL.

 

Had the site guy pull the power completely from everything and reseat the power cables and the system booted.  Now it's going unresponsive again and this is what shows in the SL:

 

MP:VWR (<CR>,<sp>,+,-,D,F,L,J,V,H,^B) >
# Location Alert Keyword Timestamp
504 CLMP 0 *4 0x202007446400405f 0x00ffffffff00ff64 IOFAN_FAIL
504 CLMP 0 *4 0x58200f0000004050 0x00006f0a0e0d3622 11/14/2011 13:54:34
503 CLMP 0 *2 0x582008294400302f 0x00006f0a0e0d3527 CABPWR_OFF
502 PDHC 0,1 *2 0x246013229201303f 0x00ffff01ff00ff91 POWER_OFF
502 PDHC 0,1 *2 0x58601b0000003030 0x00006f0a0e0d3522 11/14/2011 13:53:34
501 CLMP 0 *2 0x585008236900404f 0x00006f0a0e0d3522 HLSB_POWER_OFF
500 CLMP 0 *2 0x202004228d01303f 0x000001ffffffff8d PCIBP_PWR_OFF
500 CLMP 0 *2 0x58200c0000003030 0x00006f0a0e0d3521 11/14/2011 13:53:33
499 CLMP 0 *2 0x202003229201303f 0x00ffffffff01ff64 CELL_PWR_OFF
499 CLMP 0 *2 0x58200b0000003030 0x00006f0a0e0d3521 11/14/2011 13:53:33
498 CLMP 0 *2 0x582008226500400f 0x00006f0a0e0d3520 PSWITCH_OFF
497 CLMP 0 *6 0x582008644102401f 0x00006f0a0e0d3311 AC_UNDERVOLTAGE
496 CLMP 0 *6 0x582008644103401f 0x00006f0a0e0d3311 AC_UNDERVOLTAGE
495 CLMP 0 *14 0x582008e44501404f 0x00006f0a0e0d330f HKP_UNDRVOLTAGE
494 CLMP 0 *6 0x582008694101401e 0x00006f0a0e0d330f AC_DELETED
493 CLMP 0 *6 0x582008694100401f 0x00006f0a0e0d330f AC_DELETED
492 CLMP 0 *6 0x582008644100401f 0x00006f0a0e0d330f AC_UNDERVOLTAGE
491 CLMP 0 *6 0x582008644101401f 0x00006f0a0e0d330f AC_UNDERVOLTAGE
490 CLMP 0 *6 0x582008644100401f 0x00006f0a0e0d330f AC_UNDERVOLTAGE
489 CLMP 0 *10 0x202012aa91012fcf 0x00ffff01ff00ff91 EXTERNAL_RD_ERR
489 CLMP 0 *10 0x58201a0000002fc0 0x00006f0a0e0d3307 11/14/2011 13:51:07

 

Just before it went unresponsive, I found a syslog entry that stated this:

 

Event Time..........: Mon Nov 14 15:13:43 2011
Severity............: SERIOUS
Monitor.............: dm_core_hw
Event #.............: 43
System..............: tsfvtftest.verizon.com

Summary:
I/O fan 0 failure in compute cabinet 0 I/O enclosure chassis 0

 

So, is my problem a FAN and the system is shutting itself down to prevent thermal damage?

 

Thanks,

 

-klb

14 REPLIES 14
Torsten.
Acclaimed Contributor

Re: rp7410 fan failure causing shutdown?

A single fan can fail, because it is n+1, but 2 failed fans cause a shutdown.

 

Check from MP with "ps".


Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
klb
Valued Contributor

Re: rp7410 fan failure causing shutdown?

 

I know there's a fan failure, but maybe that's not the entire problem.  I've attached a file, ps_output.txt, that contains the PS command output, but I've pasted some of it right here to show power status for I/O chassis is FAULT and the dead fan:

 

PS and select T - Cabinet shows that a Warning / Fault was detected on I O Chassis 0 ( see attachment ).

 

T - Cabinet
S - System Backplane
G - MP (Core I/O)
P - IO Chassis
C - Cell
Select Device: P
Enter IO Chassis number: 0

HW status for IO Chassis 0 : FAULT DETECTED

Local Power Monitor Version is 1.2

Power is off, 12V Rail VOLTAGE FAULT
Power Module's Brick 0 VRM 0
Present : * *
OK : *
Enabled :

 

What's happening is that the machine boots after a hard power cycle, but only runs for a few minutes, then goes unresponsive ( the NIC returns a ping, but no connections are honored and shell sessions that were in progress on the box just hang ).  Prior to going unresponsive, I see notification in syslog that a fan has failed.

 

Going into the MP, I see the above noted power fault on IO chassis 0, but that's it.  Everything else, aside from the dead fan, looks good.

 

Any ideas?  We're going to have to pay time and materials on this, so I want to ge able to diagnose it as best as I can prior to making the service call.

 

Thanks,

 

-klb

 

 

klb
Valued Contributor

Re: rp7410 fan failure causing shutdown?

Just send Chassis log info to hp for diagnosis, the guy there recommended replacing the IO Backplane and is working up a Time and Materials quote. That's going to be expensive.

Based on what you guys see, does this sound like a good stab in the dark or a bad stab in the dark?
Robert_Jewell
Honored Contributor

Re: rp7410 fan failure causing shutdown?

Based upon what is shown in the PS output  I would say that VRM0 is faulty. 

 

Power is off, 12V Rail VOLTAGE FAULT
Power Module's   Brick 0 VRM 0
  Present      :   *       *
  OK           :   *        
  Enabled      :           

 

This is easy enough to test since there are two VRMs in the server.  Swap them around and if afterwards, VRM1 fails (as would be seen on Chassis 1) then you know you have a bad VRM.

 

EDIT:   I should have added that the VRM's I refer to are on the IO chassis backplane.  They are accessed from the right side (front facing) behind the small metal panel.  VRM 0 is the one closest to the front.

 

-Bob

----------------
Was this helpful? Like this post by giving me a thumbs up below!
klb
Valued Contributor

Re: rp7410 fan failure causing shutdown?

 

We swapped the two VRM's from the side panel and find the same condition:

 

T - Cabinet
S - System Backplane
G - MP (Core I/O)
P - IO Chassis
C - Cell
Select Device: P
Enter IO Chassis number: 0

HW status for IO Chassis 0 : FAULT DETECTED

Local Power Monitor Version is 1.2

Power is off, 12V Rail VOLTAGE FAULT
Power Module's Brick 0 VRM 0
Present : * *
OK : *
Enabled :

 

Seems odd that the same IO chassis is complaining about the same VRM being not OK.  Maybe that does point to the IO Planar board?  

 

Are there any other things we can do to try and narrow this down to a particular piece of hardware?

 

thanks,

 

-klb

 

 

klb
Valued Contributor

Re: rp7410 fan failure causing shutdown?

 

I've attached ps listings for 

 

T- Cabinet

P - IO Chassis 0 and 1

Robert_Jewell
Honored Contributor

Re: rp7410 fan failure causing shutdown?

You could also try swapping around the PCI power bricks (located in the front of the server).  Each one of these serves power to one half of the chassis.

 

You can remove all IO cards from the chassis' and try again.

 

Lastly, you can actually try to swap the cell boards around.  It is the cell board that initiates the power on of the IO Chassis (without a cell in a slot, the associated IO Chassis will not power on).  I would not think this to be the problem, but you might as well try.

 

If always you see Chassis 0 reporting an error, then I would think you have little choice but to replace that backplane assembly.

 

-Bob

----------------
Was this helpful? Like this post by giving me a thumbs up below!
klb
Valued Contributor

Re: rp7410 fan failure causing shutdown?

Thanks. HP quoted us a large number to replace the backplane ( 6k ). We're going to swap around the PCI power modules and see if that makes a difference.

BTW, this system has only 1 cell and it's in slot 1. This means we don't even need IO Chassis 0 as I understand it. Any way we can disable that entirely and just run with cell 1 IO Chassis 1?

Thanks,

-klb
Robert_Jewell
Honored Contributor

Re: rp7410 fan failure causing shutdown?

>BTW, this system has only 1 cell and it's in slot 1.

 

Well, this would explain why IO Chassis 0 is "faulting".  Without a cell in slot 0, IO Chassis will not power on.

 

From the MP Command Menu run SYSREV and post it here.  I would like to check out the system firmware levels.  Also can you attach the entire output of the Service Logs (SL, E, D)?  Now I am thinking this is not a power issue.

 

-Bob

----------------
Was this helpful? Like this post by giving me a thumbs up below!