Integrity Servers
cancel
Showing results for 
Search instead for 
Did you mean: 

RX4640 freezes randomly

 
SOLVED
Go to solution
Highlighted
l.x.
Occasional Visitor

RX4640 freezes randomly

hi, I am a new comer to this forum, so please forgive me if I am asking the questions at the wrong place. Here is the issue:

we have a RX4640 server running HP-UX 11 in our lab. From last year it started to freeze randomly - the interval between each frozen occurrence ranges from a couple of days to a few weeks. When it "freezes", the server is powered off. I have to restart by using iLO MP command line command (PC-->ON) to bring it back online. Sometimes (very rarely) that wouldn't work, and I had to physically unplug the power cords off the server and then plug it back in to make it reboot. After the server is powered up, all applications and databases work just fine.

The server's firmware info is as following:
MP FW : E.03.30
BMC FW : 04.05
EFI FW : 05.48
System FW : 04.21

I recently noticed that when the server froze, there were some fatal alerts logged in system event log. Following is log record for the latest frozen instance (obtained from MP-->SL):

Log Entry 532: 22 Mar 2011 22:10:55
Alert Level 7: Fatal
Keyword: Type-02 127008 1208328
System shut down or reset caused by sensor reading
Logged by: Baseboard Management Controller;
Sensor: System Event - 5V
Data2: OEM Code2: 0xC3
0x204D891E6F022660 C300A870C3120300


I also tried to see the logs from the OS log directory with command "/usr/sbin/diag/contrib/slview -f /var/stm/logs/os/fpl.log.02". However it showed two logs for this frozen instance instead of just one:

Log Entry 8402:
Alert Level 7: Fatal
Keyword: SHUTDOWN_OR_RESET_ON_SENSOR
System shut down or reset caused by sensor reading
System shut-down or reset caused by sensor reading.
Logged by: Baseboard Management Controller
Data: 0x204d891e6f022660 0xc300a870c3120300
Tue Mar 22 22:10:55 2011
Generator: Baseboard Management Controller
Sensor Type: System Event
Sensor Number: 195
Cause: A sensor reading in the system was determined to be non-recoverable and the system was shut down or reset.
Action: Read the system logs to find which sensor was out of range.

Log Entry 8401:
Keyword: IPMI Type-02 Event
Logged by: Baseboard Management Controller
Data: 0x204d891e6f022650 0x76255401c3020300
Tue Mar 22 22:10:55 2011
Generator: Baseboard Management Controller
Sensor Type: Voltage
Sensor Number: 195
Cause/Action : No information available.


While inspecting previous logs I could see similar fatal alerts around the time of each frozen occurrence, but they came from different "sensor numbers" (such as 97, 205, etc.)

My questions are:
1) What are these sensors? Is there a document that tells what each sensor number represents?
2) given these alerts, does anyone know what exactly happened that triggered the system to be shutdown?
3) is there a quick way to fix it? (this is a server out of warranty period, and we don't have much budget to replace parts)
4) if no quick/easy fixes, is there a way we can set the server to automatically reboot after each time it's frozen(shutdown)? (say, using MP command scripts, if there is such thing)

Thanks in advance for your advice!

- Leon
3 REPLIES 3
Robert_Jewell
Honored Contributor
Solution

Re: RX4640 freezes randomly

With the information you have provided, there does seem to be a problem with voltages. From these particular alerts I can see a problem with the 5VDC Voltage Rail. This in particular could be a result of a faulty Voltage Regulator Module. If other voltage rails are reporting problems, then it could be a faulty system board or even power supply.

>Log Entry 532: 22 Mar 2011 22:10:55
>.
>.
>Sensor: System Event - 5V

Check your logs in further detail. If each Sensor refers to the 5V rail, then I would suspect that the 5V VRM that is on the system board is faulty. This could mean replacing the entire system board (unless you can find that VRM module - there are three installed total (5v, 12v, and 3.3v)).

To answer your other questions:

> What are these sensors?
The system has ciruitry called Baseboard Management Circuitry (BMC). This is responsible for monitoring and reporting the base functions of the system such as voltages.

> is there a way se can set the server to automatically reboot after each time
I dont beleive so. It is the function of the server to try and prevent further damage from occurring. Continuing to operate with voltages going out of range could lead to further damaage.

----------------
Was this helpful? Like this post by giving me a thumbs up below!
l.x.
Occasional Visitor

Re: RX4640 freezes randomly

Robert,

Thanks for the insights. They are very helpful. I did check other alerts, and I saw there was another sensor (#97) complained about 5V Voltage out of range too. Also there were several fatal alerts when sensor number 205 complained about "1.5V MR PwrGood", such as the log record below:


Log Entry 545: 23 Mar 2011 15:01:24
Alert Level 7: Fatal
Keyword: Type-02 127008 1208328
System shut down or reset caused by sensor reading
Logged by: Baseboard Management Controller;
Sensor: System Event - 1.5V MR PwrGood
Data2: OEM Code2: 0xCD
0x204D8A0B44022750 CD00A870CD120300



I have no idea what this log means except it is a fatal alert. I wonder if there is any other logs (in addition to the Event Log) I should look into to find more information...
l.x.
Occasional Visitor

Re: RX4640 freezes randomly

Robert,

Thanks for the insights. They are very helpful. I did check other alerts, and I saw there was another sensor (#97) complained about 5V Voltage out of range too. Also there were several fatal alerts in which sensor number 205 complained about "1.5V MR PwrGood", such as the one below:


Log Entry 545: 23 Mar 2011 15:01:24
Alert Level 7: Fatal
Keyword: Type-02 127008 1208328
System shut down or reset caused by sensor reading
Logged by: Baseboard Management Controller;
Sensor: System Event - 1.5V MR PwrGood
Data2: OEM Code2: 0xCD
0x204D8A0B44022750 CD00A870CD120300



I have no idea what this log means except it is a fatal alert. I wonder if there is any other logs (in addition to the Event Log) I should look into to find more information...