System Administration
cancel
Showing results for 
Search instead for 
Did you mean: 

RHES3U4 system was auto rebooted twice whin 11 hours

 
SOLVED
Go to solution
Gary L
Super Advisor

RHES3U4 system was auto rebooted twice whin 11 hours

Hi

I have a Redhat Linux physical server, HP Blade, RH ES 3 update 4, Installed oracle 10g RAC. I have no idea why this server was auto rebooted twice from Dec, 12 21:57 to Dec, 13 07:30

#last reboot
reboot system boot 2.4.21-27.ELsmp Thu Dec 13 07:30 (04:04)
reboot system boot 2.4.21-27.ELsmp Wed Dec 12 21:57 (13:37)

I didn't find out any usaful information from system log file and dmesg.

How to check this kind of reason?

Thank you very much any answers will be very appreciate.

-Gary
11 REPLIES
Ivan Ferreira
Honored Contributor
Solution

Re: RHES3U4 system was auto rebooted twice whin 11 hours

The oracle clusterware can start server reboots if it finds some communication error beetween the nodes. Check oracle log files.

If this is the cause, then start troubleshotting the nodes interconnect.

Another possible option is if you have a environment problem, for example, a failed fan, so the server will reboot if the temperature goes high.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Gary L
Super Advisor

Re: RHES3U4 system was auto rebooted twice whin 11 hours

Hi Ivan

Thank you very much for your fast reply.

Questions for you:

1. Except the Fan reason, whatelse could cause the system reboot? Disk(s), NIC(s) etc.? As you know, after the twice system auto rebooted, so far the server running as normal.

2. How to check the health status of hardwares through server's console or GUI Desktop?

thank a lot
Ivan Ferreira
Honored Contributor

Re: RHES3U4 system was auto rebooted twice whin 11 hours

>>> 1. Except the Fan reason, whatelse could cause the system reboot? Disk(s), NIC(s) etc.? As you know, after the twice system auto rebooted, so far the server running as normal.

So far, I just saw server reboots like yours (controlled reboots) caused by fan so power supply failure issued by APCI.

2. How to check the health status of hardwares through server's console or GUI Desktop?

This depends of the hardware model, on Itanium based machines, you have a console where you can check hardware logs, on proliant servers, you should rely on "Proliant Support Pack" and email notifications.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Gary L
Super Advisor

Re: RHES3U4 system was auto rebooted twice whin 11 hours

thanks Ivan.

My physcial server is Proliant box.
Gary L
Super Advisor

Re: RHES3U4 system was auto rebooted twice whin 11 hours

Through check the "server status" via iLO I found:

the Processors informations have some different with others

This server: (3.2G Dual CPU installed on ProLiant BL20p G3)
Proc 1: 3200 MHz
Processor 1 Internal L1 Cache: 16 KB
Processor 1 Internal L2 Cache: 1024 KB
Proc 2: unavailable

Others: ( the same configurations with above)
Proc 1: 3200 MHz
Processor 1 Internal L1 Cache: 16 KB
Processor 1 Internal L2 Cache: 1024 KB
Proc 2: 3200 MHz
Processor 2 Internal L1 Cache: 16 KB
Processor 2 Internal L2 Cache: 1024 KB

Whether there is a CPU failed caused system reboot?

How to make sure it?

Thanks
Gary L
Super Advisor

Re: RHES3U4 system was auto rebooted twice whin 11 hours

But through check the system via command "top", it looks not failed,

CPU states: cpu user nice system irq softirq iowait idle
total 0.8% 0.0% 0.0% 0.0% 0.0% 1.1% 97.9%
cpu00 1.5% 0.0% 0.0% 0.0% 0.0% 1.1% 97.2%
cpu01 0.1% 0.0% 0.0% 0.0% 0.0% 1.1% 98.6%

what's going on?
Ivan Ferreira
Honored Contributor

Re: RHES3U4 system was auto rebooted twice whin 11 hours

>>> Whether there is a CPU failed caused system reboot?
>>> How to make sure it?

Are you sure this server had 2 physical CPUS? Or always had 1 physical COU. CPU failures normally cause the server to PANIC!.

>>> But through check the system via command "top", it looks not failed,

If you have only one cpu, and it's dual core or hyperthreading enabled, you will see 2 CPU (or more) from the O.S. view for each physical cpu.

Have you already verified oracle CRS logs?
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Gary L
Super Advisor

Re: RHES3U4 system was auto rebooted twice whin 11 hours

Hi Ivan

I'm not quite sure the physcial CPU number, because this server located in another city, I will go there for checking tomorrow. Maybe it's one physcial CPU that with Dual core.

I have been checking the oracle log file with oracle team.

thanks a lot.
Venilton Junior
Valued Contributor

Re: RHES3U4 system was auto rebooted twice whin 11 hours

Gary,

If it's a HW problem, u can check IML in iLO.

But this sounds something like your OS or your RAC.

Try to search out your /var/log/messages to check events before the reboot line.

RTFM
skt_skt
Honored Contributor

Re: RHES3U4 system was auto rebooted twice whin 11 hours

Go through dmidecode output; search for CPU; you may seean enry like "populated and enabled" in some models;failed one shows different..

#dmidecode >dmidecode.out
Gary L
Super Advisor

Re: RHES3U4 system was auto rebooted twice whin 11 hours

Thanks Venilton and Santhosh
Have a great day