HPE 9000 and HPE e3000 Servers
cancel
Showing results for 
Search instead for 
Did you mean: 

Event 646: Partition being reset due to watchdog timeout expiring

 
SOLVED
Go to solution

Event 646: Partition being reset due to watchdog timeout expiring

Hello Everyone,

I'm receiving the following Critical Error from the Event Monitor:

"Event 646: Partition being reset due to watchdog timeout expiring"

The entire message is attached to this post.

Should I be concerned about this? I receive it about once a month.

Also, any reccomendations on what I can do to resolve it?

Any help/insight is appreciated.

8 REPLIES 8
Joseph Loo
Honored Contributor

Re: Event 646: Partition being reset due to watchdog timeout expiring

hi mike,

have u done what the action statement ask u to follow:

Action: Find out why the partition's OS had hung. The cause could be bad HW that crashed the partition, or in rare cases, a combination of events that caused the OS to be unable to refresh the watchdog timer. Look for other events preceeding the timeout for clues to the root cause of the partition bei! ng unresponsive.

any error from /var/adm/syslog/syslog.log or did dmesg output give any scsi error, etc?

regards.
(p.s. please remember to assign points.
http://forums1.itrc.hp.com/service/forums/pageList.do?userId=CA1176297&listType=unassigned&forumId=1)
what you do not see does not mean you should not believe
Mohanasundaram_1
Honored Contributor

Re: Event 646: Partition being reset due to watchdog timeout expiring

Hi Mike,

I assume that the server is an rp7420 and the firmware is not the latest. I had a similar problem and after the message the MP was not receiving any further events.

you can try this workaround to recover from the WATCHDOG Reset and get the OS talking to the MP again by doing the following...
(This is safe to do with the partition up and running)
- Connect to the MP
- go into 'cm'
- reset the utility interface to the core cell of the partition using the 'ru'
command.

Here is an example where Cell-0 is the core cell:
[test-mp] MP:CM> ru
This command resets the selected MP bus device.
B - BPS (Bulk Power Supply)
A - PACI (Partition Console Interface)
G - MP (Management Processor)
H - PDHC (Cell Board Controller)
Select device: h
Enter cell number: 0
Do you want to reset the Cell PDH Controller Slot 0? (Y/[N]) y
-> The selected MP bus device will be reset.
[test-mp] MP:CM>

Then you should plan for the firmware upgrade.

With regards,
Mohan.
Attitude, Not aptitude, determines your altitude

Re: Event 646: Partition being reset due to watchdog timeout expiring

Joseph,

I checked the syslog.log and it doesn't look like there is anything that caused the hang up. I am rather new to this so attached a snippet a snippet of the log at the time of the problem.

dmesg: The file hasn't been updated since the 12th so I don't think anything has been logged there. But I could be wrong since I've never looked at it. Actually I don't know how to view the file correctly.


Mohan,

The watchdog eventually resets itself and I had put a call through support and they told me not to worry about it.

Regarding the firmware update. I am running a rp7420. Where can I find if there is firmware updates?

Joseph Loo
Honored Contributor

Re: Event 646: Partition being reset due to watchdog timeout expiring

hi,

syslog.log is too small a snippet to show if there are any error, u may like to grep any "warning" or "error" from that file.

post yr dmesg output:

# dmesg


for OS installable firmware updates:

http://www2.itrc.hp.com/service/patch/search.do?BC=patch.breadcrumb.main|&pageContextName=firmware:

regards.
what you do not see does not mean you should not believe
Mohanasundaram_1
Honored Contributor

Re: Event 646: Partition being reset due to watchdog timeout expiring

HI Mike,

You can check for the firmware at itrc site and select the "patch/firmware database" section.
Select "firmware" sub section.

Select the "CPU" as the firmware type and in the search string type "rp7420" and perform a search.

Or just see if the below URL helps,

http://www5.itrc.hp.com/service/patch/patchDetail.do?BC=patch.breadcrumb.main|patch.breadcrumb.search|&patchid=PF_CRAIMED0310&context=firmware:cpu

You need to have a valid ITRC login to access this page. Firmware 3.10 is the latest for rp7420 and rp8420.

With regards,
Mohan.
Attitude, Not aptitude, determines your altitude

Re: Event 646: Partition being reset due to watchdog timeout expiring

Joe, I've attacjed the output from dmesg. Please let me know what you think.


Mohan, how can I find out what firmware my cpu currently has?

Again, sorry everyone, I very new to the whole HP mainframe / Unix deal.
Mohanasundaram_1
Honored Contributor
Solution

Re: Event 646: Partition being reset due to watchdog timeout expiring

Hi Mike,

The current firmware version can be checked at MP.

1) Connect to console and press B
2) if prompted for a login and passwd give Admin/Admin. this is the default login and passwd.
3) Once you get the MP prompt, type "cm"
4) You should get a "CM>" prompt.
5) Type "sysrev" and capture that output.
6) The firmware documentation contains a matrix to show which firmware you were on.

Otherwise, send that sysrev output to us, We can tell you the firmware version.

Since you indicated that a call was logged to HP, you can also take the CE's help to ascertain your firmware version.

Please let us know if you are receiving any further messages at the live logs /errorlogs/Forward progress logs in MP after you got the Watcdog reset message.

With regards
Mohan.
Attitude, Not aptitude, determines your altitude
Joseph Loo
Honored Contributor

Re: Event 646: Partition being reset due to watchdog timeout expiring

hi,

the last line of the dmesg output, "Line 1232 in /ux/core/kern/common/io/pat_psm.c: pat_heartbeat-send log - rc -1 s
tatus -5" gives rise to a need to update your PDC firmware if the error repeatedly appears:

http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=868351


regards.
(p.s. please remember to assign points.)
what you do not see does not mean you should not believe